The problem statement:
Given a string find the frequency of each word in that string.
Counting word frequency is a common programming task where we analyze a text document or string to determine how frequently each word appears within it. This exercise is useful for various applications, such as text analysis, natural language processing, and creating word clouds.
Here is a sample text for analysis along with the defined word_frequency() function.
nursery_rhyme: str = f'The itsy bitsy spider crawled up the water spout. ' \
f'Down came the rain, and washed the spider out. ' \
f'Out came the sun, and dried up all the rain, ' \
f'and the itsy bitsy spider went up the spout again.'
def word_frequency(text: str) -> dict[str, int]:
...
The above code can be used as a starting point to solve the problem. The possible output could look as follows.
{
...,
'itsy': 2,
'rain': 2,
'the': 8,
'up': 3,
...
}
💡 EXTRA TASK: Given a string find the most common word in that string.
If you need to refresh your knowledge, follow the link below. 👇👇👇
Solution
Looking at the provided code we can easily notice that our function has to take the text to process and return a dictionary with word-to-counter mappings.
We will start by transforming the text to:
make it lowercase
strip it from:
,
, or.
, leaving only whitespaces and letterssplit into a list of tokens for further processing.
tokens: list[str] = text.lower().replace(',', '').replace('.', '').split()
This provides us a list of words for further processing.
['the', 'itsy', 'bitsy', 'spider', 'crawled', 'up', 'the', 'water', 'spout', 'down', 'came', 'the', 'rain', 'and', 'washed', 'the', ...]
Now we can use a dictionary to count the occurrences and aggregate the results.
freqs = {}
for token in tokens:
if token in freqs:
freqs[token] += 1
else:
freqs[token] = 1
The above code will either insert 1 if a word has not been seen before or increment the counter by one. The whole implementation of the word_frequency() function looks as follows.
def word_frequency(text: str) -> dict[str, int]:
tokens: list[str] = text.lower().replace(',', '').replace('.', '').split()
freqs = {}
for token in tokens:
if token in freqs:
freqs[token] += 1
else:
freqs[token] = 1
return freqs
Now let's use it and see the results.
❯ python3 main.py
{'the': 8, 'itsy': 2, 'bitsy': 2, 'spider': 3, 'crawled': 1, 'up': 3, 'water': 1, 'spout': 2, 'down': 1, 'came': 2, 'rain': 2, 'and': 3, 'washed': 1, 'out': 2, 'sun': 1, 'dried': 1, 'all': 1, 'went': 1, 'again': 1}
Having those mappings we can easily find the most common word in the text. We can use the built-in max() function for that purpose.
most_common_word = max(frequency, key=frequency.get)
The whole code looks as follows.
nursery_rhyme: str = f'The itsy bitsy spider crawled up the water spout. ' \
f'Down came the rain, and washed the spider out. ' \
f'Out came the sun, and dried up all the rain, ' \
f'and the itsy bitsy spider went up the spout again.'
def word_frequency(text: str) -> dict[str, int]:
tokens: list[str] = text.lower().replace(',', '').replace('.', '').split()
freqs = {}
for token in tokens:
if token in freqs:
freqs[token] += 1
else:
freqs[token] = 1
return freqs
frequency = word_frequency(nursery_rhyme)
print(f'Word frequencies {frequency}')
most_common_word = max(frequency, key=frequency.get)
print(f'The most common word is "{most_common_word}"')
Let's run the script and find the most common word.
Word frequencies {'the': 8, 'itsy': 2, 'bitsy': 2, 'spider': 3, 'crawled': 1, 'up': 3, 'water': 1, 'spout': 2, 'down': 1, 'came': 2, 'rain': 2, 'and': 3, 'washed': 1, 'out': 2, 'sun': 1, 'dried': 1, 'all': 1, 'went': 1, 'again': 1}
The most common word is "the"
So the most common word is "the". 😱
Summary
The "Counting Word Frequency" exercise is a fundamental task in text analysis. It involves taking a text document or string, breaking it into individual words, and then determining how often each word appears within the text. This process, known as tokenization and frequency counting, is crucial for tasks like understanding document content, generating word clouds, or identifying keywords. By creating a dictionary that associates words with their frequencies, you can extract valuable insights from text data and pave the way for more advanced text analysis and natural language processing techniques.