Counting word frequency

Brain teasers

Oct 10, 2023

The problem statement:

Given a string find the frequency of each word in that string.

Counting word frequency is a common programming task where we analyze a text document or string to determine how frequently each word appears within it. This exercise is useful for various applications, such as text analysis, natural language processing, and creating word clouds.

Here is a sample text for analysis along with the defined word_frequency() function.

nursery_rhyme: str = f'The itsy bitsy spider crawled up the water spout. ' \
                     f'Down came the rain, and washed the spider out. ' \
                     f'Out came the sun, and dried up all the rain, ' \
                     f'and the itsy bitsy spider went up the spout again.'

def word_frequency(text: str) -> dict[str, int]:
    ...

The above code can be used as a starting point to solve the problem. The possible output could look as follows.

{
    ...,
    'itsy': 2, 
    'rain': 2, 
    'the': 8, 
    'up': 3,
    ...
}

💡 EXTRA TASK: Given a string find the most common word in that string.

If you need to refresh your knowledge, follow the link below. 👇👇👇

Exploring Python

Jakub Slys

August 27, 2024

Python is frequently utilized in creating websites and software, as well as for automating tasks, analyzing data, and visualizing information. Due to its relatively accessible learning curve, Python has been embraced by numerous individuals without programming backgrounds, including accountants and scientists, who employ it for a wide range of everyday …

Read full story

Solution

Looking at the provided code we can easily notice that our function has to take the text to process and return a dictionary with word-to-counter mappings.

We will start by transforming the text to:

make it lowercase
strip it from: ,, or ., leaving only whitespaces and letters
split into a list of tokens for further processing.

tokens: list[str] = text.lower().replace(',', '').replace('.', '').split()

This provides us a list of words for further processing.

['the', 'itsy', 'bitsy', 'spider', 'crawled', 'up', 'the', 'water', 'spout', 'down', 'came', 'the', 'rain', 'and', 'washed', 'the', ...]

Now we can use a dictionary to count the occurrences and aggregate the results.

freqs = {}
for token in tokens:
    if token in freqs:
        freqs[token] += 1
    else:
        freqs[token] = 1

The above code will either insert 1 if a word has not been seen before or increment the counter by one. The whole implementation of the word_frequency() function looks as follows.

def word_frequency(text: str) -> dict[str, int]:
    tokens: list[str] = text.lower().replace(',', '').replace('.', '').split()
    freqs = {}
    for token in tokens:
        if token in freqs:
            freqs[token] += 1
        else:
            freqs[token] = 1
    return freqs

Now let's use it and see the results.

❯ python3 main.py
{'the': 8, 'itsy': 2, 'bitsy': 2, 'spider': 3, 'crawled': 1, 'up': 3, 'water': 1, 'spout': 2, 'down': 1, 'came': 2, 'rain': 2, 'and': 3, 'washed': 1, 'out': 2, 'sun': 1, 'dried': 1, 'all': 1, 'went': 1, 'again': 1}

Having those mappings we can easily find the most common word in the text. We can use the built-in max() function for that purpose.

most_common_word = max(frequency, key=frequency.get)

The whole code looks as follows.

nursery_rhyme: str = f'The itsy bitsy spider crawled up the water spout. ' \
                     f'Down came the rain, and washed the spider out. ' \
                     f'Out came the sun, and dried up all the rain, ' \
                     f'and the itsy bitsy spider went up the spout again.'


def word_frequency(text: str) -> dict[str, int]:
    tokens: list[str] = text.lower().replace(',', '').replace('.', '').split()
    freqs = {}
    for token in tokens:
        if token in freqs:
            freqs[token] += 1
        else:
            freqs[token] = 1
    return freqs


frequency = word_frequency(nursery_rhyme)
print(f'Word frequencies {frequency}')
most_common_word = max(frequency, key=frequency.get)
print(f'The most common word is "{most_common_word}"')

Let's run the script and find the most common word.

Word frequencies {'the': 8, 'itsy': 2, 'bitsy': 2, 'spider': 3, 'crawled': 1, 'up': 3, 'water': 1, 'spout': 2, 'down': 1, 'came': 2, 'rain': 2, 'and': 3, 'washed': 1, 'out': 2, 'sun': 1, 'dried': 1, 'all': 1, 'went': 1, 'again': 1}
The most common word is "the"

So the most common word is "the". 😱

Summary

The "Counting Word Frequency" exercise is a fundamental task in text analysis. It involves taking a text document or string, breaking it into individual words, and then determining how often each word appears within the text. This process, known as tokenization and frequency counting, is crucial for tasks like understanding document content, generating word clouds, or identifying keywords. By creating a dictionary that associates words with their frequencies, you can extract valuable insights from text data and pave the way for more advanced text analysis and natural language processing techniques.

Counting word frequency

Brain teasers

Exploring Python

Solution

Summary

Discussion about this post