How can you quickly and easily generate word frequency statistics from a text, stored as key-value pairs?

There are many ways to solve this problem, but short yet clear one- or two-line solutions are generally preferred. Below, we present two such approaches.

What these two techniques have in common is that they both start by splitting the text at whitespace characters using the split() method. The resulting list may include words followed by punctuation marks. We remove these using the strip() method, creating a sequence that contains only words. To ensure that the same words starting with uppercase and lowercase letters are not counted separately, we convert all words to lowercase using the lower() method.

In the first approach, we use a dictionary comprehension combined with the count() method to count the occurrences of each word. In the second approach, we import the Counter class from the collections module and use it to generate the frequency distribution. If the resulting Counter instance doesn’t meet your needs, you can convert it to a regular dict.

The code for each solution is shown below:


text = 'Say what you mean, mean what you say. And say it clearly!'
punctuation = '.,;:!?"'

# Solution 1: Using a dictionary comprehension and the count() method.
# Preprocessing: split the text at whitespace, remove punctuation,
# convert all words to lowercase.
words = [w.lower().strip(punctuation) for w in text.split()]
word_stats = {word: words.count(word) for word in words}

print(word_stats)
# Output: {'say': 3, 'what': 2, 'you': 2, 'mean': 2, 'and': 1, 'it': 1, 'clearly': 1}

# Solution 2: Using the Counter object from the collections module.
from collections import Counter
word_stats = Counter(w.lower().strip(punctuation) for w in text.split())

print(word_stats)
# Output: Counter({'say': 3, 'what': 2, 'you': 2, 'mean': 2, 'and': 1, 'it': 1, 'clearly': 1})

# Optional: Convert to a regular dictionary if needed.
word_stats = dict(word_stats)

print(word_stats)
# Output: {'say': 3, 'what': 2, 'you': 2, 'mean': 2, 'and': 1, 'it': 1, 'clearly': 1}

text = 'Say what you mean, mean what you say. And say it clearly!'

punctuation = '.,;:!?"'

# Solution 1: Using a dictionary comprehension and the count() method.

# Preprocessing: split the text at whitespace, remove punctuation,

# convert all words to lowercase.

words = [w.lower().strip(punctuation) for w in text.split()]

word_stats = {word: words.count(word) for word in words}

print(word_stats)

# Output: {'say': 3, 'what': 2, 'you': 2, 'mean': 2, 'and': 1, 'it': 1, 'clearly': 1}

# Solution 2: Using the Counter object from the collections module.

from collections import Counter

word_stats = Counter(w.lower().strip(punctuation) for w in text.split())

print(word_stats)

# Output: Counter({'say': 3, 'what': 2, 'you': 2, 'mean': 2, 'and': 1, 'it': 1, 'clearly': 1})

# Optional: Convert to a regular dictionary if needed.

word_stats = dict(word_stats)

print(word_stats)

# Output: {'say': 3, 'what': 2, 'you': 2, 'mean': 2, 'and': 1, 'it': 1, 'clearly': 1}

If you’d like to learn more about the language features used in these code snippets, and see additional examples of their usage, the chapters “Public methods of built-in types” and “Counting the occurrence of items in a sequence – Counter” in the book Python Knowledge Building Step by Step are well worth exploring.

How can you quickly and easily generate word frequency statistics from a text, stored as key-value pairs?

Interested in the e-book Python Knowledge Building Step by Step: From the Basics to Your First Desktop Application?