819. Most Common Word

EasyArrayHash TableStringCounting

Problem Description

You are given a paragraph of text and a list of banned words. Your task is to find the most frequently occurring word in the paragraph that is not in the banned list.

The problem has the following requirements and guarantees:

Case-insensitive matching: Words in the paragraph should be treated as case-insensitive (e.g., "Bob" and "bob" are the same word)
Return lowercase: The answer should be returned in lowercase
Ignore punctuation: Words in the paragraph may be separated by punctuation symbols, which should be ignored when extracting words
Unique answer: It's guaranteed that there will be exactly one most frequent non-banned word
At least one valid word: There will always be at least one word that isn't banned

For example, if the paragraph is "Bob hit a ball, the hit BALL flew far after it was hit." and banned words are ["hit"], then:

Extract words: ["bob", "hit", "a", "ball", "the", "hit", "ball", "flew", "far", "after", "it", "was", "hit"]
Count frequencies excluding banned words: "ball" appears 2 times, "bob" appears 1 time, etc.
Return "ball" as it's the most frequent non-banned word

Quick Interview Experience

Help others by sharing your interview experience

Have you seen this problem before?

Intuition

The key insight is that we need to perform three main operations: extract valid words from the paragraph, count their frequencies, and filter out banned words.

First, we need to handle the text preprocessing. Since punctuation separates words and we need case-insensitive comparison, we can convert the entire paragraph to lowercase and extract only alphabetic sequences. Using a regular expression pattern [a-z]+ on the lowercased text elegantly solves both problems at once - it automatically skips punctuation and gives us clean words.

For the banned words, we can optimize lookups by converting the list into a set. Checking if a word is banned becomes an O(1) operation instead of O(n) for each word we examine.

To count word frequencies, we can use a Counter (or hashmap) to track how many times each word appears. This gives us the frequency data we need to identify the most common words.

The final step is finding the most frequent non-banned word. Instead of manually sorting and filtering, we can leverage the Counter's most_common() method which returns words ordered by frequency (highest first). We then iterate through this sorted list and return the first word that isn't in our banned set.

The elegance of this approach is that by using next() with a generator expression on the sorted frequency list, we automatically get the first (most frequent) word that passes our filter condition, which is guaranteed to exist per the problem constraints.

Solution Approach

The implementation uses a combination of regular expressions, set operations, and Python's Counter class to efficiently solve the problem.

Step 1: Convert banned list to a set

s = set(banned)

We convert the banned list into a set for O(1) lookup time when checking if a word is banned.

Step 2: Extract and count words

p = Counter(re.findall('[a-z]+', paragraph.lower()))

This line does multiple things:

paragraph.lower() converts the entire paragraph to lowercase
re.findall('[a-z]+', ...) uses regex to find all sequences of lowercase letters, effectively:
- Splitting the text by any non-letter characters (punctuation, spaces, etc.)
- Extracting only valid words as continuous letter sequences
Counter(...) creates a frequency map of all extracted words

The regex pattern [a-z]+ matches one or more consecutive lowercase letters, which perfectly handles punctuation separation. For example, "Bob's ball." becomes ["bob", "s", "ball"].

Step 3: Find the most common non-banned word

return next(word for word, _ in p.most_common() if word not in s)

This line:

p.most_common() returns a list of (word, count) tuples sorted by frequency in descending order
The generator expression (word for word, _ in ... if word not in s) iterates through the sorted list and yields only words that aren't in the banned set
next() returns the first word from this generator, which is guaranteed to be the most frequent non-banned word

The underscore _ is used as a placeholder since we don't need the count value, only the word itself.

Time Complexity: O(n + m) where n is the length of the paragraph and m is the number of unique words Space Complexity: O(m) for storing the word frequencies and banned set

Ready to land your dream job?

Unlock your dream job with a 5-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's walk through a small example to illustrate the solution approach.

Input:

paragraph = "Apple, apple pie; APPLE juice!"
banned = ["pie"]

Step 1: Convert banned list to set

s = set(["pie"])  # s = {"pie"}

This gives us O(1) lookup for checking banned words.

Step 2: Extract and count words

First, convert to lowercase:

"Apple, apple pie; APPLE juice!" → "apple, apple pie; apple juice!"

Then apply regex [a-z]+ to find all letter sequences:

re.findall('[a-z]+', "apple, apple pie; apple juice!")
→ ["apple", "apple", "pie", "apple", "juice"]

Create frequency counter:

Counter(["apple", "apple", "pie", "apple", "juice"])
→ {"apple": 3, "pie": 1, "juice": 1}

Step 3: Find most common non-banned word

Get words sorted by frequency:

p.most_common() → [("apple", 3), ("pie", 1), ("juice", 1)]

Iterate through and filter banned words:

Check "apple": not in {"pie"} ✓ → Return "apple"

Result: "apple" (appears 3 times, most frequent non-banned word)

The solution efficiently handles punctuation, case-insensitivity, and banned word filtering in just a few lines of code.

Solution Implementation

1from typing import List
2from collections import Counter
3import re
4
5class Solution:
6    def mostCommonWord(self, paragraph: str, banned: List[str]) -> str:
7        # Create a set of banned words for O(1) lookup
8        banned_words = set(banned)
9      
10        # Extract all words (lowercase letters only) from the paragraph
11        # Convert to lowercase and count frequency of each word
12        word_counts = Counter(re.findall(r'[a-z]+', paragraph.lower()))
13      
14        # Iterate through words sorted by frequency (most common first)
15        # Return the first word that is not in the banned list
16        for word, count in word_counts.most_common():
17            if word not in banned_words:
18                return word
19

1import java.util.HashMap;
2import java.util.HashSet;
3import java.util.Map;
4import java.util.Set;
5import java.util.regex.Matcher;
6import java.util.regex.Pattern;
7
8class Solution {
9    // Regex pattern to match sequences of lowercase letters
10    private static final Pattern WORD_PATTERN = Pattern.compile("[a-z]+");
11
12    public String mostCommonWord(String paragraph, String[] banned) {
13        // Create a set of banned words for O(1) lookup
14        Set<String> bannedWordsSet = new HashSet<>();
15        for (String word : banned) {
16            bannedWordsSet.add(word);
17        }
18      
19        // Map to count frequency of each valid word
20        Map<String, Integer> wordFrequencyMap = new HashMap<>();
21      
22        // Convert paragraph to lowercase and find all words using regex
23        Matcher matcher = WORD_PATTERN.matcher(paragraph.toLowerCase());
24        while (matcher.find()) {
25            String currentWord = matcher.group();
26          
27            // Skip if the word is in the banned list
28            if (bannedWordsSet.contains(currentWord)) {
29                continue;
30            }
31          
32            // Update the frequency count for the current word
33            wordFrequencyMap.put(currentWord, wordFrequencyMap.getOrDefault(currentWord, 0) + 1);
34        }
35      
36        // Find the word with the maximum frequency
37        int maxFrequency = Integer.MIN_VALUE;
38        String mostCommonWord = null;
39      
40        for (Map.Entry<String, Integer> entry : wordFrequencyMap.entrySet()) {
41            if (entry.getValue() > maxFrequency) {
42                maxFrequency = entry.getValue();
43                mostCommonWord = entry.getKey();
44            }
45        }
46      
47        return mostCommonWord;
48    }
49}
50

1class Solution {
2public:
3    string mostCommonWord(string paragraph, vector<string>& banned) {
4        // Create a set of banned words for O(1) lookup
5        unordered_set<string> bannedSet(banned.begin(), banned.end());
6      
7        // Map to store word frequencies
8        unordered_map<string, int> wordFrequency;
9      
10        // Track the most common word
11        string mostCommonWord;
12        int maxFrequency = 0;
13        int n = paragraph.size();
14      
15        // Iterate through the paragraph
16        for (int i = 0; i < n;) {
17            // Skip non-alphabetic characters
18            if (!isalpha(paragraph[i])) {
19                i++;
20                continue;
21            }
22          
23            // Extract the current word
24            int wordStart = i;
25            string currentWord;
26          
27            // Build the word character by character (converting to lowercase)
28            while (i < n && isalpha(paragraph[i])) {
29                currentWord.push_back(tolower(paragraph[i]));
30                i++;
31            }
32          
33            // Skip this word if it's in the banned list
34            if (bannedSet.count(currentWord)) {
35                continue;
36            }
37          
38            // Update word frequency
39            wordFrequency[currentWord]++;
40          
41            // Update the most common word if current word has higher frequency
42            if (wordFrequency[currentWord] > maxFrequency) {
43                mostCommonWord = currentWord;
44                maxFrequency = wordFrequency[currentWord];
45            }
46        }
47      
48        return mostCommonWord;
49    }
50};
51

1/**
2 * Finds the most frequent word in a paragraph that is not in the banned list.
3 * @param paragraph - The input text to analyze
4 * @param banned - Array of words to exclude from consideration
5 * @returns The most common non-banned word in lowercase
6 */
7function mostCommonWord(paragraph: string, banned: string[]): string {
8    // Convert the entire paragraph to lowercase for case-insensitive comparison
9    const normalizedText = paragraph.toLowerCase();
10  
11    // Map to store word frequencies
12    const wordFrequencyMap = new Map<string, number>();
13  
14    // Set of banned words for O(1) lookup
15    const bannedWordsSet = new Set<string>(banned);
16  
17    // Split the text by non-letter characters and process each word
18    const words = normalizedText.split(/[^a-z]/);
19  
20    for (const word of words) {
21        // Skip empty strings and banned words
22        if (word === '' || bannedWordsSet.has(word)) {
23            continue;
24        }
25      
26        // Update word frequency count
27        const currentCount = wordFrequencyMap.get(word) ?? 0;
28        wordFrequencyMap.set(word, currentCount + 1);
29    }
30  
31    // Find the word with maximum frequency
32    const wordFrequencyEntries = [...wordFrequencyMap.entries()];
33    const mostFrequentEntry = wordFrequencyEntries.reduce(
34        (maxEntry, currentEntry) => {
35            // Compare frequencies and return the entry with higher count
36            return currentEntry[1] > maxEntry[1] ? currentEntry : maxEntry;
37        },
38        ['', 0] // Initial value: empty word with 0 frequency
39    );
40  
41    // Return the most frequent word
42    return mostFrequentEntry[0];
43}
44

Time and Space Complexity

Time Complexity: O(n + m + k log k)

O(n) for converting the paragraph to lowercase where n is the length of the paragraph
O(n) for the regex findall operation to extract all words
O(m) for creating the banned words set where m is the number of banned words
O(k) for creating the Counter where k is the number of unique words found
O(k log k) for most_common() which internally sorts the words by frequency
O(k) in the worst case for iterating through sorted words to find the first non-banned word
Overall: O(n + m + k log k) where typically k << n

Space Complexity: O(n + m + k)

O(m) for storing the banned words set
O(n) for storing the lowercase paragraph temporarily
O(n) for storing the list of words extracted by regex
O(k) for the Counter dictionary storing unique words and their frequencies
Overall: O(n + m + k) where n is the paragraph length, m is the number of banned words, and k is the number of unique words

Learn more about how to find time and space complexity quickly.

Common Pitfalls

1. Incorrect Word Extraction with Special Characters

A common mistake is trying to split words using simple string operations like split() or split(' '), which fails to handle punctuation correctly.

Incorrect approach:

words = paragraph.lower().split()  # Won't handle punctuation
# "Bob's" remains as "bob's" instead of separating to "bob" and "s"

Why it fails: Words like "Bob's", "hit.", or "ball," would be treated as single tokens including the punctuation, leading to incorrect word counting.

Solution: Use regex pattern [a-z]+ to extract only letter sequences:

words = re.findall(r'[a-z]+', paragraph.lower())

2. Case Sensitivity Issues

Forgetting to normalize the case can lead to treating "Bob" and "bob" as different words.

Incorrect approach:

word_counts = Counter(re.findall(r'[a-zA-Z]+', paragraph))  # Mixed case
# "Bob" and "bob" counted separately

Solution: Always convert to lowercase before processing:

word_counts = Counter(re.findall(r'[a-z]+', paragraph.lower()))

3. Using List Instead of Set for Banned Words

Checking membership in a list is O(n) per lookup, which can significantly slow down the solution for large banned lists.

Inefficient approach:

for word, count in word_counts.most_common():
    if word not in banned:  # O(n) lookup if banned is a list
        return word

Solution: Convert banned list to set for O(1) lookups:

banned_words = set(banned)
if word not in banned_words:  # O(1) lookup

4. Manual Frequency Tracking Instead of Using Counter

Attempting to manually track the maximum frequency while iterating can lead to complex and error-prone code.

Error-prone approach:

max_count = 0
result = ""
for word in words:
    if word not in banned:
        count = words.count(word)  # Inefficient O(n) operation
        if count > max_count:
            max_count = count
            result = word

Solution: Use Counter's most_common() method which handles sorting by frequency automatically:

word_counts = Counter(words)
for word, _ in word_counts.most_common():
    if word not in banned_words:
        return word

Discover Your Strengths and Weaknesses: Take Our 5-Minute Quiz to Tailor Your Study Plan:

Problem: Given a list of tasks and a list of requirements, compute a sequence of tasks that can be performed, such that we complete every task once while satisfying all the requirements.

Which of the following method should we use to solve this problem?

Topological Sort

Uniform Cost Search

Monotonic Stack

Memoization

I don't know