819. Most Common Word
Problem Description
You are given a paragraph of text and a list of banned words. Your task is to find the most frequently occurring word in the paragraph that is not in the banned list.
The problem has the following requirements and guarantees:
- Case-insensitive matching: Words in the paragraph should be treated as case-insensitive (e.g., "Bob" and "bob" are the same word)
- Return lowercase: The answer should be returned in lowercase
- Ignore punctuation: Words in the paragraph may be separated by punctuation symbols, which should be ignored when extracting words
- Unique answer: It's guaranteed that there will be exactly one most frequent non-banned word
- At least one valid word: There will always be at least one word that isn't banned
For example, if the paragraph is "Bob hit a ball, the hit BALL flew far after it was hit."
and banned words are ["hit"]
, then:
- Extract words:
["bob", "hit", "a", "ball", "the", "hit", "ball", "flew", "far", "after", "it", "was", "hit"]
- Count frequencies excluding banned words:
"ball"
appears 2 times,"bob"
appears 1 time, etc. - Return
"ball"
as it's the most frequent non-banned word
Intuition
The key insight is that we need to perform three main operations: extract valid words from the paragraph, count their frequencies, and filter out banned words.
First, we need to handle the text preprocessing. Since punctuation separates words and we need case-insensitive comparison, we can convert the entire paragraph to lowercase and extract only alphabetic sequences. Using a regular expression pattern [a-z]+
on the lowercased text elegantly solves both problems at once - it automatically skips punctuation and gives us clean words.
For the banned words, we can optimize lookups by converting the list into a set. Checking if a word is banned becomes an O(1)
operation instead of O(n)
for each word we examine.
To count word frequencies, we can use a Counter (or hashmap) to track how many times each word appears. This gives us the frequency data we need to identify the most common words.
The final step is finding the most frequent non-banned word. Instead of manually sorting and filtering, we can leverage the Counter's most_common()
method which returns words ordered by frequency (highest first). We then iterate through this sorted list and return the first word that isn't in our banned set.
The elegance of this approach is that by using next()
with a generator expression on the sorted frequency list, we automatically get the first (most frequent) word that passes our filter condition, which is guaranteed to exist per the problem constraints.
Solution Approach
The implementation uses a combination of regular expressions, set operations, and Python's Counter class to efficiently solve the problem.
Step 1: Convert banned list to a set
s = set(banned)
We convert the banned
list into a set for O(1)
lookup time when checking if a word is banned.
Step 2: Extract and count words
p = Counter(re.findall('[a-z]+', paragraph.lower()))
This line does multiple things:
paragraph.lower()
converts the entire paragraph to lowercasere.findall('[a-z]+', ...)
uses regex to find all sequences of lowercase letters, effectively:- Splitting the text by any non-letter characters (punctuation, spaces, etc.)
- Extracting only valid words as continuous letter sequences
Counter(...)
creates a frequency map of all extracted words
The regex pattern [a-z]+
matches one or more consecutive lowercase letters, which perfectly handles punctuation separation. For example, "Bob's ball."
becomes ["bob", "s", "ball"]
.
Step 3: Find the most common non-banned word
return next(word for word, _ in p.most_common() if word not in s)
This line:
p.most_common()
returns a list of(word, count)
tuples sorted by frequency in descending order- The generator expression
(word for word, _ in ... if word not in s)
iterates through the sorted list and yields only words that aren't in the banned set next()
returns the first word from this generator, which is guaranteed to be the most frequent non-banned word
The underscore _
is used as a placeholder since we don't need the count value, only the word itself.
Time Complexity: O(n + m)
where n
is the length of the paragraph and m
is the number of unique words
Space Complexity: O(m)
for storing the word frequencies and banned set
Ready to land your dream job?
Unlock your dream job with a 3-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's walk through a small example to illustrate the solution approach.
Input:
- paragraph =
"Apple, apple pie; APPLE juice!"
- banned =
["pie"]
Step 1: Convert banned list to set
s = set(["pie"]) # s = {"pie"}
This gives us O(1) lookup for checking banned words.
Step 2: Extract and count words
First, convert to lowercase:
"Apple, apple pie; APPLE juice!" → "apple, apple pie; apple juice!"
Then apply regex [a-z]+
to find all letter sequences:
re.findall('[a-z]+', "apple, apple pie; apple juice!") → ["apple", "apple", "pie", "apple", "juice"]
Create frequency counter:
Counter(["apple", "apple", "pie", "apple", "juice"]) → {"apple": 3, "pie": 1, "juice": 1}
Step 3: Find most common non-banned word
Get words sorted by frequency:
p.most_common() → [("apple", 3), ("pie", 1), ("juice", 1)]
Iterate through and filter banned words:
- Check "apple": not in {"pie"} ✓ → Return "apple"
Result: "apple"
(appears 3 times, most frequent non-banned word)
The solution efficiently handles punctuation, case-insensitivity, and banned word filtering in just a few lines of code.
Solution Implementation
1from typing import List
2from collections import Counter
3import re
4
5class Solution:
6 def mostCommonWord(self, paragraph: str, banned: List[str]) -> str:
7 # Create a set of banned words for O(1) lookup
8 banned_words = set(banned)
9
10 # Extract all words (lowercase letters only) from the paragraph
11 # Convert to lowercase and count frequency of each word
12 word_counts = Counter(re.findall(r'[a-z]+', paragraph.lower()))
13
14 # Iterate through words sorted by frequency (most common first)
15 # Return the first word that is not in the banned list
16 for word, count in word_counts.most_common():
17 if word not in banned_words:
18 return word
19
1import java.util.HashMap;
2import java.util.HashSet;
3import java.util.Map;
4import java.util.Set;
5import java.util.regex.Matcher;
6import java.util.regex.Pattern;
7
8class Solution {
9 // Regex pattern to match sequences of lowercase letters
10 private static final Pattern WORD_PATTERN = Pattern.compile("[a-z]+");
11
12 public String mostCommonWord(String paragraph, String[] banned) {
13 // Create a set of banned words for O(1) lookup
14 Set<String> bannedWordsSet = new HashSet<>();
15 for (String word : banned) {
16 bannedWordsSet.add(word);
17 }
18
19 // Map to count frequency of each valid word
20 Map<String, Integer> wordFrequencyMap = new HashMap<>();
21
22 // Convert paragraph to lowercase and find all words using regex
23 Matcher matcher = WORD_PATTERN.matcher(paragraph.toLowerCase());
24 while (matcher.find()) {
25 String currentWord = matcher.group();
26
27 // Skip if the word is in the banned list
28 if (bannedWordsSet.contains(currentWord)) {
29 continue;
30 }
31
32 // Update the frequency count for the current word
33 wordFrequencyMap.put(currentWord, wordFrequencyMap.getOrDefault(currentWord, 0) + 1);
34 }
35
36 // Find the word with the maximum frequency
37 int maxFrequency = Integer.MIN_VALUE;
38 String mostCommonWord = null;
39
40 for (Map.Entry<String, Integer> entry : wordFrequencyMap.entrySet()) {
41 if (entry.getValue() > maxFrequency) {
42 maxFrequency = entry.getValue();
43 mostCommonWord = entry.getKey();
44 }
45 }
46
47 return mostCommonWord;
48 }
49}
50
1class Solution {
2public:
3 string mostCommonWord(string paragraph, vector<string>& banned) {
4 // Create a set of banned words for O(1) lookup
5 unordered_set<string> bannedSet(banned.begin(), banned.end());
6
7 // Map to store word frequencies
8 unordered_map<string, int> wordFrequency;
9
10 // Track the most common word
11 string mostCommonWord;
12 int maxFrequency = 0;
13 int n = paragraph.size();
14
15 // Iterate through the paragraph
16 for (int i = 0; i < n;) {
17 // Skip non-alphabetic characters
18 if (!isalpha(paragraph[i])) {
19 i++;
20 continue;
21 }
22
23 // Extract the current word
24 int wordStart = i;
25 string currentWord;
26
27 // Build the word character by character (converting to lowercase)
28 while (i < n && isalpha(paragraph[i])) {
29 currentWord.push_back(tolower(paragraph[i]));
30 i++;
31 }
32
33 // Skip this word if it's in the banned list
34 if (bannedSet.count(currentWord)) {
35 continue;
36 }
37
38 // Update word frequency
39 wordFrequency[currentWord]++;
40
41 // Update the most common word if current word has higher frequency
42 if (wordFrequency[currentWord] > maxFrequency) {
43 mostCommonWord = currentWord;
44 maxFrequency = wordFrequency[currentWord];
45 }
46 }
47
48 return mostCommonWord;
49 }
50};
51
1/**
2 * Finds the most frequent word in a paragraph that is not in the banned list.
3 * @param paragraph - The input text to analyze
4 * @param banned - Array of words to exclude from consideration
5 * @returns The most common non-banned word in lowercase
6 */
7function mostCommonWord(paragraph: string, banned: string[]): string {
8 // Convert the entire paragraph to lowercase for case-insensitive comparison
9 const normalizedText = paragraph.toLowerCase();
10
11 // Map to store word frequencies
12 const wordFrequencyMap = new Map<string, number>();
13
14 // Set of banned words for O(1) lookup
15 const bannedWordsSet = new Set<string>(banned);
16
17 // Split the text by non-letter characters and process each word
18 const words = normalizedText.split(/[^a-z]/);
19
20 for (const word of words) {
21 // Skip empty strings and banned words
22 if (word === '' || bannedWordsSet.has(word)) {
23 continue;
24 }
25
26 // Update word frequency count
27 const currentCount = wordFrequencyMap.get(word) ?? 0;
28 wordFrequencyMap.set(word, currentCount + 1);
29 }
30
31 // Find the word with maximum frequency
32 const wordFrequencyEntries = [...wordFrequencyMap.entries()];
33 const mostFrequentEntry = wordFrequencyEntries.reduce(
34 (maxEntry, currentEntry) => {
35 // Compare frequencies and return the entry with higher count
36 return currentEntry[1] > maxEntry[1] ? currentEntry : maxEntry;
37 },
38 ['', 0] // Initial value: empty word with 0 frequency
39 );
40
41 // Return the most frequent word
42 return mostFrequentEntry[0];
43}
44
Time and Space Complexity
Time Complexity: O(n + m + k log k)
O(n)
for converting the paragraph to lowercase wheren
is the length of the paragraphO(n)
for the regexfindall
operation to extract all wordsO(m)
for creating the banned words set wherem
is the number of banned wordsO(k)
for creating the Counter wherek
is the number of unique words foundO(k log k)
formost_common()
which internally sorts the words by frequencyO(k)
in the worst case for iterating through sorted words to find the first non-banned word- Overall:
O(n + m + k log k)
where typicallyk << n
Space Complexity: O(n + m + k)
O(m)
for storing the banned words setO(n)
for storing the lowercase paragraph temporarilyO(n)
for storing the list of words extracted by regexO(k)
for the Counter dictionary storing unique words and their frequencies- Overall:
O(n + m + k)
wheren
is the paragraph length,m
is the number of banned words, andk
is the number of unique words
Learn more about how to find time and space complexity quickly.
Common Pitfalls
1. Incorrect Word Extraction with Special Characters
A common mistake is trying to split words using simple string operations like split()
or split(' ')
, which fails to handle punctuation correctly.
Incorrect approach:
words = paragraph.lower().split() # Won't handle punctuation # "Bob's" remains as "bob's" instead of separating to "bob" and "s"
Why it fails: Words like "Bob's"
, "hit."
, or "ball,"
would be treated as single tokens including the punctuation, leading to incorrect word counting.
Solution: Use regex pattern [a-z]+
to extract only letter sequences:
words = re.findall(r'[a-z]+', paragraph.lower())
2. Case Sensitivity Issues
Forgetting to normalize the case can lead to treating "Bob" and "bob" as different words.
Incorrect approach:
word_counts = Counter(re.findall(r'[a-zA-Z]+', paragraph)) # Mixed case # "Bob" and "bob" counted separately
Solution: Always convert to lowercase before processing:
word_counts = Counter(re.findall(r'[a-z]+', paragraph.lower()))
3. Using List Instead of Set for Banned Words
Checking membership in a list is O(n) per lookup, which can significantly slow down the solution for large banned lists.
Inefficient approach:
for word, count in word_counts.most_common(): if word not in banned: # O(n) lookup if banned is a list return word
Solution: Convert banned list to set for O(1) lookups:
banned_words = set(banned)
if word not in banned_words: # O(1) lookup
4. Manual Frequency Tracking Instead of Using Counter
Attempting to manually track the maximum frequency while iterating can lead to complex and error-prone code.
Error-prone approach:
max_count = 0 result = "" for word in words: if word not in banned: count = words.count(word) # Inefficient O(n) operation if count > max_count: max_count = count result = word
Solution: Use Counter's most_common()
method which handles sorting by frequency automatically:
word_counts = Counter(words) for word, _ in word_counts.most_common(): if word not in banned_words: return word
Consider the classic dynamic programming of fibonacci numbers, what is the recurrence relation?
Recommended Readings
Coding Interview Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https assets algo monster recursion jpg You first call Ben and ask
Runtime Overview When learning about algorithms and data structures you'll frequently encounter the term time complexity This concept is fundamental in computer science and offers insights into how long an algorithm takes to complete given a certain input size What is Time Complexity Time complexity represents the amount of time
Want a Structured Path to Master System Design Too? Don’t Miss This!