3103. Find Trending Hashtags II 🔒
Problem Description
You have a table called Tweets
that contains information about tweets posted in February 2024. Each row in the table has:
user_id
: the ID of the user who posted the tweettweet_id
: a unique identifier for each tweet (primary key)tweet_date
: the date when the tweet was postedtweet
: the actual text content of the tweet
Your task is to find the top 3 trending hashtags from all tweets posted in February 2024.
A hashtag is any word that starts with the #
symbol (like #coding
, #leetcode
, etc.). Each tweet can contain zero, one, or multiple hashtags within its text.
To determine the trending hashtags, you need to:
- Extract all hashtags from tweets posted in February 2024
- Count how many times each hashtag appears across all tweets
- Return the 3 most frequently used hashtags
The results should be ordered by:
- First, by the count of each hashtag in descending order (most frequent first)
- If two hashtags have the same count, order them by the hashtag name in descending alphabetical order
The output should include two columns:
hashtag
: the hashtag text (including the # symbol)count
: the number of times this hashtag appeared
For example, if the tweets contain hashtags like #python
appearing 10 times, #sql
appearing 8 times, and #data
appearing 8 times, the result would show these three hashtags ordered by their counts, with #sql
appearing before #data
due to alphabetical ordering (descending).
Intuition
The key insight here is that we need to process unstructured text data (tweets) to extract structured information (hashtags and their counts). Since hashtags follow a specific pattern - they always start with #
followed by word characters - we can use pattern matching to identify them.
Think of this problem as a three-step pipeline:
First, we need to filter our data to only February 2024 tweets since we're interested in trends for that specific month. This is straightforward date filtering.
Second, we need to extract hashtags from the tweet text. Regular expressions are perfect for this because hashtags have a well-defined pattern: #
followed by one or more word characters (letters, numbers, or underscores). The regex pattern #\w+
captures exactly this - the #
symbol followed by \w+
(one or more word characters).
Third, once we have all hashtags extracted, we face a counting problem. Each tweet might contribute multiple hashtags, so we need to flatten our data structure from "tweets with lists of hashtags" to "a single list of all hashtags". Then counting becomes simple - just count how many times each unique hashtag appears.
The sorting requirement gives us our final step. Since we want trending hashtags, we sort by frequency (count) first. The secondary sort by hashtag name in descending order handles ties - ensuring consistent ordering when multiple hashtags have the same count.
The beauty of this approach is that it breaks down a complex text analysis problem into simple, manageable steps: filter → extract → flatten → count → sort. Each step has a clear purpose and uses the right tool for the job (date filtering, regex matching, aggregation, and sorting).
Solution Approach
Following the approach of using regular expression matching, let's walk through the implementation step by step:
Step 1: Filter for February 2024 Tweets
tweets_feb_2024 = tweets[tweets["tweet_date"].between("2024-02-01", "2024-02-29")]
We use pandas' between()
method to filter rows where the tweet_date falls within February 2024. This gives us only the relevant tweets for our analysis.
Step 2: Extract Hashtags Using Regular Expression
hashtags = tweets_feb_2024["tweet"].str.findall(r"#\w+")
The regular expression pattern #\w+
matches:
#
- the literal hashtag symbol\w+
- one or more word characters (letters, digits, or underscores)
The str.findall()
method returns a list of all hashtags found in each tweet. For example, if a tweet contains "#python #coding", it returns ['#python', '#coding']
.
Step 3: Flatten the List of Hashtags
all_hashtags = [tag for sublist in hashtags for tag in sublist]
Since findall()
returns a list for each tweet, we have a Series of lists. We use list comprehension to flatten this into a single list containing all hashtags from all tweets. This transforms our data from nested structure to a flat list that's easier to count.
Step 4: Count Hashtag Occurrences
hashtag_counts = pd.Series(all_hashtags).value_counts().reset_index() hashtag_counts.columns = ["hashtag", "count"]
We convert the flat list to a pandas Series and use value_counts()
to count occurrences of each unique hashtag. The reset_index()
converts this to a DataFrame with two columns: the hashtag and its count.
Step 5: Sort by Count and Hashtag Name
hashtag_counts = hashtag_counts.sort_values( by=["count", "hashtag"], ascending=[False, False] )
We sort by two criteria:
- Primary: count in descending order (most frequent first)
- Secondary: hashtag name in descending order (to break ties consistently)
Step 6: Return Top 3 Trending Hashtags
top_3_hashtags = hashtag_counts.head(3)
Finally, we use head(3)
to select only the top 3 rows from our sorted DataFrame.
The complete solution elegantly combines pandas operations with regular expression pattern matching to transform unstructured tweet text into structured trending hashtag data.
Ready to land your dream job?
Unlock your dream job with a 3-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's walk through the solution with a small example dataset:
Sample Tweets Table:
user_id | tweet_id | tweet_date | tweet |
---|---|---|---|
1 | 101 | 2024-02-05 | "Learning #python and #sql today! #coding" |
2 | 102 | 2024-02-10 | "Database optimization with #sql" |
3 | 103 | 2024-02-15 | "#python is amazing for data science" |
4 | 104 | 2024-02-20 | "Practicing #coding challenges #python" |
5 | 105 | 2024-03-01 | "#javascript is fun too" |
Step 1: Filter for February 2024 After filtering dates between "2024-02-01" and "2024-02-29", we exclude tweet_id 105 (March date):
- Remaining tweets: 101, 102, 103, 104
Step 2: Extract Hashtags
Using regex pattern #\w+
on each tweet:
- Tweet 101:
['#python', '#sql', '#coding']
- Tweet 102:
['#sql']
- Tweet 103:
['#python']
- Tweet 104:
['#coding', '#python']
Step 3: Flatten the Lists
Combine all hashtags into one list:
['#python', '#sql', '#coding', '#sql', '#python', '#coding', '#python']
Step 4: Count Occurrences
#python
: 3 times#sql
: 2 times#coding
: 2 times
Step 5: Sort by Count and Name First sort by count (descending), then by name (descending) for ties:
#python
- count: 3#sql
- count: 2 (alphabetically 's' > 'c' in descending order)#coding
- count: 2
Step 6: Return Top 3 Since we only have 3 unique hashtags, all are returned:
hashtag | count |
---|---|
#python | 3 |
#sql | 2 |
#coding | 2 |
The key insight is how the secondary sort breaks the tie between #sql
and #coding
- both have count 2, but #sql
comes first because 's' > 'c' when sorting in descending alphabetical order.
Solution Implementation
1import pandas as pd
2
3
4def find_trending_hashtags(tweets: pd.DataFrame) -> pd.DataFrame:
5 """
6 Find the top 3 trending hashtags from tweets in February 2024.
7
8 Args:
9 tweets: DataFrame containing tweet data with 'tweet_date' and 'tweet' columns
10
11 Returns:
12 DataFrame with top 3 hashtags and their counts
13 """
14 # Filter tweets to only include those from February 2024
15 feb_2024_start = "2024-02-01"
16 feb_2024_end = "2024-02-29"
17 tweets_feb_2024 = tweets[tweets["tweet_date"].between(feb_2024_start, feb_2024_end)]
18
19 # Extract all hashtags from tweet text using regex pattern
20 # Pattern matches # followed by one or more word characters
21 hashtag_pattern = r"#\w+"
22 hashtags_series = tweets_feb_2024["tweet"].str.findall(hashtag_pattern)
23
24 # Flatten the list of lists into a single list of all hashtags
25 all_hashtags = []
26 for hashtag_list in hashtags_series:
27 for hashtag in hashtag_list:
28 all_hashtags.append(hashtag)
29
30 # Count the frequency of each hashtag
31 hashtag_series = pd.Series(all_hashtags)
32 hashtag_counts = hashtag_series.value_counts().reset_index()
33
34 # Rename columns for clarity
35 hashtag_counts.columns = ["hashtag", "count"]
36
37 # Sort hashtags by count (descending) and then by hashtag name (descending)
38 # This ensures consistent ordering when counts are equal
39 hashtag_counts = hashtag_counts.sort_values(
40 by=["count", "hashtag"],
41 ascending=[False, False]
42 )
43
44 # Select only the top 3 trending hashtags
45 top_3_hashtags = hashtag_counts.head(3)
46
47 return top_3_hashtags
48
1import java.time.LocalDate;
2import java.util.*;
3import java.util.regex.Matcher;
4import java.util.regex.Pattern;
5import java.util.stream.Collectors;
6
7public class TweetAnalyzer {
8
9 /**
10 * Find the top 3 trending hashtags from tweets in February 2024.
11 *
12 * @param tweets List containing tweet data with date and text
13 * @return List with top 3 hashtags and their counts
14 */
15 public static List<HashtagCount> findTrendingHashtags(List<Tweet> tweets) {
16 // Define the date range for February 2024
17 LocalDate feb2024Start = LocalDate.of(2024, 2, 1);
18 LocalDate feb2024End = LocalDate.of(2024, 2, 29);
19
20 // Filter tweets to only include those from February 2024
21 List<Tweet> tweetsFeb2024 = tweets.stream()
22 .filter(tweet -> !tweet.getTweetDate().isBefore(feb2024Start) &&
23 !tweet.getTweetDate().isAfter(feb2024End))
24 .collect(Collectors.toList());
25
26 // Pattern to match hashtags: # followed by one or more word characters
27 Pattern hashtagPattern = Pattern.compile("#\\w+");
28
29 // Extract all hashtags from tweet text using regex pattern
30 List<String> allHashtags = new ArrayList<>();
31 for (Tweet tweet : tweetsFeb2024) {
32 Matcher matcher = hashtagPattern.matcher(tweet.getTweet());
33 while (matcher.find()) {
34 allHashtags.add(matcher.group());
35 }
36 }
37
38 // Count the frequency of each hashtag
39 Map<String, Long> hashtagCounts = allHashtags.stream()
40 .collect(Collectors.groupingBy(
41 hashtag -> hashtag,
42 Collectors.counting()
43 ));
44
45 // Convert map to list of HashtagCount objects for sorting
46 List<HashtagCount> hashtagCountList = hashtagCounts.entrySet().stream()
47 .map(entry -> new HashtagCount(entry.getKey(), entry.getValue()))
48 .collect(Collectors.toList());
49
50 // Sort hashtags by count (descending) and then by hashtag name (descending)
51 // This ensures consistent ordering when counts are equal
52 hashtagCountList.sort((a, b) -> {
53 int countCompare = Long.compare(b.getCount(), a.getCount());
54 if (countCompare != 0) {
55 return countCompare;
56 }
57 return b.getHashtag().compareTo(a.getHashtag());
58 });
59
60 // Select only the top 3 trending hashtags
61 List<HashtagCount> top3Hashtags = hashtagCountList.stream()
62 .limit(3)
63 .collect(Collectors.toList());
64
65 return top3Hashtags;
66 }
67
68 // Helper class to represent a tweet
69 static class Tweet {
70 private LocalDate tweetDate;
71 private String tweet;
72
73 public Tweet(LocalDate tweetDate, String tweet) {
74 this.tweetDate = tweetDate;
75 this.tweet = tweet;
76 }
77
78 public LocalDate getTweetDate() {
79 return tweetDate;
80 }
81
82 public String getTweet() {
83 return tweet;
84 }
85 }
86
87 // Helper class to represent hashtag with its count
88 static class HashtagCount {
89 private String hashtag;
90 private long count;
91
92 public HashtagCount(String hashtag, long count) {
93 this.hashtag = hashtag;
94 this.count = count;
95 }
96
97 public String getHashtag() {
98 return hashtag;
99 }
100
101 public long getCount() {
102 return count;
103 }
104 }
105}
106
1#include <string>
2#include <vector>
3#include <unordered_map>
4#include <algorithm>
5#include <regex>
6#include <chrono>
7#include <sstream>
8
9// Structure to represent a tweet record
10struct Tweet {
11 std::string tweet_date; // Date in "YYYY-MM-DD" format
12 std::string tweet; // Tweet content
13};
14
15// Structure to represent hashtag count
16struct HashtagCount {
17 std::string hashtag;
18 int count;
19};
20
21// Custom DataFrame-like structure for tweets
22class DataFrame {
23public:
24 std::vector<Tweet> data;
25
26 // Constructor
27 DataFrame() {}
28 DataFrame(const std::vector<Tweet>& tweets) : data(tweets) {}
29};
30
31// Result DataFrame for hashtag counts
32class HashtagDataFrame {
33public:
34 std::vector<HashtagCount> data;
35
36 HashtagDataFrame() {}
37 HashtagDataFrame(const std::vector<HashtagCount>& hashtags) : data(hashtags) {}
38};
39
40/**
41 * Find the top 3 trending hashtags from tweets in February 2024.
42 *
43 * @param tweets DataFrame containing tweet data with 'tweet_date' and 'tweet' columns
44 * @return DataFrame with top 3 hashtags and their counts
45 */
46HashtagDataFrame find_trending_hashtags(const DataFrame& tweets) {
47 // Define the date range for February 2024
48 std::string feb_2024_start = "2024-02-01";
49 std::string feb_2024_end = "2024-02-29";
50
51 // Filter tweets to only include those from February 2024
52 std::vector<Tweet> tweets_feb_2024;
53 for (const auto& tweet : tweets.data) {
54 // Check if tweet date is within February 2024 range
55 if (tweet.tweet_date >= feb_2024_start && tweet.tweet_date <= feb_2024_end) {
56 tweets_feb_2024.push_back(tweet);
57 }
58 }
59
60 // Regular expression pattern to match hashtags
61 // Pattern matches # followed by one or more word characters
62 std::regex hashtag_pattern(R"(#\w+)");
63
64 // Extract all hashtags from tweet text using regex
65 std::vector<std::string> all_hashtags;
66 for (const auto& tweet : tweets_feb_2024) {
67 // Find all hashtags in the current tweet
68 std::sregex_iterator iter(tweet.tweet.begin(), tweet.tweet.end(), hashtag_pattern);
69 std::sregex_iterator end;
70
71 // Add each found hashtag to the list
72 while (iter != end) {
73 all_hashtags.push_back(iter->str());
74 ++iter;
75 }
76 }
77
78 // Count the frequency of each hashtag using a hash map
79 std::unordered_map<std::string, int> hashtag_frequency;
80 for (const auto& hashtag : all_hashtags) {
81 hashtag_frequency[hashtag]++;
82 }
83
84 // Convert the frequency map to a vector for sorting
85 std::vector<HashtagCount> hashtag_counts;
86 for (const auto& pair : hashtag_frequency) {
87 hashtag_counts.push_back({pair.first, pair.second});
88 }
89
90 // Sort hashtags by count (descending) and then by hashtag name (descending)
91 // This ensures consistent ordering when counts are equal
92 std::sort(hashtag_counts.begin(), hashtag_counts.end(),
93 [](const HashtagCount& a, const HashtagCount& b) {
94 if (a.count != b.count) {
95 return a.count > b.count; // Sort by count descending
96 }
97 return a.hashtag > b.hashtag; // Sort by hashtag name descending
98 });
99
100 // Select only the top 3 trending hashtags
101 std::vector<HashtagCount> top_3_hashtags;
102 int limit = std::min(3, static_cast<int>(hashtag_counts.size()));
103 for (int i = 0; i < limit; i++) {
104 top_3_hashtags.push_back(hashtag_counts[i]);
105 }
106
107 // Return the result as a HashtagDataFrame
108 return HashtagDataFrame(top_3_hashtags);
109}
110
1import * as pd from 'pandas-js';
2
3/**
4 * Find the top 3 trending hashtags from tweets in February 2024.
5 *
6 * @param tweets - DataFrame containing tweet data with 'tweet_date' and 'tweet' columns
7 * @returns DataFrame with top 3 hashtags and their counts
8 */
9function find_trending_hashtags(tweets: pd.DataFrame): pd.DataFrame {
10 // Define the date range for February 2024
11 const feb2024Start: string = "2024-02-01";
12 const feb2024End: string = "2024-02-29";
13
14 // Filter tweets to only include those from February 2024
15 const tweetsFeb2024: pd.DataFrame = tweets.filter(
16 (row: any) => row.tweet_date >= feb2024Start && row.tweet_date <= feb2024End
17 );
18
19 // Define regex pattern to match hashtags (# followed by one or more word characters)
20 const hashtagPattern: RegExp = /#\w+/g;
21
22 // Extract all hashtags from tweet text and flatten into a single array
23 const allHashtags: string[] = [];
24
25 // Iterate through each tweet to extract hashtags
26 tweetsFeb2024.get('tweet').forEach((tweetText: string) => {
27 // Find all hashtags in the current tweet
28 const hashtagMatches: string[] | null = tweetText.match(hashtagPattern);
29
30 // Add found hashtags to the collection
31 if (hashtagMatches) {
32 hashtagMatches.forEach((hashtag: string) => {
33 allHashtags.push(hashtag);
34 });
35 }
36 });
37
38 // Count the frequency of each hashtag using a Map
39 const hashtagFrequency: Map<string, number> = new Map();
40 allHashtags.forEach((hashtag: string) => {
41 const currentCount: number = hashtagFrequency.get(hashtag) || 0;
42 hashtagFrequency.set(hashtag, currentCount + 1);
43 });
44
45 // Convert frequency map to array of objects for sorting
46 const hashtagCounts: Array<{hashtag: string, count: number}> = Array.from(
47 hashtagFrequency.entries()
48 ).map(([hashtag, count]) => ({
49 hashtag: hashtag,
50 count: count
51 }));
52
53 // Sort hashtags by count (descending) and then by hashtag name (descending)
54 // This ensures consistent ordering when counts are equal
55 hashtagCounts.sort((a, b) => {
56 // First sort by count in descending order
57 if (a.count !== b.count) {
58 return b.count - a.count;
59 }
60 // If counts are equal, sort by hashtag name in descending order
61 return b.hashtag.localeCompare(a.hashtag);
62 });
63
64 // Select only the top 3 trending hashtags
65 const top3Hashtags: Array<{hashtag: string, count: number}> = hashtagCounts.slice(0, 3);
66
67 // Create a new DataFrame with the top 3 hashtags
68 const resultDataFrame: pd.DataFrame = new pd.DataFrame(top3Hashtags);
69
70 return resultDataFrame;
71}
72
Time and Space Complexity
Time Complexity: O(n * m + h * log(h))
O(n)
for filtering tweets wheren
is the total number of tweets in the DataFrameO(n * m)
for extracting hashtags using regex, wherem
is the average length of each tweet stringO(h)
for flattening the list of hashtags, whereh
is the total number of hashtags extractedO(h)
for counting hashtag occurrences usingvalue_counts()
O(h * log(h))
for sorting hashtags by count and then by hashtag nameO(1)
for selecting the top 3 hashtags
The dominant operation is the sorting step, making the overall time complexity O(n * m + h * log(h))
.
Space Complexity: O(n + h)
O(n)
for storing the filtered tweets from February 2024O(h)
for storing the extracted hashtags listO(h)
for storing the flattened hashtags listO(u)
for storing unique hashtag counts, whereu ≤ h
is the number of unique hashtagsO(1)
for storing the top 3 hashtags result
The overall space complexity is O(n + h)
as we need to store the filtered tweets and all extracted hashtags.
Learn more about how to find time and space complexity quickly.
Common Pitfalls
1. Case Sensitivity in Hashtag Counting
A critical issue that often occurs is treating hashtags with different cases as separate entities. For example, #Python
, #python
, and #PYTHON
would be counted as three different hashtags, even though they likely represent the same topic.
Why this happens: The default string operations and value_counts()
are case-sensitive, treating each variation as unique.
Solution: Convert all hashtags to lowercase before counting:
# Extract and normalize hashtags to lowercase
hashtags_series = tweets_feb_2024["tweet"].str.findall(r"#\w+")
all_hashtags = []
for hashtag_list in hashtags_series:
for hashtag in hashtag_list:
all_hashtags.append(hashtag.lower())
2. Handling Empty or Null Tweet Content
If the tweet column contains null values or empty strings, the str.findall()
operation will fail or produce unexpected results.
Why this happens: Null values in the tweet column will cause str.findall()
to return NaN values, which can't be iterated over in the flattening step.
Solution: Filter out null or empty tweets before processing:
# Remove rows where tweet is null or empty
tweets_feb_2024 = tweets_feb_2024[tweets_feb_2024["tweet"].notna()]
tweets_feb_2024 = tweets_feb_2024[tweets_feb_2024["tweet"].str.strip() != ""]
3. Incorrect Leap Year Handling for February
The code assumes February 2024 has 29 days, but if applied to a non-leap year, using February 29th as the end date could cause issues or miss data if the database strictly validates dates.
Why this happens: Not all years have February 29th, and some SQL databases might throw an error or handle the invalid date differently.
Solution: Use a more robust date filtering approach:
# Use month and year extraction for more reliable filtering tweets_feb_2024 = tweets[ (pd.to_datetime(tweets["tweet_date"]).dt.year == 2024) & (pd.to_datetime(tweets["tweet_date"]).dt.month == 2) ]
4. Regex Pattern Limitations
The pattern #\w+
only matches word characters (letters, numbers, underscore) but real-world hashtags might contain other valid characters that get truncated.
Why this happens: Hashtags on platforms like Twitter can include characters beyond \w
, and the current pattern would split #COVID-19
into just #COVID
.
Solution: Use a more comprehensive regex pattern:
# Pattern that captures more realistic hashtag formats hashtag_pattern = r"#[a-zA-Z0-9_]+" # Or if you want to match until whitespace or punctuation hashtag_pattern = r"#[^\s\.\,\!\?\;\:]+(?=\s|$|[\.,:;!?])"
5. Performance Issues with Large Datasets
The list comprehension approach for flattening can be memory-intensive with millions of tweets.
Why this happens: Creating intermediate lists and iterating multiple times consumes significant memory and processing time.
Solution: Use pandas' built-in explode()
method for better performance:
# More efficient flattening using explode
hashtags_series = tweets_feb_2024["tweet"].str.findall(r"#\w+")
hashtags_df = hashtags_series.explode().reset_index(drop=True)
hashtag_counts = hashtags_df.value_counts().reset_index()
Which of the following problems can be solved with backtracking (select multiple)
Recommended Readings
Coding Interview Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https assets algo monster recursion jpg You first call Ben and ask
Runtime Overview When learning about algorithms and data structures you'll frequently encounter the term time complexity This concept is fundamental in computer science and offers insights into how long an algorithm takes to complete given a certain input size What is Time Complexity Time complexity represents the amount of time
Want a Structured Path to Master System Design Too? Don’t Miss This!