Facebook Pixel

3103. Find Trending Hashtags II

HardDatabase
Leetcode Link

Problem Description

You have a table called Tweets that contains information about tweets posted in February 2024. Each row in the table has:

  • user_id: the ID of the user who posted the tweet
  • tweet_id: a unique identifier for each tweet (primary key)
  • tweet_date: the date when the tweet was posted
  • tweet: the actual text content of the tweet

Your task is to find the top 3 trending hashtags from all tweets posted in February 2024.

A hashtag is any word that starts with the # symbol (like #coding, #leetcode, etc.). Each tweet can contain zero, one, or multiple hashtags within its text.

To determine the trending hashtags, you need to:

  1. Extract all hashtags from tweets posted in February 2024
  2. Count how many times each hashtag appears across all tweets
  3. Return the 3 most frequently used hashtags

The results should be ordered by:

  • First, by the count of each hashtag in descending order (most frequent first)
  • If two hashtags have the same count, order them by the hashtag name in descending alphabetical order

The output should include two columns:

  • hashtag: the hashtag text (including the # symbol)
  • count: the number of times this hashtag appeared

For example, if the tweets contain hashtags like #python appearing 10 times, #sql appearing 8 times, and #data appearing 8 times, the result would show these three hashtags ordered by their counts, with #sql appearing before #data due to alphabetical ordering (descending).

Quick Interview Experience
Help others by sharing your interview experience
Have you seen this problem before?

Intuition

The key insight here is that we need to process unstructured text data (tweets) to extract structured information (hashtags and their counts). Since hashtags follow a specific pattern - they always start with # followed by word characters - we can use pattern matching to identify them.

Think of this problem as a three-step pipeline:

First, we need to filter our data to only February 2024 tweets since we're interested in trends for that specific month. This is straightforward date filtering.

Second, we need to extract hashtags from the tweet text. Regular expressions are perfect for this because hashtags have a well-defined pattern: # followed by one or more word characters (letters, numbers, or underscores). The regex pattern #\w+ captures exactly this - the # symbol followed by \w+ (one or more word characters).

Third, once we have all hashtags extracted, we face a counting problem. Each tweet might contribute multiple hashtags, so we need to flatten our data structure from "tweets with lists of hashtags" to "a single list of all hashtags". Then counting becomes simple - just count how many times each unique hashtag appears.

The sorting requirement gives us our final step. Since we want trending hashtags, we sort by frequency (count) first. The secondary sort by hashtag name in descending order handles ties - ensuring consistent ordering when multiple hashtags have the same count.

The beauty of this approach is that it breaks down a complex text analysis problem into simple, manageable steps: filter → extract → flatten → count → sort. Each step has a clear purpose and uses the right tool for the job (date filtering, regex matching, aggregation, and sorting).

Solution Approach

Following the approach of using regular expression matching, let's walk through the implementation step by step:

Step 1: Filter for February 2024 Tweets

tweets_feb_2024 = tweets[tweets["tweet_date"].between("2024-02-01", "2024-02-29")]

We use pandas' between() method to filter rows where the tweet_date falls within February 2024. This gives us only the relevant tweets for our analysis.

Step 2: Extract Hashtags Using Regular Expression

hashtags = tweets_feb_2024["tweet"].str.findall(r"#\w+")

The regular expression pattern #\w+ matches:

  • # - the literal hashtag symbol
  • \w+ - one or more word characters (letters, digits, or underscores)

The str.findall() method returns a list of all hashtags found in each tweet. For example, if a tweet contains "#python #coding", it returns ['#python', '#coding'].

Step 3: Flatten the List of Hashtags

all_hashtags = [tag for sublist in hashtags for tag in sublist]

Since findall() returns a list for each tweet, we have a Series of lists. We use list comprehension to flatten this into a single list containing all hashtags from all tweets. This transforms our data from nested structure to a flat list that's easier to count.

Step 4: Count Hashtag Occurrences

hashtag_counts = pd.Series(all_hashtags).value_counts().reset_index()
hashtag_counts.columns = ["hashtag", "count"]

We convert the flat list to a pandas Series and use value_counts() to count occurrences of each unique hashtag. The reset_index() converts this to a DataFrame with two columns: the hashtag and its count.

Step 5: Sort by Count and Hashtag Name

hashtag_counts = hashtag_counts.sort_values(
    by=["count", "hashtag"], ascending=[False, False]
)

We sort by two criteria:

  • Primary: count in descending order (most frequent first)
  • Secondary: hashtag name in descending order (to break ties consistently)

Step 6: Return Top 3 Trending Hashtags

top_3_hashtags = hashtag_counts.head(3)

Finally, we use head(3) to select only the top 3 rows from our sorted DataFrame.

The complete solution elegantly combines pandas operations with regular expression pattern matching to transform unstructured tweet text into structured trending hashtag data.

Ready to land your dream job?

Unlock your dream job with a 5-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's walk through the solution with a small example dataset:

Sample Tweets Table:

user_idtweet_idtweet_datetweet
11012024-02-05"Learning #python and #sql today! #coding"
21022024-02-10"Database optimization with #sql"
31032024-02-15"#python is amazing for data science"
41042024-02-20"Practicing #coding challenges #python"
51052024-03-01"#javascript is fun too"

Step 1: Filter for February 2024 After filtering dates between "2024-02-01" and "2024-02-29", we exclude tweet_id 105 (March date):

  • Remaining tweets: 101, 102, 103, 104

Step 2: Extract Hashtags Using regex pattern #\w+ on each tweet:

  • Tweet 101: ['#python', '#sql', '#coding']
  • Tweet 102: ['#sql']
  • Tweet 103: ['#python']
  • Tweet 104: ['#coding', '#python']

Step 3: Flatten the Lists Combine all hashtags into one list: ['#python', '#sql', '#coding', '#sql', '#python', '#coding', '#python']

Step 4: Count Occurrences

  • #python: 3 times
  • #sql: 2 times
  • #coding: 2 times

Step 5: Sort by Count and Name First sort by count (descending), then by name (descending) for ties:

  1. #python - count: 3
  2. #sql - count: 2 (alphabetically 's' > 'c' in descending order)
  3. #coding - count: 2

Step 6: Return Top 3 Since we only have 3 unique hashtags, all are returned:

hashtagcount
#python3
#sql2
#coding2

The key insight is how the secondary sort breaks the tie between #sql and #coding - both have count 2, but #sql comes first because 's' > 'c' when sorting in descending alphabetical order.

Solution Implementation

1import pandas as pd
2
3
4def find_trending_hashtags(tweets: pd.DataFrame) -> pd.DataFrame:
5    """
6    Find the top 3 trending hashtags from tweets in February 2024.
7
8    Args:
9        tweets: DataFrame containing tweet data with 'tweet_date' and 'tweet' columns
10
11    Returns:
12        DataFrame with top 3 hashtags and their counts
13    """
14    # Filter tweets to only include those from February 2024
15    feb_2024_start = "2024-02-01"
16    feb_2024_end = "2024-02-29"
17    tweets_feb_2024 = tweets[tweets["tweet_date"].between(feb_2024_start, feb_2024_end)]
18
19    # Extract all hashtags from tweet text using regex pattern
20    # Pattern matches # followed by one or more word characters
21    hashtag_pattern = r"#\w+"
22    hashtags_series = tweets_feb_2024["tweet"].str.findall(hashtag_pattern)
23
24    # Flatten the list of lists into a single list of all hashtags
25    all_hashtags = []
26    for hashtag_list in hashtags_series:
27        for hashtag in hashtag_list:
28            all_hashtags.append(hashtag)
29
30    # Count the frequency of each hashtag
31    hashtag_series = pd.Series(all_hashtags)
32    hashtag_counts = hashtag_series.value_counts().reset_index()
33
34    # Rename columns for clarity
35    hashtag_counts.columns = ["hashtag", "count"]
36
37    # Sort hashtags by count (descending) and then by hashtag name (descending)
38    # This ensures consistent ordering when counts are equal
39    hashtag_counts = hashtag_counts.sort_values(
40        by=["count", "hashtag"],
41        ascending=[False, False]
42    )
43
44    # Select only the top 3 trending hashtags
45    top_3_hashtags = hashtag_counts.head(3)
46
47    return top_3_hashtags
48
1import java.time.LocalDate;
2import java.util.*;
3import java.util.regex.Matcher;
4import java.util.regex.Pattern;
5import java.util.stream.Collectors;
6
7public class TweetAnalyzer {
8
9    /**
10     * Find the top 3 trending hashtags from tweets in February 2024.
11     *
12     * @param tweets List containing tweet data with date and text
13     * @return List with top 3 hashtags and their counts
14     */
15    public static List<HashtagCount> findTrendingHashtags(List<Tweet> tweets) {
16        // Define the date range for February 2024
17        LocalDate feb2024Start = LocalDate.of(2024, 2, 1);
18        LocalDate feb2024End = LocalDate.of(2024, 2, 29);
19
20        // Filter tweets to only include those from February 2024
21        List<Tweet> tweetsFeb2024 = tweets.stream()
22            .filter(tweet -> !tweet.getTweetDate().isBefore(feb2024Start) &&
23                           !tweet.getTweetDate().isAfter(feb2024End))
24            .collect(Collectors.toList());
25
26        // Pattern to match hashtags: # followed by one or more word characters
27        Pattern hashtagPattern = Pattern.compile("#\\w+");
28
29        // Extract all hashtags from tweet text using regex pattern
30        List<String> allHashtags = new ArrayList<>();
31        for (Tweet tweet : tweetsFeb2024) {
32            Matcher matcher = hashtagPattern.matcher(tweet.getTweet());
33            while (matcher.find()) {
34                allHashtags.add(matcher.group());
35            }
36        }
37
38        // Count the frequency of each hashtag
39        Map<String, Long> hashtagCounts = allHashtags.stream()
40            .collect(Collectors.groupingBy(
41                hashtag -> hashtag,
42                Collectors.counting()
43            ));
44
45        // Convert map to list of HashtagCount objects for sorting
46        List<HashtagCount> hashtagCountList = hashtagCounts.entrySet().stream()
47            .map(entry -> new HashtagCount(entry.getKey(), entry.getValue()))
48            .collect(Collectors.toList());
49
50        // Sort hashtags by count (descending) and then by hashtag name (descending)
51        // This ensures consistent ordering when counts are equal
52        hashtagCountList.sort((a, b) -> {
53            int countCompare = Long.compare(b.getCount(), a.getCount());
54            if (countCompare != 0) {
55                return countCompare;
56            }
57            return b.getHashtag().compareTo(a.getHashtag());
58        });
59
60        // Select only the top 3 trending hashtags
61        List<HashtagCount> top3Hashtags = hashtagCountList.stream()
62            .limit(3)
63            .collect(Collectors.toList());
64
65        return top3Hashtags;
66    }
67
68    // Helper class to represent a tweet
69    static class Tweet {
70        private LocalDate tweetDate;
71        private String tweet;
72
73        public Tweet(LocalDate tweetDate, String tweet) {
74            this.tweetDate = tweetDate;
75            this.tweet = tweet;
76        }
77
78        public LocalDate getTweetDate() {
79            return tweetDate;
80        }
81
82        public String getTweet() {
83            return tweet;
84        }
85    }
86
87    // Helper class to represent hashtag with its count
88    static class HashtagCount {
89        private String hashtag;
90        private long count;
91
92        public HashtagCount(String hashtag, long count) {
93            this.hashtag = hashtag;
94            this.count = count;
95        }
96
97        public String getHashtag() {
98            return hashtag;
99        }
100
101        public long getCount() {
102            return count;
103        }
104    }
105}
106
1#include <string>
2#include <vector>
3#include <unordered_map>
4#include <algorithm>
5#include <regex>
6#include <chrono>
7#include <sstream>
8
9// Structure to represent a tweet record
10struct Tweet {
11    std::string tweet_date;  // Date in "YYYY-MM-DD" format
12    std::string tweet;        // Tweet content
13};
14
15// Structure to represent hashtag count
16struct HashtagCount {
17    std::string hashtag;
18    int count;
19};
20
21// Custom DataFrame-like structure for tweets
22class DataFrame {
23public:
24    std::vector<Tweet> data;
25
26    // Constructor
27    DataFrame() {}
28    DataFrame(const std::vector<Tweet>& tweets) : data(tweets) {}
29};
30
31// Result DataFrame for hashtag counts
32class HashtagDataFrame {
33public:
34    std::vector<HashtagCount> data;
35
36    HashtagDataFrame() {}
37    HashtagDataFrame(const std::vector<HashtagCount>& hashtags) : data(hashtags) {}
38};
39
40/**
41 * Find the top 3 trending hashtags from tweets in February 2024.
42 *
43 * @param tweets DataFrame containing tweet data with 'tweet_date' and 'tweet' columns
44 * @return DataFrame with top 3 hashtags and their counts
45 */
46HashtagDataFrame find_trending_hashtags(const DataFrame& tweets) {
47    // Define the date range for February 2024
48    std::string feb_2024_start = "2024-02-01";
49    std::string feb_2024_end = "2024-02-29";
50
51    // Filter tweets to only include those from February 2024
52    std::vector<Tweet> tweets_feb_2024;
53    for (const auto& tweet : tweets.data) {
54        // Check if tweet date is within February 2024 range
55        if (tweet.tweet_date >= feb_2024_start && tweet.tweet_date <= feb_2024_end) {
56            tweets_feb_2024.push_back(tweet);
57        }
58    }
59
60    // Regular expression pattern to match hashtags
61    // Pattern matches # followed by one or more word characters
62    std::regex hashtag_pattern(R"(#\w+)");
63
64    // Extract all hashtags from tweet text using regex
65    std::vector<std::string> all_hashtags;
66    for (const auto& tweet : tweets_feb_2024) {
67        // Find all hashtags in the current tweet
68        std::sregex_iterator iter(tweet.tweet.begin(), tweet.tweet.end(), hashtag_pattern);
69        std::sregex_iterator end;
70
71        // Add each found hashtag to the list
72        while (iter != end) {
73            all_hashtags.push_back(iter->str());
74            ++iter;
75        }
76    }
77
78    // Count the frequency of each hashtag using a hash map
79    std::unordered_map<std::string, int> hashtag_frequency;
80    for (const auto& hashtag : all_hashtags) {
81        hashtag_frequency[hashtag]++;
82    }
83
84    // Convert the frequency map to a vector for sorting
85    std::vector<HashtagCount> hashtag_counts;
86    for (const auto& pair : hashtag_frequency) {
87        hashtag_counts.push_back({pair.first, pair.second});
88    }
89
90    // Sort hashtags by count (descending) and then by hashtag name (descending)
91    // This ensures consistent ordering when counts are equal
92    std::sort(hashtag_counts.begin(), hashtag_counts.end(),
93        [](const HashtagCount& a, const HashtagCount& b) {
94            if (a.count != b.count) {
95                return a.count > b.count;  // Sort by count descending
96            }
97            return a.hashtag > b.hashtag;  // Sort by hashtag name descending
98        });
99
100    // Select only the top 3 trending hashtags
101    std::vector<HashtagCount> top_3_hashtags;
102    int limit = std::min(3, static_cast<int>(hashtag_counts.size()));
103    for (int i = 0; i < limit; i++) {
104        top_3_hashtags.push_back(hashtag_counts[i]);
105    }
106
107    // Return the result as a HashtagDataFrame
108    return HashtagDataFrame(top_3_hashtags);
109}
110
1import * as pd from 'pandas-js';
2
3/**
4 * Find the top 3 trending hashtags from tweets in February 2024.
5 *
6 * @param tweets - DataFrame containing tweet data with 'tweet_date' and 'tweet' columns
7 * @returns DataFrame with top 3 hashtags and their counts
8 */
9function find_trending_hashtags(tweets: pd.DataFrame): pd.DataFrame {
10    // Define the date range for February 2024
11    const feb2024Start: string = "2024-02-01";
12    const feb2024End: string = "2024-02-29";
13
14    // Filter tweets to only include those from February 2024
15    const tweetsFeb2024: pd.DataFrame = tweets.filter(
16        (row: any) => row.tweet_date >= feb2024Start && row.tweet_date <= feb2024End
17    );
18
19    // Define regex pattern to match hashtags (# followed by one or more word characters)
20    const hashtagPattern: RegExp = /#\w+/g;
21
22    // Extract all hashtags from tweet text and flatten into a single array
23    const allHashtags: string[] = [];
24
25    // Iterate through each tweet to extract hashtags
26    tweetsFeb2024.get('tweet').forEach((tweetText: string) => {
27        // Find all hashtags in the current tweet
28        const hashtagMatches: string[] | null = tweetText.match(hashtagPattern);
29
30        // Add found hashtags to the collection
31        if (hashtagMatches) {
32            hashtagMatches.forEach((hashtag: string) => {
33                allHashtags.push(hashtag);
34            });
35        }
36    });
37
38    // Count the frequency of each hashtag using a Map
39    const hashtagFrequency: Map<string, number> = new Map();
40    allHashtags.forEach((hashtag: string) => {
41        const currentCount: number = hashtagFrequency.get(hashtag) || 0;
42        hashtagFrequency.set(hashtag, currentCount + 1);
43    });
44
45    // Convert frequency map to array of objects for sorting
46    const hashtagCounts: Array<{hashtag: string, count: number}> = Array.from(
47        hashtagFrequency.entries()
48    ).map(([hashtag, count]) => ({
49        hashtag: hashtag,
50        count: count
51    }));
52
53    // Sort hashtags by count (descending) and then by hashtag name (descending)
54    // This ensures consistent ordering when counts are equal
55    hashtagCounts.sort((a, b) => {
56        // First sort by count in descending order
57        if (a.count !== b.count) {
58            return b.count - a.count;
59        }
60        // If counts are equal, sort by hashtag name in descending order
61        return b.hashtag.localeCompare(a.hashtag);
62    });
63
64    // Select only the top 3 trending hashtags
65    const top3Hashtags: Array<{hashtag: string, count: number}> = hashtagCounts.slice(0, 3);
66
67    // Create a new DataFrame with the top 3 hashtags
68    const resultDataFrame: pd.DataFrame = new pd.DataFrame(top3Hashtags);
69
70    return resultDataFrame;
71}
72

Time and Space Complexity

Time Complexity: O(n * m + h * log(h))

  • O(n) for filtering tweets where n is the total number of tweets in the DataFrame
  • O(n * m) for extracting hashtags using regex, where m is the average length of each tweet string
  • O(h) for flattening the list of hashtags, where h is the total number of hashtags extracted
  • O(h) for counting hashtag occurrences using value_counts()
  • O(h * log(h)) for sorting hashtags by count and then by hashtag name
  • O(1) for selecting the top 3 hashtags

The dominant operation is the sorting step, making the overall time complexity O(n * m + h * log(h)).

Space Complexity: O(n + h)

  • O(n) for storing the filtered tweets from February 2024
  • O(h) for storing the extracted hashtags list
  • O(h) for storing the flattened hashtags list
  • O(u) for storing unique hashtag counts, where u ≤ h is the number of unique hashtags
  • O(1) for storing the top 3 hashtags result

The overall space complexity is O(n + h) as we need to store the filtered tweets and all extracted hashtags.

Learn more about how to find time and space complexity quickly.

Common Pitfalls

1. Case Sensitivity in Hashtag Counting

A critical issue that often occurs is treating hashtags with different cases as separate entities. For example, #Python, #python, and #PYTHON would be counted as three different hashtags, even though they likely represent the same topic.

Why this happens: The default string operations and value_counts() are case-sensitive, treating each variation as unique.

Solution: Convert all hashtags to lowercase before counting:

# Extract and normalize hashtags to lowercase
hashtags_series = tweets_feb_2024["tweet"].str.findall(r"#\w+")
all_hashtags = []
for hashtag_list in hashtags_series:
    for hashtag in hashtag_list:
        all_hashtags.append(hashtag.lower())

2. Handling Empty or Null Tweet Content

If the tweet column contains null values or empty strings, the str.findall() operation will fail or produce unexpected results.

Why this happens: Null values in the tweet column will cause str.findall() to return NaN values, which can't be iterated over in the flattening step.

Solution: Filter out null or empty tweets before processing:

# Remove rows where tweet is null or empty
tweets_feb_2024 = tweets_feb_2024[tweets_feb_2024["tweet"].notna()]
tweets_feb_2024 = tweets_feb_2024[tweets_feb_2024["tweet"].str.strip() != ""]

3. Incorrect Leap Year Handling for February

The code assumes February 2024 has 29 days, but if applied to a non-leap year, using February 29th as the end date could cause issues or miss data if the database strictly validates dates.

Why this happens: Not all years have February 29th, and some SQL databases might throw an error or handle the invalid date differently.

Solution: Use a more robust date filtering approach:

# Use month and year extraction for more reliable filtering
tweets_feb_2024 = tweets[
    (pd.to_datetime(tweets["tweet_date"]).dt.year == 2024) &
    (pd.to_datetime(tweets["tweet_date"]).dt.month == 2)
]

4. Regex Pattern Limitations

The pattern #\w+ only matches word characters (letters, numbers, underscore) but real-world hashtags might contain other valid characters that get truncated.

Why this happens: Hashtags on platforms like Twitter can include characters beyond \w, and the current pattern would split #COVID-19 into just #COVID.

Solution: Use a more comprehensive regex pattern:

# Pattern that captures more realistic hashtag formats
hashtag_pattern = r"#[a-zA-Z0-9_]+"
# Or if you want to match until whitespace or punctuation
hashtag_pattern = r"#[^\s\.\,\!\?\;\:]+(?=\s|$|[\.,:;!?])"

5. Performance Issues with Large Datasets

The list comprehension approach for flattening can be memory-intensive with millions of tweets.

Why this happens: Creating intermediate lists and iterating multiple times consumes significant memory and processing time.

Solution: Use pandas' built-in explode() method for better performance:

# More efficient flattening using explode
hashtags_series = tweets_feb_2024["tweet"].str.findall(r"#\w+")
hashtags_df = hashtags_series.explode().reset_index(drop=True)
hashtag_counts = hashtags_df.value_counts().reset_index()
Discover Your Strengths and Weaknesses: Take Our 5-Minute Quiz to Tailor Your Study Plan:

Which data structure is used in a depth first search?


Recommended Readings

Want a Structured Path to Master System Design Too? Don’t Miss This!

Load More