3103. Find Trending Hashtags II 🔒

HardDatabase

Problem Description

We have a table named Tweets that consists of several columns: user_id, tweet_id, tweet_date, and tweet. The tweet_id is the primary key in this table, meaning each value in this column is unique. Every row in the table represents an individual tweet, containing information about the user who tweeted, a unique identifier for the tweet, the date when the tweet was posted, and the content of the tweet. Importantly, it is specified that all tweet dates are valid dates occurring in February 2024.

The task is to find the top 3 trending hashtags mentioned in these tweets during February 2024. Each tweet may contain multiple hashtags. The result should be a listing of these top trending hashtags, ranked primarily by their frequency of occurrence in descending order. In the case of ties in frequency, hashtags should be further sorted in descending order by the hashtag text itself.

Quick Interview Experience

Help others by sharing your interview experience

Have you seen this problem before?

Intuition

To solve this problem, the key is to analyze the content of the tweets to extract and count the hashtags.

Filtering Tweets by Date: Since tweets are already restricted to February 2024, we confirm that we are working with the relevant dataset.
Extracting Hashtags: Each tweet can contain zero, one, or several hashtags. Using regular expressions, we can parse each tweet's content to extract all hashtags.
Counting Hashtags: After extracting the hashtags, we count how often each appears. This provides a frequency distribution of hashtags across all tweets.
Sorting Hashtags: The hashtags are then sorted in descending order by their frequency count. When counts are equal, the hashtags themselves are sorted in descending order lexicographically.
Selecting Top Hashtags: Finally, we choose the top three hashtags based on the sorted order.

This structured approach ensures that we efficiently gather the most relevant hashtags in terms of popularity, adhering strictly to the problem's requirements.

Solution Approach

The solution leverages the power of Python and its libraries to process and analyze tweet data efficiently:

Filtering Tweets by Date: First, the data is filtered to ensure we only consider tweets from February 2024. This is done using the pandas library to filter rows where the tweet_date is within the specified date range.

tweets_feb_2024 = tweets[tweets["tweet_date"].between("2024-02-01", "2024-02-29")]

Extracting Hashtags: Each tweet is scanned for hashtags using regular expression matching. The expression #\w+ identifies all words that follow a hash symbol (#). This is accomplished with the str.findall method in pandas, which creates a series of lists of hashtags.
```
hashtags = tweets_feb_2024["tweet"].str.findall(r"#\w+")
```
Flattening the List of Hashtags: To facilitate counting, the list of hashtag lists is flattened into a single list of individual hashtags. This transformation is critical for further computation and is achieved using a list comprehension.
```
all_hashtags = [tag for sublist in hashtags for tag in sublist]
```

Counting the Hashtags: Using pandas.Series, the frequency of each hashtag is counted, and the results are converted to a DataFrame. This step essentially summarizes the occurrence of each hashtag.

hashtag_counts = pd.Series(all_hashtags).value_counts().reset_index()
hashtag_counts.columns = ["hashtag", "count"]

Sorting the Hashtags: Next, the hashtag frequency counts are sorted in descending order. The primary key for sorting is count, followed by hashtag to handle ties, both in descending order to meet problem requirements.

hashtag_counts = hashtag_counts.sort_values(by=["count", "hashtag"], ascending=[False, False])

Selecting Top 3 Hashtags: Finally, the top three hashtags are extracted from the sorted list. This final selection represents our result of the most popular hashtags.
```
top_3_hashtags = hashtag_counts.head(3)
```

This approach provides an efficient and effective means of determining trending hashtags by parsing and processing tweets. It captures both the popularity and the relative ranking in case of ties.

Ready to land your dream job?

Unlock your dream job with a 2-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's illustrate the solution approach using a simplified example where we have a subset of Tweets data consisting of three hypothetical tweets from February 2024:

user_id	tweet_id	tweet_date	tweet
1	101	2024-02-03	"Loving the #WinterWonderland! #SnowDay"
2	102	2024-02-15	"Happy #ValentinesDay! Spread the love! #Love"
3	103	2024-02-28	"#WinterWonderland is magical this year! #SnowDay #Magical"

Filtering Tweets by Date: Since all tweets already fall within February 2024, we proceed without further filtering:

tweets_feb_2024 = tweets[tweets["tweet_date"].between("2024-02-01", "2024-02-29")]

Extracting Hashtags: We use a regular expression to extract hashtags from each tweet:

hashtags = tweets_feb_2024["tweet"].str.findall(r"#\w+")

This gives us:

[ ['#WinterWonderland', '#SnowDay'], ['#ValentinesDay', '#Love'], ['#WinterWonderland', '#SnowDay', '#Magical'] ]

Flattening the List of Hashtags: We flatten the nested list into a single list of hashtags:

all_hashtags = [tag for sublist in hashtags for tag in sublist]

Resulting in:

['#WinterWonderland', '#SnowDay', '#ValentinesDay', '#Love', '#WinterWonderland', '#SnowDay', '#Magical']

Counting the Hashtags: Using pandas.Series, we tally up the frequency of each hashtag:

hashtag_counts = pd.Series(all_hashtags).value_counts().reset_index()
hashtag_counts.columns = ["hashtag", "count"]

This results in:

| hashtag           | count |
|-------------------|-------|
| #WinterWonderland | 2     |
| #SnowDay          | 2     |
| #ValentinesDay    | 1     |
| #Love             | 1     |
| #Magical          | 1     |

Sorting the Hashtags: We sort the hashtags by frequency and then by alphabetic order if frequencies tie:

hashtag_counts = hashtag_counts.sort_values(by=["count", "hashtag"], ascending=[False, False])

Thus, the sorted list becomes:

| hashtag           | count |
|-------------------|-------|
| #WinterWonderland | 2     |
| #SnowDay          | 2     |
| #ValentinesDay    | 1     |
| #Love             | 1     |
| #Magical          | 1     |

Selecting Top 3 Hashtags: We select the top three hashtags from the sorted list:

top_3_hashtags = hashtag_counts.head(3)

The final output is:

| hashtag           | count |
|-------------------|-------|
| #WinterWonderland | 2     |
| #SnowDay          | 2     |
| #ValentinesDay    | 1     |

Thus, the top 3 trending hashtags for February 2024 are #WinterWonderland, #SnowDay, and #ValentinesDay.

Solution Implementation

1import pandas as pd
2
3def find_trending_hashtags(tweets: pd.DataFrame) -> pd.DataFrame:
4    # Filter the tweets DataFrame to include only those tweets from February 2024
5    tweets_feb_2024 = tweets[tweets["tweet_date"].between("2024-02-01", "2024-02-29")]
6
7    # Extract hashtags from the tweets column using a regular expression
8    hashtags = tweets_feb_2024["tweet"].str.findall(r"#\w+")
9
10    # Flatten the list of lists of hashtags into a single list
11    all_hashtags = [tag for sublist in hashtags for tag in sublist]
12
13    # Count the frequency of each hashtag using `value_counts` and reset the index
14    hashtag_counts = pd.Series(all_hashtags).value_counts().reset_index()
15    hashtag_counts.columns = ["hashtag", "count"]  # Rename the DataFrame columns
16
17    # Sort the hashtags by count in descending order, and break ties using alphabetical order
18    hashtag_counts = hashtag_counts.sort_values(
19        by=["count", "hashtag"], ascending=[False, False]
20    )
21
22    # Retrieve the top 3 hashtags with the highest count
23    top_3_hashtags = hashtag_counts.head(3)
24
25    return top_3_hashtags
26

1import java.util.*;
2import java.util.regex.*;
3import java.util.stream.*;
4import java.text.*;
5import java.time.*;
6import java.time.format.*;
7
8public class TrendingHashtags {
9
10    public static List<Map.Entry<String, Long>> findTrendingHashtags(List<Map<String, String>> tweets) {
11        // Define the date formatter for parsing the tweet dates
12        DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
13
14        // Filter the tweets to include only those from February 2024
15        List<Map<String, String>> tweetsFeb2024 = tweets.stream()
16            .filter(tweet -> {
17                LocalDate tweetDate = LocalDate.parse(tweet.get("tweet_date"), formatter);
18                LocalDate startDate = LocalDate.of(2024, Month.FEBRUARY, 1);
19                LocalDate endDate = LocalDate.of(2024, Month.FEBRUARY, 29);
20                return (tweetDate.isEqual(startDate) || tweetDate.isAfter(startDate)) && tweetDate.isBefore(endDate.plusDays(1));
21            })
22            .collect(Collectors.toList());
23
24        // Extract hashtags from the tweet text using a regular expression
25        List<String> allHashtags = new ArrayList<>();
26        Pattern hashtagPattern = Pattern.compile("#\\w+");
27      
28        for (Map<String, String> tweet : tweetsFeb2024) {
29            Matcher matcher = hashtagPattern.matcher(tweet.get("tweet"));
30            while (matcher.find()) {
31                allHashtags.add(matcher.group());
32            }
33        }
34
35        // Count the frequency of each hashtag
36        Map<String, Long> hashtagCounts = allHashtags.stream()
37            .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
38
39        // Sort the hashtags first by count in descending order and then alphabetically
40        List<Map.Entry<String, Long>> sortedHashtags = hashtagCounts.entrySet().stream()
41            .sorted(Map.Entry.<String, Long>comparingByValue(Comparator.reverseOrder())
42            .thenComparing(Map.Entry.comparingByKey()))
43            .collect(Collectors.toList());
44
45        // Retrieve the top 3 hashtags
46        return sortedHashtags.subList(0, Math.min(3, sortedHashtags.size()));
47    }
48
49    public static void main(String[] args) {
50        // Example usage of findTrendingHashtags
51        List<Map<String, String>> tweets = List.of(
52            Map.of("tweet_date", "2024-02-10", "tweet", "What a game! #sports #game"),
53            Map.of("tweet_date", "2024-02-11", "tweet", "Loving the new features #tech #innovation"),
54            Map.of("tweet_date", "2024-02-15", "tweet", "Great meal tonight #foodie #culinary"),
55            Map.of("tweet_date", "2024-02-15", "tweet", "Another beautiful day #sunshine #nature"),
56            Map.of("tweet_date", "2024-02-16", "tweet", "Tech is amazing #technology #innovation")
57        );
58
59        List<Map.Entry<String, Long>> trendingHashtags = findTrendingHashtags(tweets);
60
61        // Print the top 3 trending hashtags
62        for (Map.Entry<String, Long> entry : trendingHashtags) {
63            System.out.println("Hashtag: " + entry.getKey() + ", Count: " + entry.getValue());
64        }
65    }
66}
67

1#include <iostream>
2#include <vector>
3#include <map>
4#include <algorithm>
5#include <regex>
6#include <string>
7#include <sstream>
8
9// Structure to hold tweet information
10struct Tweet {
11    std::string tweetText;
12    std::string tweetDate;
13};
14
15// Function to extract hashtags from a tweet using regex
16std::vector<std::string> extractHashtags(const std::string& tweet) {
17    std::regex hashtagRegex("#\\w+");
18    std::smatch matches;
19    std::vector<std::string> hashtags;
20
21    std::string::const_iterator searchStart(tweet.cbegin());
22    while (regex_search(searchStart, tweet.cend(), matches, hashtagRegex)) {
23        hashtags.push_back(matches[0]);
24        searchStart = matches.suffix().first;
25    }
26    return hashtags;
27}
28
29// Function to find trending hashtags from a list of tweets
30std::vector<std::pair<std::string, int>> findTrendingHashtags(const std::vector<Tweet>& tweets) {
31    std::map<std::string, int> hashtagFrequency;
32
33    // Iterate through tweets
34    for (const Tweet& tweet : tweets) {
35        // Check if tweet date is in February 2024
36        if (tweet.tweetDate.compare(0, 7, "2024-02") == 0) {
37            // Extract hashtags from tweet
38            std::vector<std::string> hashtags = extractHashtags(tweet.tweetText);
39          
40            // Count hashtag frequency
41            for (const std::string& hashtag : hashtags) {
42                hashtagFrequency[hashtag]++;
43            }
44        }
45    }
46
47    // Convert map to vector of pairs and sort them by frequency and lexicographically
48    std::vector<std::pair<std::string, int>> hashtagCounts(hashtagFrequency.begin(), hashtagFrequency.end());
49
50    std::sort(hashtagCounts.begin(), hashtagCounts.end(), [](const auto& lhs, const auto& rhs) {
51        if (lhs.second == rhs.second) {
52            return lhs.first < rhs.first; // Sort lexicographically if counts are equal
53        }
54        return lhs.second > rhs.second; // Sort by count in descending order
55    });
56
57    // Get top 3 hashtags
58    if (hashtagCounts.size() > 3) {
59        hashtagCounts.resize(3);
60    }
61
62    return hashtagCounts;
63}
64
65int main() {
66    // Example usage
67    std::vector<Tweet> tweets = {
68        {"I love coding! #coding #fun", "2024-02-15"},
69        {"Another beautiful day! #sunshine #happy", "2024-02-20"},
70        {"Let's code more. #coding", "2024-02-25"},
71        {"Learn algorithms. #coding", "2024-02-27"}
72    };
73
74    std::vector<std::pair<std::string, int>> trendingHashtags = findTrendingHashtags(tweets);
75
76    // Display top 3 hashtags
77    for (const auto& pair : trendingHashtags) {
78        std::cout << pair.first << ": " << pair.second << std::endl;
79    }
80
81    return 0;
82}
83

1import * as pd from "pandas-js";
2
3// Define a function to find trending hashtags from a DataFrame of tweets
4function findTrendingHashtags(tweets: pd.DataFrame): pd.DataFrame {
5    // Filter tweets to select only those from February, 2024
6    const tweetsFeb2024 = tweets.get(tweets["tweet_date"].between("2024-02-01", "2024-02-29")._values);
7
8    // Extract hashtags from the 'tweet' column using a regular expression
9    const hashtags = tweetsFeb2024.get("tweet").apply((tweet: string) => {
10        return tweet.match(/#\w+/g) || [];
11    });
12
13    // Flatten the array of arrays of hashtags into a single array
14    const allHashtags: string[] = [];
15    hashtags.forEach((hashtagArray: string[]) => {
16        allHashtags.push(...hashtagArray);
17    });
18
19    // Count the frequency of each hashtag using a dictionary
20    const hashtagCounts: { [key: string]: number } = {};
21    allHashtags.forEach((tag: string) => {
22        if (hashtagCounts[tag]) {
23            hashtagCounts[tag]++;
24        } else {
25            hashtagCounts[tag] = 1;
26        }
27    });
28
29    // Convert the dictionary to an array of objects for sorting
30    let hashtagCountsArray = Object.keys(hashtagCounts).map((tag: string) => {
31        return { hashtag: tag, count: hashtagCounts[tag] };
32    });
33
34    // Sort hashtags by count in descending order, break ties using alphabetical order
35    hashtagCountsArray = hashtagCountsArray.sort((a, b) => {
36        if (b.count === a.count) {
37            return a.hashtag.localeCompare(b.hashtag);
38        }
39        return b.count - a.count;
40    });
41
42    // Select the top 3 hashtags
43    const top3Hashtags = hashtagCountsArray.slice(0, 3);
44
45    // Return the top 3 hashtags as a DataFrame
46    return new pd.DataFrame(top3Hashtags);
47}
48

Time and Space Complexity

The code performs the following operations:

Filtering Tweets: Filtering the DataFrame for dates in February 2024 involves checking each row in the tweet_date column, resulting in a time complexity of O(n).
Extracting Hashtags: Extracting hashtags using str.findall involves a regex operation on each tweet. If m is the average number of characters per tweet, the time complexity is O(n * m).
Flattening List of Hashtags: Flattening all hashtags into a single list is an operation with a time complexity of O(k), where k is the total number of hashtags extracted.
Counting Hashtags: Counting occurrences using pandas.Series.value_counts processes the list of hashtags, giving a time complexity of O(k).
Sorting Hashtag Counts: Sorting the results based on counts and hashtag values involves sorting the hashtags, which has a time complexity of O(k log k).
Selecting Top Hashtags: Selecting the top 3 hashtags involves slicing the DataFrame, which is O(1).

Overall Time Complexity:
The dominant terms are extracting hashtags and sorting them. Therefore, the overall time complexity of the code is O(n * m + k log k).

Space Complexity:

Data Storage: Storing hashtags and their counts requires additional space. If k is the total number of hashtags, this space complexity will be O(k).
Intermediate DataFrames and Lists: Additional space for filtered DataFrames and lists of hashtags also contributes, maintaining a space complexity of O(k).

Overall Space Complexity:
The overall space complexity is O(k) due to the storage of hashtags and counts.

Learn more about how to find time and space complexity quickly.

Discover Your Strengths and Weaknesses: Take Our 2-Minute Quiz to Tailor Your Study Plan: