3103. Find Trending Hashtags II 🔒
Problem Description
We have a table named Tweets
that consists of several columns: user_id
, tweet_id
, tweet_date
, and tweet
. The tweet_id
is the primary key in this table, meaning each value in this column is unique. Every row in the table represents an individual tweet, containing information about the user who tweeted, a unique identifier for the tweet, the date when the tweet was posted, and the content of the tweet. Importantly, it is specified that all tweet dates are valid dates occurring in February 2024.
The task is to find the top 3 trending hashtags mentioned in these tweets during February 2024. Each tweet may contain multiple hashtags. The result should be a listing of these top trending hashtags, ranked primarily by their frequency of occurrence in descending order. In the case of ties in frequency, hashtags should be further sorted in descending order by the hashtag text itself.
Intuition
To solve this problem, the key is to analyze the content of the tweets to extract and count the hashtags.
-
Filtering Tweets by Date: Since tweets are already restricted to February 2024, we confirm that we are working with the relevant dataset.
-
Extracting Hashtags: Each tweet can contain zero, one, or several hashtags. Using regular expressions, we can parse each tweet's content to extract all hashtags.
-
Counting Hashtags: After extracting the hashtags, we count how often each appears. This provides a frequency distribution of hashtags across all tweets.
-
Sorting Hashtags: The hashtags are then sorted in descending order by their frequency count. When counts are equal, the hashtags themselves are sorted in descending order lexicographically.
-
Selecting Top Hashtags: Finally, we choose the top three hashtags based on the sorted order.
This structured approach ensures that we efficiently gather the most relevant hashtags in terms of popularity, adhering strictly to the problem's requirements.
Solution Approach
The solution leverages the power of Python and its libraries to process and analyze tweet data efficiently:
-
Filtering Tweets by Date: First, the data is filtered to ensure we only consider tweets from February 2024. This is done using the
pandas
library to filter rows where thetweet_date
is within the specified date range.tweets_feb_2024 = tweets[tweets["tweet_date"].between("2024-02-01", "2024-02-29")]
-
Extracting Hashtags: Each tweet is scanned for hashtags using regular expression matching. The expression
#\w+
identifies all words that follow a hash symbol (#
). This is accomplished with thestr.findall
method inpandas
, which creates a series of lists of hashtags.hashtags = tweets_feb_2024["tweet"].str.findall(r"#\w+")
-
Flattening the List of Hashtags: To facilitate counting, the list of hashtag lists is flattened into a single list of individual hashtags. This transformation is critical for further computation and is achieved using a list comprehension.
all_hashtags = [tag for sublist in hashtags for tag in sublist]
-
Counting the Hashtags: Using
pandas.Series
, the frequency of each hashtag is counted, and the results are converted to a DataFrame. This step essentially summarizes the occurrence of each hashtag.hashtag_counts = pd.Series(all_hashtags).value_counts().reset_index() hashtag_counts.columns = ["hashtag", "count"]
-
Sorting the Hashtags: Next, the hashtag frequency counts are sorted in descending order. The primary key for sorting is
count
, followed byhashtag
to handle ties, both in descending order to meet problem requirements.hashtag_counts = hashtag_counts.sort_values(by=["count", "hashtag"], ascending=[False, False])
-
Selecting Top 3 Hashtags: Finally, the top three hashtags are extracted from the sorted list. This final selection represents our result of the most popular hashtags.
top_3_hashtags = hashtag_counts.head(3)
This approach provides an efficient and effective means of determining trending hashtags by parsing and processing tweets. It captures both the popularity and the relative ranking in case of ties.
Ready to land your dream job?
Unlock your dream job with a 2-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's illustrate the solution approach using a simplified example where we have a subset of Tweets
data consisting of three hypothetical tweets from February 2024:
user_id | tweet_id | tweet_date | tweet |
---|---|---|---|
1 | 101 | 2024-02-03 | "Loving the #WinterWonderland! #SnowDay" |
2 | 102 | 2024-02-15 | "Happy #ValentinesDay! Spread the love! #Love" |
3 | 103 | 2024-02-28 | "#WinterWonderland is magical this year! #SnowDay #Magical" |
-
Filtering Tweets by Date: Since all tweets already fall within February 2024, we proceed without further filtering:
tweets_feb_2024 = tweets[tweets["tweet_date"].between("2024-02-01", "2024-02-29")]
-
Extracting Hashtags: We use a regular expression to extract hashtags from each tweet:
hashtags = tweets_feb_2024["tweet"].str.findall(r"#\w+")
This gives us:
[ ['#WinterWonderland', '#SnowDay'], ['#ValentinesDay', '#Love'], ['#WinterWonderland', '#SnowDay', '#Magical'] ]
-
Flattening the List of Hashtags: We flatten the nested list into a single list of hashtags:
all_hashtags = [tag for sublist in hashtags for tag in sublist]
Resulting in:
['#WinterWonderland', '#SnowDay', '#ValentinesDay', '#Love', '#WinterWonderland', '#SnowDay', '#Magical']
-
Counting the Hashtags: Using
pandas.Series
, we tally up the frequency of each hashtag:hashtag_counts = pd.Series(all_hashtags).value_counts().reset_index() hashtag_counts.columns = ["hashtag", "count"]
This results in:
| hashtag | count | |-------------------|-------| | #WinterWonderland | 2 | | #SnowDay | 2 | | #ValentinesDay | 1 | | #Love | 1 | | #Magical | 1 |
-
Sorting the Hashtags: We sort the hashtags by frequency and then by alphabetic order if frequencies tie:
hashtag_counts = hashtag_counts.sort_values(by=["count", "hashtag"], ascending=[False, False])
Thus, the sorted list becomes:
| hashtag | count | |-------------------|-------| | #WinterWonderland | 2 | | #SnowDay | 2 | | #ValentinesDay | 1 | | #Love | 1 | | #Magical | 1 |
-
Selecting Top 3 Hashtags: We select the top three hashtags from the sorted list:
top_3_hashtags = hashtag_counts.head(3)
The final output is:
| hashtag | count | |-------------------|-------| | #WinterWonderland | 2 | | #SnowDay | 2 | | #ValentinesDay | 1 |
Thus, the top 3 trending hashtags for February 2024 are #WinterWonderland
, #SnowDay
, and #ValentinesDay
.
Solution Implementation
1import pandas as pd
2
3def find_trending_hashtags(tweets: pd.DataFrame) -> pd.DataFrame:
4 # Filter the tweets DataFrame to include only those tweets from February 2024
5 tweets_feb_2024 = tweets[tweets["tweet_date"].between("2024-02-01", "2024-02-29")]
6
7 # Extract hashtags from the tweets column using a regular expression
8 hashtags = tweets_feb_2024["tweet"].str.findall(r"#\w+")
9
10 # Flatten the list of lists of hashtags into a single list
11 all_hashtags = [tag for sublist in hashtags for tag in sublist]
12
13 # Count the frequency of each hashtag using `value_counts` and reset the index
14 hashtag_counts = pd.Series(all_hashtags).value_counts().reset_index()
15 hashtag_counts.columns = ["hashtag", "count"] # Rename the DataFrame columns
16
17 # Sort the hashtags by count in descending order, and break ties using alphabetical order
18 hashtag_counts = hashtag_counts.sort_values(
19 by=["count", "hashtag"], ascending=[False, False]
20 )
21
22 # Retrieve the top 3 hashtags with the highest count
23 top_3_hashtags = hashtag_counts.head(3)
24
25 return top_3_hashtags
26
1import java.util.*;
2import java.util.regex.*;
3import java.util.stream.*;
4import java.text.*;
5import java.time.*;
6import java.time.format.*;
7
8public class TrendingHashtags {
9
10 public static List<Map.Entry<String, Long>> findTrendingHashtags(List<Map<String, String>> tweets) {
11 // Define the date formatter for parsing the tweet dates
12 DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyy-MM-dd");
13
14 // Filter the tweets to include only those from February 2024
15 List<Map<String, String>> tweetsFeb2024 = tweets.stream()
16 .filter(tweet -> {
17 LocalDate tweetDate = LocalDate.parse(tweet.get("tweet_date"), formatter);
18 LocalDate startDate = LocalDate.of(2024, Month.FEBRUARY, 1);
19 LocalDate endDate = LocalDate.of(2024, Month.FEBRUARY, 29);
20 return (tweetDate.isEqual(startDate) || tweetDate.isAfter(startDate)) && tweetDate.isBefore(endDate.plusDays(1));
21 })
22 .collect(Collectors.toList());
23
24 // Extract hashtags from the tweet text using a regular expression
25 List<String> allHashtags = new ArrayList<>();
26 Pattern hashtagPattern = Pattern.compile("#\\w+");
27
28 for (Map<String, String> tweet : tweetsFeb2024) {
29 Matcher matcher = hashtagPattern.matcher(tweet.get("tweet"));
30 while (matcher.find()) {
31 allHashtags.add(matcher.group());
32 }
33 }
34
35 // Count the frequency of each hashtag
36 Map<String, Long> hashtagCounts = allHashtags.stream()
37 .collect(Collectors.groupingBy(Function.identity(), Collectors.counting()));
38
39 // Sort the hashtags first by count in descending order and then alphabetically
40 List<Map.Entry<String, Long>> sortedHashtags = hashtagCounts.entrySet().stream()
41 .sorted(Map.Entry.<String, Long>comparingByValue(Comparator.reverseOrder())
42 .thenComparing(Map.Entry.comparingByKey()))
43 .collect(Collectors.toList());
44
45 // Retrieve the top 3 hashtags
46 return sortedHashtags.subList(0, Math.min(3, sortedHashtags.size()));
47 }
48
49 public static void main(String[] args) {
50 // Example usage of findTrendingHashtags
51 List<Map<String, String>> tweets = List.of(
52 Map.of("tweet_date", "2024-02-10", "tweet", "What a game! #sports #game"),
53 Map.of("tweet_date", "2024-02-11", "tweet", "Loving the new features #tech #innovation"),
54 Map.of("tweet_date", "2024-02-15", "tweet", "Great meal tonight #foodie #culinary"),
55 Map.of("tweet_date", "2024-02-15", "tweet", "Another beautiful day #sunshine #nature"),
56 Map.of("tweet_date", "2024-02-16", "tweet", "Tech is amazing #technology #innovation")
57 );
58
59 List<Map.Entry<String, Long>> trendingHashtags = findTrendingHashtags(tweets);
60
61 // Print the top 3 trending hashtags
62 for (Map.Entry<String, Long> entry : trendingHashtags) {
63 System.out.println("Hashtag: " + entry.getKey() + ", Count: " + entry.getValue());
64 }
65 }
66}
67
1#include <iostream>
2#include <vector>
3#include <map>
4#include <algorithm>
5#include <regex>
6#include <string>
7#include <sstream>
8
9// Structure to hold tweet information
10struct Tweet {
11 std::string tweetText;
12 std::string tweetDate;
13};
14
15// Function to extract hashtags from a tweet using regex
16std::vector<std::string> extractHashtags(const std::string& tweet) {
17 std::regex hashtagRegex("#\\w+");
18 std::smatch matches;
19 std::vector<std::string> hashtags;
20
21 std::string::const_iterator searchStart(tweet.cbegin());
22 while (regex_search(searchStart, tweet.cend(), matches, hashtagRegex)) {
23 hashtags.push_back(matches[0]);
24 searchStart = matches.suffix().first;
25 }
26 return hashtags;
27}
28
29// Function to find trending hashtags from a list of tweets
30std::vector<std::pair<std::string, int>> findTrendingHashtags(const std::vector<Tweet>& tweets) {
31 std::map<std::string, int> hashtagFrequency;
32
33 // Iterate through tweets
34 for (const Tweet& tweet : tweets) {
35 // Check if tweet date is in February 2024
36 if (tweet.tweetDate.compare(0, 7, "2024-02") == 0) {
37 // Extract hashtags from tweet
38 std::vector<std::string> hashtags = extractHashtags(tweet.tweetText);
39
40 // Count hashtag frequency
41 for (const std::string& hashtag : hashtags) {
42 hashtagFrequency[hashtag]++;
43 }
44 }
45 }
46
47 // Convert map to vector of pairs and sort them by frequency and lexicographically
48 std::vector<std::pair<std::string, int>> hashtagCounts(hashtagFrequency.begin(), hashtagFrequency.end());
49
50 std::sort(hashtagCounts.begin(), hashtagCounts.end(), [](const auto& lhs, const auto& rhs) {
51 if (lhs.second == rhs.second) {
52 return lhs.first < rhs.first; // Sort lexicographically if counts are equal
53 }
54 return lhs.second > rhs.second; // Sort by count in descending order
55 });
56
57 // Get top 3 hashtags
58 if (hashtagCounts.size() > 3) {
59 hashtagCounts.resize(3);
60 }
61
62 return hashtagCounts;
63}
64
65int main() {
66 // Example usage
67 std::vector<Tweet> tweets = {
68 {"I love coding! #coding #fun", "2024-02-15"},
69 {"Another beautiful day! #sunshine #happy", "2024-02-20"},
70 {"Let's code more. #coding", "2024-02-25"},
71 {"Learn algorithms. #coding", "2024-02-27"}
72 };
73
74 std::vector<std::pair<std::string, int>> trendingHashtags = findTrendingHashtags(tweets);
75
76 // Display top 3 hashtags
77 for (const auto& pair : trendingHashtags) {
78 std::cout << pair.first << ": " << pair.second << std::endl;
79 }
80
81 return 0;
82}
83
1import * as pd from "pandas-js";
2
3// Define a function to find trending hashtags from a DataFrame of tweets
4function findTrendingHashtags(tweets: pd.DataFrame): pd.DataFrame {
5 // Filter tweets to select only those from February, 2024
6 const tweetsFeb2024 = tweets.get(tweets["tweet_date"].between("2024-02-01", "2024-02-29")._values);
7
8 // Extract hashtags from the 'tweet' column using a regular expression
9 const hashtags = tweetsFeb2024.get("tweet").apply((tweet: string) => {
10 return tweet.match(/#\w+/g) || [];
11 });
12
13 // Flatten the array of arrays of hashtags into a single array
14 const allHashtags: string[] = [];
15 hashtags.forEach((hashtagArray: string[]) => {
16 allHashtags.push(...hashtagArray);
17 });
18
19 // Count the frequency of each hashtag using a dictionary
20 const hashtagCounts: { [key: string]: number } = {};
21 allHashtags.forEach((tag: string) => {
22 if (hashtagCounts[tag]) {
23 hashtagCounts[tag]++;
24 } else {
25 hashtagCounts[tag] = 1;
26 }
27 });
28
29 // Convert the dictionary to an array of objects for sorting
30 let hashtagCountsArray = Object.keys(hashtagCounts).map((tag: string) => {
31 return { hashtag: tag, count: hashtagCounts[tag] };
32 });
33
34 // Sort hashtags by count in descending order, break ties using alphabetical order
35 hashtagCountsArray = hashtagCountsArray.sort((a, b) => {
36 if (b.count === a.count) {
37 return a.hashtag.localeCompare(b.hashtag);
38 }
39 return b.count - a.count;
40 });
41
42 // Select the top 3 hashtags
43 const top3Hashtags = hashtagCountsArray.slice(0, 3);
44
45 // Return the top 3 hashtags as a DataFrame
46 return new pd.DataFrame(top3Hashtags);
47}
48
Time and Space Complexity
The code performs the following operations:
-
Filtering Tweets: Filtering the DataFrame for dates in February 2024 involves checking each row in the
tweet_date
column, resulting in a time complexity ofO(n)
. -
Extracting Hashtags: Extracting hashtags using
str.findall
involves a regex operation on each tweet. Ifm
is the average number of characters per tweet, the time complexity isO(n * m)
. -
Flattening List of Hashtags: Flattening all hashtags into a single list is an operation with a time complexity of
O(k)
, wherek
is the total number of hashtags extracted. -
Counting Hashtags: Counting occurrences using
pandas.Series.value_counts
processes the list of hashtags, giving a time complexity ofO(k)
. -
Sorting Hashtag Counts: Sorting the results based on counts and hashtag values involves sorting the hashtags, which has a time complexity of
O(k log k)
. -
Selecting Top Hashtags: Selecting the top 3 hashtags involves slicing the DataFrame, which is
O(1)
.
Overall Time Complexity:
The dominant terms are extracting hashtags and sorting them. Therefore, the overall time complexity of the code is O(n * m + k log k)
.
Space Complexity:
-
Data Storage: Storing hashtags and their counts requires additional space. If
k
is the total number of hashtags, this space complexity will beO(k)
. -
Intermediate DataFrames and Lists: Additional space for filtered DataFrames and lists of hashtags also contributes, maintaining a space complexity of
O(k)
.
Overall Space Complexity:
The overall space complexity is O(k)
due to the storage of hashtags and counts.
Learn more about how to find time and space complexity quickly.
A person thinks of a number between 1 and 1000. You may ask any number questions to them, provided that the question can be answered with either "yes" or "no".
What is the minimum number of questions you needed to ask so that you are guaranteed to know the number that the person is thinking?
Recommended Readings
Coding Interview Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https algomonster s3 us east 2 amazonaws com recursion jpg You first
Runtime Overview When learning about algorithms and data structures you'll frequently encounter the term time complexity This concept is fundamental in computer science and offers insights into how long an algorithm takes to complete given a certain input size What is Time Complexity Time complexity represents the amount of time
Want a Structured Path to Master System Design Too? Don’t Miss This!