2882. Drop Duplicate Rows
Problem Description
You are given a DataFrame called customers
with three columns: customer_id
(integer), name
(string), and email
(string). The DataFrame contains customer information, but there's an issue - some customers share the same email address, resulting in duplicate email entries.
Your task is to clean this DataFrame by removing duplicate email addresses while keeping only the first occurrence of each unique email. When multiple rows have the same email, you should retain the row that appears first in the DataFrame and remove all subsequent rows with that same email.
For example, if customer_id 4 (Alice) and customer_id 5 (Finn) both have the email "john@example.com", you would keep Alice's record (since it appears first) and remove Finn's record.
The solution uses pandas' drop_duplicates()
method with the subset=['email']
parameter, which:
- Identifies all rows with duplicate values in the 'email' column
- Keeps the first occurrence of each unique email (default behavior)
- Removes all subsequent duplicate occurrences
- Returns the cleaned DataFrame
The resulting DataFrame will have unique email addresses, with each email appearing exactly once, preserving the original order and keeping the first customer associated with each email address.
Intuition
When we need to handle duplicate data, the key insight is recognizing that pandas provides built-in functionality specifically designed for this common data cleaning task. The problem explicitly asks us to remove duplicate emails while keeping the first occurrence - this is a standard deduplication operation.
The thought process flows naturally:
- We need to identify which column contains the duplicates → the
email
column - We want to keep some rows and remove others based on this column → this suggests filtering or dropping operations
- The requirement to keep the "first occurrence" is a crucial hint → pandas typically preserves the first instance by default in deduplication operations
Rather than manually iterating through rows and tracking which emails we've seen (which would be inefficient), we can leverage pandas' drop_duplicates()
method. This method is designed exactly for this scenario - it examines specified columns for duplicate values and intelligently removes redundant rows.
The beauty of drop_duplicates(subset=['email'])
is its simplicity and efficiency. By specifying subset=['email']
, we tell pandas to only consider the email column when determining duplicates, ignoring whether the customer_id or name might be different. The method automatically keeps the first row it encounters for each unique email value and drops all subsequent rows with the same email, which perfectly matches our requirements.
This approach is both intuitive and optimal because it uses a single, purpose-built function rather than complex logic with loops or conditional statements. It's a direct translation of the problem statement into code: "remove duplicate emails, keep the first one" becomes drop_duplicates(subset=['email'])
.
Solution Approach
The implementation leverages pandas' built-in drop_duplicates()
method to efficiently remove duplicate email entries from the DataFrame. Here's how the solution works step by step:
Method Signature:
def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
return customers.drop_duplicates(subset=['email'])
Breaking down the implementation:
-
Function Input: The function receives a pandas DataFrame
customers
containing customer data with columns:customer_id
,name
, andemail
. -
Core Operation -
drop_duplicates()
:- This method scans through the DataFrame to identify duplicate values
- The
subset=['email']
parameter tells pandas to only consider the 'email' column when checking for duplicates - Without this parameter, it would check if entire rows are identical across all columns
-
Default Behavior - Keep First:
drop_duplicates()
has an implicit parameterkeep='first'
(the default value)- This means when duplicates are found, the first occurrence is retained
- The method preserves the original row order, so "first" means the row with the smallest index
-
Internal Algorithm:
- Pandas internally creates a hash table of email values as it iterates through the DataFrame
- For each row, it checks if the email has been seen before
- If the email is new, the row is marked to keep
- If the email already exists, the row is marked for removal
- Finally, it returns a new DataFrame containing only the rows marked to keep
-
Return Value:
- The method returns a new DataFrame with duplicate emails removed
- The original DataFrame remains unchanged (non-destructive operation)
- The resulting DataFrame maintains the original structure and column names
This single-line solution is both memory-efficient and time-efficient, with a time complexity of O(n) where n is the number of rows in the DataFrame.
Ready to land your dream job?
Unlock your dream job with a 5-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's walk through a concrete example to see how the solution works:
Initial DataFrame (customers
):
| customer_id | name | email | |-------------|---------|-------------------| | 1 | Alice | alice@email.com | | 2 | Bob | bob@email.com | | 3 | Charlie | alice@email.com | | 4 | David | david@email.com | | 5 | Eve | bob@email.com |
Step 1: Identify Duplicates in Email Column The algorithm scans the 'email' column and identifies which emails appear more than once:
alice@email.com
appears at rows 1 and 3bob@email.com
appears at rows 2 and 5david@email.com
appears only at row 4
Step 2: Mark First Occurrences to Keep For each unique email, the first occurrence is marked to keep:
- Row 1 (Alice with
alice@email.com
) - KEEP (first occurrence) - Row 2 (Bob with
bob@email.com
) - KEEP (first occurrence) - Row 3 (Charlie with
alice@email.com
) - DROP (duplicate of row 1) - Row 4 (David with
david@email.com
) - KEEP (only occurrence) - Row 5 (Eve with
bob@email.com
) - DROP (duplicate of row 2)
Step 3: Apply drop_duplicates(subset=['email'])
When we execute customers.drop_duplicates(subset=['email'])
, pandas removes rows 3 and 5.
Final Result:
| customer_id | name | email | |-------------|-------|------------------| | 1 | Alice | alice@email.com | | 2 | Bob | bob@email.com | | 4 | David | david@email.com |
Notice that:
- Each email now appears exactly once
- We kept Alice (not Charlie) for
alice@email.com
because Alice appeared first - We kept Bob (not Eve) for
bob@email.com
because Bob appeared first - The relative order of remaining rows is preserved (1, 2, 4)
This demonstrates how the solution efficiently removes duplicate emails while maintaining the first customer associated with each unique email address.
Solution Implementation
1import pandas as pd
2
3
4def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
5 """
6 Remove duplicate rows based on email column from the customers DataFrame.
7
8 Args:
9 customers: A pandas DataFrame containing customer information with an 'email' column
10
11 Returns:
12 A pandas DataFrame with duplicate emails removed (keeping the first occurrence)
13 """
14 # Drop duplicate rows based on the 'email' column
15 # By default, keeps the first occurrence and removes subsequent duplicates
16 return customers.drop_duplicates(subset=['email'])
17
1import java.util.*;
2import java.util.stream.Collectors;
3
4public class Solution {
5 /**
6 * Remove duplicate rows based on email column from the customers list.
7 *
8 * @param customers A list of Customer objects containing customer information with an email field
9 * @return A list of Customer objects with duplicate emails removed (keeping the first occurrence)
10 */
11 public List<Customer> dropDuplicateEmails(List<Customer> customers) {
12 // Create a set to track emails we've already seen
13 Set<String> seenEmails = new HashSet<>();
14
15 // Create a list to store the result with duplicates removed
16 List<Customer> result = new ArrayList<>();
17
18 // Iterate through customers and keep only the first occurrence of each email
19 for (Customer customer : customers) {
20 // Check if this email hasn't been seen before
21 if (seenEmails.add(customer.getEmail())) {
22 // If add() returns true, it's a new email, so add the customer to result
23 result.add(customer);
24 }
25 }
26
27 return result;
28 }
29
30 /**
31 * Customer class to represent a row in the DataFrame
32 */
33 public static class Customer {
34 private String email;
35 // Other fields can be added as needed
36
37 public Customer(String email) {
38 this.email = email;
39 }
40
41 public String getEmail() {
42 return email;
43 }
44
45 public void setEmail(String email) {
46 this.email = email;
47 }
48 }
49}
50```
51
52## Approach 2: Using Java Streams (more concise)
53
54```java
55import java.util.*;
56import java.util.stream.Collectors;
57
58public class Solution {
59 /**
60 * Remove duplicate rows based on email column from the customers list.
61 *
62 * @param customers A list of Map objects representing customer records with an "email" key
63 * @return A list of Map objects with duplicate emails removed (keeping the first occurrence)
64 */
65 public List<Map<String, Object>> dropDuplicateEmails(List<Map<String, Object>> customers) {
66 // Use a LinkedHashSet to maintain insertion order while removing duplicates
67 Set<String> seenEmails = new LinkedHashSet<>();
68
69 // Filter the customers list to keep only the first occurrence of each email
70 return customers.stream()
71 .filter(customer -> {
72 // Extract the email from the current customer map
73 String email = (String) customer.get("email");
74 // Keep this customer only if we haven't seen this email before
75 return seenEmails.add(email);
76 })
77 .collect(Collectors.toList());
78 }
79}
80
1#include <vector>
2#include <string>
3#include <unordered_set>
4#include <algorithm>
5
6// Structure to represent a customer record
7struct Customer {
8 int id;
9 std::string name;
10 std::string email;
11 // Add other fields as needed
12};
13
14// Class to represent a DataFrame-like structure for customers
15class CustomersDataFrame {
16public:
17 std::vector<Customer> data;
18
19 // Constructor
20 CustomersDataFrame(const std::vector<Customer>& customers) : data(customers) {}
21
22 // Default constructor
23 CustomersDataFrame() {}
24};
25
26/**
27 * Remove duplicate rows based on email column from the customers DataFrame.
28 *
29 * @param customers A CustomersDataFrame containing customer information with an 'email' field
30 * @return A CustomersDataFrame with duplicate emails removed (keeping the first occurrence)
31 */
32CustomersDataFrame dropDuplicateEmails(const CustomersDataFrame& customers) {
33 // Create a result DataFrame to store unique customers
34 CustomersDataFrame result;
35
36 // Set to track emails we've already seen
37 std::unordered_set<std::string> seen_emails;
38
39 // Iterate through all customers in the input DataFrame
40 for (const auto& customer : customers.data) {
41 // Check if this email has not been seen before
42 if (seen_emails.find(customer.email) == seen_emails.end()) {
43 // Add email to the set of seen emails
44 seen_emails.insert(customer.email);
45
46 // Add this customer to the result (keeping first occurrence)
47 result.data.push_back(customer);
48 }
49 // If email already exists, skip this customer (remove duplicate)
50 }
51
52 // Return the DataFrame with duplicates removed
53 return result;
54}
55
1// Import statement would be different in TypeScript - pandas is a Python library
2// In TypeScript, we'd typically use different data manipulation libraries
3
4/**
5 * Remove duplicate rows based on email column from the customers DataFrame.
6 *
7 * @param customers - A data structure containing customer information with an 'email' column
8 * @returns A data structure with duplicate emails removed (keeping the first occurrence)
9 */
10function dropDuplicateEmails(customers: any[]): any[] {
11 // Create a Set to track unique emails we've already seen
12 const seenEmails = new Set<string>();
13
14 // Array to store the result with duplicates removed
15 const result: any[] = [];
16
17 // Iterate through each customer record
18 for (const customer of customers) {
19 // Check if we've already seen this email
20 if (!seenEmails.has(customer.email)) {
21 // If not seen, add email to the set and customer to result
22 seenEmails.add(customer.email);
23 result.push(customer);
24 }
25 // Skip subsequent occurrences of the same email
26 }
27
28 // Return the filtered array with duplicate emails removed
29 return result;
30}
31
Time and Space Complexity
Time Complexity: O(n)
where n
is the number of rows in the DataFrame.
The drop_duplicates()
method needs to scan through all rows in the DataFrame to identify duplicates based on the 'email' column. Internally, pandas uses a hash table to track seen values, which provides O(1)
average-case lookup time for each row. Therefore, processing all n
rows results in O(n)
time complexity.
Space Complexity: O(n)
in the worst case.
The space complexity consists of:
- The hash table used internally by
drop_duplicates()
to track unique email values:O(k)
wherek
is the number of unique emails (at mostn
) - The returned DataFrame:
O(k)
wherek
is the number of unique rows retained after deduplication - In the worst case where all emails are unique,
k = n
, resulting inO(n)
space complexity - Note that this analysis assumes the operation creates a new DataFrame rather than modifying in-place
Common Pitfalls
1. Modifying the Original DataFrame Unintentionally
A common mistake is assuming that drop_duplicates()
modifies the DataFrame in place. By default, it returns a new DataFrame, leaving the original unchanged.
Incorrect approach:
def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
customers.drop_duplicates(subset=['email']) # This doesn't modify 'customers'
return customers # Returns the original DataFrame with duplicates still present
Correct approach:
def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
return customers.drop_duplicates(subset=['email']) # Return the new DataFrame
Alternative with in-place modification:
def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
customers.drop_duplicates(subset=['email'], inplace=True)
return customers # Now returns the modified DataFrame
2. Case Sensitivity in Email Comparison
Email addresses are case-insensitive in practice (john@example.com = JOHN@example.com), but drop_duplicates()
treats them as different values.
Problem scenario:
# These would be considered different emails: # "john@example.com" # "John@example.com" # "JOHN@EXAMPLE.COM"
Solution - Normalize emails before deduplication:
def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
# Create a temporary lowercase email column for comparison
customers['email_lower'] = customers['email'].str.lower()
# Drop duplicates based on the normalized column
result = customers.drop_duplicates(subset=['email_lower'])
# Remove the temporary column
return result.drop(columns=['email_lower'])
3. Handling NULL/NaN Values in Email Column
If the email column contains NaN or None values, drop_duplicates()
treats each NaN as unique, potentially keeping multiple rows with missing emails.
Problem scenario:
# Multiple rows with NaN emails would all be kept: # customer_id=1, email=NaN # customer_id=2, email=NaN # This would also be kept!
Solution - Remove or handle NaN values explicitly:
def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
# Option 1: Remove rows with NaN emails first
customers = customers.dropna(subset=['email'])
return customers.drop_duplicates(subset=['email'])
# Option 2: Fill NaN values with a placeholder before deduplication
customers['email'] = customers['email'].fillna('NO_EMAIL')
return customers.drop_duplicates(subset=['email'])
4. Keeping the Wrong Occurrence
Sometimes business logic requires keeping the last occurrence or a specific occurrence based on other criteria (e.g., most recent customer_id).
Solution - Specify the keep parameter:
def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
# Keep last occurrence instead of first
return customers.drop_duplicates(subset=['email'], keep='last')
# Or sort by customer_id first to keep the lowest ID
customers_sorted = customers.sort_values('customer_id')
return customers_sorted.drop_duplicates(subset=['email'])
5. Leading/Trailing Whitespace in Emails
Emails with extra spaces might not be recognized as duplicates.
Problem scenario:
# These would be considered different: # "john@example.com" # " john@example.com" # "john@example.com "
Solution - Strip whitespace before deduplication:
def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
# Clean email column by stripping whitespace
customers['email'] = customers['email'].str.strip()
return customers.drop_duplicates(subset=['email'])
What's the output of running the following function using input [30, 20, 10, 100, 33, 12]
?
1def fun(arr: List[int]) -> List[int]:
2 import heapq
3 heapq.heapify(arr)
4 res = []
5 for i in range(3):
6 res.append(heapq.heappop(arr))
7 return res
8
1public static int[] fun(int[] arr) {
2 int[] res = new int[3];
3 PriorityQueue<Integer> heap = new PriorityQueue<>();
4 for (int i = 0; i < arr.length; i++) {
5 heap.add(arr[i]);
6 }
7 for (int i = 0; i < 3; i++) {
8 res[i] = heap.poll();
9 }
10 return res;
11}
12
1class HeapItem {
2 constructor(item, priority = item) {
3 this.item = item;
4 this.priority = priority;
5 }
6}
7
8class MinHeap {
9 constructor() {
10 this.heap = [];
11 }
12
13 push(node) {
14 // insert the new node at the end of the heap array
15 this.heap.push(node);
16 // find the correct position for the new node
17 this.bubble_up();
18 }
19
20 bubble_up() {
21 let index = this.heap.length - 1;
22
23 while (index > 0) {
24 const element = this.heap[index];
25 const parentIndex = Math.floor((index - 1) / 2);
26 const parent = this.heap[parentIndex];
27
28 if (parent.priority <= element.priority) break;
29 // if the parent is bigger than the child then swap the parent and child
30 this.heap[index] = parent;
31 this.heap[parentIndex] = element;
32 index = parentIndex;
33 }
34 }
35
36 pop() {
37 const min = this.heap[0];
38 this.heap[0] = this.heap[this.size() - 1];
39 this.heap.pop();
40 this.bubble_down();
41 return min;
42 }
43
44 bubble_down() {
45 let index = 0;
46 let min = index;
47 const n = this.heap.length;
48
49 while (index < n) {
50 const left = 2 * index + 1;
51 const right = left + 1;
52
53 if (left < n && this.heap[left].priority < this.heap[min].priority) {
54 min = left;
55 }
56 if (right < n && this.heap[right].priority < this.heap[min].priority) {
57 min = right;
58 }
59 if (min === index) break;
60 [this.heap[min], this.heap[index]] = [this.heap[index], this.heap[min]];
61 index = min;
62 }
63 }
64
65 peek() {
66 return this.heap[0];
67 }
68
69 size() {
70 return this.heap.length;
71 }
72}
73
74function fun(arr) {
75 const heap = new MinHeap();
76 for (const x of arr) {
77 heap.push(new HeapItem(x));
78 }
79 const res = [];
80 for (let i = 0; i < 3; i++) {
81 res.push(heap.pop().item);
82 }
83 return res;
84}
85
Recommended Readings
Coding Interview Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https assets algo monster recursion jpg You first call Ben and ask
Runtime Overview When learning about algorithms and data structures you'll frequently encounter the term time complexity This concept is fundamental in computer science and offers insights into how long an algorithm takes to complete given a certain input size What is Time Complexity Time complexity represents the amount of time
Want a Structured Path to Master System Design Too? Don’t Miss This!