Facebook Pixel

2882. Drop Duplicate Rows

Problem Description

You are given a DataFrame called customers with three columns: customer_id (integer), name (string), and email (string). The DataFrame contains customer information, but there's an issue - some customers share the same email address, resulting in duplicate email entries.

Your task is to clean this DataFrame by removing duplicate email addresses while keeping only the first occurrence of each unique email. When multiple rows have the same email, you should retain the row that appears first in the DataFrame and remove all subsequent rows with that same email.

For example, if customer_id 4 (Alice) and customer_id 5 (Finn) both have the email "john@example.com", you would keep Alice's record (since it appears first) and remove Finn's record.

The solution uses pandas' drop_duplicates() method with the subset=['email'] parameter, which:

  1. Identifies all rows with duplicate values in the 'email' column
  2. Keeps the first occurrence of each unique email (default behavior)
  3. Removes all subsequent duplicate occurrences
  4. Returns the cleaned DataFrame

The resulting DataFrame will have unique email addresses, with each email appearing exactly once, preserving the original order and keeping the first customer associated with each email address.

Quick Interview Experience
Help others by sharing your interview experience
Have you seen this problem before?

Intuition

When we need to handle duplicate data, the key insight is recognizing that pandas provides built-in functionality specifically designed for this common data cleaning task. The problem explicitly asks us to remove duplicate emails while keeping the first occurrence - this is a standard deduplication operation.

The thought process flows naturally:

  1. We need to identify which column contains the duplicates → the email column
  2. We want to keep some rows and remove others based on this column → this suggests filtering or dropping operations
  3. The requirement to keep the "first occurrence" is a crucial hint → pandas typically preserves the first instance by default in deduplication operations

Rather than manually iterating through rows and tracking which emails we've seen (which would be inefficient), we can leverage pandas' drop_duplicates() method. This method is designed exactly for this scenario - it examines specified columns for duplicate values and intelligently removes redundant rows.

The beauty of drop_duplicates(subset=['email']) is its simplicity and efficiency. By specifying subset=['email'], we tell pandas to only consider the email column when determining duplicates, ignoring whether the customer_id or name might be different. The method automatically keeps the first row it encounters for each unique email value and drops all subsequent rows with the same email, which perfectly matches our requirements.

This approach is both intuitive and optimal because it uses a single, purpose-built function rather than complex logic with loops or conditional statements. It's a direct translation of the problem statement into code: "remove duplicate emails, keep the first one" becomes drop_duplicates(subset=['email']).

Solution Approach

The implementation leverages pandas' built-in drop_duplicates() method to efficiently remove duplicate email entries from the DataFrame. Here's how the solution works step by step:

Method Signature:

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
    return customers.drop_duplicates(subset=['email'])

Breaking down the implementation:

  1. Function Input: The function receives a pandas DataFrame customers containing customer data with columns: customer_id, name, and email.

  2. Core Operation - drop_duplicates():

    • This method scans through the DataFrame to identify duplicate values
    • The subset=['email'] parameter tells pandas to only consider the 'email' column when checking for duplicates
    • Without this parameter, it would check if entire rows are identical across all columns
  3. Default Behavior - Keep First:

    • drop_duplicates() has an implicit parameter keep='first' (the default value)
    • This means when duplicates are found, the first occurrence is retained
    • The method preserves the original row order, so "first" means the row with the smallest index
  4. Internal Algorithm:

    • Pandas internally creates a hash table of email values as it iterates through the DataFrame
    • For each row, it checks if the email has been seen before
    • If the email is new, the row is marked to keep
    • If the email already exists, the row is marked for removal
    • Finally, it returns a new DataFrame containing only the rows marked to keep
  5. Return Value:

    • The method returns a new DataFrame with duplicate emails removed
    • The original DataFrame remains unchanged (non-destructive operation)
    • The resulting DataFrame maintains the original structure and column names

This single-line solution is both memory-efficient and time-efficient, with a time complexity of O(n) where n is the number of rows in the DataFrame.

Ready to land your dream job?

Unlock your dream job with a 5-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's walk through a concrete example to see how the solution works:

Initial DataFrame (customers):

| customer_id | name    | email              |
|-------------|---------|-------------------|
| 1           | Alice   | alice@email.com   |
| 2           | Bob     | bob@email.com     |
| 3           | Charlie | alice@email.com   |
| 4           | David   | david@email.com   |
| 5           | Eve     | bob@email.com     |

Step 1: Identify Duplicates in Email Column The algorithm scans the 'email' column and identifies which emails appear more than once:

  • alice@email.com appears at rows 1 and 3
  • bob@email.com appears at rows 2 and 5
  • david@email.com appears only at row 4

Step 2: Mark First Occurrences to Keep For each unique email, the first occurrence is marked to keep:

  • Row 1 (Alice with alice@email.com) - KEEP (first occurrence)
  • Row 2 (Bob with bob@email.com) - KEEP (first occurrence)
  • Row 3 (Charlie with alice@email.com) - DROP (duplicate of row 1)
  • Row 4 (David with david@email.com) - KEEP (only occurrence)
  • Row 5 (Eve with bob@email.com) - DROP (duplicate of row 2)

Step 3: Apply drop_duplicates(subset=['email']) When we execute customers.drop_duplicates(subset=['email']), pandas removes rows 3 and 5.

Final Result:

| customer_id | name  | email            |
|-------------|-------|------------------|
| 1           | Alice | alice@email.com  |
| 2           | Bob   | bob@email.com    |
| 4           | David | david@email.com  |

Notice that:

  • Each email now appears exactly once
  • We kept Alice (not Charlie) for alice@email.com because Alice appeared first
  • We kept Bob (not Eve) for bob@email.com because Bob appeared first
  • The relative order of remaining rows is preserved (1, 2, 4)

This demonstrates how the solution efficiently removes duplicate emails while maintaining the first customer associated with each unique email address.

Solution Implementation

1import pandas as pd
2
3
4def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
5    """
6    Remove duplicate rows based on email column from the customers DataFrame.
7  
8    Args:
9        customers: A pandas DataFrame containing customer information with an 'email' column
10  
11    Returns:
12        A pandas DataFrame with duplicate emails removed (keeping the first occurrence)
13    """
14    # Drop duplicate rows based on the 'email' column
15    # By default, keeps the first occurrence and removes subsequent duplicates
16    return customers.drop_duplicates(subset=['email'])
17
1import java.util.*;
2import java.util.stream.Collectors;
3
4public class Solution {
5    /**
6     * Remove duplicate rows based on email column from the customers list.
7     * 
8     * @param customers A list of Customer objects containing customer information with an email field
9     * @return A list of Customer objects with duplicate emails removed (keeping the first occurrence)
10     */
11    public List<Customer> dropDuplicateEmails(List<Customer> customers) {
12        // Create a set to track emails we've already seen
13        Set<String> seenEmails = new HashSet<>();
14      
15        // Create a list to store the result with duplicates removed
16        List<Customer> result = new ArrayList<>();
17      
18        // Iterate through customers and keep only the first occurrence of each email
19        for (Customer customer : customers) {
20            // Check if this email hasn't been seen before
21            if (seenEmails.add(customer.getEmail())) {
22                // If add() returns true, it's a new email, so add the customer to result
23                result.add(customer);
24            }
25        }
26      
27        return result;
28    }
29  
30    /**
31     * Customer class to represent a row in the DataFrame
32     */
33    public static class Customer {
34        private String email;
35        // Other fields can be added as needed
36      
37        public Customer(String email) {
38            this.email = email;
39        }
40      
41        public String getEmail() {
42            return email;
43        }
44      
45        public void setEmail(String email) {
46            this.email = email;
47        }
48    }
49}
50```
51
52## Approach 2: Using Java Streams (more concise)
53
54```java
55import java.util.*;
56import java.util.stream.Collectors;
57
58public class Solution {
59    /**
60     * Remove duplicate rows based on email column from the customers list.
61     * 
62     * @param customers A list of Map objects representing customer records with an "email" key
63     * @return A list of Map objects with duplicate emails removed (keeping the first occurrence)
64     */
65    public List<Map<String, Object>> dropDuplicateEmails(List<Map<String, Object>> customers) {
66        // Use a LinkedHashSet to maintain insertion order while removing duplicates
67        Set<String> seenEmails = new LinkedHashSet<>();
68      
69        // Filter the customers list to keep only the first occurrence of each email
70        return customers.stream()
71            .filter(customer -> {
72                // Extract the email from the current customer map
73                String email = (String) customer.get("email");
74                // Keep this customer only if we haven't seen this email before
75                return seenEmails.add(email);
76            })
77            .collect(Collectors.toList());
78    }
79}
80
1#include <vector>
2#include <string>
3#include <unordered_set>
4#include <algorithm>
5
6// Structure to represent a customer record
7struct Customer {
8    int id;
9    std::string name;
10    std::string email;
11    // Add other fields as needed
12};
13
14// Class to represent a DataFrame-like structure for customers
15class CustomersDataFrame {
16public:
17    std::vector<Customer> data;
18  
19    // Constructor
20    CustomersDataFrame(const std::vector<Customer>& customers) : data(customers) {}
21  
22    // Default constructor
23    CustomersDataFrame() {}
24};
25
26/**
27 * Remove duplicate rows based on email column from the customers DataFrame.
28 * 
29 * @param customers A CustomersDataFrame containing customer information with an 'email' field
30 * @return A CustomersDataFrame with duplicate emails removed (keeping the first occurrence)
31 */
32CustomersDataFrame dropDuplicateEmails(const CustomersDataFrame& customers) {
33    // Create a result DataFrame to store unique customers
34    CustomersDataFrame result;
35  
36    // Set to track emails we've already seen
37    std::unordered_set<std::string> seen_emails;
38  
39    // Iterate through all customers in the input DataFrame
40    for (const auto& customer : customers.data) {
41        // Check if this email has not been seen before
42        if (seen_emails.find(customer.email) == seen_emails.end()) {
43            // Add email to the set of seen emails
44            seen_emails.insert(customer.email);
45          
46            // Add this customer to the result (keeping first occurrence)
47            result.data.push_back(customer);
48        }
49        // If email already exists, skip this customer (remove duplicate)
50    }
51  
52    // Return the DataFrame with duplicates removed
53    return result;
54}
55
1// Import statement would be different in TypeScript - pandas is a Python library
2// In TypeScript, we'd typically use different data manipulation libraries
3
4/**
5 * Remove duplicate rows based on email column from the customers DataFrame.
6 * 
7 * @param customers - A data structure containing customer information with an 'email' column
8 * @returns A data structure with duplicate emails removed (keeping the first occurrence)
9 */
10function dropDuplicateEmails(customers: any[]): any[] {
11    // Create a Set to track unique emails we've already seen
12    const seenEmails = new Set<string>();
13  
14    // Array to store the result with duplicates removed
15    const result: any[] = [];
16  
17    // Iterate through each customer record
18    for (const customer of customers) {
19        // Check if we've already seen this email
20        if (!seenEmails.has(customer.email)) {
21            // If not seen, add email to the set and customer to result
22            seenEmails.add(customer.email);
23            result.push(customer);
24        }
25        // Skip subsequent occurrences of the same email
26    }
27  
28    // Return the filtered array with duplicate emails removed
29    return result;
30}
31

Time and Space Complexity

Time Complexity: O(n) where n is the number of rows in the DataFrame.

The drop_duplicates() method needs to scan through all rows in the DataFrame to identify duplicates based on the 'email' column. Internally, pandas uses a hash table to track seen values, which provides O(1) average-case lookup time for each row. Therefore, processing all n rows results in O(n) time complexity.

Space Complexity: O(n) in the worst case.

The space complexity consists of:

  • The hash table used internally by drop_duplicates() to track unique email values: O(k) where k is the number of unique emails (at most n)
  • The returned DataFrame: O(k) where k is the number of unique rows retained after deduplication
  • In the worst case where all emails are unique, k = n, resulting in O(n) space complexity
  • Note that this analysis assumes the operation creates a new DataFrame rather than modifying in-place

Common Pitfalls

1. Modifying the Original DataFrame Unintentionally

A common mistake is assuming that drop_duplicates() modifies the DataFrame in place. By default, it returns a new DataFrame, leaving the original unchanged.

Incorrect approach:

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
    customers.drop_duplicates(subset=['email'])  # This doesn't modify 'customers'
    return customers  # Returns the original DataFrame with duplicates still present

Correct approach:

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
    return customers.drop_duplicates(subset=['email'])  # Return the new DataFrame

Alternative with in-place modification:

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
    customers.drop_duplicates(subset=['email'], inplace=True)
    return customers  # Now returns the modified DataFrame

2. Case Sensitivity in Email Comparison

Email addresses are case-insensitive in practice (john@example.com = JOHN@example.com), but drop_duplicates() treats them as different values.

Problem scenario:

# These would be considered different emails:
# "john@example.com"
# "John@example.com"
# "JOHN@EXAMPLE.COM"

Solution - Normalize emails before deduplication:

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
    # Create a temporary lowercase email column for comparison
    customers['email_lower'] = customers['email'].str.lower()
    # Drop duplicates based on the normalized column
    result = customers.drop_duplicates(subset=['email_lower'])
    # Remove the temporary column
    return result.drop(columns=['email_lower'])

3. Handling NULL/NaN Values in Email Column

If the email column contains NaN or None values, drop_duplicates() treats each NaN as unique, potentially keeping multiple rows with missing emails.

Problem scenario:

# Multiple rows with NaN emails would all be kept:
# customer_id=1, email=NaN
# customer_id=2, email=NaN  # This would also be kept!

Solution - Remove or handle NaN values explicitly:

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
    # Option 1: Remove rows with NaN emails first
    customers = customers.dropna(subset=['email'])
    return customers.drop_duplicates(subset=['email'])
  
    # Option 2: Fill NaN values with a placeholder before deduplication
    customers['email'] = customers['email'].fillna('NO_EMAIL')
    return customers.drop_duplicates(subset=['email'])

4. Keeping the Wrong Occurrence

Sometimes business logic requires keeping the last occurrence or a specific occurrence based on other criteria (e.g., most recent customer_id).

Solution - Specify the keep parameter:

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
    # Keep last occurrence instead of first
    return customers.drop_duplicates(subset=['email'], keep='last')
  
    # Or sort by customer_id first to keep the lowest ID
    customers_sorted = customers.sort_values('customer_id')
    return customers_sorted.drop_duplicates(subset=['email'])

5. Leading/Trailing Whitespace in Emails

Emails with extra spaces might not be recognized as duplicates.

Problem scenario:

# These would be considered different:
# "john@example.com"
# " john@example.com"
# "john@example.com "

Solution - Strip whitespace before deduplication:

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame:
    # Clean email column by stripping whitespace
    customers['email'] = customers['email'].str.strip()
    return customers.drop_duplicates(subset=['email'])
Discover Your Strengths and Weaknesses: Take Our 5-Minute Quiz to Tailor Your Study Plan:

What's the output of running the following function using input [30, 20, 10, 100, 33, 12]?

1def fun(arr: List[int]) -> List[int]:
2    import heapq
3    heapq.heapify(arr)
4    res = []
5    for i in range(3):
6        res.append(heapq.heappop(arr))
7    return res
8
1public static int[] fun(int[] arr) {
2    int[] res = new int[3];
3    PriorityQueue<Integer> heap = new PriorityQueue<>();
4    for (int i = 0; i < arr.length; i++) {
5        heap.add(arr[i]);
6    }
7    for (int i = 0; i < 3; i++) {
8        res[i] = heap.poll();
9    }
10    return res;
11}
12
1class HeapItem {
2    constructor(item, priority = item) {
3        this.item = item;
4        this.priority = priority;
5    }
6}
7
8class MinHeap {
9    constructor() {
10        this.heap = [];
11    }
12
13    push(node) {
14        // insert the new node at the end of the heap array
15        this.heap.push(node);
16        // find the correct position for the new node
17        this.bubble_up();
18    }
19
20    bubble_up() {
21        let index = this.heap.length - 1;
22
23        while (index > 0) {
24            const element = this.heap[index];
25            const parentIndex = Math.floor((index - 1) / 2);
26            const parent = this.heap[parentIndex];
27
28            if (parent.priority <= element.priority) break;
29            // if the parent is bigger than the child then swap the parent and child
30            this.heap[index] = parent;
31            this.heap[parentIndex] = element;
32            index = parentIndex;
33        }
34    }
35
36    pop() {
37        const min = this.heap[0];
38        this.heap[0] = this.heap[this.size() - 1];
39        this.heap.pop();
40        this.bubble_down();
41        return min;
42    }
43
44    bubble_down() {
45        let index = 0;
46        let min = index;
47        const n = this.heap.length;
48
49        while (index < n) {
50            const left = 2 * index + 1;
51            const right = left + 1;
52
53            if (left < n && this.heap[left].priority < this.heap[min].priority) {
54                min = left;
55            }
56            if (right < n && this.heap[right].priority < this.heap[min].priority) {
57                min = right;
58            }
59            if (min === index) break;
60            [this.heap[min], this.heap[index]] = [this.heap[index], this.heap[min]];
61            index = min;
62        }
63    }
64
65    peek() {
66        return this.heap[0];
67    }
68
69    size() {
70        return this.heap.length;
71    }
72}
73
74function fun(arr) {
75    const heap = new MinHeap();
76    for (const x of arr) {
77        heap.push(new HeapItem(x));
78    }
79    const res = [];
80    for (let i = 0; i < 3; i++) {
81        res.push(heap.pop().item);
82    }
83    return res;
84}
85

Recommended Readings

Want a Structured Path to Master System Design Too? Don’t Miss This!

Load More