2883. Drop Missing Data
Problem Description
You are given a DataFrame called students
with three columns:
student_id
: an integer representing the unique identifier for each studentname
: a string (object type) containing the student's nameage
: an integer representing the student's age
The problem is that some rows in this DataFrame have missing values (null/NaN values) specifically in the name
column. Your task is to clean this data by removing all rows where the name
column contains missing values.
For example, if your DataFrame looks like this:
student_id | name | age 1 | "Alice" | 20 2 | null | 21 3 | "Bob" | 22
After removing rows with missing names, the result should be:
student_id | name | age 1 | "Alice" | 20 3 | "Bob" | 22
The solution uses pandas' notnull()
method to filter the DataFrame. The expression students['name'].notnull()
creates a boolean mask that is True
for rows where the name is not null and False
for rows where the name is missing. By using this mask to index the DataFrame (students[...]
), we keep only the rows where the name column has valid values.
Intuition
When working with DataFrames that contain missing data, we need a way to identify and filter out incomplete records. The key insight is that pandas provides built-in methods to detect null values, which makes this task straightforward.
Think of the DataFrame as a table where each row represents a complete student record. If a student's name is missing, that record is incomplete and should be excluded from our clean dataset.
The natural approach is to:
- First identify which rows have valid (non-null) names
- Then keep only those rows
This leads us to use boolean indexing - a powerful pandas feature where we create a True/False mask for each row. The notnull()
method does exactly what we need: it returns True
for rows where the name exists and False
where it's missing.
By applying this boolean mask to the original DataFrame using bracket notation students[boolean_mask]
, pandas automatically filters out all rows where the mask is False
, giving us only the rows with complete name information.
This approach is more elegant and efficient than manually iterating through rows or using explicit loops. It leverages pandas' vectorized operations, which are optimized for performance on large datasets.
Solution Approach
The implementation uses pandas' built-in filtering mechanism with boolean indexing. Here's how the solution works step by step:
-
Access the name column: We first access the
name
column of the DataFrame usingstudents['name']
. This gives us a pandas Series containing all the name values. -
Apply the notnull() method: We call
notnull()
on this Series:students['name'].notnull()
. This method checks each element in the Series and returns a boolean Series of the same length where:True
indicates the value is not null (valid name)False
indicates the value is null (missing name)
-
Use boolean indexing: We use the boolean Series as a filter by passing it inside square brackets:
students[students['name'].notnull()]
. This is pandas' boolean indexing pattern, which returns a new DataFrame containing only the rows where the corresponding boolean value isTrue
.
The complete implementation in one line:
return students[students['name'].notnull()]
This pattern is efficient because:
- It operates on the entire column at once (vectorized operation)
- No explicit loops are needed
- The filtering happens at the C level in pandas, making it very fast
- It returns a new DataFrame without modifying the original
Alternative approaches could include using dropna()
method:
return students.dropna(subset=['name'])
Both approaches achieve the same result, but the boolean indexing approach shown in the solution is more explicit about what condition we're filtering on, making the code more readable and maintainable.
Ready to land your dream job?
Unlock your dream job with a 5-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's walk through a concrete example to understand how the solution works.
Initial DataFrame:
student_id | name | age 1 | "Alice" | 20 2 | NaN | 21 3 | "Bob" | 22 4 | NaN | 19 5 | "Charlie" | 23
Step 1: Apply notnull() to the name column
When we execute students['name'].notnull()
, it examines each value in the name column:
- Row 1: "Alice" is not null →
True
- Row 2: NaN is null →
False
- Row 3: "Bob" is not null →
True
- Row 4: NaN is null →
False
- Row 5: "Charlie" is not null →
True
This creates a boolean Series:
1 True 2 False 3 True 4 False 5 True
Step 2: Use the boolean Series as a filter
When we apply students[students['name'].notnull()]
, pandas uses this boolean Series as a mask:
- It keeps rows where the mask is
True
(rows 1, 3, and 5) - It removes rows where the mask is
False
(rows 2 and 4)
Final Result:
student_id | name | age 1 | "Alice" | 20 3 | "Bob" | 22 5 | "Charlie" | 23
The rows with student_id 2 and 4 have been successfully removed because their name values were null. The resulting DataFrame contains only students with valid names, maintaining all other column data intact.
Solution Implementation
1import pandas as pd
2
3
4def dropMissingData(students: pd.DataFrame) -> pd.DataFrame:
5 """
6 Remove rows from the DataFrame where the 'name' column contains missing values.
7
8 Args:
9 students: A pandas DataFrame containing student data with a 'name' column
10
11 Returns:
12 A pandas DataFrame with rows containing null names removed
13 """
14 # Filter the DataFrame to keep only rows where 'name' is not null
15 # notnull() returns a boolean Series indicating which values are not missing
16 return students[students['name'].notnull()]
17
1import java.util.List;
2import java.util.Map;
3import java.util.ArrayList;
4import java.util.stream.Collectors;
5
6public class DataProcessor {
7
8 /**
9 * Remove rows from the data structure where the 'name' column contains missing values.
10 *
11 * @param students A list of maps representing student data with a 'name' field
12 * @return A list of maps with rows containing null names removed
13 */
14 public static List<Map<String, Object>> dropMissingData(List<Map<String, Object>> students) {
15 // Filter the list to keep only entries where 'name' is not null
16 // Using stream API to filter out entries with null or missing name values
17 return students.stream()
18 .filter(student -> student.get("name") != null) // Check if name field exists and is not null
19 .collect(Collectors.toList()); // Collect the filtered results back into a list
20 }
21}
22
1#include <vector>
2#include <string>
3#include <optional>
4#include <algorithm>
5
6// Structure to represent a student record
7struct Student {
8 std::optional<std::string> name; // Using optional to handle null/missing values
9 // Other fields can be added as needed
10};
11
12// Class to represent a DataFrame-like structure for students
13class StudentsDataFrame {
14public:
15 std::vector<Student> data;
16
17 // Constructor
18 StudentsDataFrame(const std::vector<Student>& students) : data(students) {}
19
20 // Default constructor
21 StudentsDataFrame() = default;
22};
23
24/**
25 * Remove rows from the DataFrame where the 'name' column contains missing values.
26 *
27 * @param students A StudentsDataFrame containing student data with a 'name' column
28 * @return A StudentsDataFrame with rows containing null names removed
29 */
30StudentsDataFrame dropMissingData(const StudentsDataFrame& students) {
31 // Create a new DataFrame to store the filtered results
32 StudentsDataFrame result;
33
34 // Filter the DataFrame to keep only rows where 'name' is not null
35 // Iterate through each student record and check if name has a value
36 for (const auto& student : students.data) {
37 // Check if the name field has a value (is not null)
38 if (student.name.has_value()) {
39 // Add the student to the result if name is not missing
40 result.data.push_back(student);
41 }
42 }
43
44 return result;
45}
46
1// Import statement would be handled differently in TypeScript
2// TypeScript doesn't have pandas, so this is a conceptual translation
3
4/**
5 * Remove rows from the DataFrame where the 'name' column contains missing values.
6 *
7 * @param students - A DataFrame containing student data with a 'name' column
8 * @returns A DataFrame with rows containing null names removed
9 */
10function dropMissingData(students: DataFrame): DataFrame {
11 // Filter the DataFrame to keep only rows where 'name' is not null
12 // notnull() returns a boolean Series indicating which values are not missing
13 return students[students['name'].notnull()];
14}
15```
16
17Note: TypeScript doesn't have a direct equivalent to pandas DataFrame. In a real TypeScript implementation, you would need to:
181. Use a library like Danfo.js for DataFrame operations
192. Implement custom data structures
203. Use regular JavaScript arrays and objects with type definitions
21
22For a more practical TypeScript approach without pandas:
23
24```typescript
25// Define the Student type for type safety
26type Student = {
27 name: string | null | undefined;
28 [key: string]: any; // Allow other properties
29};
30
31/**
32 * Remove rows from the array where the 'name' property contains missing values.
33 *
34 * @param students - An array of student objects with a 'name' property
35 * @returns An array with students containing null/undefined names removed
36 */
37function dropMissingData(students: Student[]): Student[] {
38 // Filter the array to keep only students where 'name' is not null or undefined
39 // This checks for both null and undefined values
40 return students.filter(student => student.name != null);
41}
42
Time and Space Complexity
Time Complexity: O(n)
where n
is the number of rows in the DataFrame. The operation students['name'].notnull()
needs to check each row to determine if the 'name' column contains a null value, creating a boolean mask. Then, the boolean indexing students[...]
filters the DataFrame based on this mask, which also requires traversing through all rows once.
Space Complexity: O(n)
where n
is the number of rows in the DataFrame. The intermediate boolean mask created by students['name'].notnull()
requires O(n)
space to store a boolean value for each row. Additionally, the filtered DataFrame returned by the function creates a new DataFrame object containing the non-null rows, which in the worst case (when no nulls exist) would contain all n
rows, resulting in O(n)
space complexity.
Common Pitfalls
1. Modifying the Original DataFrame Unintentionally
A common mistake is attempting to modify the DataFrame in-place without understanding the implications. For instance, if you assign the filtered result back to a slice of the original DataFrame or forget to return/reassign the result:
# Pitfall - This doesn't actually modify the DataFrame
def dropMissingData(students: pd.DataFrame) -> pd.DataFrame:
students[students['name'].notnull()] # Missing return statement
return students # Returns the original, unmodified DataFrame
Solution: Always return the filtered DataFrame or explicitly use in-place operations:
# Option 1: Return the filtered DataFrame (recommended) return students[students['name'].notnull()] # Option 2: Use dropna with inplace=True (modifies original) students.dropna(subset=['name'], inplace=True) return students
2. Confusing notnull()
with notna()
While both notnull()
and notna()
achieve the same result in pandas, mixing them up or using the deprecated versions can cause confusion:
# These are equivalent but mixing styles reduces readability students[students['name'].notnull()] # OK students[~students['name'].isnull()] # OK but more complex students[students['name'].notna()] # OK (newer alias)
Solution: Stick to one consistent approach throughout your codebase. The notna()
and isna()
methods are the more modern aliases, but notnull()
and isnull()
are still widely used and perfectly valid.
3. Not Handling Empty Strings as Missing Data
The notnull()
method only detects NaN
, None
, and pd.NaT
values. Empty strings ''
are considered valid non-null values:
# This DataFrame has an empty string in row 2 # student_id | name | age # 1 | "Alice" | 20 # 2 | "" | 21 <- Empty string, not null! # 3 | "Bob" | 22 # notnull() will NOT remove the empty string row students[students['name'].notnull()] # Row 2 remains
Solution: If empty strings should also be treated as missing data, combine conditions:
# Remove both null values and empty strings return students[(students['name'].notnull()) & (students['name'] != '')] # Or convert empty strings to NaN first students['name'] = students['name'].replace('', pd.NA) return students[students['name'].notnull()]
4. Assuming Column Exists Without Validation
If the 'name' column doesn't exist, the code will raise a KeyError
:
# Pitfall - No check for column existence return students[students['name'].notnull()] # KeyError if 'name' doesn't exist
Solution: Add defensive programming checks:
def dropMissingData(students: pd.DataFrame) -> pd.DataFrame:
if 'name' not in students.columns:
raise ValueError("DataFrame must contain a 'name' column")
return students[students['name'].notnull()]
5. Index Reset Confusion
After filtering, the DataFrame retains the original index values, which may not be consecutive:
# After filtering, index might be: [0, 2, 4, 5] instead of [0, 1, 2, 3] filtered = students[students['name'].notnull()]
Solution: Reset the index if consecutive indices are needed:
return students[students['name'].notnull()].reset_index(drop=True)
Which of the following is a min heap?
Recommended Readings
Coding Interview Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https assets algo monster recursion jpg You first call Ben and ask
Runtime Overview When learning about algorithms and data structures you'll frequently encounter the term time complexity This concept is fundamental in computer science and offers insights into how long an algorithm takes to complete given a certain input size What is Time Complexity Time complexity represents the amount of time
Want a Structured Path to Master System Design Too? Don’t Miss This!