609. Find Duplicate File in System
Problem Description
You are given a list of strings where each string represents directory information. Each string contains:
- A directory path
- One or more files with their contents
The format of each string is:
"root/d1/d2/.../dm f1.txt(f1_content) f2.txt(f2_content) ... fn.txt(fn_content)"
This means:
- The directory path is
root/d1/d2/.../dm
- There are
n
files (f1.txt
,f2.txt
, ...,fn.txt
) in this directory - Each file has its content shown in parentheses (
f1_content
,f2_content
, ...,fn_content
)
Your task is to find all duplicate files in the file system. Duplicate files are files that have the exact same content, regardless of their names or locations.
You need to return groups of duplicate file paths, where:
- Each group contains at least 2 files with identical content
- Each file path should be in the format:
"directory_path/file_name.txt"
For example, if the input is:
["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root 4.txt(efgh)"]
The output would be:
[["root/a/1.txt","root/c/3.txt"], ["root/a/2.txt","root/4.txt"]]
This is because:
- Files
1.txt
and3.txt
both contain "abcd" - Files
2.txt
and4.txt
both contain "efgh"
The solution uses a hash map to group files by their content. It parses each input string to extract the directory path, file names, and contents, then builds the complete file paths and groups them by content. Finally, it returns only the groups that contain more than one file (duplicates).
Intuition
The key insight is that we need to identify files with identical content, regardless of their location or name. This immediately suggests using a hash map (dictionary) where we can group files by their content.
Think of it like organizing documents in a filing system - instead of organizing by name or location, we want to organize by what's written inside them. Every document with the same content goes into the same folder.
The natural approach is:
- Use the file content as the key in our hash map
- Store all file paths that have this content as the value (a list of paths)
This way, when we encounter a file, we can quickly check if we've seen this content before and add the current file path to the appropriate group.
The parsing strategy becomes clear when we look at the input format. Each string has:
- First part: the directory path (everything before the first space)
- Remaining parts: files in format
filename.txt(content)
For each file, we need to:
- Extract the filename (everything before the
(
) - Extract the content (everything between
(
and)
) - Combine the directory path with the filename to get the full path
After processing all files, we'll have a hash map where each key (content) maps to a list of all file paths with that content. Since we only want duplicates, we filter out any groups with just one file.
This approach is efficient because:
- We process each file exactly once:
O(n)
wheren
is the total number of files - Hash map operations (insert/lookup) are
O(1)
on average - We avoid comparing every file with every other file, which would be
O(n²)
Solution Approach
The implementation uses a hash map (Python's defaultdict
) to group files by their content:
-
Initialize the data structure: Create a
defaultdict(list)
where each key will be file content and each value will be a list of file paths with that content. -
Parse each path string:
- Split each path string by spaces using
p.split()
- The first element
ps[0]
is the directory path - All remaining elements
ps[1:]
are files with their contents
- Split each path string by spaces using
-
Extract file information: For each file string in format
filename.txt(content)
:- Find the position of
(
usingf.find('(')
- Extract the filename:
f[:i]
(everything before the(
) - Extract the content:
f[i + 1 : -1]
(everything between(
and)
, excluding the parentheses)
- Find the position of
-
Build the complete file path:
- Combine directory and filename:
ps[0] + '/' + name
- Add this complete path to the list for the corresponding content:
d[content].append(path)
- Combine directory and filename:
-
Filter for duplicates:
- Return only groups with more than one file:
[v for v in d.values() if len(v) > 1]
- This list comprehension filters out any content that appears in only one file
- Return only groups with more than one file:
Example walkthrough:
If input is ["root/a 1.txt(abc) 2.txt(xyz)", "root/b 3.txt(abc)"]
- First string: splits into
["root/a", "1.txt(abc)", "2.txt(xyz)"]
1.txt(abc)
→ filename:1.txt
, content:abc
, path:root/a/1.txt
2.txt(xyz)
→ filename:2.txt
, content:xyz
, path:root/a/2.txt
- Second string: splits into
["root/b", "3.txt(abc)"]
3.txt(abc)
→ filename:3.txt
, content:abc
, path:root/b/3.txt
Hash map after processing:
"abc"
:["root/a/1.txt", "root/b/3.txt"]
"xyz"
:["root/a/2.txt"]
Final output: [["root/a/1.txt", "root/b/3.txt"]]
(only groups with 2+ files)
Ready to land your dream job?
Unlock your dream job with a 5-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's walk through a small example to illustrate the solution approach.
Input: ["root/a 1.txt(hello) 2.txt(world)", "root/b 3.txt(hello)", "root/c/d 4.txt(hello) 5.txt(foo)"]
Step 1: Initialize our hash map Create an empty hash map to group files by content:
content_map = {}
Step 2: Process first string "root/a 1.txt(hello) 2.txt(world)"
Split by spaces: ["root/a", "1.txt(hello)", "2.txt(world)"]
- Directory:
root/a
- Files to process:
["1.txt(hello)", "2.txt(world)"]
For 1.txt(hello)
:
- Find
(
at position 5 - Filename:
1.txt
(characters before position 5) - Content:
hello
(between parentheses) - Full path:
root/a/1.txt
- Update map:
content_map["hello"] = ["root/a/1.txt"]
For 2.txt(world)
:
- Filename:
2.txt
- Content:
world
- Full path:
root/a/2.txt
- Update map:
content_map["world"] = ["root/a/2.txt"]
Step 3: Process second string "root/b 3.txt(hello)"
Split: ["root/b", "3.txt(hello)"]
- Directory:
root/b
For 3.txt(hello)
:
- Filename:
3.txt
- Content:
hello
- Full path:
root/b/3.txt
- Update map:
content_map["hello"] = ["root/a/1.txt", "root/b/3.txt"]
Step 4: Process third string "root/c/d 4.txt(hello) 5.txt(foo)"
Split: ["root/c/d", "4.txt(hello)", "5.txt(foo)"]
- Directory:
root/c/d
For 4.txt(hello)
:
- Full path:
root/c/d/4.txt
- Update map:
content_map["hello"] = ["root/a/1.txt", "root/b/3.txt", "root/c/d/4.txt"]
For 5.txt(foo)
:
- Full path:
root/c/d/5.txt
- Update map:
content_map["foo"] = ["root/c/d/5.txt"]
Step 5: Filter for duplicates
Final hash map state:
"hello"
:["root/a/1.txt", "root/b/3.txt", "root/c/d/4.txt"]
(3 files - duplicates!)"world"
:["root/a/2.txt"]
(1 file - not duplicate)"foo"
:["root/c/d/5.txt"]
(1 file - not duplicate)
Return only groups with 2+ files:
Output: [["root/a/1.txt", "root/b/3.txt", "root/c/d/4.txt"]]
The algorithm successfully identified that three files contain "hello" and grouped them together, while filtering out the unique files.
Solution Implementation
1from collections import defaultdict
2from typing import List
3
4class Solution:
5 def findDuplicate(self, paths: List[str]) -> List[List[str]]:
6 # Dictionary to store content as key and list of file paths as value
7 content_to_paths = defaultdict(list)
8
9 # Process each path string in the input
10 for path_string in paths:
11 # Split the path string into directory and file components
12 path_components = path_string.split()
13 directory = path_components[0]
14
15 # Process each file in the directory
16 for file_info in path_components[1:]:
17 # Find the position of opening parenthesis to separate filename and content
18 paren_index = file_info.find('(')
19
20 # Extract filename and content
21 filename = file_info[:paren_index]
22 file_content = file_info[paren_index + 1:-1] # Remove parentheses
23
24 # Build full file path and add to the dictionary
25 full_path = f"{directory}/{filename}"
26 content_to_paths[file_content].append(full_path)
27
28 # Return groups of duplicate files (groups with more than one file)
29 return [file_paths for file_paths in content_to_paths.values() if len(file_paths) > 1]
30
1class Solution {
2 public List<List<String>> findDuplicate(String[] paths) {
3 // Map to store file content as key and list of file paths as value
4 Map<String, List<String>> contentToPathsMap = new HashMap<>();
5
6 // Process each directory path string
7 for (String pathString : paths) {
8 // Split the string to get directory and files
9 // Format: "directory file1(content1) file2(content2) ..."
10 String[] pathComponents = pathString.split(" ");
11 String directory = pathComponents[0];
12
13 // Process each file in the directory (starting from index 1)
14 for (int i = 1; i < pathComponents.length; i++) {
15 String fileInfo = pathComponents[i];
16
17 // Find the opening parenthesis to separate filename from content
18 int openParenIndex = fileInfo.indexOf('(');
19
20 // Extract content between parentheses
21 String fileContent = fileInfo.substring(openParenIndex + 1, fileInfo.length() - 1);
22
23 // Build complete file path: directory/filename
24 String fileName = fileInfo.substring(0, openParenIndex);
25 String fullPath = directory + '/' + fileName;
26
27 // Add the file path to the list of paths with the same content
28 contentToPathsMap.computeIfAbsent(fileContent, key -> new ArrayList<>()).add(fullPath);
29 }
30 }
31
32 // Collect groups of duplicate files
33 List<List<String>> duplicateGroups = new ArrayList<>();
34
35 // Add only groups with more than one file (actual duplicates)
36 for (List<String> filePaths : contentToPathsMap.values()) {
37 if (filePaths.size() > 1) {
38 duplicateGroups.add(filePaths);
39 }
40 }
41
42 return duplicateGroups;
43 }
44}
45
1class Solution {
2public:
3 vector<vector<string>> findDuplicate(vector<string>& paths) {
4 // Map to store content as key and list of file paths with that content as value
5 unordered_map<string, vector<string>> contentToFilePaths;
6
7 // Process each path string in the input
8 for (auto& pathString : paths) {
9 // Split the path string by space to get directory and files
10 vector<string> tokens = split(pathString, ' ');
11 string directory = tokens[0];
12
13 // Process each file (starting from index 1, as index 0 is the directory)
14 for (int i = 1; i < tokens.size(); ++i) {
15 // Find the position of opening parenthesis to separate filename and content
16 int openParenPos = tokens[i].find('(');
17
18 // Extract content between parentheses (excluding the parentheses)
19 string fileContent = tokens[i].substr(openParenPos + 1,
20 tokens[i].size() - openParenPos - 2);
21
22 // Extract filename (before the opening parenthesis)
23 string fileName = tokens[i].substr(0, openParenPos);
24
25 // Construct full file path
26 string fullPath = directory + '/' + fileName;
27
28 // Group files by their content
29 contentToFilePaths[fileContent].push_back(fullPath);
30 }
31 }
32
33 // Collect groups of duplicate files
34 vector<vector<string>> result;
35 for (auto& [content, filePaths] : contentToFilePaths) {
36 // Only include groups with more than one file (duplicates)
37 if (filePaths.size() > 1) {
38 result.push_back(filePaths);
39 }
40 }
41
42 return result;
43 }
44
45private:
46 // Helper function to split a string by a delimiter character
47 vector<string> split(string& str, char delimiter) {
48 vector<string> tokens;
49 stringstream stringStream(str);
50 string token;
51
52 // Extract tokens separated by the delimiter
53 while (getline(stringStream, token, delimiter)) {
54 tokens.push_back(token);
55 }
56
57 return tokens;
58 }
59};
60
1/**
2 * Finds groups of files with duplicate content across different paths
3 * @param paths - Array of strings where each string contains a directory path followed by files with their content
4 * @returns Array of arrays, where each inner array contains paths to files with identical content
5 */
6function findDuplicate(paths: string[]): string[][] {
7 // Map to store content as key and array of file paths as value
8 const contentToPathsMap = new Map<string, string[]>();
9
10 // Process each path string
11 for (const pathString of paths) {
12 // Split the path string into directory and files
13 // First element is the directory, rest are files with content
14 const [directory, ...files] = pathString.split(' ');
15
16 // Process each file in the current directory
17 for (const fileWithContent of files) {
18 // Extract filename and content by splitting on parentheses
19 // filter(Boolean) removes empty strings from the result
20 const [fileName, fileContent] = fileWithContent.split(/\(|\)/g).filter(Boolean);
21
22 // Get existing paths for this content, or initialize empty array
23 const existingPaths = contentToPathsMap.get(fileContent) ?? [];
24
25 // Add the full path (directory + filename) to the list
26 existingPaths.push(directory + '/' + fileName);
27
28 // Update the map with the new paths array
29 contentToPathsMap.set(fileContent, existingPaths);
30 }
31 }
32
33 // Return only groups with duplicate files (more than one path)
34 return [...contentToPathsMap.values()].filter(pathGroup => pathGroup.length > 1);
35}
36
Time and Space Complexity
Time Complexity: O(n * m)
n
is the total number of paths in the input listm
is the average length of each path string- The outer loop iterates through all
n
paths - For each path, we split the string which takes
O(m)
time - For each file in the path (let's say
k
files on average), we:- Find the '(' character:
O(length of filename)
- Slice the string to extract name and content:
O(length of filename)
- Append to the dictionary list:
O(1)
amortized
- Find the '(' character:
- The final list comprehension iterates through all unique contents in the dictionary, which is at most
O(total number of files)
- Overall:
O(n * m)
where the string operations dominate
Space Complexity: O(n * m)
- The dictionary
d
stores all file paths grouped by content - In the worst case, every file has unique content, so we store all file paths
- Each stored path includes the directory path plus filename
- The total space used is proportional to the sum of all characters in all file paths
- The output list could contain all file paths if they all share the same content
- Overall:
O(n * m)
for storing all the path strings in the dictionary and output
Learn more about how to find time and space complexity quickly.
Common Pitfalls
1. Incorrect String Parsing with Multiple Spaces
A common mistake is assuming the input string is always perfectly formatted with single spaces. If there are extra spaces or inconsistent spacing, using split()
without parameters handles this correctly, but developers might try to use split(' ')
which would create empty strings in the result.
Pitfall Example:
# Wrong approach path_components = path_string.split(' ') # Creates empty strings if multiple spaces exist # If input is "root/a 1.txt(abc)" (double space), this gives ["root/a", "", "1.txt(abc)"]
Solution:
# Correct approach path_components = path_string.split() # Handles any whitespace correctly
2. Hardcoding Parenthesis Removal
Another pitfall is incorrectly extracting content by hardcoding indices without finding the parenthesis positions, which fails if filenames contain special characters or numbers.
Pitfall Example:
# Wrong approach - assumes parentheses are always at specific positions content = file_info[-5:-1] # Assumes content is always 4 characters
Solution:
# Correct approach - dynamically find parenthesis positions paren_index = file_info.find('(') filename = file_info[:paren_index] content = file_info[paren_index + 1:-1]
3. Not Handling Edge Cases
Failing to consider edge cases like:
- Empty content in files:
file.txt()
- Files with identical names in different directories
- Single file with unique content (should not be in output)
Pitfall Example:
# Wrong - returns all groups including single files
return list(content_to_paths.values())
Solution:
# Correct - only return groups with duplicates
return [file_paths for file_paths in content_to_paths.values() if len(file_paths) > 1]
4. Path Concatenation Issues
Using incorrect path concatenation that might create double slashes or missing slashes.
Pitfall Example:
# Wrong approaches full_path = directory + filename # Missing slash full_path = directory + '//' + filename # Double slash
Solution:
# Correct approach
full_path = f"{directory}/{filename}"
# Or using os.path.join for platform independence (though not needed for this problem)
How many times is a tree node visited in a depth first search?
Recommended Readings
Coding Interview Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https assets algo monster recursion jpg You first call Ben and ask
Runtime Overview When learning about algorithms and data structures you'll frequently encounter the term time complexity This concept is fundamental in computer science and offers insights into how long an algorithm takes to complete given a certain input size What is Time Complexity Time complexity represents the amount of time
Want a Structured Path to Master System Design Too? Don’t Miss This!