609. Find Duplicate File in System
Problem Description
The problem presents a scenario where we have a file system that consists of several directories and files. Each directory can have multiple files with their contents. We are given a list of directory info strings, where each string has the following format: a path to the directory followed by one or more files, with each file's name and content enclosed in parentheses.
Our task is to find all the duplicate files in the file system. A duplicate file is defined as a file that has the exact same content as one or more other files, regardless of the file or directory names. The output should be a list of groups, with each group containing the file paths of all files that have identical content.
Here's a breakdown of a directory info string:
"root/d1/d2/.../dm"
represents the path to the directory.f1.txt(f1_content)
represents a file within the directory, wheref1.txt
is the file name, andf1_content
is the content of the file. There can be multiple such file representations in a single directory info string.
The solution should return groups of file paths for all sets of duplicate files. Each file path in the output is formatted as "directory_path/file_name.txt"
.
Intuition
To solve this problem, we need to efficiently group the files based on their contents. Here's a straightforward approach to tackle this:
- Iterate through each directory info string in the given
paths
list. - For each directory info string, separate the directory path from the files. Then, for each file, extract the name and content.
- Use a data structure that allows easy grouping of files with the same content. A dictionary (or hashmap) is a good choice for this task. We can use the file content as the key and a list of file paths as the value.
- For each file, append the full file path (directory path + '/' + file name) to the list in the dictionary corresponding to the file's content.
- Once all files are processed, we only need the lists that have more than one file path, since these represent duplicate files.
With this approach, we're using the file contents as a unique identifier to find duplicates, which satisfies the condition that duplicate files consist of at least two files that have the content.
The solution code creates a defaultdict
of lists, goes through each file, extracts the content, and appends the full path to the corresponding list in the dictionary. Finally, it builds the result by including only the lists with more than one file path, as these are the duplicates.
The intuition behind this approach is that by mapping file contents to file paths, we can quickly identify which files are duplicates without having to compare the contents of every file with every other file, which would be less efficient.
Solution Approach
The solution uses a simple yet effective approach, leveraging a defaultdict
to group files by their contents. The implementation steps are as follows:
-
Initialize a defaultdict: We start by creating a
defaultdict
of lists. This particular data structure is chosen because it simplifies the process of appending to a list for a given key without first checking if the key exists in the dictionary. -
Split the input: Iterate through each entry in the
paths
list. For every path string, we split it into components using thesplit()
method. The first component is the directory path, and the rest are files with content. -
Process each file: For each file, find the index of the opening parenthesis (
(
) to separate the file name from its content. -
Extracting name and content: The substring before the parenthesis is the file name, and the substring within the parenthesis is the file content.
-
Construct the full path and group by content: Use the extracted name and content to form the full file path (concatenating the directory path,
'/'
, and the file name). Append this path to the list in thedefaultdict
associated with the file content. -
Filter out unique files: After all paths have been processed, we filter the
defaultdict
to only include lists with more than one path, meaning these paths have duplicate contents. -
Return the result: The final step is to return the lists of file paths that have duplicates.
The algorithm's complexity is efficient as it consists mostly of linear operations: splitting strings, indexing, and dictionary operations. The use of defaultdict
and list appends are constant time operations on average, making the solution scalable for a large number of files and directories.
The solution code completes this process with a concise script that translates the conceptual steps into Python code fluently, leveraging Python's built-in string manipulation and dictionary methods to create an elegant and efficient solution.
Here is the main part of the code with comments to demonstrate the implementation of the approach:
1class Solution:
2 def findDuplicate(self, paths: List[str]) -> List[List[str]]:
3 d = defaultdict(list) # Step 1: Initialize a defaultdict
4 for p in paths: # Step 2: Split the input
5 ps = p.split()
6 for f in ps[1:]:
7 i = f.find('(') # Step 3: Process each file
8 # Step 4: Extracting name and content
9 name, content = f[:i], f[i + 1 : -1]
10 # Step 5: Construct the full path and group by content
11 d[content].append(ps[0] + '/' + name)
12 # Step 6: Filter out unique files
13 return [v for v in d.values() if len(v) > 1]
The key to this solution is the grouping of files by their content using a dictionary, making the task of finding duplicates straightforward.
Ready to land your dream job?
Unlock your dream job with a 2-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's illustrate the solution approach with a small example.
Given the following input list of directory info strings:
paths = ["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
This list tells us we have 4 paths:
- "root/a" contains "1.txt" with content "abcd" and "2.txt" with content "efgh".
- "root/c" contains "3.txt" with content "abcd".
- "root/c/d" contains "4.txt" with content "efgh".
- "root" contains "4.txt" with content "efgh".
We need to find duplicate files (files with the same content) and group them.
- Step 1: Initialize a defaultdict called "d".
- Step 2: Split each path string based on space to separate the directory path and the files within it.
- Step 3: For each file in the directory string, find the parenthesis to separate the file name and content.
- Step 4: Extract the file name which is before the parenthesis and the file content which is within the parentheses.
- Step 5: Form the full file path by concatenating the directory path with the file name and append this path to the corresponding list in the defaultdict indicated by the file content.
- Step 6: After processing all paths, we filter the defaultdict to include only those lists that have more than one file path, which indicate duplicates.
For the given example, the process will work as follows:
-
root/a
is split into["root/a", "1.txt(abcd)", "2.txt(efgh)"]
.- The file
"1.txt(abcd)"
is processed to get the name "1.txt" and content "abcd". - The full path becomes
"root/a/1.txt"
and is appended to the list for content "abcd". - Similarly for
"2.txt(efgh)"
, producing the full path"root/a/2.txt"
and appending to the list for content "efgh".
- The file
-
The same process follows for each entry in
paths
. After processing, the defaultdictd
looks like this:
1{ 2 "abcd": ["root/a/1.txt", "root/c/3.txt"], # Duplicate content 3 "efgh": ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"] # Duplicate content 4}
-
Step 6: We only want lists with more than one file path:
Our final output will be:
1[ 2 ["root/a/1.txt", "root/c/3.txt"], 3 ["root/a/2.txt", "root/c/d/4.txt", "root/4.txt"] 4]
These are the groups of duplicate files. The list for "abcd" content shows that "1.txt" in the "root/a" directory is a duplicate of "3.txt" in the "root/c" directory. Similarly, for the "efgh" content group.
Solution Implementation
1from collections import defaultdict
2from typing import List
3
4class Solution:
5 def findDuplicate(self, paths: List[str]) -> List[List[str]]:
6 # Initialize a dictionary which maps file content to a list of file paths
7 content_to_paths = defaultdict(list)
8
9 # Iterate over each path in the input list
10 for path_string in paths:
11 # Split the path string into the directory and the file detail parts
12 path_parts = path_string.split()
13 directory = path_parts[0]
14
15 # For each file in the current directory path
16 for file_detail in path_parts[1:]:
17 # Find the index of the opening bracket '(' which starts the file content
18 content_start_idx = file_detail.find('(')
19 # Extract the file name and content
20 file_name = file_detail[:content_start_idx]
21 content = file_detail[content_start_idx + 1 : -1] # Exclude the closing bracket
22
23 # Append the full file path to the list of paths corresponding to the content
24 full_path = directory + '/' + file_name
25 content_to_paths[content].append(full_path)
26
27 # Filter and return only the lists of file paths with more than one file path
28 # This indicates that files are duplicates as per their content
29 return [paths_list for paths_list in content_to_paths.values() if len(paths_list) > 1]
30
1class Solution {
2 public List<List<String>> findDuplicate(String[] paths) {
3 // Create a map to hold file content as key and a list of file paths as value
4 Map<String, List<String>> contentToPathsMap = new HashMap<>();
5
6 // Loop over each path in the input array
7 for (String path : paths) {
8 // Split the current path into directory and files with content
9 String[] parts = path.split(" ");
10
11 // Skip the first part as it is the directory path and iterate over the files
12 for (int i = 1; i < parts.length; ++i) {
13 // Find the index of the first '(' to separate file name and content
14 int contentStartIndex = parts[i].indexOf('(');
15 // Extract the content between parentheses
16 String content = parts[i].substring(contentStartIndex + 1, parts[i].length() - 1);
17 // Construct the full file path using the directory and the file name
18 String fullPath = parts[0] + '/' + parts[i].substring(0, contentStartIndex);
19 // Add the full file path to the list of paths for the current content
20 contentToPathsMap.computeIfAbsent(content, k -> new ArrayList<>()).add(fullPath);
21 }
22 }
23
24 // Prepare the result list for duplicate files
25 List<List<String>> duplicates = new ArrayList<>();
26
27 // Loop over the entries in the content to paths map
28 for (List<String> pathsList : contentToPathsMap.values()) {
29 // Only consider lists with more than one file path (indicating duplicates)
30 if (pathsList.size() > 1) {
31 duplicates.add(pathsList);
32 }
33 }
34
35 // Return the list of duplicates
36 return duplicates;
37 }
38}
39
1#include <vector>
2#include <string>
3#include <unordered_map>
4#include <sstream>
5using namespace std;
6
7class Solution {
8public:
9 // Finds duplicates in the file system given list of directory paths
10 vector<vector<string>> findDuplicate(vector<string>& paths) {
11 // This map will store the content of the file as the key and a list of file paths with that content as the value
12 unordered_map<string, vector<string>> contentToFilePathsMap;
13
14 // Iterate over each path provided in the input
15 for (auto& path : paths) {
16 // Split the path string based on spaces. The first element is the directory path and the remaining elements are files
17 vector<string> pathAndFiles = split(path, ' ');
18 // Iterate over the files
19 for (size_t i = 1; i < pathAndFiles.size(); ++i) {
20 // Find the position of the first parenthesis which marks the start of the file content
21 size_t contentStartPos = pathAndFiles[i].find('(');
22 // Extract the file content (excluding the enclosing parentheses)
23 string content = pathAndFiles[i].substr(contentStartPos + 1, pathAndFiles[i].size() - contentStartPos - 2);
24 // Build the full file path by concatenating directory path and the file name
25 string fullPath = pathAndFiles[0] + '/' + pathAndFiles[i].substr(0, contentStartPos);
26 // Store the full path in the map under the corresponding content key
27 contentToFilePathsMap[content].push_back(fullPath);
28 }
29 }
30
31 // Vector to store the final result
32 vector<vector<string>> duplicates;
33 // Iterate over the map to check for duplicate contents
34 for (auto& mapEntry : contentToFilePathsMap) {
35 // If there is more than one file with the same content, it is considered a duplicate
36 if (mapEntry.second.size() > 1) {
37 // Add the list of file paths with the same content to the result
38 duplicates.push_back(mapEntry.second);
39 }
40 }
41
42 // Return the list of duplicate file groups
43 return duplicates;
44 }
45
46 // A helper function to split a string by a given delimiter
47 vector<string> split(const string& str, char delimiter) {
48 vector<string> tokens;
49 stringstream ss(str);
50 string token;
51
52 // Split the string by the delimiter and push each token into the vector
53 while (getline(ss, token, delimiter)) {
54 tokens.push_back(token);
55 }
56 return tokens;
57 }
58};
59
1function findDuplicate(paths: string[]): string[][] {
2 // Map to store the content of the files and their paths
3 const fileContentMap = new Map<string, string[]>();
4
5 // Iterate over each path provided
6 for (const path of paths) {
7 // Split the path into root directory and files
8 const [rootDirectory, ...files] = path.split(' ');
9
10 // Iterate over each file in the current path
11 for (const file of files) {
12 // Split the file information into name and content (the Boolean filter removes empty strings caused by splitting)
13 const [fileName, fileContent] = file.split(/\(|\)/g).filter(Boolean);
14
15 // Get the existing list of file paths with the same content, or initialize an empty one
16 const existingPaths = fileContentMap.get(fileContent) ?? [];
17
18 // Add the new file path to the list of paths for this content
19 existingPaths.push(`${rootDirectory}/${fileName}`);
20
21 // Update the map with the new list of paths for this content
22 fileContentMap.set(fileContent, existingPaths);
23 }
24 }
25
26 // Filter out the file contents that have more than one path and return the values of the map
27 return [...fileContentMap.values()].filter(paths => paths.length > 1);
28}
29
Time and Space Complexity
Time Complexity
The time complexity of the function is determined by several factors:
- Splitting each path string, which takes
O(N)
time whereN
is the total length of the string. - Iterating through each file within each path, which takes
O(M)
time whereM
is the number of files across all paths. - Finding the index of the first parenthesis in each file name, which takes
O(K)
time whereK
is the length of the file name. - Inserting the path and file name into the dictionary, which takes
O(1)
for each operation assuming average case scenario for the hash table insert.
Let's assume P
is the number of paths and each path contains at most F
files, and the longest file is of L
characters. The split operation initially takes O(P*L)
, and then each file of each path will be accessed and splitted which is O(P*F*L)
.
Thus, overall the time complexity is O(P*L + P*F*L)
which simplifies to O(P*F*L)
- as we assume that each file is also bounded by the same L
.
Space Complexity
The space complexity includes:
- The dictionary storing all the file contents as keys, and the associated file paths as values. In the worst-case scenario, each file has unique content, so the space complexity would be
O(M)
whereM
is the number of files. - Temporary variables for processing, such as storing the name and content of each file. This does not significantly add to the space complexity, as they're overwritten for each file.
Thus, the space complexity is primarily dictated by the size of the dictionary, which in the worst case is O(P*F)
to store all unique file content-path pairs. Therefore, the space complexity is O(P*F)
.
Note: If file content is very large, it could dominate the space complexity, and we might express it as O(P*F*C)
where C
is the average content size.
Learn more about how to find time and space complexity quickly using problem constraints.
Which data structure is used to implement recursion?
Recommended Readings
LeetCode Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way we
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https algomonster s3 us east 2 amazonaws com recursion jpg You first
Runtime Overview When learning about algorithms and data structures you'll frequently encounter the term time complexity This concept is fundamental in computer science and offers insights into how long an algorithm takes to complete given a certain input size What is Time Complexity Time complexity represents the amount of time
Got a question? Ask the Monster Assistant anything you don't understand.
Still not clear?  Submit the part you don't understand to our editors. Or join our Discord and ask the community.