Leetcode 609. Find Duplicate File in System

Problem

Given a list of directory info which includes information about the directory path, files, and their contents in the directory, the task is to find all the groups of duplicate files based on their paths.

Here is an example:

1
2plaintext
3For instance, given the below input,
4["root/a 1.txt(abcd) 2.txt(efgh)", "root/c 3.txt(abcd)", "root/c/d 4.txt(efgh)", "root 4.txt(efgh)"]
5
6The expected output would be:
7[["root/a/2.txt","root/c/d/4.txt","root/4.txt"],["root/a/1.txt","root/c/3.txt"]]

In the above example, the files 2.txt, 4.txt in directory root/c/d and 4.txt in root directory have the same content efgh. Therefore, they are grouped together. Similarly, the files 1.txt in directory root/a and 3.txt in directory root/c have the same content abcd. So they are grouped together.

Solution

The problem can be solved by using a hashtable i.e., unordered_map in C++ which stores the content of the file as the key and the corresponding file paths as its value.

Algorithm

  1. Loop over the given paths,
  2. For each path, split the string into root directory and file content.
  3. For each file content, extract the file name and content.
  4. Form the complete file path with root directory and the file name.
  5. Store this complete file path in the hashmap where the key is the file content.
  6. After collecting all paths corresponding to each unique file content in hashmap, check for the content keys which have more than one file path.
  7. For each file content with more than one file path, these paths are considered as a group and add these groups to the response.

Python

1
2python
3from collections import defaultdict
4import re
5
6class Solution:
7
8    def findDuplicate(self, paths):
9        m = defaultdict(list)
10        for path in paths:
11            data = path.split()
12            root = data[0]
13            for file in data[1:]:
14                name, content = re.match(r"(.+)\((.+)\)", file).groups()
15                m[content].append(root + '/' + name)
16        return [x for x in m.values() if len(x) > 1]

Java

1
2java
3public List<List<String>> findDuplicate(String[] paths) {
4     Map<String, List<String>> map = new HashMap<>();
5     for (String path : paths) {
6         String[] values = path.split(" ");
7         for (int i = 1; i < values.length; i++) {
8             String[] name_cont = values[i].split("\\(");
9             name_cont[1] = name_cont[1].replace(")", "");
10             List<String> list = map.getOrDefault(name_cont[1], new ArrayList<String>());
11             list.add(values[0] + "/" + name_cont[0]);
12             map.put(name_cont[1], list);
13         }
14     }
15     return map.values().stream().filter(e -> e.size() > 1).collect(Collectors.toList());
16 }

C++

1
2c++
3class Solution {
4public:
5    vector<vector<string>> findDuplicate(vector<string>& paths) {
6        unordered_map<string, vector<string>> files;
7        for (string path : paths) {
8            stringstream ss(path);
9            string root;
10            string s;
11            getline(ss, root, ' ');
12            while (getline(ss, s, ' ')) {
13                string fileName = root + '/' + s.substr(0, s.find('('));
14                string fileContent = s.substr(s.find('(') + 1, s.find(')') - s.find('(') - 1);
15                files[fileContent].push_back(fileName);
16            }
17        }
18        vector<vector<string>> res;
19        for (auto file : files) {
20            if (file.second.size() > 1)
21                res.push_back(file.second);
22        }
23        return res;
24     }
25};

C#

1
2csharp
3public class Solution {
4    public IList<IList<string>> FindDuplicate(string[] paths) {
5        var map = new Dictionary<string, List<string>>();
6        foreach(var path in paths)
7        {
8            var details = path.Split(' ');
9            for(var i = 1; i < details.Length; i++)
10            {
11                var file = details[i].Split('(');
12                var content = file[1].Substring(0, file[1].Length - 1);
13                if(!map.ContainsKey(content))
14                {
15                    map.Add(content, new List<string>());
16                }
17                map[content].Add(details[0] + "/" + file[0]);
18            }
19        }
20        var result = new List<IList<string>>();
21        foreach(var kvp in map)
22        {
23            if(kvp.Value.Count > 1)
24            {
25                result.Add(kvp.Value);
26            }
27        }
28        return result;
29    }
30}

Javascript

1
2javascript
3/**
4 * @param {string[]} paths
5 * @return {string[][]}
6 */
7var findDuplicate = function(paths) {
8    var map = {};
9    for(var path of paths) {
10        var files = path.split(" ");
11        var dir = files.shift();
12        for(var file of files) {
13            var idx = file.indexOf("(");
14            var content = file.substring(idx + 1, file.length - 1);
15            if(!map[content]) {
16                map[content] = [];
17            }
18            map[content].push(dir + "/" + file.substring(0, idx));
19        }
20    }
21    return Object.values(map).filter(dups => dups.length > 1);
22};

Please note that there is no fixed output for the problem because the output does not need to be in a specific order.The solution to this problem involves creating a hash map where the function hashes contents of each file and maps it to a list of files with that same content. This is done by iterating through the directory paths, splitting the information in each path to a root directory path and file content. For each file content, we get the file name and the content, form a complete file path and add it to the hash map. If any file content in the hash map has more than one path, these paths are grouped together and returned as the result.

Let's consider each of these solutions in more detail:

Python

The Python solution uses regular expressions to split the file name and content. It employs a dictionary m with default lists as values. For each directory path, it splits the path into root and file parts and further extracts the file name and content. It then adds the complete file path to the dictionary keyed by the content. At the end, it builds a list of values from the dictionary that have more than one file path, and this is returned as the result.

Java

The Java solution is similar to the Python solution but does not use regular expressions. Instead, it uses the split function of the String class to break down the path into root and file parts. Then for each file, it extracts the name and content and adds the full path to a map keyed by the content. The map is used to generate the final list of groups having the same file content.

C++

The C++ solution employs a data structure called stringstream to split the path string into the root directory and file content. The dot (.) operator and relevant substring functions are used to extract file name and content. These paths are stored in an unordered map keyed by the file content. Finally, a vector of vectors captures groups of paths with the same content.

C#

The C# solution uses the Split method of the string class to break down the directory path into root and file parts. It employs a dictionary similar to the Java solution. Lists are used instead of sets to keep the duplicate paths. It finally generates a list of groups with the same file content using a key-value pair traversal.

Javascript

The Javascript solution is analogous to the Java and C# solutions, only the language syntax is different. Here the indexOf function is used to determine the division between the file name and content. If the file content does not exist in the map, an empty array is added against the file content key. Then, the full file path is pushed into this array. Finally, the arrays having more than one path are extracted and returned as the result.

This problem demonstrates the practical usage of hash table (hash map) data structure. In addition to this, familiarity with string manipulation methods and regular expression handling is beneficial to solve such problems.


Got a question? Ask the Teaching Assistant anything you don't understand.

Still not clear? Ask in the Forum,  Discord or Submit the part you don't understand to our editors.


TA 👨‍🏫