1948. Delete Duplicate Folders in System
Problem Description
You have a file system with many duplicate folders due to a bug. Given a 2D array paths
where each paths[i]
represents an absolute path to a folder (e.g., ["one", "two", "three"]
represents /one/two/three
), you need to identify and remove all duplicate folder structures.
Two folders are considered identical if they contain the exact same set of non-empty subfolders with the same structure, regardless of their location in the file system. When folders are identified as identical, both folders and all their subfolders must be marked for deletion.
For example, consider these two folder structures:
/a
with subfolders/a/x
,/a/x/y
,/a/z
/b
with subfolders/b/x
,/b/x/y
,/b/z
These folders /a
and /b
are identical because they have the same subfolder structure (x
, x/y
, and z
). Therefore, all these paths would be marked for deletion:
/a
,/a/x
,/a/x/y
,/a/z
/b
,/b/x
,/b/x/y
,/b/z
However, if /b
had an additional subfolder /b/w
, then /a
and /b
would no longer be identical. Note that even in this case, /a/x
and /b/x
could still be identical if they have the same internal structure.
The deletion process runs only once - any folders that become identical after the initial deletion are not removed.
Your task is to return the paths of all remaining folders after deleting all marked folders. The paths can be returned in any order.
Intuition
To identify duplicate folder structures, we need a way to represent each folder's complete subfolder structure in a comparable format. Think of it like creating a "fingerprint" for each folder based on its contents.
The key insight is that two folders are identical if they have the same subfolders with the same names and those subfolders are also structurally identical. This suggests a recursive approach - we need to compare not just immediate children, but the entire subtree structure.
We can represent each folder's structure as a string that captures both the names of its subfolders and their internal structures. For example, if a folder has subfolders x
and z
, and x
has a subfolder y
, we could represent this as something like "x(y())z()"
. The parentheses capture the hierarchical structure.
To build this representation efficiently, we can:
- First construct a tree (trie) from all the paths to understand the folder structure
- Traverse the tree from bottom to top (post-order), building string representations for each subtree
- Use these string representations as keys to identify duplicates - if two different folders generate the same string representation, they have identical structures
The beauty of this approach is that by processing folders bottom-up, we ensure that when we create a folder's representation, we already know the representations of all its subfolders. This allows us to detect duplicates at any level of the hierarchy.
Once we identify which folders are duplicates (by finding matching string representations), we mark them for deletion. Finally, we traverse the tree again to collect only the paths that weren't marked for deletion.
Learn more about Trie patterns.
Solution Approach
The solution uses a Trie data structure combined with DFS traversal to efficiently identify and remove duplicate folder structures.
Step 1: Build the Trie Structure
First, we create a [Trie](/problems/trie_intro)
class with:
children
: A dictionary mapping folder names to their child nodesdeleted
: A boolean flag to mark nodes for deletion
We insert all paths into the trie:
root = Trie() for path in paths: cur = root for name in path: if cur.children[name] is None: cur.children[name] = Trie() cur = cur.children[name]
Step 2: Generate Unique Signatures for Each Subtree
We use a DFS function dfs(node)
that:
- Returns an empty string for leaf nodes (folders with no subfolders)
- For non-leaf nodes, creates a string representation by:
- Recursively getting signatures for all child nodes
- Combining each child's name with its signature in format:
"name(child_signature)"
- Sorting these representations alphabetically to ensure consistent ordering
- Concatenating them into a single string
def dfs(node: [Trie](/problems/trie_intro)) -> str:
if not node.children:
return ""
subs: List[str] = []
for name, child in node.children.items():
subs.append(f"{name}({dfs(child)})")
s = "".join(sorted(subs))
For example, a folder with subfolders x
(containing y
) and z
(empty) would generate: "x(y())z()"
Step 3: Identify and Mark Duplicates
We maintain a global dictionary g
that maps each unique signature to the first node that generated it:
- If a signature already exists in
g
, we've found a duplicate - Mark both the current node and the previously seen node as deleted
- Otherwise, add the signature and node to the dictionary
if s in g: node.deleted = g[s].deleted = True else: g[s] = node
Step 4: Collect Remaining Paths
Finally, we perform another DFS traversal dfs2(node)
to collect all non-deleted paths:
- Skip any node marked as
deleted
- Build paths by maintaining a running list of folder names
- Add complete paths to the result when we reach valid nodes
- Use backtracking (append/pop) to explore all paths
def dfs2(node: [Trie](/problems/trie_intro)) -> None:
if node.deleted:
return
if path:
ans.append(path[:])
for name, child in node.children.items():
path.append(name)
dfs2(child)
path.pop()
The time complexity is O(n Γ m)
where n
is the number of paths and m
is the average path length, as we need to traverse all paths twice. The space complexity is also O(n Γ m)
for storing the trie and the signature dictionary.
Ready to land your dream job?
Unlock your dream job with a 5-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's walk through a small example to illustrate the solution approach.
Input: paths = [["a","x"], ["a","x","y"], ["a","z"], ["b","x"], ["b","x","y"], ["b","z"], ["b","w"]]
Step 1: Build the Trie
We construct a trie from all paths:
root βββ a β βββ x β β βββ y β βββ z βββ b βββ x β βββ y βββ z βββ w
Step 2: Generate Signatures (Bottom-up DFS)
Starting from leaf nodes and working up:
-
Node
y
under/a/x
: No children β signature =""
-
Node
y
under/b/x
: No children β signature =""
-
Node
z
under/a
: No children β signature =""
-
Node
z
under/b
: No children β signature =""
-
Node
w
under/b
: No children β signature =""
-
Node
x
under/a
: Has childy
with signature""
- Create:
"y()"
- Signature =
"y()"
- Create:
-
Node
x
under/b
: Has childy
with signature""
- Create:
"y()"
- Signature =
"y()"
- Create:
-
Node
a
: Has childrenx
(signature"y()"
) andz
(signature""
)- Create:
["x(y())", "z()"]
- Sort alphabetically:
["x(y())", "z()"]
- Signature =
"x(y())z()"
- Create:
-
Node
b
: Has childrenx
(signature"y()"
),z
(signature""
), andw
(signature""
)- Create:
["x(y())", "z()", "w()"]
- Sort alphabetically:
["w()", "x(y())", "z()"]
- Signature =
"w()x(y())z()"
- Create:
Step 3: Identify Duplicates
As we generate signatures, we check for duplicates:
/a/x/y
signature""
: First occurrence, store in dictionary/b/x/y
signature""
: Already exists! Mark both as deleted β/a/z
signature""
: Already exists! Mark both as deleted β/b/z
signature""
: Already exists! Mark both as deleted β/b/w
signature""
: Already exists! Mark all as deleted β/a/x
signature"y()"
: First occurrence, store in dictionary/b/x
signature"y()"
: Already exists! Mark both as deleted β/a
signature"x(y())z()"
: First occurrence, store in dictionary/b
signature"w()x(y())z()"
: First occurrence, store in dictionary
Step 4: Mark Deletions
After identifying duplicates:
/a/x
and/b/x
are marked as deleted (same signature"y()"
)- All their subfolders are automatically deleted when we skip deleted nodes
/a
and/b
have different signatures due to/b/w
, so they remain
Step 5: Collect Remaining Paths
Traversing the trie, skipping deleted nodes:
- Start at root
- Visit
/a
(not deleted) β add["a"]
to result- Try
/a/x
β deleted, skip entire subtree - Try
/a/z
β deleted, skip
- Try
- Visit
/b
(not deleted) β add["b"]
to result- Try
/b/x
β deleted, skip entire subtree - Try
/b/z
β deleted, skip - Visit
/b/w
(not deleted) β add["b","w"]
to result
- Try
Output: [["a"], ["b"], ["b","w"]]
The key insight: /a/x
and /b/x
had identical structures (both containing only y
), so they were removed along with all their subfolders. The parent folders /a
and /b
survived because /b
had an extra subfolder w
, making their structures different.
Solution Implementation
1from collections import defaultdict
2from typing import Dict, List
3
4class Trie:
5 def __init__(self):
6 # Dictionary to store child nodes, where key is folder name
7 self.children: Dict[str, "Trie"] = {}
8 # Flag to mark if this subtree should be deleted as duplicate
9 self.deleted: bool = False
10
11
12class Solution:
13 def deleteDuplicateFolder(self, paths: List[List[str]]) -> List[List[str]]:
14 # Build the trie structure from all paths
15 root = Trie()
16 for path in paths:
17 current_node = root
18 for folder_name in path:
19 # Create new node if it doesn't exist
20 if folder_name not in current_node.children:
21 current_node.children[folder_name] = Trie()
22 current_node = current_node.children[folder_name]
23
24 # Dictionary to store serialized subtree signatures and their first occurrence
25 signature_to_node: Dict[str, Trie] = {}
26
27 def serialize_and_mark_duplicates(node: Trie) -> str:
28 """
29 Serialize the subtree rooted at node and mark duplicate subtrees.
30 Returns a unique string representation of the subtree structure.
31 """
32 # Empty folder has empty signature
33 if not node.children:
34 return ""
35
36 # Build signature from all child subtrees
37 subtree_signatures: List[str] = []
38 for folder_name, child_node in node.children.items():
39 child_signature = serialize_and_mark_duplicates(child_node)
40 # Format: foldername(child_subtree_signature)
41 subtree_signatures.append(f"{folder_name}({child_signature})")
42
43 # Sort to ensure same structure gives same signature
44 current_signature = "".join(sorted(subtree_signatures))
45
46 # Check if this subtree structure has been seen before
47 if current_signature in signature_to_node:
48 # Mark both occurrences for deletion
49 node.deleted = True
50 signature_to_node[current_signature].deleted = True
51 else:
52 # First occurrence of this structure
53 signature_to_node[current_signature] = node
54
55 return current_signature
56
57 def collect_remaining_paths(node: Trie, current_path: List[str], result: List[List[str]]) -> None:
58 """
59 Traverse the trie and collect all non-deleted paths.
60 """
61 # Skip deleted subtrees
62 if node.deleted:
63 return
64
65 # Add current path to result (except for root which has empty path)
66 if current_path:
67 result.append(current_path[:])
68
69 # Recursively process all children
70 for folder_name, child_node in node.children.items():
71 current_path.append(folder_name)
72 collect_remaining_paths(child_node, current_path, result)
73 current_path.pop()
74
75 # Step 1: Identify and mark duplicate subtrees
76 serialize_and_mark_duplicates(root)
77
78 # Step 2: Collect all non-deleted paths
79 result: List[List[str]] = []
80 collect_remaining_paths(root, [], result)
81
82 return result
83
1class Trie {
2 // Map to store child directories, key is directory name, value is the Trie node
3 Map<String, Trie> children;
4 // Flag to mark if this subtree should be deleted due to duplication
5 boolean isDeleted;
6
7 public Trie() {
8 children = new HashMap<>();
9 isDeleted = false;
10 }
11}
12
13class Solution {
14 public List<List<String>> deleteDuplicateFolder(List<List<String>> paths) {
15 // Build the Trie structure from all paths
16 Trie root = new Trie();
17 for (List<String> path : paths) {
18 Trie currentNode = root;
19 for (String folderName : path) {
20 // Create child node if it doesn't exist
21 if (!currentNode.children.containsKey(folderName)) {
22 currentNode.children.put(folderName, new Trie());
23 }
24 currentNode = currentNode.children.get(folderName);
25 }
26 }
27
28 // Map to store serialized subtree structure -> first occurrence of that structure
29 Map<String, Trie> subtreeMap = new HashMap<>();
30
31 // First DFS: Serialize each subtree and mark duplicates
32 Function<Trie, String> serializeAndMarkDuplicates = new Function<Trie, String>() {
33 @Override
34 public String apply(Trie node) {
35 // Leaf nodes have empty serialization
36 if (node.children.isEmpty()) {
37 return "";
38 }
39
40 // Collect serialized representations of all children
41 List<String> childSerializations = new ArrayList<>();
42 for (Map.Entry<String, Trie> entry : node.children.entrySet()) {
43 String childName = entry.getKey();
44 Trie childNode = entry.getValue();
45 // Format: "folderName(childSerialization)"
46 childSerializations.add(childName + "(" + apply(childNode) + ")");
47 }
48
49 // Sort to ensure same subtree structure produces same serialization
50 Collections.sort(childSerializations);
51 String serialization = String.join("", childSerializations);
52
53 // Check if this subtree structure has been seen before
54 if (subtreeMap.containsKey(serialization)) {
55 // Mark both occurrences as deleted
56 node.isDeleted = true;
57 subtreeMap.get(serialization).isDeleted = true;
58 } else {
59 // First occurrence of this structure
60 subtreeMap.put(serialization, node);
61 }
62
63 return serialization;
64 }
65 };
66
67 serializeAndMarkDuplicates.apply(root);
68
69 // Collect remaining paths after deletion
70 List<List<String>> result = new ArrayList<>();
71 List<String> currentPath = new ArrayList<>();
72
73 // Second DFS: Collect all non-deleted paths
74 Function<Trie, Void> collectRemainingPaths = new Function<Trie, Void>() {
75 @Override
76 public Void apply(Trie node) {
77 // Skip deleted subtrees
78 if (node.isDeleted) {
79 return null;
80 }
81
82 // Add current path to result (except for root which has empty path)
83 if (!currentPath.isEmpty()) {
84 result.add(new ArrayList<>(currentPath));
85 }
86
87 // Recursively process all children
88 for (Map.Entry<String, Trie> entry : node.children.entrySet()) {
89 currentPath.add(entry.getKey());
90 apply(entry.getValue());
91 currentPath.remove(currentPath.size() - 1); // Backtrack
92 }
93
94 return null;
95 }
96 };
97
98 collectRemainingPaths.apply(root);
99
100 return result;
101 }
102}
103
1class Trie {
2public:
3 // Map from folder name to its corresponding Trie node
4 unordered_map<string, Trie*> children;
5 // Flag to mark if this subtree is a duplicate and should be deleted
6 bool isDeleted = false;
7};
8
9class Solution {
10public:
11 vector<vector<string>> deleteDuplicateFolder(vector<vector<string>>& paths) {
12 // Build the Trie structure from all paths
13 Trie* root = new Trie();
14
15 // Insert all paths into the Trie
16 for (auto& path : paths) {
17 Trie* currentNode = root;
18 for (auto& folderName : path) {
19 // Create new node if folder doesn't exist
20 if (currentNode->children.find(folderName) == currentNode->children.end()) {
21 currentNode->children[folderName] = new Trie();
22 }
23 currentNode = currentNode->children[folderName];
24 }
25 }
26
27 // Map to store serialized subtree structure to detect duplicates
28 // Key: serialized subtree string, Value: pointer to first occurrence of this structure
29 unordered_map<string, Trie*> subtreeMap;
30
31 // DFS to serialize each subtree and mark duplicates
32 auto serializeAndMarkDuplicates = [&](this auto&& serializeAndMarkDuplicates, Trie* node) -> string {
33 // Leaf nodes have empty serialization
34 if (node->children.empty()) {
35 return "";
36 }
37
38 // Collect serialized representations of all children
39 vector<string> childSerializations;
40 for (auto& [childName, childNode] : node->children) {
41 string childSerialization = childName + "(" + serializeAndMarkDuplicates(childNode) + ")";
42 childSerializations.push_back(childSerialization);
43 }
44
45 // Sort to ensure same subtree structure produces same serialization
46 sort(childSerializations.begin(), childSerializations.end());
47
48 // Concatenate all child serializations
49 string serializedSubtree = "";
50 for (auto& childSerialization : childSerializations) {
51 serializedSubtree += childSerialization;
52 }
53
54 // Check if this subtree structure already exists
55 if (subtreeMap.contains(serializedSubtree)) {
56 // Mark both occurrences as deleted
57 node->isDeleted = true;
58 subtreeMap[serializedSubtree]->isDeleted = true;
59 } else {
60 // First occurrence of this subtree structure
61 subtreeMap[serializedSubtree] = node;
62 }
63
64 return serializedSubtree;
65 };
66
67 // Serialize and mark duplicate subtrees
68 serializeAndMarkDuplicates(root);
69
70 // Collect remaining paths after deletion
71 vector<vector<string>> result;
72 vector<string> currentPath;
73
74 // DFS to collect all non-deleted paths
75 auto collectPaths = [&](this auto&& collectPaths, Trie* node) -> void {
76 // Skip deleted subtrees
77 if (node->isDeleted) {
78 return;
79 }
80
81 // Add current path to result (except for root which has empty path)
82 if (!currentPath.empty()) {
83 result.push_back(currentPath);
84 }
85
86 // Recursively process all children
87 for (auto& [childName, childNode] : node->children) {
88 currentPath.push_back(childName);
89 collectPaths(childNode);
90 currentPath.pop_back();
91 }
92 };
93
94 // Collect all remaining paths
95 collectPaths(root);
96
97 return result;
98 }
99};
100
1// Global Trie node interface
2interface TrieNode {
3 children: { [key: string]: TrieNode };
4 deleted: boolean;
5}
6
7function deleteDuplicateFolder(paths: string[][]): string[][] {
8 // Create root node of the Trie
9 const root: TrieNode = {
10 children: {},
11 deleted: false
12 };
13
14 // Build Trie from all paths
15 for (const path of paths) {
16 let currentNode = root;
17 for (const folderName of path) {
18 // Create child node if it doesn't exist
19 if (!currentNode.children[folderName]) {
20 currentNode.children[folderName] = {
21 children: {},
22 deleted: false
23 };
24 }
25 currentNode = currentNode.children[folderName];
26 }
27 }
28
29 // Map to store serialized subtree structures and their corresponding nodes
30 const subtreeMap: { [serialized: string]: TrieNode } = {};
31
32 // DFS to serialize each subtree and mark duplicates
33 const serializeAndMarkDuplicates = (node: TrieNode): string => {
34 // Base case: leaf node returns empty string
35 if (Object.keys(node.children).length === 0) {
36 return '';
37 }
38
39 // Serialize all child subtrees
40 const serializedChildren: string[] = [];
41 for (const [folderName, childNode] of Object.entries(node.children)) {
42 const childSerialization = serializeAndMarkDuplicates(childNode);
43 serializedChildren.push(`${folderName}(${childSerialization})`);
44 }
45
46 // Sort to ensure consistent serialization for identical structures
47 serializedChildren.sort();
48 const serializedSubtree = serializedChildren.join('');
49
50 // Check if this subtree structure already exists
51 if (subtreeMap[serializedSubtree]) {
52 // Mark both duplicate subtrees for deletion
53 node.deleted = true;
54 subtreeMap[serializedSubtree].deleted = true;
55 } else {
56 // Store first occurrence of this subtree structure
57 subtreeMap[serializedSubtree] = node;
58 }
59
60 return serializedSubtree;
61 };
62
63 // Execute first DFS to identify and mark duplicates
64 serializeAndMarkDuplicates(root);
65
66 // Result array to store remaining paths
67 const result: string[][] = [];
68 // Current path being constructed during traversal
69 const currentPath: string[] = [];
70
71 // DFS to collect all non-deleted paths
72 const collectRemainingPaths = (node: TrieNode): void => {
73 // Skip deleted subtrees
74 if (node.deleted) {
75 return;
76 }
77
78 // Add current path to result if it's not empty (not root)
79 if (currentPath.length > 0) {
80 result.push([...currentPath]);
81 }
82
83 // Traverse all children
84 for (const [folderName, childNode] of Object.entries(node.children)) {
85 currentPath.push(folderName);
86 collectRemainingPaths(childNode);
87 currentPath.pop();
88 }
89 };
90
91 // Execute second DFS to collect remaining paths
92 collectRemainingPaths(root);
93
94 return result;
95}
96
Time and Space Complexity
Time Complexity: O(N * M * L + N * M * log(M))
Where:
N
= number of pathsM
= average number of folders in each path (depth)L
= average length of folder names
Breaking down the operations:
- Building the Trie:
O(N * M)
- iterating through all paths and inserting each folder - First DFS (serialization):
- Each node is visited once:
O(total nodes)
- At each node, we create a serialization string by iterating through children and sorting them
- Sorting at each node:
O(K * log(K))
whereK
is the number of children - String concatenation:
O(K * L)
for each node - Overall:
O(N * M * L + N * M * log(M))
in the worst case
- Each node is visited once:
- Second DFS (collecting results):
O(N * M)
- visiting non-deleted nodes
The dominant term is O(N * M * L + N * M * log(M))
due to the serialization and sorting operations.
Space Complexity: O(N * M * L)
Breaking down the space usage:
- Trie structure:
O(N * M)
- storing all nodes - HashMap
g
for duplicate detection:O(N * M * L)
- storing serialized strings as keys - Recursion stack depth:
O(M)
- maximum depth of the folder structure - Result storage:
O(N * M)
- storing the output paths - Temporary strings during serialization:
O(M * L)
per recursive call
The dominant term is O(N * M * L)
due to storing the serialized representations in the hashmap.
Learn more about how to find time and space complexity quickly.
Common Pitfalls
1. Incorrect Handling of Empty Folders
A critical pitfall is misunderstanding what constitutes an "empty" folder. In this problem, a folder is considered empty (leaf node) only if it has NO subfolders at all. However, developers often mistakenly treat folders that appear in the input paths as having content.
Problem Example:
# Given paths: [["a"], ["a", "x"], ["b"], ["b", "x"]] # Incorrect assumption: /a and /b are non-empty because they appear as paths # Correct understanding: /a/x and /b/x are empty (leaf nodes)
Solution: Always check node.children
to determine if a folder is empty, not whether it appears as a complete path in the input.
2. Signature Collision Due to Poor Serialization
Without proper delimiters in the serialization format, different folder structures can generate identical signatures.
Problem Example:
# Without delimiters: # /a/bc with empty child β signature: "bc" # /ab/c with empty child β signature: "abc" # These are different but might generate the same signature! # Incorrect serialization: signature = name + child_signature # No delimiter!
Solution: Always use delimiters like parentheses in the format "name(child_signature)"
to ensure unique signatures:
subtree_signatures.append(f"{folder_name}({child_signature})")
3. Forgetting to Sort Child Signatures
The order of children in the trie depends on the input order, but identical structures should generate the same signature regardless of order.
Problem Example:
# /a has children: x, y # /b has children: y, x (different order) # Without sorting: signatures would be "x()y()" vs "y()x()" - incorrectly different!
Solution: Always sort the child signatures before concatenation:
current_signature = "".join(sorted(subtree_signatures))
4. Modifying the Trie During First Traversal
Attempting to delete nodes immediately during the duplicate detection phase can corrupt the tree structure and cause missed duplicates.
Problem Example:
# Wrong approach - deleting immediately: if current_signature in signature_to_node: # DON'T do this: del node.children # This breaks the traversal!
Solution: Use a two-phase approach:
- First pass: Mark nodes with
deleted = True
flag - Second pass: Collect only non-deleted paths
5. Including Root Path in Results
The root node represents the file system root ("/") and should never be included in the output.
Problem Example:
# Wrong - might add empty path for root:
def collect_paths(node, path, result):
result.append(path[:]) # This adds [] for root!
Solution: Check if the current path is non-empty before adding:
if current_path: # Only add non-empty paths result.append(current_path[:])
6. Not Creating a Copy of Path When Adding to Results
Python lists are mutable references. Without copying, all results would point to the same list object.
Problem Example:
# Wrong - all results will be the same (last) path: result.append(current_path) # Adds reference, not copy!
Solution: Always create a copy when adding to results:
result.append(current_path[:]) # Creates a shallow copy
What are the two properties the problem needs to have for dynamic programming to be applicable? (Select 2)
Recommended Readings
Trie Introduction Imagine you have a magic dictionary where as soon as you start spelling a word it immediately tells you how many words that start with those letters How cool would that be This is exactly the kind of magic a data structure called trie performs What's a Trie
Coding Interview Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https assets algo monster recursion jpg You first call Ben and ask
Want a Structured Path to Master System Design Too? Donβt Miss This!