Facebook Pixel

1948. Delete Duplicate Folders in System

HardTrieArrayHash TableStringHash Function
Leetcode Link

Problem Description

You have a file system with many duplicate folders due to a bug. Given a 2D array paths where each paths[i] represents an absolute path to a folder (e.g., ["one", "two", "three"] represents /one/two/three), you need to identify and remove all duplicate folder structures.

Two folders are considered identical if they contain the exact same set of non-empty subfolders with the same structure, regardless of their location in the file system. When folders are identified as identical, both folders and all their subfolders must be marked for deletion.

For example, consider these two folder structures:

  • /a with subfolders /a/x, /a/x/y, /a/z
  • /b with subfolders /b/x, /b/x/y, /b/z

These folders /a and /b are identical because they have the same subfolder structure (x, x/y, and z). Therefore, all these paths would be marked for deletion:

  • /a, /a/x, /a/x/y, /a/z
  • /b, /b/x, /b/x/y, /b/z

However, if /b had an additional subfolder /b/w, then /a and /b would no longer be identical. Note that even in this case, /a/x and /b/x could still be identical if they have the same internal structure.

The deletion process runs only once - any folders that become identical after the initial deletion are not removed.

Your task is to return the paths of all remaining folders after deleting all marked folders. The paths can be returned in any order.

Quick Interview Experience
Help others by sharing your interview experience
Have you seen this problem before?

Intuition

To identify duplicate folder structures, we need a way to represent each folder's complete subfolder structure in a comparable format. Think of it like creating a "fingerprint" for each folder based on its contents.

The key insight is that two folders are identical if they have the same subfolders with the same names and those subfolders are also structurally identical. This suggests a recursive approach - we need to compare not just immediate children, but the entire subtree structure.

We can represent each folder's structure as a string that captures both the names of its subfolders and their internal structures. For example, if a folder has subfolders x and z, and x has a subfolder y, we could represent this as something like "x(y())z()". The parentheses capture the hierarchical structure.

To build this representation efficiently, we can:

  1. First construct a tree (trie) from all the paths to understand the folder structure
  2. Traverse the tree from bottom to top (post-order), building string representations for each subtree
  3. Use these string representations as keys to identify duplicates - if two different folders generate the same string representation, they have identical structures

The beauty of this approach is that by processing folders bottom-up, we ensure that when we create a folder's representation, we already know the representations of all its subfolders. This allows us to detect duplicates at any level of the hierarchy.

Once we identify which folders are duplicates (by finding matching string representations), we mark them for deletion. Finally, we traverse the tree again to collect only the paths that weren't marked for deletion.

Learn more about Trie patterns.

Solution Approach

The solution uses a Trie data structure combined with DFS traversal to efficiently identify and remove duplicate folder structures.

Step 1: Build the Trie Structure

First, we create a [Trie](/problems/trie_intro) class with:

  • children: A dictionary mapping folder names to their child nodes
  • deleted: A boolean flag to mark nodes for deletion

We insert all paths into the trie:

root = Trie()
for path in paths:
    cur = root
    for name in path:
        if cur.children[name] is None:
            cur.children[name] = Trie()
        cur = cur.children[name]

Step 2: Generate Unique Signatures for Each Subtree

We use a DFS function dfs(node) that:

  1. Returns an empty string for leaf nodes (folders with no subfolders)
  2. For non-leaf nodes, creates a string representation by:
    • Recursively getting signatures for all child nodes
    • Combining each child's name with its signature in format: "name(child_signature)"
    • Sorting these representations alphabetically to ensure consistent ordering
    • Concatenating them into a single string
def dfs(node: [Trie](/problems/trie_intro)) -> str:
    if not node.children:
        return ""
    subs: List[str] = []
    for name, child in node.children.items():
        subs.append(f"{name}({dfs(child)})")
    s = "".join(sorted(subs))

For example, a folder with subfolders x (containing y) and z (empty) would generate: "x(y())z()"

Step 3: Identify and Mark Duplicates

We maintain a global dictionary g that maps each unique signature to the first node that generated it:

  • If a signature already exists in g, we've found a duplicate
  • Mark both the current node and the previously seen node as deleted
  • Otherwise, add the signature and node to the dictionary
if s in g:
    node.deleted = g[s].deleted = True
else:
    g[s] = node

Step 4: Collect Remaining Paths

Finally, we perform another DFS traversal dfs2(node) to collect all non-deleted paths:

  • Skip any node marked as deleted
  • Build paths by maintaining a running list of folder names
  • Add complete paths to the result when we reach valid nodes
  • Use backtracking (append/pop) to explore all paths
def dfs2(node: [Trie](/problems/trie_intro)) -> None:
    if node.deleted:
        return
    if path:
        ans.append(path[:])
    for name, child in node.children.items():
        path.append(name)
        dfs2(child)
        path.pop()

The time complexity is O(n Γ— m) where n is the number of paths and m is the average path length, as we need to traverse all paths twice. The space complexity is also O(n Γ— m) for storing the trie and the signature dictionary.

Ready to land your dream job?

Unlock your dream job with a 5-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's walk through a small example to illustrate the solution approach.

Input: paths = [["a","x"], ["a","x","y"], ["a","z"], ["b","x"], ["b","x","y"], ["b","z"], ["b","w"]]

Step 1: Build the Trie

We construct a trie from all paths:

root
β”œβ”€β”€ a
β”‚   β”œβ”€β”€ x
β”‚   β”‚   └── y
β”‚   └── z
└── b
    β”œβ”€β”€ x
    β”‚   └── y
    β”œβ”€β”€ z
    └── w

Step 2: Generate Signatures (Bottom-up DFS)

Starting from leaf nodes and working up:

  • Node y under /a/x: No children β†’ signature = ""

  • Node y under /b/x: No children β†’ signature = ""

  • Node z under /a: No children β†’ signature = ""

  • Node z under /b: No children β†’ signature = ""

  • Node w under /b: No children β†’ signature = ""

  • Node x under /a: Has child y with signature ""

    • Create: "y()"
    • Signature = "y()"
  • Node x under /b: Has child y with signature ""

    • Create: "y()"
    • Signature = "y()"
  • Node a: Has children x (signature "y()") and z (signature "")

    • Create: ["x(y())", "z()"]
    • Sort alphabetically: ["x(y())", "z()"]
    • Signature = "x(y())z()"
  • Node b: Has children x (signature "y()"), z (signature ""), and w (signature "")

    • Create: ["x(y())", "z()", "w()"]
    • Sort alphabetically: ["w()", "x(y())", "z()"]
    • Signature = "w()x(y())z()"

Step 3: Identify Duplicates

As we generate signatures, we check for duplicates:

  • /a/x/y signature "": First occurrence, store in dictionary
  • /b/x/y signature "": Already exists! Mark both as deleted βœ—
  • /a/z signature "": Already exists! Mark both as deleted βœ—
  • /b/z signature "": Already exists! Mark both as deleted βœ—
  • /b/w signature "": Already exists! Mark all as deleted βœ—
  • /a/x signature "y()": First occurrence, store in dictionary
  • /b/x signature "y()": Already exists! Mark both as deleted βœ—
  • /a signature "x(y())z()": First occurrence, store in dictionary
  • /b signature "w()x(y())z()": First occurrence, store in dictionary

Step 4: Mark Deletions

After identifying duplicates:

  • /a/x and /b/x are marked as deleted (same signature "y()")
  • All their subfolders are automatically deleted when we skip deleted nodes
  • /a and /b have different signatures due to /b/w, so they remain

Step 5: Collect Remaining Paths

Traversing the trie, skipping deleted nodes:

  • Start at root
  • Visit /a (not deleted) β†’ add ["a"] to result
    • Try /a/x β†’ deleted, skip entire subtree
    • Try /a/z β†’ deleted, skip
  • Visit /b (not deleted) β†’ add ["b"] to result
    • Try /b/x β†’ deleted, skip entire subtree
    • Try /b/z β†’ deleted, skip
    • Visit /b/w (not deleted) β†’ add ["b","w"] to result

Output: [["a"], ["b"], ["b","w"]]

The key insight: /a/x and /b/x had identical structures (both containing only y), so they were removed along with all their subfolders. The parent folders /a and /b survived because /b had an extra subfolder w, making their structures different.

Solution Implementation

1from collections import defaultdict
2from typing import Dict, List
3
4class Trie:
5    def __init__(self):
6        # Dictionary to store child nodes, where key is folder name
7        self.children: Dict[str, "Trie"] = {}
8        # Flag to mark if this subtree should be deleted as duplicate
9        self.deleted: bool = False
10
11
12class Solution:
13    def deleteDuplicateFolder(self, paths: List[List[str]]) -> List[List[str]]:
14        # Build the trie structure from all paths
15        root = Trie()
16        for path in paths:
17            current_node = root
18            for folder_name in path:
19                # Create new node if it doesn't exist
20                if folder_name not in current_node.children:
21                    current_node.children[folder_name] = Trie()
22                current_node = current_node.children[folder_name]
23      
24        # Dictionary to store serialized subtree signatures and their first occurrence
25        signature_to_node: Dict[str, Trie] = {}
26      
27        def serialize_and_mark_duplicates(node: Trie) -> str:
28            """
29            Serialize the subtree rooted at node and mark duplicate subtrees.
30            Returns a unique string representation of the subtree structure.
31            """
32            # Empty folder has empty signature
33            if not node.children:
34                return ""
35          
36            # Build signature from all child subtrees
37            subtree_signatures: List[str] = []
38            for folder_name, child_node in node.children.items():
39                child_signature = serialize_and_mark_duplicates(child_node)
40                # Format: foldername(child_subtree_signature)
41                subtree_signatures.append(f"{folder_name}({child_signature})")
42          
43            # Sort to ensure same structure gives same signature
44            current_signature = "".join(sorted(subtree_signatures))
45          
46            # Check if this subtree structure has been seen before
47            if current_signature in signature_to_node:
48                # Mark both occurrences for deletion
49                node.deleted = True
50                signature_to_node[current_signature].deleted = True
51            else:
52                # First occurrence of this structure
53                signature_to_node[current_signature] = node
54          
55            return current_signature
56      
57        def collect_remaining_paths(node: Trie, current_path: List[str], result: List[List[str]]) -> None:
58            """
59            Traverse the trie and collect all non-deleted paths.
60            """
61            # Skip deleted subtrees
62            if node.deleted:
63                return
64          
65            # Add current path to result (except for root which has empty path)
66            if current_path:
67                result.append(current_path[:])
68          
69            # Recursively process all children
70            for folder_name, child_node in node.children.items():
71                current_path.append(folder_name)
72                collect_remaining_paths(child_node, current_path, result)
73                current_path.pop()
74      
75        # Step 1: Identify and mark duplicate subtrees
76        serialize_and_mark_duplicates(root)
77      
78        # Step 2: Collect all non-deleted paths
79        result: List[List[str]] = []
80        collect_remaining_paths(root, [], result)
81      
82        return result
83
1class Trie {
2    // Map to store child directories, key is directory name, value is the Trie node
3    Map<String, Trie> children;
4    // Flag to mark if this subtree should be deleted due to duplication
5    boolean isDeleted;
6
7    public Trie() {
8        children = new HashMap<>();
9        isDeleted = false;
10    }
11}
12
13class Solution {
14    public List<List<String>> deleteDuplicateFolder(List<List<String>> paths) {
15        // Build the Trie structure from all paths
16        Trie root = new Trie();
17        for (List<String> path : paths) {
18            Trie currentNode = root;
19            for (String folderName : path) {
20                // Create child node if it doesn't exist
21                if (!currentNode.children.containsKey(folderName)) {
22                    currentNode.children.put(folderName, new Trie());
23                }
24                currentNode = currentNode.children.get(folderName);
25            }
26        }
27
28        // Map to store serialized subtree structure -> first occurrence of that structure
29        Map<String, Trie> subtreeMap = new HashMap<>();
30
31        // First DFS: Serialize each subtree and mark duplicates
32        Function<Trie, String> serializeAndMarkDuplicates = new Function<Trie, String>() {
33            @Override
34            public String apply(Trie node) {
35                // Leaf nodes have empty serialization
36                if (node.children.isEmpty()) {
37                    return "";
38                }
39              
40                // Collect serialized representations of all children
41                List<String> childSerializations = new ArrayList<>();
42                for (Map.Entry<String, Trie> entry : node.children.entrySet()) {
43                    String childName = entry.getKey();
44                    Trie childNode = entry.getValue();
45                    // Format: "folderName(childSerialization)"
46                    childSerializations.add(childName + "(" + apply(childNode) + ")");
47                }
48              
49                // Sort to ensure same subtree structure produces same serialization
50                Collections.sort(childSerializations);
51                String serialization = String.join("", childSerializations);
52              
53                // Check if this subtree structure has been seen before
54                if (subtreeMap.containsKey(serialization)) {
55                    // Mark both occurrences as deleted
56                    node.isDeleted = true;
57                    subtreeMap.get(serialization).isDeleted = true;
58                } else {
59                    // First occurrence of this structure
60                    subtreeMap.put(serialization, node);
61                }
62              
63                return serialization;
64            }
65        };
66
67        serializeAndMarkDuplicates.apply(root);
68
69        // Collect remaining paths after deletion
70        List<List<String>> result = new ArrayList<>();
71        List<String> currentPath = new ArrayList<>();
72
73        // Second DFS: Collect all non-deleted paths
74        Function<Trie, Void> collectRemainingPaths = new Function<Trie, Void>() {
75            @Override
76            public Void apply(Trie node) {
77                // Skip deleted subtrees
78                if (node.isDeleted) {
79                    return null;
80                }
81              
82                // Add current path to result (except for root which has empty path)
83                if (!currentPath.isEmpty()) {
84                    result.add(new ArrayList<>(currentPath));
85                }
86              
87                // Recursively process all children
88                for (Map.Entry<String, Trie> entry : node.children.entrySet()) {
89                    currentPath.add(entry.getKey());
90                    apply(entry.getValue());
91                    currentPath.remove(currentPath.size() - 1); // Backtrack
92                }
93              
94                return null;
95            }
96        };
97
98        collectRemainingPaths.apply(root);
99
100        return result;
101    }
102}
103
1class Trie {
2public:
3    // Map from folder name to its corresponding Trie node
4    unordered_map<string, Trie*> children;
5    // Flag to mark if this subtree is a duplicate and should be deleted
6    bool isDeleted = false;
7};
8
9class Solution {
10public:
11    vector<vector<string>> deleteDuplicateFolder(vector<vector<string>>& paths) {
12        // Build the Trie structure from all paths
13        Trie* root = new Trie();
14      
15        // Insert all paths into the Trie
16        for (auto& path : paths) {
17            Trie* currentNode = root;
18            for (auto& folderName : path) {
19                // Create new node if folder doesn't exist
20                if (currentNode->children.find(folderName) == currentNode->children.end()) {
21                    currentNode->children[folderName] = new Trie();
22                }
23                currentNode = currentNode->children[folderName];
24            }
25        }
26
27        // Map to store serialized subtree structure to detect duplicates
28        // Key: serialized subtree string, Value: pointer to first occurrence of this structure
29        unordered_map<string, Trie*> subtreeMap;
30
31        // DFS to serialize each subtree and mark duplicates
32        auto serializeAndMarkDuplicates = [&](this auto&& serializeAndMarkDuplicates, Trie* node) -> string {
33            // Leaf nodes have empty serialization
34            if (node->children.empty()) {
35                return "";
36            }
37
38            // Collect serialized representations of all children
39            vector<string> childSerializations;
40            for (auto& [childName, childNode] : node->children) {
41                string childSerialization = childName + "(" + serializeAndMarkDuplicates(childNode) + ")";
42                childSerializations.push_back(childSerialization);
43            }
44          
45            // Sort to ensure same subtree structure produces same serialization
46            sort(childSerializations.begin(), childSerializations.end());
47          
48            // Concatenate all child serializations
49            string serializedSubtree = "";
50            for (auto& childSerialization : childSerializations) {
51                serializedSubtree += childSerialization;
52            }
53
54            // Check if this subtree structure already exists
55            if (subtreeMap.contains(serializedSubtree)) {
56                // Mark both occurrences as deleted
57                node->isDeleted = true;
58                subtreeMap[serializedSubtree]->isDeleted = true;
59            } else {
60                // First occurrence of this subtree structure
61                subtreeMap[serializedSubtree] = node;
62            }
63          
64            return serializedSubtree;
65        };
66
67        // Serialize and mark duplicate subtrees
68        serializeAndMarkDuplicates(root);
69
70        // Collect remaining paths after deletion
71        vector<vector<string>> result;
72        vector<string> currentPath;
73
74        // DFS to collect all non-deleted paths
75        auto collectPaths = [&](this auto&& collectPaths, Trie* node) -> void {
76            // Skip deleted subtrees
77            if (node->isDeleted) {
78                return;
79            }
80          
81            // Add current path to result (except for root which has empty path)
82            if (!currentPath.empty()) {
83                result.push_back(currentPath);
84            }
85          
86            // Recursively process all children
87            for (auto& [childName, childNode] : node->children) {
88                currentPath.push_back(childName);
89                collectPaths(childNode);
90                currentPath.pop_back();
91            }
92        };
93
94        // Collect all remaining paths
95        collectPaths(root);
96
97        return result;
98    }
99};
100
1// Global Trie node interface
2interface TrieNode {
3    children: { [key: string]: TrieNode };
4    deleted: boolean;
5}
6
7function deleteDuplicateFolder(paths: string[][]): string[][] {
8    // Create root node of the Trie
9    const root: TrieNode = {
10        children: {},
11        deleted: false
12    };
13
14    // Build Trie from all paths
15    for (const path of paths) {
16        let currentNode = root;
17        for (const folderName of path) {
18            // Create child node if it doesn't exist
19            if (!currentNode.children[folderName]) {
20                currentNode.children[folderName] = {
21                    children: {},
22                    deleted: false
23                };
24            }
25            currentNode = currentNode.children[folderName];
26        }
27    }
28
29    // Map to store serialized subtree structures and their corresponding nodes
30    const subtreeMap: { [serialized: string]: TrieNode } = {};
31
32    // DFS to serialize each subtree and mark duplicates
33    const serializeAndMarkDuplicates = (node: TrieNode): string => {
34        // Base case: leaf node returns empty string
35        if (Object.keys(node.children).length === 0) {
36            return '';
37        }
38
39        // Serialize all child subtrees
40        const serializedChildren: string[] = [];
41        for (const [folderName, childNode] of Object.entries(node.children)) {
42            const childSerialization = serializeAndMarkDuplicates(childNode);
43            serializedChildren.push(`${folderName}(${childSerialization})`);
44        }
45      
46        // Sort to ensure consistent serialization for identical structures
47        serializedChildren.sort();
48        const serializedSubtree = serializedChildren.join('');
49
50        // Check if this subtree structure already exists
51        if (subtreeMap[serializedSubtree]) {
52            // Mark both duplicate subtrees for deletion
53            node.deleted = true;
54            subtreeMap[serializedSubtree].deleted = true;
55        } else {
56            // Store first occurrence of this subtree structure
57            subtreeMap[serializedSubtree] = node;
58        }
59      
60        return serializedSubtree;
61    };
62
63    // Execute first DFS to identify and mark duplicates
64    serializeAndMarkDuplicates(root);
65
66    // Result array to store remaining paths
67    const result: string[][] = [];
68    // Current path being constructed during traversal
69    const currentPath: string[] = [];
70
71    // DFS to collect all non-deleted paths
72    const collectRemainingPaths = (node: TrieNode): void => {
73        // Skip deleted subtrees
74        if (node.deleted) {
75            return;
76        }
77      
78        // Add current path to result if it's not empty (not root)
79        if (currentPath.length > 0) {
80            result.push([...currentPath]);
81        }
82      
83        // Traverse all children
84        for (const [folderName, childNode] of Object.entries(node.children)) {
85            currentPath.push(folderName);
86            collectRemainingPaths(childNode);
87            currentPath.pop();
88        }
89    };
90
91    // Execute second DFS to collect remaining paths
92    collectRemainingPaths(root);
93
94    return result;
95}
96

Time and Space Complexity

Time Complexity: O(N * M * L + N * M * log(M))

Where:

  • N = number of paths
  • M = average number of folders in each path (depth)
  • L = average length of folder names

Breaking down the operations:

  1. Building the Trie: O(N * M) - iterating through all paths and inserting each folder
  2. First DFS (serialization):
    • Each node is visited once: O(total nodes)
    • At each node, we create a serialization string by iterating through children and sorting them
    • Sorting at each node: O(K * log(K)) where K is the number of children
    • String concatenation: O(K * L) for each node
    • Overall: O(N * M * L + N * M * log(M)) in the worst case
  3. Second DFS (collecting results): O(N * M) - visiting non-deleted nodes

The dominant term is O(N * M * L + N * M * log(M)) due to the serialization and sorting operations.

Space Complexity: O(N * M * L)

Breaking down the space usage:

  1. Trie structure: O(N * M) - storing all nodes
  2. HashMap g for duplicate detection: O(N * M * L) - storing serialized strings as keys
  3. Recursion stack depth: O(M) - maximum depth of the folder structure
  4. Result storage: O(N * M) - storing the output paths
  5. Temporary strings during serialization: O(M * L) per recursive call

The dominant term is O(N * M * L) due to storing the serialized representations in the hashmap.

Learn more about how to find time and space complexity quickly.

Common Pitfalls

1. Incorrect Handling of Empty Folders

A critical pitfall is misunderstanding what constitutes an "empty" folder. In this problem, a folder is considered empty (leaf node) only if it has NO subfolders at all. However, developers often mistakenly treat folders that appear in the input paths as having content.

Problem Example:

# Given paths: [["a"], ["a", "x"], ["b"], ["b", "x"]]
# Incorrect assumption: /a and /b are non-empty because they appear as paths
# Correct understanding: /a/x and /b/x are empty (leaf nodes)

Solution: Always check node.children to determine if a folder is empty, not whether it appears as a complete path in the input.

2. Signature Collision Due to Poor Serialization

Without proper delimiters in the serialization format, different folder structures can generate identical signatures.

Problem Example:

# Without delimiters:
# /a/bc with empty child β†’ signature: "bc"
# /ab/c with empty child β†’ signature: "abc"  
# These are different but might generate the same signature!

# Incorrect serialization:
signature = name + child_signature  # No delimiter!

Solution: Always use delimiters like parentheses in the format "name(child_signature)" to ensure unique signatures:

subtree_signatures.append(f"{folder_name}({child_signature})")

3. Forgetting to Sort Child Signatures

The order of children in the trie depends on the input order, but identical structures should generate the same signature regardless of order.

Problem Example:

# /a has children: x, y
# /b has children: y, x (different order)
# Without sorting: signatures would be "x()y()" vs "y()x()" - incorrectly different!

Solution: Always sort the child signatures before concatenation:

current_signature = "".join(sorted(subtree_signatures))

4. Modifying the Trie During First Traversal

Attempting to delete nodes immediately during the duplicate detection phase can corrupt the tree structure and cause missed duplicates.

Problem Example:

# Wrong approach - deleting immediately:
if current_signature in signature_to_node:
    # DON'T do this:
    del node.children  # This breaks the traversal!

Solution: Use a two-phase approach:

  1. First pass: Mark nodes with deleted = True flag
  2. Second pass: Collect only non-deleted paths

5. Including Root Path in Results

The root node represents the file system root ("/") and should never be included in the output.

Problem Example:

# Wrong - might add empty path for root:
def collect_paths(node, path, result):
    result.append(path[:])  # This adds [] for root!

Solution: Check if the current path is non-empty before adding:

if current_path:  # Only add non-empty paths
    result.append(current_path[:])

6. Not Creating a Copy of Path When Adding to Results

Python lists are mutable references. Without copying, all results would point to the same list object.

Problem Example:

# Wrong - all results will be the same (last) path:
result.append(current_path)  # Adds reference, not copy!

Solution: Always create a copy when adding to results:

result.append(current_path[:])  # Creates a shallow copy
Discover Your Strengths and Weaknesses: Take Our 5-Minute Quiz to Tailor Your Study Plan:

What are the two properties the problem needs to have for dynamic programming to be applicable? (Select 2)


Recommended Readings

Want a Structured Path to Master System Design Too? Don’t Miss This!

Load More