652. Find Duplicate Subtrees


Problem Description

In this problem, we are given a binary tree and we have to identify all the subtrees that appear more than once within that tree. A subtree consists of a node and all of its descendants. Subtrees are considered duplicates if they have identical structure and node values. That is to say, if you took two duplicate subtrees and compared them, they would be indistinguishable in terms of both their shape and the values held in each node. The problem requires returning a list where each element is the root node of one of the duplicate subtrees. If there are multiple duplicates of the same subtree, we only need to include one instance of it in our answer.

Intuition

The core idea of finding duplicate subtrees is to use a traversal algorithm to serialize or represent each subtree in a unique way so that we can compare them directly. Serialization means converting the structure of a subtree into a string (or any other form of a data structure that can be used for comparison purposes) which, if two serialized strings are the same, ensures that the subtrees they represent are identical.

A common method is to use a post-order depth-first search (DFS) to serialize the tree. Post-order traversal means we visit a node's left child, then its right child, and finally the node itself. We define a helper function dfs to perform the traversal and serialize the subtrees as it recursively processes the tree. For each node, we create a string that includes the node's value and the serialization of its left and right children. This way, identical subtrees yield the same serialized string.

We store the frequency of each serialized subtree in a counter (a hashmap), and if we come across the same serialization again, it means we have found a duplicate subtree. We keep a list called ans where we add the root of a subtree when we encounter its serialization the second time (when the count becomes 2). We continue this process until every subtree has been serialized and processed, ending up with a list of roots for all duplicate subtrees.

Learn more about Tree, Depth-First Search and Binary Tree patterns.

Not Sure What to Study? Take the 2-min Quiz to Find Your Missing Piece:

Given a sorted array of integers and an integer called target, find the element that equals to the target and return its index. Select the correct code that fills the ___ in the given code snippet.

1def binary_search(arr, target):
2    left, right = 0, len(arr) - 1
3    while left ___ right:
4        mid = (left + right) // 2
5        if arr[mid] == target:
6            return mid
7        if arr[mid] < target:
8            ___ = mid + 1
9        else:
10            ___ = mid - 1
11    return -1
12
1public static int binarySearch(int[] arr, int target) {
2    int left = 0;
3    int right = arr.length - 1;
4
5    while (left ___ right) {
6        int mid = left + (right - left) / 2;
7        if (arr[mid] == target) return mid;
8        if (arr[mid] < target) {
9            ___ = mid + 1;
10        } else {
11            ___ = mid - 1;
12        }
13    }
14    return -1;
15}
16
1function binarySearch(arr, target) {
2    let left = 0;
3    let right = arr.length - 1;
4
5    while (left ___ right) {
6        let mid = left + Math.trunc((right - left) / 2);
7        if (arr[mid] == target) return mid;
8        if (arr[mid] < target) {
9            ___ = mid + 1;
10        } else {
11            ___ = mid - 1;
12        }
13    }
14    return -1;
15}
16

Solution Approach

The proposed solution involves implementing a post-order depth-first search (DFS) to traverse the tree, serialize each subtree, and use a counter to detect duplicates. The steps in the implementation and the data structures involved are described below:

  1. Define a helper function dfs(root) that will:

    • Take the current root of a subtree as an input.
    • Base case: If the current root is None, return a special placeholder ('#') to indicate a null node.
    • Recur for the left child and right child, serializing both subtrees.
    • Construct the serialized version of the current subtree, denoted by v, using the value of the current node and the representations of the left and right subtrees in a comma-separated format (e.g., val,left_repr,right_repr).
  2. Use a Counter, which is a special dictionary from Python's collections module that keeps track of the frequency of each serialized subtree representation.

  3. In the main findDuplicateSubtrees method, define an empty list ans to store the roots of detected duplicate subtrees.

  4. Call the helper function dfs(root) to start the process from the root of the entire tree.

  5. Within the dfs function, after getting the serialized representation of the current subtree:

    • Increment the count for this serialized form in the counter.
    • If the count is exactly 2, it means we've found a duplicate for the first time, so we append the current root to ans (we only add the subtrees when their count hits 2 to ensure we don't add duplicates of duplicates).
  6. Once the DFS and serialization process completes, return the list ans which now contains the roots of all duplicate subtrees.

The choice of using a post-order DFS is strategic for a couple of reasons:

  • When you process a node, its left and right subtrees have already been serialized and compared, so the comparison of the structure (shape) is naturally included.
  • The construction of strings for serialization allows an accurate and efficiently comparable representation of each subtree, making it easy to use a counter to detect duplicates.

By using serialization and a hash-based counter, this approach ensures that the resulting list contains each duplicate subtree exactly once, following the problem's constraints.

Discover Your Strengths and Weaknesses: Take Our 2-Minute Quiz to Tailor Your Study Plan:

Depth first search can be used to find whether two components in a graph are connected.

Example Walkthrough

Let's use a small binary tree to illustrate the solution approach:

Consider the following binary tree:

1    1
2   / \
3  2   3
4 /   / \
54   2   4
6   /
7  4

In this tree, the leftmost subtree 2 -> 4 appears again as the left subtree of node 3. We'll walk through the solution steps to find this duplicate subtree.

  1. We define a helper function dfs which takes a node of the binary tree and serializes the subtree rooted at that node.

  2. Starting with the root, dfs performs post-order traversal. This means it will first process all nodes in the left subtree, then the right subtree, and finally, the current node.

  3. At each node, the dfs function:

    • Returns the string '#' if the node is None.
    • Calls itself recursively to get the serialized string of the left subtree.
    • Calls itself recursively to get the serialized string of the right subtree.
    • Combines the value of the current node and the serialized strings from step 3 in a comma-separated format. For the root node '1', assuming its left serialization is '2,#,4' and right is '3,2,#,4,#,4', the serialized string would be '1,2,#,4,3,2,#,4,#,4'.
  4. We use a Counter to track the occurrence of each serialized subtree. Initially it is empty.

  5. Whenever we serialize a subtree in our dfs function:

    • We add the serialization to the Counter.
    • If we come across a serialized subtree that already has a count of 1, which means this is the second time we have found this subtree, we add its root node to the answer list ans.

Following the traversal, we end up with the following serializations after dfs function is completed:

  • 4 (for the leaves with value 4)
  • 2,#,4 (for the subtrees rooted at the nodes with value 2)
  • 3,2,#,4,#,4 (for the subtree rooted at the node with value 3)
  • 1,2,#,4,3,2,#,4,#,4 (for the whole tree)

The 'Counter' at the end of traversal would have the serialized strings with their frequencies:

  • '#': 3 (placeholder for null, for each node without one child or more)
  • 4: 3 (for each leaf node)
  • 2,#,4: 2 (for the subtrees that appear twice)
  • The rest will have a count of 1 because they don't have duplicates.

Since 2,#,4 has a frequency of 2, we only add the root node of its first instance to the ans list, which contains the node with the value '2' that is the left child of the root node '1'.

  1. At the end of traversal, the ans list contains the root node of the duplicate subtree found, which is just the node with value '2' in our example.

After we have finished traversing the entire tree, the solution would be the node with value '2' that appeared as the left child of root '1', because the structure 2 -> 4 is duplicated in the tree.

Not Sure What to Study? Take the 2-min Quiz:

Which algorithm should you use to find a node that is close to the root of the tree?

Python Solution

1from collections import Counter
2from typing import Optional, List
3
4# Definition for a binary tree node.
5class TreeNode:
6    def __init__(self, val=0, left=None, right=None):
7        self.val = val
8        self.left = left
9        self.right = right
10
11class Solution:
12    def findDuplicateSubtrees(self, root: Optional[TreeNode]) -> List[Optional[TreeNode]]:
13        """
14        Finds all duplicate subtrees in a binary tree.
15      
16        Args:
17        - root: The root node of the binary tree.
18      
19        Returns:
20        - A list of tree nodes where each node represents the root of a duplicate subtree.
21        """
22      
23        def traverse(node):
24            """
25            Performs a post-order traversal of the tree to serialize each subtree.
26          
27            Args:
28            - node: The current node being visited.
29          
30            Returns:
31            - A string representing the serialized form of the subtree rooted at the current node.
32            """
33            if node is None:
34                return '#'
35            # Serialize the subtree rooted at this node
36            serialized_subtree = f'{node.val},{traverse(node.left)},{traverse(node.right)}'
37            # Count the occurrence of this serialized subtree
38            subtree_counter[serialized_subtree] += 1
39            # If this is the second time we've seen this subtree, add it to the answer
40            if subtree_counter[serialized_subtree] == 2:
41                duplicate_subtrees.append(node)
42            return serialized_subtree
43
44        # List to hold the root nodes of duplicate subtrees
45        duplicate_subtrees = []
46        # Counter to track the number of times a serialized subtree occurs
47        subtree_counter = Counter()
48        # Start the recursive traversal from the root
49        traverse(root)
50      
51        return duplicate_subtrees
52

Java Solution

1import java.util.HashMap;
2import java.util.Map;
3import java.util.List;
4import java.util.ArrayList;
5
6// Definition for a binary tree node.
7class TreeNode {
8    int val; // The value of the node
9    TreeNode left; // Pointer to the left child
10    TreeNode right; // Pointer to the right child
11  
12    // Node constructor
13    TreeNode() {}
14
15    // Node constructor with just the value
16    TreeNode(int val) { this.val = val; }
17
18    // Node constructor with value and children
19    TreeNode(int val, TreeNode left, TreeNode right) {
20        this.val = val;
21        this.left = left;
22        this.right = right;
23    }
24}
25
26class Solution {
27    // This Map will store the serialization of each subtree and count its occurrences
28    private Map<String, Integer> subtreeCounter;
29
30    // This List will store all the duplicate subtrees finded
31    private List<TreeNode> duplicateSubtrees;
32
33    // This method will find all duplicate subtrees in the binary tree with root 'root'
34    public List<TreeNode> findDuplicateSubtrees(TreeNode root) {
35        subtreeCounter = new HashMap<>();
36        duplicateSubtrees = new ArrayList<>();
37        traverseAndSerialize(root); // Deserialize the tree and find duplicates
38        return duplicateSubtrees;
39    }
40
41    // This recursive method serializes a subtree rooted at 'root' in pre-order
42    // and uses that serialization to identify duplicate subtrees
43    private String traverseAndSerialize(TreeNode root) {
44        if (root == null) {
45            return "#"; // Using # to denote a null node
46        }
47      
48        // Serialize the current subtree
49        String serialization = root.val + "," + traverseAndSerialize(root.left) + "," + traverseAndSerialize(root.right);
50      
51        // Update the count of the current subtree serialization in the map
52        subtreeCounter.put(serialization, subtreeCounter.getOrDefault(serialization, 0) + 1);
53      
54        // If the count is 2 (meaning it was 1 before this increment), we found a duplicate
55        if (subtreeCounter.get(serialization) == 2) {
56            duplicateSubtrees.add(root);
57        }
58      
59        // Return the serialization of this subtree
60        return serialization;
61    }
62}
63

C++ Solution

1#include <unordered_map>
2#include <vector>
3#include <string>
4
5using namespace std;
6
7// Definition for a binary tree node.
8struct TreeNode {
9    int val;
10    TreeNode *left;
11    TreeNode *right;
12    TreeNode() : val(0), left(nullptr), right(nullptr) {}
13    TreeNode(int x) : val(x), left(nullptr), right(nullptr) {}
14    TreeNode(int x, TreeNode *left, TreeNode *right) : val(x), left(left), right(right) {}
15};
16
17class Solution {
18public:
19    unordered_map<string, int> subtreeCounter;  // Used for counting occurrences of subtrees.
20    vector<TreeNode*> duplicateSubtrees;        // Stores the roots of duplicate subtrees.
21
22    // Function to find all duplicate subtrees in a binary tree.
23    vector<TreeNode*> findDuplicateSubtrees(TreeNode* root) {
24        traverseAndSerialize(root);
25        return duplicateSubtrees;
26    }
27
28    // Helper function to perform DFS traversal and serialize each subtree.
29    string traverseAndSerialize(TreeNode* node) {
30        if (!node) return "#";  // Use "#" to represent null pointers.
31
32        // Serialize the current subtree rooted at node.
33        string serialization = to_string(node->val) + "," + traverseAndSerialize(node->left) + "," + traverseAndSerialize(node->right);
34
35        // Increment the count for this serialized subtree.
36        ++subtreeCounter[serialization];
37
38        // If this is the second time we've seen this subtree, add it to the answer.
39        if (subtreeCounter[serialization] == 2) {
40            duplicateSubtrees.push_back(node);
41        }
42
43        // Return the serialization of this subtree.
44        return serialization;
45    }
46};
47

Typescript Solution

1// The TreeNode class defines the structure of a node in the binary tree.
2class TreeNode {
3    val: number;
4    left: TreeNode | null;
5    right: TreeNode | null;
6    constructor(val?: number, left?: TreeNode | null, right?: TreeNode | null) {
7        this.val = val === undefined ? 0 : val;
8        this.left = left === undefined ? null : left;
9        this.right = right === undefined ? null : right;
10    }
11}
12
13/**
14 * Finds all duplicate subtrees in a binary tree.
15 * @param root - The root node of the binary tree.
16 * @returns An array of TreeNode, each representing a root of a duplicate subtree.
17 */
18function findDuplicateSubtrees(root: TreeNode | null): Array<TreeNode | null> {
19    // Mapping of tree serialization to its frequency.
20    const serializationCounts = new Map<string, number>();
21    // Results array to store the root nodes of duplicate subtrees.
22    const result: Array<TreeNode | null> = [];
23
24    /**
25     * A depth-first search function that serializes the tree as it traverses,
26     * and records the frequency of each serialization.
27     * @param node - The current node being processed.
28     * @returns - A string that uniquely represents the subtree rooted at the current node.
29     */
30    function dfs(node: TreeNode | null): string {
31        // Base case for a null node.
32        if (node == null) {
33            return '#';
34        }
35        // Serialization for non-null node: value, left subtree serialization, right subtree serialization.
36        const serialization = `${node.val},${dfs(node.left)},${dfs(node.right)}`;
37        // Update frequency count or set to 1 if seeing this serialization for the first time.
38        serializationCounts.set(serialization, (serializationCounts.get(serialization) ?? 0) + 1);
39        // If the serialization count is exactly 2, add the node to the results.
40        if (serializationCounts.get(serialization) === 2) {
41            result.push(node);
42        }
43        return serialization;
44    }
45
46    // Invoke the dfs function starting from the root to fill in the results.
47    dfs(root);
48
49    // Return the filled results array.
50    return result;
51}
52
Fast Track Your Learning with Our Quick Skills Quiz:

Which type of traversal does breadth first search do?

Time and Space Complexity

Time Complexity:

The main operation in the provided code is a Depth-First Search (DFS) through the binary tree which involves visiting every node exactly once and constructing a serialization string for each node by recursively obtaining the serialization of its left and right children. The time complexity for visiting all nodes in the tree is O(N), where N is the number of nodes in the tree. However, constructing the serialization string for a node involves concatenation operations, which in Python have a time complexity that depends on the length of the strings being concatenated. In the worst case, these strings can get as large as O(N) when the tree degenerates into a linked list, with N being the number of nodes. This concatenation happens for every node, making the time complexity O(N^2) in the worst case.

Additionally, the serialization string for each node is stored in a hash table (counter), and each insertion operation can be considered O(1) on average, though in the worst case, due to hash collisions, it could be O(N). However, this is rare and typically the average case is expected.

Therefore, the average time complexity of the DFS is O(N^2) due to the serialization concatenation, but it could potentially degrade to O(N^3) if every hash insert operation takes O(N) time, which is a very pessimistic assumption and unlikely in practice.

Space Complexity:

The space complexity of the algorithm comes from three main sources:

  1. The recursion stack used by DFS, which goes as deep as the height of the tree. In the worst case (degenerate tree), this height is O(N), contributing O(N) to the space complexity.

  2. The counter hash table which stores the serialization strings for nodes. In the worst case, every node has a unique serialization, so the space used by the counter can grow up to O(N^2) since every serialization can be of length O(N).

  3. The list of answers (ans), which holds a reference to each node that has a duplicate subtree. Since each node could potentially be a part of this list just once, this contributes O(N) to space complexity.

Considering these parts, the overall space complexity of the solution is O(N^2) due to the storage required for the serialization strings in the hash table.

Learn more about how to find time and space complexity quickly.


Recommended Readings


Got a question? Ask the Teaching Assistant anything you don't understand.

Still not clear? Ask in the Forum,  Discord or Submit the part you don't understand to our editors.


TA 👨‍🏫