1044. Longest Duplicate Substring

HardStringBinary SearchSuffix ArraySliding WindowHash FunctionRolling Hash
Leetcode Link

Problem Description

The problem requires us to find the longest duplicated substring within a given string s; these are substrings that appear more than once. A substring is considered a consecutive sequence of characters within the string. The key challenge here is handling the potential overlap of occurrences—this means that the same characters can be part of different duplicated substrings as long as those substrings start at different positions in s.

For example, in the string "banana", the substring "ana" is duplicated with overlapping occurrences: "banana" contains "ana" starting at the second character and again starting at the fourth character.

If no such substring exists, our function should return an empty string. The desired solution can be any one of the longest duplicated substrings if there are multiple with the same maximum length.

Intuition

The intuition behind the solution involves a [binary search](/problems/binary-search-speedrun) algorithm combined with a rolling hash technique or a hash set to quickly check for duplicate substrings.

Binary search is used to reduce the search space for the length of the longest duplicated substring. Instead of checking every possible length, we can start in the middle of the possible range of lengths and then increase or decrease our search based on whether we find a duplicate of that length.

Hash sets are used to store previously seen substrings of a certain length. As we increase or decrease the length of substrings we are looking at, we can add these to the set or check against it to see if we have seen them before. If a substring is already in the set, it means we've found a duplicate.

The process works as follows:

  • We define left as the minimum length of the substring and right as the maximum possible length of a duplicated substring, which is n - 1 (where n is the length of the input string).
  • We use binary search to find the largest mid value such that mid is the length of a possible duplicated substring. We check if there is a duplicate substring of this length using the helper function check.
  • The check function iterates through all possible substrings of length l (where l is the current midpoint of the binary search), adding them to a hash set or returning one immediately if a duplicate is found.

In this solution, we leverage the efficiency of hash sets and the power of binary search to arrive at an efficient solution.

Learn more about Binary Search and Sliding Window patterns.

Solution Approach

The solution provided is a classic application of binary search with a rolling hash or a hash set to efficiently find the longest duplicated substring. Below is an explanation of the code implementation:

  1. Binary Search:

    • Instead of checking all possible lengths, binary search is used to find the length of the longest duplicate substring. We initialize left as 0 and right as n - 1, where n is the length of the input string s.
    • Within a loop, we calculate mid as the average of left and right, rounded up. The >> 1 is a bitwise operation that effectively divides by 2. We are rounding up to ensure the search space reduces on each iteration.
    • We check if there is a duplicate with the length mid using a helper function. If a duplicate exists, left is set to mid. If not, right is set to mid - 1.
    • This process continues until left < right is no longer true.
  2. Helper Function - check:

    • This function takes a length l and checks if any substring of this length occurs more than once.
    • We use of a set to store substrings hashes, which helps in constant-time lookups to check for duplicates.
    • The function iterates over all possible substrings of length l, creating a substring t by slicing the input string s from index i to index i + l.
    • If t is already in the vis, we have found our duplicate and return it immediately. Otherwise, we add t to vis and continue checking the next substrings.
    • If no duplicate is found, an empty string is returned.
  3. Maintaining the Longest Duplicate Substring:

    • We maintain the variable ans to store the longest duplicate substring found so far.
    • After each successful check, we update ans if the length of the found duplicate t is greater than the length of ans.
  4. Efficiency:

    • By using binary search, we reduce the potential search space from n(n-1)/2 to log(n), which is a significant improvement, particularly for larger strings.
    • The use of a set to store hashes implies an O(1) search time for duplicates. However, the creation of substrings and their hashes, done once per substring per iteration, has its own cost.
  5. Termination and Result:

    • The loop terminates when the binary search space collapses, meaning left is equal to right, which also implies we have either found the longest duplicate substring or determined that no duplicate exists.
    • The final result is returned as ans, which is the longest duplicate found during the binary search process.

This approach is efficient for large strings because it avoids inspecting every possible substring directly. Instead, it quickly narrows down the range of possible lengths and checks for duplicates using a HashSet, considerably reducing the time complexity.

Ready to land your dream job?

Unlock your dream job with a 2-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's take a small example to illustrate the solution approach. Consider the string "axaxa":

  1. Initial Setup:

    • We initialize left to 0 and right to 4 (since the length of "axaxa" is 5, and the max length of a duplicate substring can be n - 1).
  2. First Iteration of Binary Search:

    • Calculate mid as (0 + 4) / 2, which is 2.
    • Check for duplicates of length mid (2). Here, we find that "ax" is a substring starting at index 0, and it also appears starting at index 2.
  3. Update Binary Search Bounds:

    • Since we found a duplicate, we update left to mid, making it 2. We keep right at 4.
  4. Second Iteration of Binary Search:

    • Calculate the new mid as (2 + 4) / 2, which is 3.
    • Check for duplicates of length 3. We see "axa" starting at index 0 and again at index 2, so a duplicate is found.
  5. Keep Updating Bounds:

    • We update left to the new mid, making left now 3. Right remains 4.
  6. Third Iteration of Binary Search:

    • Calculate mid as (3 + 4) / 2, which gets rounded down to 3 (since we need integers for substrings lengths).
    • Since mid is still 3 and we've already found a duplicate for length 3, the bounds don't need to be changed.
  7. No Further Progress:

    • As left and right are no longer converging, the loop will terminate.
  8. Termination and Result:

    • In this case, the binary search would stop here as it cannot progress further and we have found that the longest duplicated substring is "axa".

By employing binary search, we quickly identify the longest duplicate without checking every length. We start from the midpoint and either increase or decrease depending on the search results. Thus, the longest duplicated substring in "axaxa" is "axa".

Solution Implementation

1class Solution:
2    def longestDupSubstring(self, s: str) -> str:
3        # Helper function to check if a duplicate substring of given length exists
4        def has_duplicate(length):
5            seen = set()
6            for i in range(n - length + 1):
7                # Extract a substring of given length
8                substring = s[i : i + length]
9                # If this substring has already been seen, return it as a duplicate
10                if substring in seen:
11                    return substring
12                # If not seen, add this substring to the set of seen substrings
13                seen.add(substring)
14            # Return an empty string if no duplicate is found
15            return ''
16      
17        # Length of input string
18        n = len(s)
19        # Initialize binary search bounds
20        left, right = 0, n
21        # variable to store the longest duplicate substring found so far
22        longest_dup = ''
23      
24        # Perform binary search to find the length of the longest duplicate substring
25        while left < right:
26            # Calculate mid-point
27            mid = (left + right + 1) // 2
28            # Use the helper function to check for a duplicate of the current mid length
29            current_dup = has_duplicate(mid)
30            # Keep the longest duplicate seen so far
31            longest_dup = current_dup or longest_dup
32          
33            # If a duplicate of the current length exists, search in the upper half
34            if current_dup:
35                left = mid
36            # If no duplicate of the current length exists, search in the lower half
37            else:
38                right = mid - 1
39      
40        # Return the longest duplicate substring found
41        return longest_dup
42
1import java.util.HashSet;
2import java.util.Set;
3
4public class Solution {
5    // Precomputed hashes and powers of the base
6    private long[] powers;
7    private long[] hashes;
8
9    // Finds the longest duplicate substring in a given string
10    public String longestDupSubstring(String s) {
11        int base = 131; // Base value for polynomial hash calculation
12        int n = s.length(); // Length of the input string
13        powers = new long[n + 10];
14        hashes = new long[n + 10];
15        powers[0] = 1;
16      
17        // Precompute the powers and hashes
18        for (int i = 0; i < n; ++i) {
19            powers[i + 1] = powers[i] * base;
20            hashes[i + 1] = hashes[i] * base + s.charAt(i);
21        }
22      
23        String longestDuplicate = ""; // Store the longest duplicate substring
24        int leftBound = 0, rightBound = n; // Define search bounds
25
26        // Perform binary search on substring length
27        while (leftBound < rightBound) {
28            int mid = (leftBound + rightBound + 1) >>> 1; // Middle point (length of substring)
29            String duplicate = checkForDuplicate(s, mid);
30            if (!duplicate.isEmpty()) {
31                leftBound = mid; // If a duplicate is found, search in the upper half
32                longestDuplicate = duplicate;
33            } else {
34                rightBound = mid - 1; // Otherwise, search in the lower half
35            }
36        }
37      
38        return longestDuplicate; // Return the longest duplicate substring found
39    }
40
41    // Checks for a duplicate substring of a given length
42    private String checkForDuplicate(String s, int len) {
43        int n = s.length();
44        Set<Long> seenHashes = new HashSet<>(); // Set to store previously encountered hashes
45
46        // Iterate over each possible substring of the given length
47        for (int i = 1; i + len - 1 <= n; ++i) {
48            int j = i + len - 1;
49            // Compute the hash for the current substring
50            long currentHash = hashes[j] - hashes[i - 1] * powers[j - i + 1];
51            if (seenHashes.contains(currentHash)) { 
52                // If the hash is already in the set, we found a duplicate
53                return s.substring(i - 1, j);
54            }
55            // Add the current hash to the set
56            seenHashes.add(currentHash);
57        }
58        // If no duplicate is found, return an empty string
59        return "";
60    }
61}
62
1#include <string>
2#include <unordered_set>
3using namespace std;
4
5typedef unsigned long long ull;
6
7class Solution {
8public:
9    ull power[30010];
10    ull hash[30010];
11
12    // Function to find the longest duplicated substring in the given string s.
13    string longestDupSubstring(string s) {
14        int base = 131; // Base used for polynomial rolling hash function.
15        int n = s.size();
16        // Pre-computing the powers of base.
17        power[0] = 1;
18        for (int i = 0; i < n; ++i) {
19            power[i + 1] = power[i] * base;
20            // Calculating hash values for the prefix ending at current character.
21            hash[i + 1] = hash[i] * base + s[i];
22        }
23        // Binary search on the length of the duplicated substring.
24        int left = 0, right = n;
25        string result = "";
26        while (left < right) {
27            // Trying out the middle length in the binary search.
28            int mid = (left + right + 1) >> 1;
29            // Checking if there's a duplicated substring of length 'mid'.
30            string candidate = check(s, mid);
31            if (candidate.empty())
32                right = mid - 1; // If not found, look for shorter substrings.
33            else {
34                left = mid; // If found, look for longer substrings.
35                result = candidate;
36            }
37        }
38        return result;
39    }
40
41private:
42    // Check function to find a duplicated substring of given length.
43    string check(const string& s, int len) {
44        int n = s.size();
45        unordered_set<ull> visited; // Hashes of substrings we have seen.
46        for (int i = 1; i + len - 1 <= n; ++i) {
47            int j = i + len - 1;
48            // Compute the hash of the current substring using a sliding window.
49            ull current_hash = hash[j] - hash[i - 1] * power[j - i + 1];
50            // Check if the current substring's hash has already been seen.
51            if (visited.count(current_hash))
52                return s.substr(i - 1, len); // If so, return the duplicated substring.
53            visited.insert(current_hash);
54        }
55        return ""; // No duplicated substring of this length exists.
56    }
57};
58
1// Define a base for the polynomial rolling hash function
2const base = 131;
3
4// Arrays for storing power values of the base and hash values of the prefixes
5let power: bigint[] = [];
6let hash: bigint[] = [];
7
8// Function to pre-compute the powers of the base up to the given length
9function computePowers(length: number): void {
10    power[0] = BigInt(1);
11    for (let i = 0; i < length; ++i) {
12        power[i + 1] = power[i] * BigInt(base);
13    }
14}
15
16// Function to pre-compute the hash values of the prefixes of the string
17function computeHashes(s: string): void {
18    const n = s.length;
19    hash[0] = BigInt(0);
20    for (let i = 0; i < n; ++i) {
21        hash[i + 1] = hash[i] * BigInt(base) + BigInt(s.charCodeAt(i));
22    }
23}
24
25// Function to find the longest duplicated substring in the given string
26function longestDupSubstring(s: string): string {
27    const n = s.length;
28
29    // Pre-compute powers and hashes
30    computePowers(n);
31    computeHashes(s);
32
33    // Binary search for the longest duplicated substring length
34    let left = 0, right = n;
35    let result = "";
36
37    while (left < right) {
38        const mid = Math.floor((left + right + 1) / 2);
39        const candidate = check(s, mid);
40      
41        if (candidate) {
42            left = mid;
43            result = candidate;
44        } else {
45            right = mid - 1;
46        }
47    }
48  
49    return result;
50}
51
52// Function to check for a duplicated substring of the given length
53function check(s: string, len: number): string | null {
54    const n = s.length;
55    const visited = new Set<bigint>();
56
57    for (let i = 1; i + len - 1 <= n; ++i) {
58        const j = i + len - 1;
59        const currentHash = hash[j] - hash[i - 1] * power[len];
60      
61        if (visited.has(currentHash)) {
62            return s.substring(i - 1, i - 1 + len);
63        }
64      
65        visited.add(currentHash);
66    }
67  
68    return null;
69}
70
71// Example usage
72const inputString = "banana";
73const result = longestDupSubstring(inputString);
74console.log(result); // Outputs the longest duplicated substring, if found.
75

Time and Space Complexity

Time Complexity

The provided code uses a binary search combined with a rolling hash (or in this case, a direct string check) for finding the longest duplicated substring in a given string. The main while loop (which is effectively the binary search) will run for O(log n) iterations, where n is the length of the input string s.

Inside the binary search, there's a call to the check function which runs for each possible substring length determined by mid. The check function iterates for each possible starting position of a substring (n - l + 1 times, with l being the length of the current substring we're checking). For each iteration, it checks if that substring has already been seen (O(1) operation for a hash set lookup on average), and if not, it adds the substring to the set (O(l) operation for hashing the substring). Therefore, the time complexity of each check call is O((n - l + 1) * l).

Since the outer loop runs O(log n) times and the check function is O((n - l + 1) * l) for each iteration, and l ranges from 0 to n, the overall time complexity is O(n log n * l), which can also be simplified to O(n^2 log n) as the worst-case scenario when l is close to n.

Space Complexity

The space complexity is dominated by the vis set, which at most will store n substrings of length l. Since Python strings are immutable and slicing a string creates a new string, the space required to store each individual substring is O(l), where l can be at most n.

Hence, in the worst case, the space complexity is O(n^2), occurring when l is close to n and we store all possible substrings of that length.

Learn more about how to find time and space complexity quickly using problem constraints.


Discover Your Strengths and Weaknesses: Take Our 2-Minute Quiz to Tailor Your Study Plan:
Question 1 out of 10

Which of these properties could exist for a graph but not a tree?


Recommended Readings

Want a Structured Path to Master System Design Too? Don’t Miss This!