1044. Longest Duplicate Substring

HardStringBinary SearchSuffix ArraySliding WindowHash FunctionRolling Hash

Problem Description

You are given a string s and need to find the longest substring that appears at least twice in the string. These duplicate substrings can overlap with each other.

A duplicated substring is a contiguous sequence of characters from s that occurs 2 or more times anywhere in the string. For example, in the string "banana", the substring "ana" appears twice (at positions 1-3 and 3-5), and these occurrences overlap at position 3.

Your task is to return any duplicated substring that has the maximum possible length. If no substring appears more than once in s, return an empty string "".

The solution uses binary search on the length of the potential duplicate substring. The check function verifies if there exists a duplicate substring of a given length l by iterating through all possible substrings of that length and using a set to track which substrings have been seen. If a substring is encountered that's already in the set, it means we've found a duplicate.

The binary search works by:

Setting the search range from 0 to the length of the string
For each middle value mid, checking if a duplicate substring of length mid exists
If found, searching for longer duplicates (moving left pointer up)
If not found, searching for shorter duplicates (moving right pointer down)
Keeping track of the longest valid duplicate found so far in ans

Quick Interview Experience

Help others by sharing your interview experience

Have you seen this problem before?

Intuition

The key insight is that if a substring of length k appears twice in the string, then we know the answer is at least k characters long. However, there might be an even longer duplicate substring of length k+1, k+2, and so on. We want to find the maximum such length.

A brute force approach would check all possible substrings, but this would be inefficient. Instead, we can observe that the problem has a monotonic property: if there exists a duplicate substring of length k, then there must also exist duplicate substrings of all lengths from 1 to k-1. This is because any substring of a duplicate substring is also a duplicate substring.

This monotonic property suggests using binary search on the answer. We can search for the maximum length L such that there exists a duplicate substring of length L.

For each candidate length during binary search, we need to verify if any substring of that length appears twice. We can do this efficiently by:

Generating all substrings of the given length
Using a hash set to track which substrings we've seen
If we encounter a substring that's already in the set, we've found a duplicate

The binary search narrows down the search space from O(n) possible lengths to O(log n) checks. For each check, we examine O(n) substrings, and comparing strings takes O(L) time where L is the current length being checked. This gives us a more efficient solution than checking all possible substring pairs.

The approach elegantly combines the efficiency of binary search with the simplicity of hash-based duplicate detection, making it practical even for longer strings.

Learn more about Binary Search and Sliding Window patterns.

Solution Approach

The solution implements a binary search combined with a hash set-based duplicate detection function.

Binary Search Setup:

Initialize left = 0 and right = n (length of string) as the search boundaries for possible substring lengths
Use ans to keep track of the longest duplicate substring found so far

Binary Search Logic: The main loop continues while left < right:

Calculate mid = (left + right + 1) >> 1 (equivalent to (left + right + 1) // 2)
- The +1 ensures we round up to avoid infinite loops when left and right are adjacent
Call check(mid) to verify if a duplicate substring of length mid exists
Update the answer: ans = t or ans keeps the non-empty result (either the new found substring t or the previous ans)
Adjust search boundaries:
- If a duplicate of length mid is found, search for longer substrings: left = mid
- Otherwise, search for shorter substrings: right = mid - 1

The check Function: This helper function determines if any substring of length l appears at least twice:

Initialize an empty set vis to track seen substrings
Iterate through all possible starting positions from 0 to n - l
For each position i, extract substring t = s[i : i + l]
If t is already in vis, we found a duplicate - return it immediately
Otherwise, add t to vis and continue
If no duplicate is found after checking all positions, return empty string

Time Complexity Analysis:

Binary search performs O(log n) iterations
Each check call examines O(n) substrings
String slicing and hashing takes O(L) time for length L
Overall: O(n * L * log n) where L is the average substring length checked

The algorithm efficiently narrows down the search space using binary search while maintaining simplicity through straightforward string comparison using Python's built-in set operations.

Ready to land your dream job?

Unlock your dream job with a 5-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's trace through the algorithm with the string s = "banana".

Initial Setup:

String: "banana" (length n = 6)
Binary search range: left = 0, right = 6
Answer: ans = ""

Iteration 1:

mid = (0 + 6 + 1) >> 1 = 3
Check if any substring of length 3 appears twice:
- i=0: "ban" → add to vis = {"ban"}
- i=1: "ana" → add to vis = {"ban", "ana"}
- i=2: "nan" → add to vis = {"ban", "ana", "nan"}
- i=3: "ana" → already in vis! Return "ana"
Found duplicate "ana", so ans = "ana"
Search for longer: left = 3, right = 6

Iteration 2:

mid = (3 + 6 + 1) >> 1 = 5
Check if any substring of length 5 appears twice:
- i=0: "banan" → add to vis = {"banan"}
- i=1: "anana" → add to vis = {"banan", "anana"}
No duplicates found, return ""
No duplicate found, search for shorter: left = 3, right = 4

Iteration 3:

mid = (3 + 4 + 1) >> 1 = 4
Check if any substring of length 4 appears twice:
- i=0: "bana" → add to vis = {"bana"}
- i=1: "anan" → add to vis = {"bana", "anan"}
- i=2: "nana" → add to vis = {"bana", "anan", "nana"}
No duplicates found, return ""
No duplicate found, search for shorter: left = 3, right = 3

Termination:

left = right = 3, loop ends
Return ans = "ana"

The algorithm correctly identifies "ana" as the longest duplicated substring. It appears at positions [1:4] and [3:6], with an overlap at position 3. Through binary search, we efficiently narrowed down from checking all possible lengths to finding the optimal length in just 3 iterations.

Solution Implementation

1class Solution:
2    def longestDupSubstring(self, s: str) -> str:
3        """
4        Find the longest duplicate substring in a given string.
5        Uses binary search on the length of substring and checks for duplicates.
6
7        Args:
8            s: Input string to search for duplicate substrings
9
10        Returns:
11            The longest duplicate substring, or empty string if none exists
12        """
13
14        def has_duplicate_of_length(length: int) -> str:
15            """
16            Check if there exists a duplicate substring of given length.
17
18            Args:
19                length: The length of substring to check for duplicates
20
21            Returns:
22                The duplicate substring if found, otherwise empty string
23            """
24            seen_substrings = set()
25
26            # Iterate through all possible substrings of given length
27            for start_idx in range(string_length - length + 1):
28                # Extract substring of current length starting at start_idx
29                current_substring = s[start_idx : start_idx + length]
30
31                # Check if we've seen this substring before
32                if current_substring in seen_substrings:
33                    return current_substring
34
35                # Add current substring to the set of seen substrings
36                seen_substrings.add(current_substring)
37
38            return ''
39
40        # Initialize variables
41        string_length = len(s)
42        left_bound = 0
43        right_bound = string_length
44        longest_duplicate = ''
45
46        # Binary search on the length of the duplicate substring
47        while left_bound < right_bound:
48            # Calculate mid point (ceiling division)
49            mid_length = (left_bound + right_bound + 1) >> 1
50
51            # Check if duplicate substring of mid_length exists
52            found_substring = has_duplicate_of_length(mid_length)
53
54            # Update longest_duplicate if a duplicate was found
55            longest_duplicate = found_substring or longest_duplicate
56
57            if found_substring:
58                # If duplicate found, search for longer substrings
59                left_bound = mid_length
60            else:
61                # If no duplicate found, search for shorter substrings
62                right_bound = mid_length - 1
63
64        return longest_duplicate
65

1class Solution {
2    // Arrays to store powers of base and hash values
3    private long[] basePowers;
4    private long[] prefixHash;
5
6    public String longestDupSubstring(String s) {
7        // Constants for rolling hash
8        final int BASE = 131;
9        int n = s.length();
10
11        // Initialize arrays with extra space to avoid boundary issues
12        basePowers = new long[n + 10];
13        prefixHash = new long[n + 10];
14
15        // Precompute powers of base and prefix hash values
16        basePowers[0] = 1;
17        for (int i = 0; i < n; i++) {
18            basePowers[i + 1] = basePowers[i] * BASE;
19            prefixHash[i + 1] = prefixHash[i] * BASE + s.charAt(i);
20        }
21
22        // Binary search on the length of the longest duplicate substring
23        String longestDuplicate = "";
24        int left = 0;
25        int right = n;
26
27        while (left < right) {
28            // Calculate mid point (bias towards right when even)
29            int mid = (left + right + 1) >> 1;
30
31            // Check if there exists a duplicate substring of length mid
32            String duplicateFound = checkDuplicateOfLength(s, mid);
33
34            if (duplicateFound.length() > 0) {
35                // Found a duplicate, try longer lengths
36                left = mid;
37                longestDuplicate = duplicateFound;
38            } else {
39                // No duplicate found, try shorter lengths
40                right = mid - 1;
41            }
42        }
43
44        return longestDuplicate;
45    }
46
47    /**
48     * Checks if there exists a duplicate substring of given length.
49     * Uses rolling hash to efficiently compare substrings.
50     *
51     * @param s the input string
52     * @param length the length of substring to check
53     * @return the duplicate substring if found, empty string otherwise
54     */
55    private String checkDuplicateOfLength(String s, int length) {
56        int n = s.length();
57        Set<Long> seenHashes = new HashSet<>();
58
59        // Check all possible substrings of given length
60        for (int startPos = 1; startPos + length - 1 <= n; startPos++) {
61            int endPos = startPos + length - 1;
62
63            // Calculate hash value for substring [startPos-1, endPos-1] using prefix hash
64            // Formula: hash(s[i:j]) = prefixHash[j] - prefixHash[i-1] * base^(j-i+1)
65            long hashValue = prefixHash[endPos] - prefixHash[startPos - 1] * basePowers[endPos - startPos + 1];
66
67            // Check if we've seen this hash before (indicating a duplicate)
68            if (seenHashes.contains(hashValue)) {
69                // Return the duplicate substring (convert to 0-based indexing)
70                return s.substring(startPos - 1, endPos);
71            }
72
73            // Add current hash to the set
74            seenHashes.add(hashValue);
75        }
76
77        // No duplicate found
78        return "";
79    }
80}
81

1typedef unsigned long long ULL;
2
3class Solution {
4public:
5    // Arrays for polynomial hash computation
6    ULL powers[30010];      // powers[i] stores base^i
7    ULL prefixHash[30010];  // prefixHash[i] stores hash of s[0...i-1]
8
9    string longestDupSubstring(string s) {
10        // Initialize hash parameters
11        const int BASE = 131;  // Prime base for polynomial rolling hash
12        int n = s.size();
13
14        // Precompute powers of base
15        powers[0] = 1;
16        for (int i = 0; i < n; ++i) {
17            powers[i + 1] = powers[i] * BASE;
18            prefixHash[i + 1] = prefixHash[i] * BASE + s[i];
19        }
20
21        // Binary search on the length of duplicate substring
22        int left = 0, right = n;
23        string result = "";
24
25        while (left < right) {
26            // Try middle length (bias towards larger when even)
27            int mid = (left + right + 1) >> 1;
28            string duplicateFound = findDuplicateOfLength(s, mid);
29
30            if (duplicateFound.empty()) {
31                // No duplicate of this length, try smaller lengths
32                right = mid - 1;
33            } else {
34                // Found duplicate, try larger lengths
35                left = mid;
36                result = duplicateFound;
37            }
38        }
39
40        return result;
41    }
42
43private:
44    // Check if there exists a duplicate substring of given length
45    // Returns the duplicate substring if found, empty string otherwise
46    string findDuplicateOfLength(string& s, int targetLength) {
47        int n = s.size();
48        unordered_set<ULL> seenHashes;
49
50        // Slide window of size targetLength through the string
51        for (int startIdx = 1; startIdx + targetLength - 1 <= n; ++startIdx) {
52            int endIdx = startIdx + targetLength - 1;
53
54            // Calculate hash of substring s[startIdx-1...endIdx-1]
55            // Using formula: hash(s[i...j]) = prefixHash[j+1] - prefixHash[i] * base^(j-i+1)
56            ULL currentHash = prefixHash[endIdx] - prefixHash[startIdx - 1] * powers[endIdx - startIdx + 1];
57
58            // Check if we've seen this hash before (duplicate found)
59            if (seenHashes.count(currentHash)) {
60                return s.substr(startIdx - 1, targetLength);
61            }
62
63            seenHashes.insert(currentHash);
64        }
65
66        return "";  // No duplicate of this length found
67    }
68};
69

1// Type alias for unsigned long long equivalent (using BigInt in TypeScript)
2type ULL = bigint;
3
4// Arrays for polynomial hash computation
5const powers: ULL[] = new Array(30010);      // powers[i] stores base^i
6const prefixHash: ULL[] = new Array(30010);  // prefixHash[i] stores hash of s[0...i-1]
7
8function longestDupSubstring(s: string): string {
9    // Initialize hash parameters
10    const BASE: bigint = 131n;  // Prime base for polynomial rolling hash
11    const n: number = s.length;
12
13    // Precompute powers of base
14    powers[0] = 1n;
15    for (let i = 0; i < n; ++i) {
16        powers[i + 1] = powers[i] * BASE;
17        prefixHash[i + 1] = prefixHash[i] * BASE + BigInt(s.charCodeAt(i));
18    }
19
20    // Binary search on the length of duplicate substring
21    let left: number = 0;
22    let right: number = n;
23    let result: string = "";
24
25    while (left < right) {
26        // Try middle length (bias towards larger when even)
27        const mid: number = (left + right + 1) >> 1;
28        const duplicateFound: string = findDuplicateOfLength(s, mid);
29
30        if (duplicateFound === "") {
31            // No duplicate of this length, try smaller lengths
32            right = mid - 1;
33        } else {
34            // Found duplicate, try larger lengths
35            left = mid;
36            result = duplicateFound;
37        }
38    }
39
40    return result;
41}
42
43// Check if there exists a duplicate substring of given length
44// Returns the duplicate substring if found, empty string otherwise
45function findDuplicateOfLength(s: string, targetLength: number): string {
46    const n: number = s.length;
47    const seenHashes: Set<ULL> = new Set();
48
49    // Slide window of size targetLength through the string
50    for (let startIdx = 1; startIdx + targetLength - 1 <= n; ++startIdx) {
51        const endIdx: number = startIdx + targetLength - 1;
52
53        // Calculate hash of substring s[startIdx-1...endIdx-1]
54        // Using formula: hash(s[i...j]) = prefixHash[j+1] - prefixHash[i] * base^(j-i+1)
55        const currentHash: ULL = prefixHash[endIdx] - prefixHash[startIdx - 1] * powers[endIdx - startIdx + 1];
56
57        // Check if we've seen this hash before (duplicate found)
58        if (seenHashes.has(currentHash)) {
59            return s.substring(startIdx - 1, startIdx - 1 + targetLength);
60        }
61
62        seenHashes.add(currentHash);
63    }
64
65    return "";  // No duplicate of this length found
66}
67

Time and Space Complexity

Time Complexity: O(n² * log n)

The algorithm uses binary search to find the longest duplicate substring. The binary search runs for O(log n) iterations where n is the length of the string. In each iteration, the check function is called.

The check function iterates through all possible substrings of length l, which takes O(n - l + 1) ≈ O(n) iterations. For each substring, it performs:

String slicing s[i : i + l] which takes O(l) time
Set membership check and insertion which takes O(l) time on average (due to string hashing)

Therefore, the check function has time complexity O(n * l). In the worst case, l can be up to n/2 (when the longest duplicate substring is half the string length), making each check call O(n²) in the worst case.

Overall time complexity: O(log n) * O(n²) = O(n² * log n)

Space Complexity: O(n²)

The space complexity is dominated by the vis set in the check function. In the worst case:

The set can store up to O(n) different substrings
Each substring can have length up to O(n)
Therefore, the total space used can be O(n²)

Additionally, the ans variable stores a string of length up to O(n), but this doesn't change the overall space complexity.

Overall space complexity: O(n²)

Learn more about how to find time and space complexity quickly.

Common Pitfalls

1. Memory Limit Exceeded with Large Strings

The current solution stores full substring strings in a set, which can consume excessive memory when dealing with very long strings (e.g., 100,000 characters). Each substring of length L requires O(L) space, and with potentially O(n) substrings stored, memory usage can explode.

Solution: Use rolling hash (Rabin-Karp algorithm) instead of storing actual substrings:

def has_duplicate_of_length_optimized(length: int) -> str:
    if length == 0:
        return ''

    # Use rolling hash instead of storing strings
    base = 26
    mod = 2**63 - 1

    # Calculate initial hash for first substring
    hash_val = 0
    base_power = 1
    for i in range(length):
        hash_val = (hash_val * base + ord(s[i])) % mod
        if i < length - 1:
            base_power = (base_power * base) % mod

    seen_hashes = {hash_val}

    # Roll the hash through the string
    for i in range(length, len(s)):
        # Remove leftmost character and add rightmost character
        hash_val = (hash_val - ord(s[i - length]) * base_power) % mod
        hash_val = (hash_val * base + ord(s[i])) % mod

        if hash_val in seen_hashes:
            # Found potential duplicate, verify to avoid hash collision
            return s[i - length + 1 : i + 1]
        seen_hashes.add(hash_val)

    return ''

2. Incorrect Binary Search Boundaries

A common mistake is using mid = (left + right) // 2 without the +1, which can cause infinite loops when left = right - 1. If a duplicate of length mid exists, setting left = mid would keep the search stuck.

Solution: Always use mid = (left + right + 1) // 2 when updating left = mid in the success case, or alternatively use mid = (left + right) // 2 with left = mid + 1.

3. Off-by-One Error in Substring Extraction

Incorrectly calculating the range for substring extraction, such as using range(n - length) instead of range(n - length + 1), would miss the last valid substring.

Solution: Ensure the loop iterates through range(n - length + 1) to include all valid starting positions:

for start_idx in range(string_length - length + 1):  # Correct
    # Not: range(string_length - length)  # Would miss last substring

4. Hash Collision Issues (When Using Rolling Hash)

If implementing rolling hash optimization, hash collisions can lead to false positives where different substrings produce the same hash value.

Solution: When a hash match is found, perform a secondary verification or use multiple hash functions:

if hash_val in seen_hashes:
    # Verify the actual substring match
    candidate = s[i - length + 1 : i + 1]
    # Additional verification logic here
    for start_pos in seen_positions[hash_val]:
        if s[start_pos : start_pos + length] == candidate:
            return candidate

5. Performance Degradation with Short Repeated Characters

For strings with many repeated characters (e.g., "aaaaaaa..."), the check function becomes inefficient as almost every substring of a given length might need to be compared.

Solution: Add early termination conditions or use more sophisticated string matching algorithms like suffix arrays or suffix trees for better worst-case performance.

Discover Your Strengths and Weaknesses: Take Our 5-Minute Quiz to Tailor Your Study Plan: