187. Repeated DNA Sequences

MediumBit ManipulationHash TableStringSliding WindowHash FunctionRolling Hash

Problem Description

You are given a DNA sequence represented as a string containing only the characters 'A', 'C', 'G', and 'T'. Your task is to find all substring sequences that are exactly 10 letters long and appear more than once in the DNA string.

For example, if the DNA sequence is "ACGAATTCCG", you need to identify any 10-character substring that occurs at least twice in the sequence.

The function should:

Take a string s representing the DNA sequence as input
Return a list of all 10-letter substrings that appear more than once in s
The substrings can be returned in any order

The solution uses a hash table to track the frequency of each 10-letter substring. It slides a window of size 10 across the string, counting occurrences of each substring. When a substring's count reaches exactly 2 (meaning it's been seen twice), it gets added to the result list. This ensures each repeated sequence appears only once in the output, even if it occurs more than twice in the original string.

Quick Interview Experience

Help others by sharing your interview experience

Have you seen this problem before?

Intuition

The key insight is that we need to identify duplicate substrings of a fixed length (10 characters). When dealing with finding duplicates, a hash table naturally comes to mind as it allows us to track what we've seen before in constant time.

Since we're looking for all 10-letter sequences that appear more than once, we need to examine every possible 10-character substring in the DNA sequence. This suggests using a sliding window approach - we can move through the string character by character, extracting each 10-character window as we go.

The challenge is not just finding duplicates, but also ensuring we don't report the same sequence multiple times if it appears three or more times. For instance, if a sequence appears 5 times, we only want it in our result once, not 4 times.

To handle this elegantly, we can count the occurrences of each substring. The trick is to add a substring to our result only when its count reaches exactly 2. This way:

When we see it the first time (count = 1), we don't add it
When we see it the second time (count = 2), we add it to results
When we see it the third, fourth, or more times (count > 2), we don't add it again

This approach ensures each repeated sequence appears exactly once in our output, regardless of how many times it actually occurs in the DNA string.

Solution Approach

The solution uses a hash table combined with a sliding window technique to efficiently find all repeated 10-letter DNA sequences.

Implementation Steps:

Initialize a Counter: We create a hash table cnt using Python's Counter() to track the frequency of each 10-letter substring. This data structure automatically handles counting and initializing new keys.
Set up the result list: We initialize an empty list ans to store the sequences that appear more than once.
Sliding Window Iteration: We iterate through the string using a sliding window of size 10:
- The loop runs from index i = 0 to i = len(s) - 10, ensuring we don't go out of bounds
- At each position i, we extract the substring t = s[i : i + 10]
Count and Check: For each extracted substring t:
- We increment its count in the hash table: cnt[t] += 1
- We check if the count equals exactly 2: if cnt[t] == 2
- If it's the second occurrence, we add it to our answer: ans.append(t)
Return Result: After processing all possible 10-letter substrings, we return the ans list containing all sequences that appeared more than once.

Why this works:

The sliding window ensures we examine every possible 10-letter substring in O(n) iterations
The hash table gives us O(1) lookup and update for counting occurrences
The "count == 2" check ensures each repeated sequence is added to the result exactly once, avoiding duplicates in the output

Time Complexity: O(n) where n is the length of the DNA string, as we iterate through the string once and perform constant-time operations for each substring.

Space Complexity: O(n) in the worst case, as we might need to store up to n-9 unique substrings in the hash table.

Ready to land your dream job?

Unlock your dream job with a 5-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's walk through the solution with a small example DNA sequence: "AAAAACCCCCAAAAACCCCCC"

Step 1: Initialize data structures

Create an empty counter cnt = {}
Create an empty result list ans = []

Step 2: Sliding window process

We'll extract each 10-letter substring and track its count:

i = 0: Extract "AAAAACCCCC"
- cnt["AAAAACCCCC"] = 1 (first occurrence)
- Count is 1, don't add to result
i = 1: Extract "AAAACCCCCA"
- cnt["AAAACCCCCA"] = 1 (first occurrence)
- Count is 1, don't add to result
i = 2: Extract "AAACCCCCAA"
- cnt["AAACCCCCAA"] = 1 (first occurrence)
- Count is 1, don't add to result
i = 3: Extract "AACCCCCAAA"
- cnt["AACCCCCAAA"] = 1 (first occurrence)
- Count is 1, don't add to result
i = 4: Extract "ACCCCCAAAA"
- cnt["ACCCCCAAAA"] = 1 (first occurrence)
- Count is 1, don't add to result
i = 5: Extract "CCCCCAAAAA"
- cnt["CCCCCAAAAA"] = 1 (first occurrence)
- Count is 1, don't add to result
i = 6: Extract "CCCCAAAAAC"
- cnt["CCCCAAAAAC"] = 1 (first occurrence)
- Count is 1, don't add to result
i = 7: Extract "CCCAAAAACC"
- cnt["CCCAAAAACC"] = 1 (first occurrence)
- Count is 1, don't add to result
i = 8: Extract "CCAAAAACCC"
- cnt["CCAAAAACCC"] = 1 (first occurrence)
- Count is 1, don't add to result
i = 9: Extract "CAAAAACCCC"
- cnt["CAAAAACCCC"] = 1 (first occurrence)
- Count is 1, don't add to result
i = 10: Extract "AAAAACCCCC"
- cnt["AAAAACCCCC"] = 2 (second occurrence!)
- Count is 2, add to result: ans = ["AAAAACCCCC"]
i = 11: Extract "AAAACCCCCC"
- cnt["AAAACCCCCC"] = 1 (first occurrence)
- Count is 1, don't add to result

Step 3: Return result

Final answer: ["AAAAACCCCC"]

The substring "AAAAACCCCC" appears at positions 0 and 10 in the original string, making it a repeated 10-letter sequence. Our algorithm correctly identifies it by adding it to the result only when we encounter it for the second time.

Solution Implementation

1class Solution:
2    def findRepeatedDnaSequences(self, s: str) -> List[str]:
3        """
4        Find all 10-letter-long sequences that occur more than once in a DNA string.
5      
6        Args:
7            s: DNA string containing only 'A', 'C', 'G', 'T' characters
8          
9        Returns:
10            List of all 10-letter sequences that appear at least twice
11        """
12        from collections import Counter
13        from typing import List
14      
15        # Dictionary to count occurrences of each 10-letter substring
16        sequence_count = Counter()
17      
18        # List to store sequences that appear more than once
19        result = []
20      
21        # Iterate through all possible 10-letter substrings
22        # Range stops at len(s) - 9 to ensure we can extract 10 characters
23        for i in range(len(s) - 9):
24            # Extract current 10-letter substring
25            current_sequence = s[i:i + 10]
26          
27            # Increment count for this sequence
28            sequence_count[current_sequence] += 1
29          
30            # Add to result only when count reaches exactly 2
31            # This ensures each repeated sequence is added only once
32            if sequence_count[current_sequence] == 2:
33                result.append(current_sequence)
34      
35        return result
36

1class Solution {
2    public List<String> findRepeatedDnaSequences(String s) {
3        // HashMap to store the frequency count of each 10-character DNA sequence
4        Map<String, Integer> sequenceCount = new HashMap<>();
5      
6        // List to store sequences that appear more than once
7        List<String> repeatedSequences = new ArrayList<>();
8      
9        // Iterate through all possible 10-character substrings in the DNA string
10        // Loop runs from index 0 to (length - 10), inclusive
11        for (int i = 0; i <= s.length() - 10; i++) {
12            // Extract the current 10-character substring starting at index i
13            String currentSequence = s.substring(i, i + 10);
14          
15            // Increment the count for this sequence and get the new count value
16            // merge() adds 1 to existing value, or sets to 1 if key doesn't exist
17            int newCount = sequenceCount.merge(currentSequence, 1, Integer::sum);
18          
19            // If this sequence has been seen exactly twice, add it to the result
20            // Using == 2 ensures each repeated sequence is added only once
21            if (newCount == 2) {
22                repeatedSequences.add(currentSequence);
23            }
24        }
25      
26        // Return the list of DNA sequences that appear more than once
27        return repeatedSequences;
28    }
29}
30

1class Solution {
2public:
3    vector<string> findRepeatedDnaSequences(string s) {
4        // Hash map to store frequency of each 10-letter substring
5        unordered_map<string, int> substringFrequency;
6      
7        // Result vector to store repeated DNA sequences
8        vector<string> result;
9      
10        // Calculate the number of possible 10-letter substrings
11        int numSubstrings = s.size() - 10 + 1;
12      
13        // Iterate through all possible starting positions for 10-letter substrings
14        for (int i = 0; i < numSubstrings; ++i) {
15            // Extract the current 10-letter substring
16            string currentSubstring = s.substr(i, 10);
17          
18            // Increment the frequency count for this substring
19            substringFrequency[currentSubstring]++;
20          
21            // If this substring appears exactly twice, add it to the result
22            // (This ensures each repeated sequence is added only once)
23            if (substringFrequency[currentSubstring] == 2) {
24                result.emplace_back(currentSubstring);
25            }
26        }
27      
28        return result;
29    }
30};
31

1/**
2 * Finds all 10-letter-long sequences (substrings) that occur more than once in a DNA string
3 * @param s - The input DNA string containing only 'A', 'C', 'G', and 'T' characters
4 * @returns An array of all repeated 10-letter DNA sequences
5 */
6function findRepeatedDnaSequences(s: string): string[] {
7    const stringLength: number = s.length;
8  
9    // Map to store the count of each 10-letter sequence
10    const sequenceCountMap: Map<string, number> = new Map<string, number>();
11  
12    // Array to store sequences that appear more than once
13    const repeatedSequences: string[] = [];
14  
15    // Iterate through all possible 10-letter sequences in the string
16    for (let startIndex = 0; startIndex <= stringLength - 10; startIndex++) {
17        // Extract the current 10-letter sequence
18        const currentSequence: string = s.slice(startIndex, startIndex + 10);
19      
20        // Update the count for this sequence in the map
21        const currentCount: number = (sequenceCountMap.get(currentSequence) ?? 0) + 1;
22        sequenceCountMap.set(currentSequence, currentCount);
23      
24        // Add to result array only when the sequence appears for the second time
25        // This ensures each repeated sequence is added only once
26        if (currentCount === 2) {
27            repeatedSequences.push(currentSequence);
28        }
29    }
30  
31    return repeatedSequences;
32}
33

Time and Space Complexity

The time complexity is O(n × 10), where n is the length of the string s. This is because:

The loop iterates n - 10 + 1 times, which is approximately n iterations
In each iteration, we extract a substring of length 10 using s[i : i + 10], which takes O(10) time
The hash table operations (lookup and insertion) for strings of length 10 also take O(10) time
Therefore, the total time complexity is O(n × 10)

The space complexity is O(n × 10), where n is the length of the string s. This is because:

In the worst case, all possible 10-character substrings are unique and stored in the Counter dictionary
There can be at most n - 10 + 1 unique substrings, which is approximately n
Each substring stored takes O(10) space
The answer list ans can contain at most n - 10 + 1 substrings in the worst case, each of length 10
Therefore, the total space complexity is O(n × 10)

Common Pitfalls

1. Adding Duplicates to the Result

A common mistake is adding the same sequence to the result multiple times when it appears more than twice in the DNA string.

Incorrect approach:

# WRONG: This adds the sequence every time it's seen after the first occurrence
if sequence_count[current_sequence] > 1:
    result.append(current_sequence)

This would add "AAAAAAAAAA" three times if it appears four times in the string.

Correct approach:

# RIGHT: Only add when count equals exactly 2
if sequence_count[current_sequence] == 2:
    result.append(current_sequence)

2. Using a Set Instead of Checking Count == 2

Some might try using a set to avoid duplicates:

Inefficient approach:

result = set()
for i in range(len(s) - 9):
    current_sequence = s[i:i + 10]
    sequence_count[current_sequence] += 1
    if sequence_count[current_sequence] > 1:
        result.add(current_sequence)
return list(result)

While this works, it's less efficient because:

Converting between set and list adds overhead
The set needs to check for duplicates on every add operation
The elegance of the "count == 2" solution is lost

3. Incorrect Window Boundaries

Wrong boundary calculation:

# WRONG: This causes index out of bounds
for i in range(len(s) - 10 + 1):  # Should be len(s) - 9
    current_sequence = s[i:i + 10]

Or even worse:

# WRONG: Misses the last valid substring
for i in range(len(s) - 11):
    current_sequence = s[i:i + 10]

Correct approach:

# RIGHT: len(s) - 9 ensures we can always extract 10 characters
for i in range(len(s) - 9):
    current_sequence = s[i:i + 10]

4. Not Handling Edge Cases

Forgetting to handle strings shorter than 10 characters:

Incomplete solution:

def findRepeatedDnaSequences(self, s: str) -> List[str]:
    # Missing edge case check
    sequence_count = Counter()
    result = []
  
    for i in range(len(s) - 9):  # This becomes range(negative) if len(s) < 10
        # ...

Better solution:

def findRepeatedDnaSequences(self, s: str) -> List[str]:
    if len(s) < 10:
        return []
  
    sequence_count = Counter()
    result = []
    # ... rest of the code

5. Using Regular Dictionary Instead of Counter

Using a regular dict requires manual initialization:

More verbose approach:

sequence_count = {}
for i in range(len(s) - 9):
    current_sequence = s[i:i + 10]
    # Need to check if key exists first
    if current_sequence not in sequence_count:
        sequence_count[current_sequence] = 0
    sequence_count[current_sequence] += 1

Using Counter() or defaultdict(int) eliminates this boilerplate and potential KeyError issues.

Discover Your Strengths and Weaknesses: Take Our 5-Minute Quiz to Tailor Your Study Plan: