187. Repeated DNA Sequences
Problem Description
You are given a DNA sequence represented as a string containing only the characters 'A'
, 'C'
, 'G'
, and 'T'
. Your task is to find all substring sequences that are exactly 10
letters long and appear more than once in the DNA string.
For example, if the DNA sequence is "ACGAATTCCG"
, you need to identify any 10-character substring that occurs at least twice in the sequence.
The function should:
- Take a string
s
representing the DNA sequence as input - Return a list of all 10-letter substrings that appear more than once in
s
- The substrings can be returned in any order
The solution uses a hash table to track the frequency of each 10-letter substring. It slides a window of size 10 across the string, counting occurrences of each substring. When a substring's count reaches exactly 2 (meaning it's been seen twice), it gets added to the result list. This ensures each repeated sequence appears only once in the output, even if it occurs more than twice in the original string.
Intuition
The key insight is that we need to identify duplicate substrings of a fixed length (10 characters). When dealing with finding duplicates, a hash table naturally comes to mind as it allows us to track what we've seen before in constant time.
Since we're looking for all 10-letter sequences that appear more than once, we need to examine every possible 10-character substring in the DNA sequence. This suggests using a sliding window approach - we can move through the string character by character, extracting each 10-character window as we go.
The challenge is not just finding duplicates, but also ensuring we don't report the same sequence multiple times if it appears three or more times. For instance, if a sequence appears 5 times, we only want it in our result once, not 4 times.
To handle this elegantly, we can count the occurrences of each substring. The trick is to add a substring to our result only when its count reaches exactly 2
. This way:
- When we see it the first time (count = 1), we don't add it
- When we see it the second time (count = 2), we add it to results
- When we see it the third, fourth, or more times (count > 2), we don't add it again
This approach ensures each repeated sequence appears exactly once in our output, regardless of how many times it actually occurs in the DNA string.
Solution Approach
The solution uses a hash table combined with a sliding window technique to efficiently find all repeated 10-letter DNA sequences.
Implementation Steps:
-
Initialize a Counter: We create a hash table
cnt
using Python'sCounter()
to track the frequency of each 10-letter substring. This data structure automatically handles counting and initializing new keys. -
Set up the result list: We initialize an empty list
ans
to store the sequences that appear more than once. -
Sliding Window Iteration: We iterate through the string using a sliding window of size 10:
- The loop runs from index
i = 0
toi = len(s) - 10
, ensuring we don't go out of bounds - At each position
i
, we extract the substringt = s[i : i + 10]
- The loop runs from index
-
Count and Check: For each extracted substring
t
:- We increment its count in the hash table:
cnt[t] += 1
- We check if the count equals exactly 2:
if cnt[t] == 2
- If it's the second occurrence, we add it to our answer:
ans.append(t)
- We increment its count in the hash table:
-
Return Result: After processing all possible 10-letter substrings, we return the
ans
list containing all sequences that appeared more than once.
Why this works:
- The sliding window ensures we examine every possible 10-letter substring in O(n) iterations
- The hash table gives us O(1) lookup and update for counting occurrences
- The "count == 2" check ensures each repeated sequence is added to the result exactly once, avoiding duplicates in the output
Time Complexity: O(n) where n is the length of the DNA string, as we iterate through the string once and perform constant-time operations for each substring.
Space Complexity: O(n) in the worst case, as we might need to store up to n-9 unique substrings in the hash table.
Ready to land your dream job?
Unlock your dream job with a 3-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's walk through the solution with a small example DNA sequence: "AAAAACCCCCAAAAACCCCCC"
Step 1: Initialize data structures
- Create an empty counter
cnt = {}
- Create an empty result list
ans = []
Step 2: Sliding window process
We'll extract each 10-letter substring and track its count:
-
i = 0: Extract
"AAAAACCCCC"
cnt["AAAAACCCCC"] = 1
(first occurrence)- Count is 1, don't add to result
-
i = 1: Extract
"AAAACCCCCA"
cnt["AAAACCCCCA"] = 1
(first occurrence)- Count is 1, don't add to result
-
i = 2: Extract
"AAACCCCCAA"
cnt["AAACCCCCAA"] = 1
(first occurrence)- Count is 1, don't add to result
-
i = 3: Extract
"AACCCCCAAA"
cnt["AACCCCCAAA"] = 1
(first occurrence)- Count is 1, don't add to result
-
i = 4: Extract
"ACCCCCAAAA"
cnt["ACCCCCAAAA"] = 1
(first occurrence)- Count is 1, don't add to result
-
i = 5: Extract
"CCCCCAAAAA"
cnt["CCCCCAAAAA"] = 1
(first occurrence)- Count is 1, don't add to result
-
i = 6: Extract
"CCCCAAAAAC"
cnt["CCCCAAAAAC"] = 1
(first occurrence)- Count is 1, don't add to result
-
i = 7: Extract
"CCCAAAAACC"
cnt["CCCAAAAACC"] = 1
(first occurrence)- Count is 1, don't add to result
-
i = 8: Extract
"CCAAAAACCC"
cnt["CCAAAAACCC"] = 1
(first occurrence)- Count is 1, don't add to result
-
i = 9: Extract
"CAAAAACCCC"
cnt["CAAAAACCCC"] = 1
(first occurrence)- Count is 1, don't add to result
-
i = 10: Extract
"AAAAACCCCC"
cnt["AAAAACCCCC"] = 2
(second occurrence!)- Count is 2, add to result:
ans = ["AAAAACCCCC"]
-
i = 11: Extract
"AAAACCCCCC"
cnt["AAAACCCCCC"] = 1
(first occurrence)- Count is 1, don't add to result
Step 3: Return result
- Final answer:
["AAAAACCCCC"]
The substring "AAAAACCCCC"
appears at positions 0 and 10 in the original string, making it a repeated 10-letter sequence. Our algorithm correctly identifies it by adding it to the result only when we encounter it for the second time.
Solution Implementation
1class Solution:
2 def findRepeatedDnaSequences(self, s: str) -> List[str]:
3 """
4 Find all 10-letter-long sequences that occur more than once in a DNA string.
5
6 Args:
7 s: DNA string containing only 'A', 'C', 'G', 'T' characters
8
9 Returns:
10 List of all 10-letter sequences that appear at least twice
11 """
12 from collections import Counter
13 from typing import List
14
15 # Dictionary to count occurrences of each 10-letter substring
16 sequence_count = Counter()
17
18 # List to store sequences that appear more than once
19 result = []
20
21 # Iterate through all possible 10-letter substrings
22 # Range stops at len(s) - 9 to ensure we can extract 10 characters
23 for i in range(len(s) - 9):
24 # Extract current 10-letter substring
25 current_sequence = s[i:i + 10]
26
27 # Increment count for this sequence
28 sequence_count[current_sequence] += 1
29
30 # Add to result only when count reaches exactly 2
31 # This ensures each repeated sequence is added only once
32 if sequence_count[current_sequence] == 2:
33 result.append(current_sequence)
34
35 return result
36
1class Solution {
2 public List<String> findRepeatedDnaSequences(String s) {
3 // HashMap to store the frequency count of each 10-character DNA sequence
4 Map<String, Integer> sequenceCount = new HashMap<>();
5
6 // List to store sequences that appear more than once
7 List<String> repeatedSequences = new ArrayList<>();
8
9 // Iterate through all possible 10-character substrings in the DNA string
10 // Loop runs from index 0 to (length - 10), inclusive
11 for (int i = 0; i <= s.length() - 10; i++) {
12 // Extract the current 10-character substring starting at index i
13 String currentSequence = s.substring(i, i + 10);
14
15 // Increment the count for this sequence and get the new count value
16 // merge() adds 1 to existing value, or sets to 1 if key doesn't exist
17 int newCount = sequenceCount.merge(currentSequence, 1, Integer::sum);
18
19 // If this sequence has been seen exactly twice, add it to the result
20 // Using == 2 ensures each repeated sequence is added only once
21 if (newCount == 2) {
22 repeatedSequences.add(currentSequence);
23 }
24 }
25
26 // Return the list of DNA sequences that appear more than once
27 return repeatedSequences;
28 }
29}
30
1class Solution {
2public:
3 vector<string> findRepeatedDnaSequences(string s) {
4 // Hash map to store frequency of each 10-letter substring
5 unordered_map<string, int> substringFrequency;
6
7 // Result vector to store repeated DNA sequences
8 vector<string> result;
9
10 // Calculate the number of possible 10-letter substrings
11 int numSubstrings = s.size() - 10 + 1;
12
13 // Iterate through all possible starting positions for 10-letter substrings
14 for (int i = 0; i < numSubstrings; ++i) {
15 // Extract the current 10-letter substring
16 string currentSubstring = s.substr(i, 10);
17
18 // Increment the frequency count for this substring
19 substringFrequency[currentSubstring]++;
20
21 // If this substring appears exactly twice, add it to the result
22 // (This ensures each repeated sequence is added only once)
23 if (substringFrequency[currentSubstring] == 2) {
24 result.emplace_back(currentSubstring);
25 }
26 }
27
28 return result;
29 }
30};
31
1/**
2 * Finds all 10-letter-long sequences (substrings) that occur more than once in a DNA string
3 * @param s - The input DNA string containing only 'A', 'C', 'G', and 'T' characters
4 * @returns An array of all repeated 10-letter DNA sequences
5 */
6function findRepeatedDnaSequences(s: string): string[] {
7 const stringLength: number = s.length;
8
9 // Map to store the count of each 10-letter sequence
10 const sequenceCountMap: Map<string, number> = new Map<string, number>();
11
12 // Array to store sequences that appear more than once
13 const repeatedSequences: string[] = [];
14
15 // Iterate through all possible 10-letter sequences in the string
16 for (let startIndex = 0; startIndex <= stringLength - 10; startIndex++) {
17 // Extract the current 10-letter sequence
18 const currentSequence: string = s.slice(startIndex, startIndex + 10);
19
20 // Update the count for this sequence in the map
21 const currentCount: number = (sequenceCountMap.get(currentSequence) ?? 0) + 1;
22 sequenceCountMap.set(currentSequence, currentCount);
23
24 // Add to result array only when the sequence appears for the second time
25 // This ensures each repeated sequence is added only once
26 if (currentCount === 2) {
27 repeatedSequences.push(currentSequence);
28 }
29 }
30
31 return repeatedSequences;
32}
33
Time and Space Complexity
The time complexity is O(n × 10)
, where n
is the length of the string s
. This is because:
- The loop iterates
n - 10 + 1
times, which is approximatelyn
iterations - In each iteration, we extract a substring of length 10 using
s[i : i + 10]
, which takesO(10)
time - The hash table operations (lookup and insertion) for strings of length 10 also take
O(10)
time - Therefore, the total time complexity is
O(n × 10)
The space complexity is O(n × 10)
, where n
is the length of the string s
. This is because:
- In the worst case, all possible 10-character substrings are unique and stored in the
Counter
dictionary - There can be at most
n - 10 + 1
unique substrings, which is approximatelyn
- Each substring stored takes
O(10)
space - The answer list
ans
can contain at mostn - 10 + 1
substrings in the worst case, each of length 10 - Therefore, the total space complexity is
O(n × 10)
Common Pitfalls
1. Adding Duplicates to the Result
A common mistake is adding the same sequence to the result multiple times when it appears more than twice in the DNA string.
Incorrect approach:
# WRONG: This adds the sequence every time it's seen after the first occurrence if sequence_count[current_sequence] > 1: result.append(current_sequence)
This would add "AAAAAAAAAA" three times if it appears four times in the string.
Correct approach:
# RIGHT: Only add when count equals exactly 2 if sequence_count[current_sequence] == 2: result.append(current_sequence)
2. Using a Set Instead of Checking Count == 2
Some might try using a set to avoid duplicates:
Inefficient approach:
result = set()
for i in range(len(s) - 9):
current_sequence = s[i:i + 10]
sequence_count[current_sequence] += 1
if sequence_count[current_sequence] > 1:
result.add(current_sequence)
return list(result)
While this works, it's less efficient because:
- Converting between set and list adds overhead
- The set needs to check for duplicates on every add operation
- The elegance of the "count == 2" solution is lost
3. Incorrect Window Boundaries
Wrong boundary calculation:
# WRONG: This causes index out of bounds
for i in range(len(s) - 10 + 1): # Should be len(s) - 9
current_sequence = s[i:i + 10]
Or even worse:
# WRONG: Misses the last valid substring
for i in range(len(s) - 11):
current_sequence = s[i:i + 10]
Correct approach:
# RIGHT: len(s) - 9 ensures we can always extract 10 characters
for i in range(len(s) - 9):
current_sequence = s[i:i + 10]
4. Not Handling Edge Cases
Forgetting to handle strings shorter than 10 characters:
Incomplete solution:
def findRepeatedDnaSequences(self, s: str) -> List[str]:
# Missing edge case check
sequence_count = Counter()
result = []
for i in range(len(s) - 9): # This becomes range(negative) if len(s) < 10
# ...
Better solution:
def findRepeatedDnaSequences(self, s: str) -> List[str]:
if len(s) < 10:
return []
sequence_count = Counter()
result = []
# ... rest of the code
5. Using Regular Dictionary Instead of Counter
Using a regular dict requires manual initialization:
More verbose approach:
sequence_count = {}
for i in range(len(s) - 9):
current_sequence = s[i:i + 10]
# Need to check if key exists first
if current_sequence not in sequence_count:
sequence_count[current_sequence] = 0
sequence_count[current_sequence] += 1
Using Counter()
or defaultdict(int)
eliminates this boilerplate and potential KeyError issues.
Which data structure is used in a depth first search?
Recommended Readings
https assets algo monster cover_photos stack svg Sliding Window Maximum Monotonic Stack We have an array and a sliding window defined by a start index and an end index The sliding window moves from left of the array to right There are always k elements in the window The window
Coding Interview Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https assets algo monster recursion jpg You first call Ben and ask
Want a Structured Path to Master System Design Too? Don’t Miss This!