187. Repeated DNA Sequences
Problem Description
In the given LeetCode problem, we are tasked with analyzing a string that represents a DNA sequence. A DNA sequence consists of a series of nucleotides, which are denoted by the characters 'A', 'C', 'G', and 'T'. Our goal is to find all the unique substrings of length 10 that repeat more than once within this DNA sequence. We need to return these substrings in any order, without accounting for their frequency beyond the first occurrence.
Intuition
Encountering this problem, the first intuition is to track the frequency of every possible 10-letter-long substring within the larger DNA string. To efficiently look up these sequences, a HashTable (Python's Counter
dictionary can be used here for convenience) is a natural choice because it allows insertions and lookups in constant time, assuming a good hash distribution.
The thinking process goes something like this:
-
We want to iterate over the sequence only once and keep track of every 10-letter sequence as we go. This is a sliding window problem.
-
Each time we encounter a new 10-letter sequence, we add it to the HashTable. If the sequence already exists in the table, we increment its count.
-
Instead of waiting until the end to find out which sequences have occurred more than once, we can check the count as we go. If a sequence's count reaches 2, we add it to the answer list. We only need to do this once per sequence because the problem asks for sequences that occur 'more than once,' not the specific number of times they occur.
-
When a sequence's count exceeds 2, we don't need to do anything more with it, so we can ignore it. This avoids adding duplicates to our results list and keeps the operation more efficient.
-
Finally, we return the list of repeating 10-letter sequences.
By using this approach, we optimize for both time and space by not keeping track of sequences that don't matter (those with counts less than 2 after processing) and by not duplicating entries in our results list.
Learn more about Sliding Window patterns.
Solution Approach
The implementation leverages a HashTable to track 10-letter sequences, which provides an efficient way to check for duplicate sequences as we iterate through the DNA string. Here's how the solution unfolds:
-
Define a variable
n
which represents the range we need to check within the string. Since we're looking for sequences of length 10, we setn
tolen(s) - 10
. This ensures we don't go over the string length when checking for the substrings at the end of the DNA sequence. -
Initialize a
Counter
(a special HashTable from Python's collections library) to keep track of the counts of all 10-letter sequences we find. -
Define an empty list
ans
, which will store the sequences that repeat more than once. -
Iterate over the DNA string using a range that goes from
0
ton+1
. This range represents the starting index of each 10-letter sequence we want to examine. -
Inside the loop, extract a substring
sub
of length 10 starting from the current indexi
. -
Increment the count of this substring in the
Counter
. TheCounter
automatically handles the creation of new keys if the substring hasn't been seen before. -
Check if the count of the current substring has reached 2. If it has, it means this sequence has been encountered once before and now is the second occurrence. It is now qualified as a repeating sequence and should be appended to the
ans
list. -
Continue this process until we've checked all possible starting indices of 10-letter sequences in the DNA string.
-
Once the loop completes, return the
ans
list holding all the sequences that repeated more than once.
The application of this method ensures we only store and return unique sequences that fulfill the problem's conditions. The usage of the Counter
allows for both efficient insertion and lookup operations, where the worst-case time complexity is O(n * 10)
accounting for the substring operation in each iteration. The space complexity is O(n)
, as we potentially store information about each 10-letter sequence present in the DNA string.
Ready to land your dream job?
Unlock your dream job with a 2-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let’s consider a small example to illustrate the solution approach using a given DNA string: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
.
-
We initialize
n
aslen(s) - 10
, which equates to20
since the string length is30
. This signifies that we will check for 10-letter sequences starting at indices0
to20
. -
We then create a
Counter
to keep track of the sequences. -
We initialize an empty list
ans
to store our answers. -
We begin iterating from index
0
to20
(inclusive) to check all possible 10-letter-long sequences.-
At index
0
, we get the substringsub = "AAAAACCCCC"
. We increment its count in theCounter
. It's the first time we've seen it, so its count is now1
. -
Moving to index
1
, the substring issub = "AAAACCCCCA"
. We increment its count in theCounter
. -
This process continues with new substrings as the index increases.
-
At index
10
, we reach the substringsub = "AAAAACCCCC"
, which was previously seen at index0
. Its count in theCounter
becomes2
. Since the count is2
, we add this sequence to theans
list. -
At index
15
, the substringsub = "CCCCCAAAAA"
appears. This is also the second time we've encountered it, so its count becomes2
, and it is added to theans
list.
-
-
After iterating through all starting indices from
0
to20
, we finish withans
containing the sequences that appeared more than once. In this case,ans = ["AAAAACCCCC", "CCCCCAAAAA"]
. -
Finally, we return the list
ans
as our answer, which contains the repeated sequences, satisfying the problem's conditions.
This example demonstrates the process of using the sliding window technique, combined with a Counter
, to efficiently determine sequences that repeat more than once within a given DNA string. By checking and updating counts as we iterate, we're able to avoid unnecessary work and produce our answer without redundancies.
Solution Implementation
1from collections import Counter
2from typing import List
3
4class Solution:
5 def findRepeatedDnaSequences(self, dna_sequence: str) -> List[str]:
6 # Initialize a counter to keep track of the DNA sequence occurrences
7 sequence_count = Counter()
8 # Initialize an empty list to store the repeated sequences
9 repeated_sequences = []
10 # Calculate the number of possible substrings of length 10
11 num_substrings = len(dna_sequence) - 10
12 # Iterate over all the substrings of length 10 in the DNA sequence
13 for i in range(num_substrings + 1):
14 # Extract a substring of length 10
15 subsequence = dna_sequence[i: i + 10]
16 # Increment the count for this particular substring
17 sequence_count[subsequence] += 1
18 # If the count reaches 2, it means we've found a repeated sequence
19 if sequence_count[subsequence] == 2:
20 # Add it to the list of repeated sequences
21 repeated_sequences.append(subsequence)
22 # Return the list of repeated sequences
23 return repeated_sequences
24
1import java.util.HashMap;
2import java.util.List;
3import java.util.ArrayList;
4import java.util.Map;
5
6class Solution {
7 public List<String> findRepeatedDnaSequences(String s) {
8 // Calculate the number of 10-character substrings in 's'
9 int substringLengthLimit = s.length() - 10;
10 // Create a map to store counts of each DNA sequence
11 Map<String, Integer> sequenceCounts = new HashMap<>();
12 // List to store the DNA sequences that appear more than once
13 List<String> repeatedSequences = new ArrayList<>();
14
15 // Iterate through the string checking for 10-character long substrings
16 for (int i = 0; i <= substringLengthLimit; ++i) {
17 // Extract the 10-character substring starting at index 'i'
18 String currentSubstring = s.substring(i, i + 10);
19 // Increase the count of the current DNA sequence in the map
20 sequenceCounts.put(currentSubstring, sequenceCounts.getOrDefault(currentSubstring, 0) + 1);
21 // If this is the second occurrence of the DNA sequence, add it to the answer list
22 if (sequenceCounts.get(currentSubstring) == 2) {
23 repeatedSequences.add(currentSubstring);
24 }
25 }
26 // Return the list of repeated DNA sequences
27 return repeatedSequences;
28 }
29}
30
1#include <vector>
2#include <string>
3#include <unordered_map>
4
5class Solution {
6public:
7 std::vector<std::string> findRepeatedDnaSequences(std::string s) {
8 // Use an unordered_map for efficient lookup and insertion
9 std::unordered_map<std::string, int> sequenceCount;
10
11 // Calculate the number of possible starting indices for 10-letter sequences
12 int possibleStarts = s.size() - 9;
13
14 // Prepare a vector to store the resulting repeated sequences
15 std::vector<std::string> repeats;
16
17 // Loop through the string to examine all possible 10-letter sequences
18 for (int i = 0; i < possibleStarts; ++i) {
19 // Extract a 10-letter substring starting at index i
20 std::string subsequence = s.substr(i, 10);
21
22 // Increment the count for this particular sequence
23 // If the count becomes 2, it implies this is the second time we've seen it,
24 // so we add it to the list of repeats
25 if (++sequenceCount[subsequence] == 2) {
26 repeats.push_back(subsequence);
27 }
28 }
29
30 // Return the list of repeated sequences
31 return repeats;
32 }
33};
34
1function findRepeatedDnaSequences(dnaSequence: string): string[] {
2 // Define the length of the sequence.
3 const sequenceLength = dnaSequence.length;
4 // Create a map to keep track of all 10-letter sequences.
5 const sequenceCountMap = new Map<string, number>();
6 // Initialize the array to store the result.
7 const repeatedSequences = [];
8 // Iterate over the DNA sequence to the point where a 10-letter sequence can be extracted.
9 for (let i = 0; i <= sequenceLength - 10; i++) {
10 // Extract the current 10-letter sequence.
11 const currentSequence = dnaSequence.slice(i, i + 10);
12 // Check if the current sequence is already in the map.
13 if (sequenceCountMap.has(currentSequence)) {
14 // If it's in the map and the value is true, add it to the results.
15 // The value is set to true when the sequence is seen for the second time.
16 if (sequenceCountMap.get(currentSequence) === 1) {
17 repeatedSequences.push(currentSequence);
18 // Increment the count to indicate that the sequence has been added to the results.
19 // This prevents adding the same sequence more than once.
20 sequenceCountMap.set(currentSequence, 2);
21 }
22 } else {
23 // If it's a new sequence, add it to the map with a count of 1,
24 // indicating that it has been seen for the first time.
25 sequenceCountMap.set(currentSequence, 1);
26 }
27 }
28 // Return the array containing all repeated 10-letter sequences found.
29 return repeatedSequences;
30}
31
Time and Space Complexity
The provided code snippet finds all the 10-letter-long sequences in the string s
that appear more than once. To analyze the time and space complexity of this Python code, we will review each significant portion of the code.
Time Complexity:
The code loops through the string s
, starting from index 0
to index n
(which is len(s) - 10
). In each iteration, it creates a substring of length 10
and increases the corresponding counter for this substring. Since the loop runs for (len(s) - 10) + 1
times and each operation within the loop (creating a substring and updating the counter) can be considered as O(1), the overall time complexity is O(n - 10 + 1)
which simplifies to O(n)
where n
is the length of the input string s
.
Space Complexity:
The space complexity is determined by the additional space required by the algorithm which is primarily dependent on the cnt
and ans
variables.
- The
Counter
object will at most store each unique 10-letter-long sequence present ins
. In the worst case, this will ben - 9
unique sequences if every sequence is unique. This gives O(n) as the space complexity contribution fromcnt
. - The
ans
list stores each 10-letter sequence that appears exactly twice. In the worst case, if every possible 10-letter sequence appeared exactly twice, this would yieldO(n/10)
space complexity which simplifies toO(n)
.
By combining the above considerations, the overall space complexity is O(n)
.
Learn more about how to find time and space complexity quickly using problem constraints.
Which of the following is a good use case for backtracking?
Recommended Readings
https algomonster s3 us east 2 amazonaws com cover_photos stack svg Sliding Window Maximum Monotonic Stack We have an array and a sliding window defined by a start index and an end index The sliding window moves from left of the array to right There are always k elements in
LeetCode Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way we
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https algomonster s3 us east 2 amazonaws com recursion jpg You first
Want a Structured Path to Master System Design Too? Don’t Miss This!