433. Minimum Genetic Mutation
Problem Description
In this problem, we are given a start gene string and an end gene string, each 8 characters long composed of the characters 'A', 'C', 'G', and 'T'. We also have a bank of valid gene mutations. A mutation is a change of a single character in the gene string, and a gene must be in the bank to be considered a valid gene mutation.
The task is to determine the minimum number of mutations required to change the start gene into the end gene. If it's not possible to reach the end gene from the start gene by only using valid mutations from the bank, we must return -1. It is worth noting that the start gene is considered valid even if it's not present in the bank, but every step after must be a mutation existing in the bank.
To clarify: a gene string mutation is represented as a single character change from the set of possible characters. For example, changing 'AACCGGTT' to 'AACCGGTA' is considered one mutation.
Intuition
To solve this problem, we can use either Breadth-First Search (BFS) or Depth-First Search (DFS), but BFS is more convenient for finding the shortest path in an unweighted graph, which is analogous to finding the minimum number of mutations needed.
The intuition behind the solution is to treat each gene string as a node in a graph and each valid mutation as an edge connecting two nodes. We use BFS to explore the graph level by level, starting from the start gene and working towards the end gene. BFS is ideal for this scenario because it finds the shortest path (minimum mutations) from the start to the end node.
Here are the key steps of the BFS approach:
- Initialize a set from the bank to quickly check if a mutation is valid.
- Begin with a queue initialized with the start gene and a mutation count of 0.
- Use a dictionary to define mutation possibilities for each character.
- Dequeue an element and iterate over its characters, changing one character at a time according to the mutation possibilities.
- If a new mutation is valid (in the bank), enqueue it with a mutation count incremented by one.
- If we reach the end gene, return the mutation count.
- If the queue is exhausted without finding the end gene, return -1.
By using this BFS approach, we explore all possible gene mutations in the shortest number of steps, ensuring the minimum number of mutations needed to achieve the end gene, if possible.
Learn more about Breadth-First Search patterns.
Solution Approach
The solution uses Breadth-First Search (BFS), which is an algorithm for traversing or searching tree or graph data structures. It explores the neighbor nodes at the present depth prior to moving on to nodes at the next depth level. The implementation includes the following algorithms, data structures, and patterns:
-
Set: A set is initialized from the bank, which provides O(1) complexity for checking the existence of a gene string.
-
Queue (Deque): The deque data structure is used as a queue to implement BFS. Items are appended at one end and popped from the other end, mimicking a queue's First In, First Out (FIFO) behavior.
-
Hash Map: A dictionary
mp
maps each character to the other three characters, indicating the possible mutations from one character. This is used to easily determine the viable mutations at any position of a gene string. -
Level-by-Level Traversal: The algorithm traverses the graph level by level. It considers all gene strings that can be reached from the current gene string in a single mutation as neighbors and explores them in the BFS manner.
-
Early Stopping: If at any point the
end
gene is found during the exploration, the function immediately returns the number of mutations (steps) taken to reach this state. -
Graph Pruning: To avoid revisiting gene strings and thus entering cycles, every time a new valid gene string is found in the bank, it is removed from the set. This ensures that each gene string is visited only once.
The implementation steps are as follows:
- Create a
set
of the gene bank to facilitate quick look-up. - Initialize a
deque
with a tuple containing the start gene and an initial step count of 0. - Define a
mp
dictionary representing the mutation map from each character to the other three possible characters. - Start the BFS loop by popping an element from the queue:
- For each character
v
in the current gene stringt
, loop through the possible charactersj
inmp[v]
to generate new strings. - Construct a new gene string
next
by replacing the character at positioni
withj
. - If
next
is found in the set, append it to the queue with the step count incremented and remove it from the set to prevent future visits.
- For each character
- Continue the loop until the queue is empty or the end gene is reached.
Here is the complete BFS implementation in code:
1s = set(bank) # Convert bank into a set for O(1) access
2q = deque([(start, 0)]) # Initialize the queue with a tuple of start gene and step counter
3mp = {'A': 'TCG', 'T': 'ACG', 'C': 'ATG', 'G': 'ATC'} # Mutation map
4
5# BFS loop
6while q:
7 t, step = q.popleft() # Pop the next gene string and current step count
8 if t == end: # End reached
9 return step
10 for i, v in enumerate(t): # Iterate through each character in the string
11 for j in mp[v]: # Go through all possible mutations for the character
12 next = t[:i] + j + t[i + 1:] # Create the new mutated string
13 if next in s: # Check if the mutation is valid
14 q.append((next, step + 1)) # Enqueue the new gene with incremented steps
15 s.remove(next) # Remove the visited gene from the set
16return -1 # Return -1 if end can't be reached
This code utilizes BFS to find the shortest path from startGene
to endGene
, which corresponds to the minimum mutations required to transform one into the other.
Ready to land your dream job?
Unlock your dream job with a 2-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's demonstrate the solution approach with an example:
Assume our start gene is AACCGGTT
and our end gene is AACCGGTA
. The bank of valid gene mutations includes AACCGGTA
, AACCGGTC
, and AACCGGCT
.
-
The set of bank mutations is
{AACCGGTA, AACCGGTC, AACCGGCT}
. Since we can check the presence of a gene in O(1) time, this helps us quickly validate new mutations. -
We initialize our queue (deque) with the start gene and a mutation count of 0:
deque([('AACCGGTT', 0)])
. -
Our mutation map
mp
is{'A': 'TCG', 'T': 'ACG', 'C': 'ATG', 'G': 'ATC'}
, showing all possible single character mutations for each character.
Now, we begin the BFS:
- We dequeue the first element:
('AACCGGTT', 0)
. - Beginning with the first character, we check all other characters, but since no mutation is present in the bank for a change in the first character, we move to the second character and so on.
- We finally reach the last character 'T'. The possible mutations for 'T' are 'A', 'C', and 'G'.
- We mutate the last character to 'A' (resulting in
AACCGGTA
) and check if it's in the bank. It is, so we enqueue it with a mutation count incremented by one:deque([('AACCGGTA', 1)])
. - We remove
AACCGGTA
from the set to avoid revisiting this gene string. - Continuing the BFS, we dequeue
('AACCGGTA', 1)
and find it's the end gene. The search is over, and we return 1.
Hence, we have found that the minimum number of mutations to get from AACCGGTT
to AACCGGTA
using the given gene bank is 1.
Solution Implementation
1from collections import deque
2
3class Solution:
4 def minMutation(self, start: str, end: str, bank: list[str]) -> int:
5 # Convert the bank into a set for faster lookup
6 bank_set = set(bank)
7
8 # Initialize a queue with the starting sequence and initial step count of 0
9 queue = deque([(start, 0)])
10
11 # Map to store the possible mutations for each gene
12 mutation_map = {'A': 'TCG', 'T': 'ACG', 'C': 'ATG', 'G': 'ATC'}
13
14 # Perform BFS to find the minimum mutation steps
15 while queue:
16 current_sequence, step_count = queue.popleft()
17
18 # If the current sequence matches the end sequence, we return the steps taken
19 if current_sequence == end:
20 return step_count
21
22 # Try all possible single-gene mutations of the current sequence
23 for i, gene in enumerate(current_sequence):
24 for mutation in mutation_map[gene]:
25 # Form the new sequence by replacing the gene at index i with the mutation
26 next_sequence = current_sequence[:i] + mutation + current_sequence[i + 1:]
27 # If the new sequence is in the bank, add it to the queue and remove from bank
28 if next_sequence in bank_set:
29 queue.append((next_sequence, step_count + 1))
30 bank_set.remove(next_sequence)
31
32 # If we reach this point, there is no way to mutate start into end with the given bank
33 return -1
34
1// Import statements for the HashSet, HashMap, and LinkedList
2import java.util.HashSet;
3import java.util.HashMap;
4import java.util.Set;
5import java.util.Map;
6import java.util.Deque;
7import java.util.LinkedList;
8
9// Pair class implementation (as Pair class was not explicitly provided or imported)
10class Pair<K, V> {
11 private K key;
12 private V value;
13
14 public Pair(K key, V value) {
15 this.key = key;
16 this.value = value;
17 }
18
19 public K getKey() {
20 return key;
21 }
22
23 public V getValue() {
24 return value;
25 }
26}
27
28class Solution {
29 // Function to find the minimum number of mutations to transform start string into end string
30 public int minMutation(String start, String end, String[] bank) {
31 // Create a set from the given bank array for efficient lookups
32 Set<String> bankSet = new HashSet<>();
33 for (String gene : bank) {
34 bankSet.add(gene);
35 }
36
37 // Map to hold the possible mutations for each gene character
38 Map<Character, String> mutationMap = new HashMap<>(4);
39 mutationMap.put('A', "TCG");
40 mutationMap.put('T', "ACG");
41 mutationMap.put('C', "ATG");
42 mutationMap.put('G', "ATC");
43
44 // Initialize a queue to perform Breadth-First Search (BFS)
45 Deque<Pair<String, Integer>> queue = new LinkedList<>();
46 queue.offer(new Pair<>(start, 0)); // Insert start gene with step 0 into the queue
47
48 // Loop until the queue is empty
49 while (!queue.isEmpty()) {
50 // Poll the first pair from the queue
51 Pair<String, Integer> current = queue.poll();
52 String currentGene = current.getKey();
53 int currentStep = current.getValue();
54
55 // If the current gene matches the end gene, return the number of steps taken
56 if (end.equals(currentGene)) {
57 return currentStep;
58 }
59
60 // Iterate through all characters of the current gene
61 for (int i = 0; i < currentGene.length(); ++i) {
62 // Explore each possible mutation based on the mapping
63 for (char mutation : mutationMap.get(currentGene.charAt(i)).toCharArray()) {
64 // Construct the new mutated gene
65 String mutatedGene = currentGene.substring(0, i) + mutation + currentGene.substring(i + 1);
66
67 // If the mutated gene is in the bank
68 if (bankSet.contains(mutatedGene)) {
69 // Add the mutated gene with an incremented step count into the queue
70 queue.offer(new Pair<>(mutatedGene, currentStep + 1));
71 // Remove the mutated gene from the set to prevent revisiting
72 bankSet.remove(mutatedGene);
73 }
74 }
75 }
76 }
77
78 // If no mutation leads to the end gene, return -1
79 return -1;
80 }
81}
82
1#include <unordered_set>
2#include <unordered_map>
3#include <queue>
4#include <string>
5#include <vector>
6
7class Solution {
8public:
9 int minMutation(std::string start, std::string end, std::vector<std::string>& bank) {
10 // Create a set to store valid genes from the bank for O(1) access
11 std::unordered_set<std::string> validGenes(bank.begin(), bank.end());
12
13 // Map to hold potential mutations for each gene character
14 std::unordered_map<char, std::string> mutationsMap;
15 mutationsMap['A'] = "TCG";
16 mutationsMap['T'] = "ACG";
17 mutationsMap['C'] = "ATG";
18 mutationsMap['G'] = "ATC";
19
20 // Queue to perform BFS, storing gene strings and their mutation step counts
21 std::queue<std::pair<std::string, int>> mutationQueue;
22 mutationQueue.push({start, 0});
23
24 // Perform BFS
25 while (!mutationQueue.empty()) {
26 auto current = mutationQueue.front();
27 mutationQueue.pop();
28 std::string gene = current.first;
29 int steps = current.second;
30
31 // Return the number of steps if the end gene is reached
32 if (gene == end) return steps;
33
34 // Try mutating each character of the gene string
35 for (size_t i = 0; i < gene.size(); ++i) {
36 // Check each possible mutation for the current character
37 for (char mut : mutationsMap[gene[i]]) {
38 std::string nextGene = gene.substr(0, i) + mut + gene.substr(i + 1);
39
40 // If mutation is valid and in the bank, add it to the queue
41 if (validGenes.count(nextGene)) {
42 mutationQueue.push({nextGene, steps + 1});
43 // Remove the gene from set to prevent revisiting
44 validGenes.erase(nextGene);
45 }
46 }
47 }
48 }
49
50 // If the end gene cannot be reached, return -1
51 return -1;
52 }
53};
54
1function minMutation(start: string, end: string, bank: string[]): number {
2 // Initialize a queue and start with the starting sequence
3 const sequenceQueue: string[] = [start];
4
5 // This variable will hold the number of mutations needed
6 let mutationCount: number = 0;
7
8 // Process the queue until it's empty
9 while (sequenceQueue.length > 0) {
10 // Get the number of sequences to process in this round
11 const numberOfSequences: number = sequenceQueue.length;
12
13 // Process all sequences in the current round
14 for (let i = 0; i < numberOfSequences; i++) {
15 // Remove the first sequence from the queue to process it
16 const currentSequence: string = sequenceQueue.shift();
17
18 // If the current sequence is the target end sequence, return mutation count
19 if (currentSequence === end) {
20 return mutationCount;
21 }
22
23 // Iterate through the bank in reverse order to remove elements while iterating
24 for (let j = bank.length - 1; j >= 0; j--) {
25 const bankSequence: string = bank[j];
26 let difference: number = 0;
27
28 // Compare the current sequence with the bank sequence and count the differences
29 for (let k = 0; k < 8; k++) {
30 if (currentSequence[k] !== bankSequence[k]) {
31 difference++;
32 }
33 }
34
35 // A valid mutation is one where there is exactly one difference
36 if (difference === 1) {
37 // Add the new mutation to the queue and remove it from the bank
38 sequenceQueue.push(...bank.splice(j, 1));
39 }
40 }
41 }
42 // Increment the mutation counter after each round
43 mutationCount++;
44 }
45
46 // If no mutation path was found to reach the end sequence, return -1
47 return -1;
48}
49
Time and Space Complexity
The given code snippet is a solution to the genetic mutation problem, which finds the minimum number of mutations needed to transform the start gene sequence into the end gene sequence using a given set of gene mutations from the bank.
Time Complexity
To analyze time complexity, let's look at the components involved:
-
The BFS traversal: The algorithm uses a queue to perform a breadth-first search.
-
Checking each character and trying all possible mutations: For each gene sequence, the algorithm checks each of its
N
characters (whereN
is the length of the gene sequence, typically 8 in the problem constraint) and attempts to mutate it to all other possible characters (3 possibilities per character, since there are 4 possible nucleotides). -
Checking and removing gene sequences from the set: Upon each successful mutation, it checks if the new sequence is in the set and removes it if present.
Considering that there are M
gene sequences in the bank:
- In the worst case, the algorithm might have to visit all
M
sequences in the bank. - For each sequence, we perform
N
character checks and3
possible mutations. - Each mutation check involves an
O(1)
set containment check andO(1)
removal operation (since it's a set).
Thus, the total time complexity can be estimated as O(M * N * 3)
, which simplifies to O(M * N)
as constants can be ignored in Big O notation.
Space Complexity
The primary space-consuming structures are:
- The queue used for BFS which, in the worst case, might hold all
M
sequences before mutation at the same level. - The set
s
, also holding theM
gene sequences from the bank. - The
mp
dictionary with a constant size of 4 (not dependent onN
orM
), therefore can be consideredO(1)
.
So the space complexity is dominated by the queue and the set, both of which may hold up to M
gene sequences. Hence, the space complexity is O(M)
.
Learn more about how to find time and space complexity quickly using problem constraints.
You are given an array of intervals where intervals[i] = [start_i, end_i]
represent the start and end of the ith
interval. You need to merge all overlapping intervals and return an array of the non-overlapping intervals that cover all the intervals in the input.
Recommended Readings
https algomonster s3 us east 2 amazonaws com cover_photos bfs svg Breadth First Search on Trees Hopefully by this time you've drunk enough DFS Kool Aid to understand its immense power and seen enough visualization to create a call stack in your mind Now let me introduce the companion spell
Patterns The Shortest Path Algorithm for Coding Interviews The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way we can determine the
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https algomonster s3 us east 2 amazonaws com recursion jpg You first
Got a question?ย Ask the Monster Assistantย anything you don't understand.
Still not clear? ย Submitย the part you don't understand to our editors. Or join ourย Discord and ask the community.