433. Minimum Genetic Mutation
Problem Description
You have a gene string represented by an 8-character string containing only the characters 'A'
, 'C'
, 'G'
, and 'T'
.
You need to find the minimum number of mutations required to change a starting gene string (startGene
) into a target gene string (endGene
). A mutation is defined as changing exactly one character in the gene string.
For example: "AACCGGTT" → "AACCGGTA"
is one mutation (the last character changed from 'T'
to 'A'
).
There's an additional constraint: you're given a gene bank (bank
) which contains all valid gene mutations. Each intermediate gene string during the mutation process must exist in this bank to be considered valid. The starting gene is assumed to be valid and may or may not be in the bank.
The task is to return the minimum number of mutations needed to transform startGene
to endGene
. If it's impossible to reach the target gene through valid mutations, return -1
.
Key points:
- Each gene string is exactly 8 characters long
- Only characters
'A'
,'C'
,'G'
,'T'
are allowed - Each mutation changes exactly one character
- All intermediate genes must be in the bank (except possibly the starting gene)
- Find the shortest mutation path from start to end
Flowchart Walkthrough
First, let's pin down the algorithm using the Flowchart. Here's a step-by-step walkthrough:
Is it a graph?
- Yes: This problem can be modeled as a graph where each gene string is a node, and edges exist between gene strings that differ by exactly one character. We need to find the shortest path from
startGene
toendGene
through valid intermediate genes in the bank.
Is it a tree?
- No: The graph is not a tree. Multiple gene strings can potentially mutate into the same gene string, and there can be cycles (gene A → gene B → gene C → gene A is possible if all mutations are valid).
Is the problem related to directed acyclic graphs?
- No: The graph can have cycles, so it's not a DAG. A gene mutation can potentially be reversed through a series of other mutations.
Is the problem related to shortest paths?
- Yes: We need to find the minimum number of mutations (shortest path) from the starting gene to the ending gene.
Is the graph Weighted?
- No: Each mutation (edge) has the same cost of 1. Every single character change counts as exactly one mutation, making this an unweighted graph.
BFS
- Yes: For unweighted shortest path problems, BFS is the optimal choice.
Conclusion: The flowchart suggests using BFS (Breadth-First Search) for finding the minimum number of mutations from startGene
to endGene
. BFS is perfect here because:
- We're dealing with an unweighted graph (each mutation has equal cost)
- We need the shortest path between two nodes
- BFS explores nodes level by level, guaranteeing the first time we reach the target is via the shortest path
Intuition
Think of this problem as finding the shortest path in a maze where each room (gene string) has doors to other rooms that differ by exactly one character. We start in one room (startGene
) and want to reach another room (endGene
) using the minimum number of steps.
The key insight is that each gene string can be connected to other gene strings that differ by exactly one character. This creates a graph structure where:
- Nodes = gene strings
- Edges = exist between genes that differ by exactly one character
- Edge weight = 1 (each mutation counts as one step)
Since we want the minimum number of mutations and each mutation has the same cost (1), we're essentially looking for the shortest path in an unweighted graph. BFS is perfect for this because it explores all genes at distance 1 first, then distance 2, and so on. This level-by-level exploration guarantees that the first time we encounter the target gene, we've found it via the shortest path.
The approach becomes clear:
- Start from
startGene
and explore all valid genes that are one mutation away - From each of those, explore genes that are one more mutation away
- Continue until we find
endGene
or exhaust all possibilities
We use a queue to maintain the BFS order and a visited set to avoid revisiting the same gene (which would create unnecessary cycles). For each gene we process, we check all genes in the bank to see which ones differ by exactly one character - these become our next candidates to explore.
The beauty of BFS here is that it naturally tracks the mutation count through its level-by-level traversal. When we finally reach endGene
, the depth level tells us exactly how many mutations were needed.
Solution Approach
The implementation uses BFS with a queue and visited set to find the shortest mutation path:
Data Structures Used:
deque
(double-ended queue): Stores tuples of(gene_string, mutation_count)
for BFS traversalset
(vis): Tracks visited gene strings to avoid revisiting and creating cycles
Algorithm Steps:
-
Initialize BFS structures:
- Create a queue
q
with the starting gene and mutation count 0:[(startGene, 0)]
- Create a visited set
vis
containing the starting gene to mark it as explored
- Create a queue
-
BFS traversal: While the queue is not empty:
- Dequeue the front element to get current
gene
and itsdepth
(number of mutations so far) - Check if we've reached the target: if
gene == endGene
, returndepth
- Dequeue the front element to get current
-
Explore neighbors: For each gene string
nxt
in the bank:- Calculate the difference between current gene and bank gene using:
sum(a != b for a, b in zip(gene, nxt))
- If the difference is exactly 1 (one mutation away) AND this gene hasn't been visited:
- Add
(nxt, depth + 1)
to the queue for future exploration - Mark
nxt
as visited by adding tovis
- Add
- Calculate the difference between current gene and bank gene using:
-
Return result:
- If we exit the while loop without finding
endGene
, return-1
(no valid mutation path exists)
- If we exit the while loop without finding
Key Implementation Details:
-
The difference calculation
sum(a != b for a, b in zip(gene, nxt))
efficiently counts character mismatches between two genes by zipping them character by character and summing the boolean differences. -
Using
depth + 1
when enqueueing ensures each level of BFS correctly tracks the mutation count from the start. -
The visited set prevents exploring the same gene multiple times, which is crucial for efficiency and avoiding infinite loops.
-
We don't need to generate all possible one-character mutations explicitly; instead, we check against the bank directly, which is more efficient when the bank is small.
The time complexity is O(N * M * K)
where N is the bank size, M is the gene length (8), and K is the maximum queue size. The space complexity is O(N)
for the visited set and queue.
Ready to land your dream job?
Unlock your dream job with a 5-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's walk through a concrete example to illustrate the BFS solution approach:
Input:
startGene = "AACCGGTT"
endGene = "AACCGGTA"
bank = ["AACCGGTA", "AACCGCTT", "AACCGGAT"]
Step-by-step BFS Process:
Initial Setup:
- Queue:
[("AACCGGTT", 0)]
- Visited:
{"AACCGGTT"}
Iteration 1:
- Dequeue:
("AACCGGTT", 0)
- Check if this is endGene? No, continue
- Check neighbors in bank:
- "AACCGGTA": differs by 1 char (position 7: T→A) ✓ Valid neighbor
- "AACCGCTT": differs by 2 chars (positions 5,6: G→C, G→T) ✗ Not a neighbor
- "AACCGGAT": differs by 1 char (position 6: T→A) ✓ Valid neighbor
- Add valid unvisited neighbors to queue
- Queue:
[("AACCGGTA", 1), ("AACCGGAT", 1)]
- Visited:
{"AACCGGTT", "AACCGGTA", "AACCGGAT"}
Iteration 2:
- Dequeue:
("AACCGGTA", 1)
- Check if this is endGene? Yes! Return depth = 1
Result: The minimum number of mutations is 1.
The mutation path was: "AACCGGTT" → "AACCGGTA"
(changed last character from T to A).
Why BFS guarantees the shortest path: Notice that BFS explored all genes at distance 1 before moving to distance 2. Since we found the target at distance 1, we know this is the shortest possible path. If there were a longer path (like going through "AACCGGAT" first), BFS wouldn't explore it because it found the target earlier.
Solution Implementation
1from collections import deque
2from typing import List
3
4class Solution:
5 def minMutation(self, startGene: str, endGene: str, bank: List[str]) -> int:
6 """
7 Find the minimum number of mutations needed to mutate from startGene to endGene.
8 Each mutation changes exactly one character, and the result must be in the gene bank.
9
10 Args:
11 startGene: The starting gene string
12 endGene: The target gene string
13 bank: List of valid gene strings that can be used as intermediate mutations
14
15 Returns:
16 Minimum number of mutations needed, or -1 if impossible
17 """
18 # Initialize BFS queue with starting gene and depth 0
19 queue = deque([(startGene, 0)])
20
21 # Keep track of visited genes to avoid cycles
22 visited = {startGene}
23
24 # Perform BFS to find shortest path
25 while queue:
26 current_gene, current_depth = queue.popleft()
27
28 # Check if we've reached the target gene
29 if current_gene == endGene:
30 return current_depth
31
32 # Try all possible next mutations from the bank
33 for next_gene in bank:
34 # Count the number of character differences
35 difference_count = sum(
36 char1 != char2
37 for char1, char2 in zip(current_gene, next_gene)
38 )
39
40 # If exactly one character differs and we haven't visited this gene
41 if difference_count == 1 and next_gene not in visited:
42 queue.append((next_gene, current_depth + 1))
43 visited.add(next_gene)
44
45 # No valid mutation path found
46 return -1
47
1class Solution {
2 public int minMutation(String startGene, String endGene, String[] bank) {
3 // Initialize BFS queue with the starting gene
4 Deque<String> queue = new ArrayDeque<>();
5 queue.offer(startGene);
6
7 // Track visited genes to avoid cycles
8 Set<String> visited = new HashSet<>();
9 visited.add(startGene);
10
11 // Track the number of mutations (BFS depth)
12 int mutationSteps = 0;
13
14 // BFS traversal
15 while (!queue.isEmpty()) {
16 // Process all genes at current level
17 int levelSize = queue.size();
18 for (int i = 0; i < levelSize; i++) {
19 String currentGene = queue.poll();
20
21 // Check if we've reached the target gene
22 if (currentGene.equals(endGene)) {
23 return mutationSteps;
24 }
25
26 // Try mutating to each gene in the bank
27 for (String candidateGene : bank) {
28 // Count differences between current gene and candidate gene
29 // Allow at most 1 difference (single mutation)
30 int allowedDifferences = 2; // Start with 2 to check for exactly 1 difference
31 for (int j = 0; j < 8 && allowedDifferences > 0; j++) {
32 if (currentGene.charAt(j) != candidateGene.charAt(j)) {
33 allowedDifferences--;
34 }
35 }
36
37 // If exactly one mutation difference and not visited, add to queue
38 if (allowedDifferences > 0 && !visited.contains(candidateGene)) {
39 queue.offer(candidateGene);
40 visited.add(candidateGene);
41 }
42 }
43 }
44 // Increment mutation steps after processing current level
45 mutationSteps++;
46 }
47
48 // No valid mutation path found
49 return -1;
50 }
51}
52
1class Solution {
2public:
3 int minMutation(string startGene, string endGene, vector<string>& bank) {
4 // BFS queue storing pairs of (current gene string, mutation depth)
5 queue<pair<string, int>> bfsQueue;
6 bfsQueue.push({startGene, 0});
7
8 // Set to track visited genes to avoid cycles
9 unordered_set<string> visited;
10 visited.insert(startGene);
11
12 // Perform BFS to find minimum mutations
13 while (!bfsQueue.empty()) {
14 // Extract current gene and its mutation depth
15 auto [currentGene, mutationCount] = bfsQueue.front();
16 bfsQueue.pop();
17
18 // Check if we've reached the target gene
19 if (currentGene == endGene) {
20 return mutationCount;
21 }
22
23 // Try all possible next mutations from the gene bank
24 for (const string& candidateGene : bank) {
25 // Count differences between current gene and candidate gene
26 // We need exactly 1 difference for a valid mutation
27 int allowedDifferences = 2; // Start with 2 to check for exactly 1 difference
28 for (int position = 0; position < 8 && allowedDifferences > 0; ++position) {
29 if (currentGene[position] != candidateGene[position]) {
30 allowedDifferences--;
31 }
32 }
33
34 // If exactly one character differs (allowedDifferences == 1) and not visited
35 if (allowedDifferences == 1 && !visited.contains(candidateGene)) {
36 visited.insert(candidateGene);
37 bfsQueue.push({candidateGene, mutationCount + 1});
38 }
39 }
40 }
41
42 // No valid mutation path found
43 return -1;
44 }
45};
46
1/**
2 * Finds the minimum number of mutations to transform startGene to endGene
3 * using valid gene mutations from the bank
4 * @param startGene - The starting gene string
5 * @param endGene - The target gene string
6 * @param bank - Array of valid intermediate gene mutations
7 * @returns Minimum number of mutations needed, or -1 if transformation is impossible
8 */
9function minMutation(startGene: string, endGene: string, bank: string[]): number {
10 // BFS queue storing [current gene, number of mutations to reach it]
11 const queue: [string, number][] = [[startGene, 0]];
12
13 // Set to track visited genes to avoid revisiting
14 const visited = new Set<string>([startGene]);
15
16 // Process queue using BFS
17 for (const [currentGene, mutationCount] of queue) {
18 // Check if we've reached the target gene
19 if (currentGene === endGene) {
20 return mutationCount;
21 }
22
23 // Try all possible next mutations from the bank
24 for (const nextGene of bank) {
25 // Count allowed differences (max 1 difference for valid mutation)
26 let allowedDifferences = 2;
27
28 // Check character differences between current and next gene
29 for (let charIndex = 0; charIndex < 8 && allowedDifferences > 0; charIndex++) {
30 if (currentGene[charIndex] !== nextGene[charIndex]) {
31 allowedDifferences--;
32 }
33 }
34
35 // If exactly 1 difference found and gene not visited, add to queue
36 if (allowedDifferences === 1 && !visited.has(nextGene)) {
37 queue.push([nextGene, mutationCount + 1]);
38 visited.add(nextGene);
39 }
40 }
41 }
42
43 // No valid transformation path found
44 return -1;
45}
46
Time and Space Complexity
Time Complexity: O(n * m^2)
where n
is the length of each gene string and m
is the size of the gene bank.
- The BFS explores at most
m
genes from the bank (worst case: visiting all genes in the bank) - For each gene dequeued, we compare it with all genes in the bank, which takes
O(m)
iterations - For each comparison, we calculate the difference between two genes using
zip
and sum, which takesO(n)
time - Therefore, the total time complexity is
O(m * m * n) = O(n * m^2)
Note: The reference answer suggests O(C * n * m)
where C = 4
represents generating all possible mutations by changing each position to one of 4 possible characters. However, this code takes a different approach by directly comparing with genes in the bank rather than generating mutations, resulting in O(n * m^2)
complexity.
Space Complexity: O(m)
where m
is the size of the gene bank.
- The queue can contain at most
m
genes (all genes from the bank) - The visited set can also contain at most
m + 1
genes (all genes from the bank plus the start gene) - Each gene string has length
n
, but we're storing references to existing strings rather than creating new ones - Therefore, the space complexity is
O(m)
for the queue and visited set
Common Pitfalls
1. Not Checking if Target Gene Exists in Bank
A critical pitfall is assuming the target gene (endGene
) must exist in the bank for a valid mutation path to exist. The algorithm will fail to find a valid path if endGene
is not in the bank, even when it should return a valid mutation count.
Why this happens: The BFS only explores genes that are in the bank, so if the target isn't in the bank, it will never be reached through the neighbor exploration loop.
Example scenario:
startGene = "AACCGGTT"
endGene = "AACCGGTA"
bank = ["AACCGGTT"]
The algorithm would return -1
even though the target is just one mutation away from the start.
Solution: Add an early check to ensure the target gene exists in the bank (unless it's the same as the start gene):
# Add this check at the beginning of the function if endGene not in bank and endGene != startGene: return -1
2. Inefficient Neighbor Generation
Another pitfall is generating all possible mutations (8 positions × 4 characters = 32 possibilities) and then checking if each exists in the bank. This becomes inefficient when the bank is small.
Inefficient approach:
# Don't do this - generates unnecessary mutations
for i in range(8):
for char in ['A', 'C', 'G', 'T']:
if char != current_gene[i]:
mutated = current_gene[:i] + char + current_gene[i+1:]
if mutated in bank and mutated not in visited:
# Process mutation
Better approach (as shown in the solution): Iterate through the bank directly and check if each gene is exactly one mutation away. This is more efficient when the bank size is smaller than the number of possible mutations.
3. Using List Instead of Set for Bank Lookup
If the bank remains as a list, the in
operation for checking membership becomes O(N) for each check, significantly slowing down the algorithm.
Solution: Convert the bank to a set for O(1) lookup if you need to check membership frequently:
bank_set = set(bank)
# Now checking 'gene in bank_set' is O(1) instead of O(N)
What is the best way of checking if an element exists in an unsorted array once in terms of time complexity? Select the best that applies.
Recommended Readings
https assets algo monster cover_photos bfs svg Breadth First Search on Trees Hopefully by this time you've drunk enough DFS Kool Aid to understand its immense power and seen enough visualization to create a call stack in your mind Now let me introduce the companion spell Breadth First Search BFS
Coding Interview Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https assets algo monster recursion jpg You first call Ben and ask
Want a Structured Path to Master System Design Too? Don’t Miss This!