943. Find the Shortest Superstring


Problem Description

The challenge is to find the shortest possible string that contains all the given strings in the array words as substrings. It's important to note that none of the strings in words is a substring of any other string in the array, simplifying the problem. The task is to construct a superstring that concatenates all individual strings in such a way that they overlap wherever possible to reduce the overall length. If there are several shortest superstrings, returning any one of them is acceptable.

Intuition

To solve this problem, we think about overlaps between pairs of strings to determine which strings should be concatenated together. To start, we need to know how to combine two strings together to achieve the maximum overlap, which means finding how a string can partially overlap the end of another string.

The intuition for the solution approach involves calculating the overlap between each pair of strings, which is the maximum prefix of one string that is also a suffix of the other. This forms a graph-like structure where the overlap value can represent the edge weight. To find the shortest superstring, we aim to visit each node (string) while maximizing the total weight of the edges we have traversed.

To find the order of strings that maximizes the overlaps, we can use dynamic programming. A dynamic programming table dp[mask][i] can be constructed where mask is a bitmask representing the set of strings included, and i is the last string added. The value represents the maximum overlap we can get so far using the strings in mask, ending with the string i.

However, since the exact order of strings that leads to this maximum value is also required to construct the superstring, we need a way to backtrack from the maximum overlap value to the start. This is done using another table p[mask][i] that records the second to last string added before i.

Once the dynamic programming table is filled, we can reconstruct the path from the maximum value, and then we add the remaining strings without overlaps to complete the superstring. The superstring is then constructed by taking the first string in order and adding the suffixes of all subsequent strings according to the overlap.

The final output is constructed by joining these strings. This way, we arrive at a superstring with the smallest possible length that still includes all strings from the words array as substrings.

Learn more about Dynamic Programming and Bitmask patterns.

Not Sure What to Study? Take the 2-min Quiz to Find Your Missing Piece

Is the following code DFS or BFS?

1void search(Node root) {
2  if (!root) return;
3  visit(root);
4  root.visited = true;
5  for (Node node in root.adjacent) {
6    if (!node.visited) {
7      search(node);
8    }
9  }
10}

Solution Approach

The Reference Solution Approach above uses a combination of dynamic programming with bit masking and graph theory to construct the shortest superstring. Here's a detailed walkthrough of the implementation:

  1. Graph Creation with Overlaps:

    • A 2D array g[][] is created to store the overlaps. g[i][j] means the length of the overlap between the end of word i and the start of word j.
    • Nested loops iterate through each pair of words a and b in words, and for each pair, a third loop finds the maximum overlap.
  2. Dynamic Programming (DP) Table Initialization:

    • A 2D DP array dp[][] is declared with 1 << n rows and n columns. Each cell dp[mask][i] will store the maximum overlap obtainable with the last string in the subset mask being words[i].
    • To reconstruct the optimal combination of string overlaps, a predecessor array p[][] is maintained to track the choice of previous word in the optimal path.
  3. DP and Bitmasking:

    • The algorithm iterates over all possible subsets i (represented as bitmasks) and each word j. For each subset-word pair, it calculates the maximum overlap by trying to add word j to the subset pi (where pi is the subset without word j).
    • The idea is to transition from a smaller subset (without word j) to a larger subset (with word j) and update the DP table with the max overlap.
  4. Reconstructing the Path:

    • After populating the DP table, the last word j in the optimal path (leading to the maximum overlap in the full set) is found.
    • From there, using the p table, the algorithm reconstructs the order of strings, starting from the full set (all bits set) and moving backwards by removing the current word from the mask and jumping to its predecessor.
  5. Building the Shortest Superstring:

    • Once the order is found, each word is added to the result in a specific way:
      • The first word is added in full.
      • Every subsequent word is added, minus the prefix that overlaps with the end of the current superstring (this overlap length is stored in g[i][j]).
    • Any leftover words which were not in the set (which didn't form part of any optimal path) are simply appended to the end, because words are guaranteed not to be substrings of one another.

This solution uses dynamic programming for subset optimization combined with graph theory concepts (finding the longest path in an edge-weighted graph, where edges represent word overlaps). The superstring is formed by selecting substrings that give the maximum overlap when concatenated together. This results in space and time complexity that is exponential relative to the number of strings due to the bitmasking approach (O(n^2 * 2^n)), which is acceptable here given the problem's constraints.

Discover Your Strengths and Weaknesses: Take Our 2-Minute Quiz to Tailor Your Study Plan:

Which of the following is the prefix sum of array [1, 2, 3, 4, 5]?

Example Walkthrough

Let's consider a small example with the words array consisting of ["catg", "ctaagt", "gcta", "ttca", "atgcatc"]. Here's how the solution approach operates on this example:

Step 1: Graph Creation with Overlaps

  • For each pair of words, we calculate the maximum overlap. For instance:
    • The overlap of catg and ctaagt is cta, with length 3.
    • The overlap of gcta and ttca is tca, with length 3.
    • ... and so on.

This results in a 2D overlap array g[][].

Step 2: Dynamic Programming (DP) Table Initialization

  • Create a dp table with 1 << 5 rows (since there are 5 strings) and 5 columns.
  • Create a p predecessor table with the same dimensions to trace the optimal path.

Step 3: DP and Bitmasking

  • Iterate over all subsets of words (represented by bitmasks) and try to extend each subset by adding one word to determine the best overlap.
  • When trying to add ctaagt to a subset containing catg, check the dp value of the subset without ctaagt and the g[i][j] overlap value; update dp and p if a larger overlap is achieved.

Step 4: Reconstructing the Path

  • Once the dp table is fully populated, trace back from the full set 11111 to find the sequence resulting in the maximum overlap.
  • For this example, let’s say the order obtained after reconstruction is ["gcta", "ttca", "ctaagt", "catg", "atgcatc"].

Step 5: Building the Shortest Superstring

  • Start by adding gcta in full to the result.
  • Then append ttca minus the prefix tca, resulting in the superstring gcttca.
  • Next, add ctaagt minus the prefix cta, resulting in gcttcaagt.
  • Add catg minus the prefix catg (no overlap with the last added string), so gcttcaagtcatg.
  • Lastly, add atgcatc minus the prefix atgcatc (again no overlap), completing the superstring as gcttcaagtcatgatgcatc.

The final superstring gcttcaagtcatgatgcatc contains all strings from the words array as substrings, and it's the shortest possible string that does so according to our calculations.

This example illustrates the concept of using dynamic programming with bit masking and graph theory to construct the optimal string. The DP table dp[mask][i] holds the maximum overlap value, and p[mask][i] helps in tracing back the string sequence. We then build the superstring using these calculated overlaps, ensuring each string is represented as a substring while minimizing the total length.

Solution Implementation

1from typing import List
2
3class Solution:
4    def shortestSuperstring(self, words: List[str]) -> str:
5        # Number of words
6        num_words = len(words)
7      
8        # Graph g where g[i][j] represents the length of overlap if word i is followed by word j
9        overlap = [[0] * num_words for _ in range(num_words)]
10      
11        # Calculate the maximum overlap of each pair of words
12        for i, word1 in enumerate(words):
13            for j, word2 in enumerate(words):
14                if i != j:
15                    # Check the maximum prefix of word2 that's a suffix of word1
16                    for k in range(min(len(word1), len(word2)), 0, -1):
17                        if word1[-k:] == word2[:k]:
18                            overlap[i][j] = k
19                            break
20      
21        # Dynamic programming table, where dp[mask][i] holds the highest overlap with mask and ending with word i
22        dp = [[0] * num_words for _ in range(1 << num_words)]
23      
24        # Parent table to reconstruct the path
25        parent = [[-1] * num_words for _ in range(1 << num_words)]
26      
27        # Fill dp[] and parent[]
28        for mask in range(1 << num_words):
29            for j in range(num_words):
30                if (mask >> j) & 1:
31                    # Previous mask before adding word j
32                    prev_mask = mask ^ (1 << j)
33                    for k in range(num_words):
34                        if (prev_mask >> k) & 1:
35                            value = dp[prev_mask][k] + overlap[k][j]
36                            if value > dp[mask][j]:
37                                dp[mask][j] = value
38                                parent[mask][j] = k
39      
40        # Recover the last word in the optimal arrangement
41        last = max(range(num_words), key=lambda i: dp[-1][i])
42      
43        # Recover the order of words in the optimal superstring
44        order = []
45        mask = (1 << num_words) - 1
46        while last != -1:
47            prev = parent[mask][last]
48            order.append(last)
49            mask ^= (1 << last)
50            last = prev
51        order.reverse()
52      
53        # Add words that are not in the optimal order (could happen if all overlaps are 0)
54        order.extend(j for j in range(num_words) if j not in order)
55      
56        # Build the shortest superstring based on the order
57        result = [words[order[0]]]
58        for i in range(1, len(order)):
59            index = order[i]
60            prev_index = order[i - 1]
61            result.append(words[index][overlap[prev_index][index]:])
62      
63        return ''.join(result)
64
65# An example of how to use the class would be:
66# solution = Solution()
67# superstring = solution.shortestSuperstring(["abc", "bcd", "cde"])
68
1import java.util.ArrayList;
2import java.util.Arrays;
3import java.util.Collections;
4import java.util.HashSet;
5import java.util.List;
6import java.util.Set;
7
8class Solution {
9    public String shortestSuperstring(String[] words) {
10        int wordCount = words.length;
11        int[][] overlap = new int[wordCount][wordCount];
12      
13        // Calculate the overlap between pairs of words
14        for (int i = 0; i < wordCount; ++i) {
15            for (int j = 0; j < wordCount; ++j) {
16                if (i != j) {
17                    for (int k = Math.min(words[i].length(), words[j].length()); k > 0; --k) {
18                        if (words[i].endsWith(words[j].substring(0, k))) {
19                            overlap[i][j] = k;
20                            break;
21                        }
22                    }
23                }
24            }
25        }
26      
27        // Initialize the dynamic programming tables
28        int[][] dp = new int[1 << wordCount][wordCount];
29        int[][] parent = new int[1 << wordCount][wordCount];
30      
31        // Fill the parent table with -1 to indicate no parent
32        for (int i = 0; i < (1 << wordCount); ++i) {
33            Arrays.fill(parent[i], -1);
34        }
35      
36        // Build up the dp table which stores maximum overlap
37        for (int bitmask = 0; bitmask < (1 << wordCount); ++bitmask) {
38            for (int j = 0; j < wordCount; ++j) {
39                // Check if word j is in the set represented by bitmask
40                if ((bitmask & (1 << j)) > 0) {
41                    int prevState = bitmask ^ (1 << j);
42                    for (int k = 0; k < wordCount; ++k) {
43                        if ((prevState & (1 << k)) > 0) {
44                            int overlapValue = dp[prevState][k] + overlap[k][j];
45                            if (overlapValue > dp[bitmask][j]) {
46                                dp[bitmask][j] = overlapValue;
47                                parent[bitmask][j] = k;
48                            }
49                        }
50                    }
51                }
52            }
53        }
54      
55        // Reconstruct the path and compute the answer
56        int[] order = findOptimalOrder(dp, parent, wordCount);
57        StringBuilder result = new StringBuilder(words[order[0]]);
58        for (int i = 1; i < wordCount; ++i) {
59            int o = overlap[order[i - 1]][order[i]];
60            result.append(words[order[i]].substring(o));
61        }
62        return result.toString();
63    }
64  
65    // Helper function to find the optimal order of words
66    private int[] findOptimalOrder(int[][] dp, int[][] parent, int wordCount) {
67        List<Integer> path = new ArrayList<>();
68        int bitmask = (1 << wordCount) - 1;
69        int lastWordInPath = findLastWord(dp, wordCount);
70      
71        // Reconstruct the path backwards using parent pointers
72        for (int i = lastWordInPath; bitmask != 0; ) {
73            int temp = bitmask;
74            path.add(i);
75            bitmask &= ~(1 << i);
76            i = parent[temp][i];
77        }
78      
79        // Add any unused words
80        Set<Integer> usedWords = new HashSet<>(path);
81        for (int i = 0; i < wordCount; ++i) {
82            if (!usedWords.contains(i)) {
83                path.add(i);
84            }
85        }
86      
87        // Reverse the order since we were moving backwards
88        Collections.reverse(path);
89      
90        // Convert the List<Integer> to an array
91        return path.stream().mapToInt(i -> i).toArray();
92    }
93  
94    // Helper function to find the last word in the optimal path
95    private int findLastWord(int[][] dp, int wordCount) {
96        int lastWordIndex = 0;
97        // Choose the word that ends up with the largest overlap
98        for (int i = 0; i < wordCount; ++i) {
99            if (dp[(1 << wordCount) - 1][i] > dp[(1 << wordCount) - 1][lastWordIndex]) {
100                lastWordIndex = i;
101            }
102        }
103        return lastWordIndex;
104    }
105}
106
1#include <vector>
2#include <string>
3#include <unordered_set>
4#include <algorithm>
5
6using namespace std;
7
8class Solution {
9public:
10    string shortestSuperstring(vector<string>& words) {
11        int wordCount = words.size();
12        // Initialize the graph to store overlaps
13        vector<vector<int>> overlaps(wordCount, vector<int>(wordCount, 0));
14
15        // Calculate overlap for each pair of words
16        for (int i = 0; i < wordCount; ++i) {
17            for (int j = 0; j < wordCount; ++j) {
18                if (i != j) {
19                    for (int overlapSize = min(words[i].size(), words[j].size()); overlapSize > 0; --overlapSize) {
20                        if (words[i].substr(words[i].size() - overlapSize) == words[j].substr(0, overlapSize)) {
21                            overlaps[i][j] = overlapSize;
22                            break;
23                        }
24                    }
25                }
26            }
27        }
28
29        // Dynamic Programming to store max overlap
30        vector<vector<int>> dp(1 << wordCount, vector<int>(wordCount, 0));
31        vector<vector<int>> parent(1 << wordCount, vector<int>(wordCount, -1));
32
33        // Fill the dp table
34        for (int mask = 0; mask < (1 << wordCount); ++mask) { // For each combination of words
35            for (int j = 0; j < wordCount; ++j) {
36                if ((mask >> j) & 1) {
37                    int previousMask = mask ^ (1 << j);
38                    for (int k = 0; k < wordCount; ++k) {
39                        if ((previousMask >> k) & 1) {
40                            int val = dp[previousMask][k] + overlaps[k][j];
41                            if (val > dp[mask][j]) {
42                                dp[mask][j] = val;
43                                parent[mask][j] = k;
44                            }
45                        }
46                    }
47                }
48            }
49        }
50
51        // Reconstruct the path
52        vector<int> path;
53        int j = 0;
54
55        // Find the maximum overlap
56        for (int i = 0; i < wordCount; ++i) {
57            if (dp[(1 << wordCount) - 1][i] > dp[(1 << wordCount) - 1][j]) {
58                j = i;
59            }
60        }
61
62        // Build the path backwards
63        for (int mask = (1 << wordCount) - 1; parent[mask][j] != -1; ) {
64            int prev = parent[mask][j];
65            path.push_back(j);
66            mask ^= (1 << j);
67            j = prev;
68        }
69        path.push_back(j);
70
71        // Include the remaining words
72        unordered_set<int> visited(path.begin(), path.end());
73        for (int i = 0; i < wordCount; ++i) {
74            if (!visited.count(i)) {
75                path.push_back(i);
76            }
77        }
78
79        // Construct the shortest superstring
80        reverse(path.begin(), path.end());
81        string superstring = words[path[0]];
82        for (int i = 1; i < wordCount; ++i) {
83            int overlap = overlaps[path[i - 1]][path[i]];
84            superstring += words[path[i]].substr(overlap);
85        }
86
87        return superstring;
88    }
89};
90
1// Importing necessary functions from other modules
2import { vector, unordered_set } from 'somelibrary'; // Placeholder, replace with actual library if needed.
3
4function calculateOverlaps(words: string[]): number[][] {
5    const wordCount = words.length;
6    let overlaps: number[][] = vector(wordCount, vector(wordCount, 0));
7
8    // Calculate overlap for each pair of words
9    for (let i = 0; i < wordCount; ++i) {
10        for (let j = 0; j < wordCount; ++j) {
11            if (i !== j) {
12                for (let overlapSize = Math.min(words[i].length, words[j].length); overlapSize > 0; --overlapSize) {
13                    if (words[i].endsWith(words[j].substring(0, overlapSize))) {
14                        overlaps[i][j] = overlapSize;
15                        break;
16                    }
17                }
18            }
19        }
20    }
21
22    return overlaps;
23}
24
25function findShortestSuperstring(words: string[]): string {
26    const wordCount = words.length;
27    const overlaps = calculateOverlaps(words);
28
29    let dp: number[][] = vector(1 << wordCount, vector(wordCount, 0));
30    let parent: number[][] = vector(1 << wordCount, vector<number>(wordCount, -1));
31
32    // Fill the dp table
33    for (let mask = 0; mask < (1 << wordCount); ++mask) {
34        for (let j = 0; j < wordCount; ++j) {
35            if ((mask >> j) & 1) {
36                let previousMask = mask ^ (1 << j);
37                for (let k = 0; k < wordCount; ++k) {
38                    if ((previousMask >> k) & 1) {
39                        let val = dp[previousMask][k] + overlaps[k][j];
40                        if (val > dp[mask][j]) {
41                            dp[mask][j] = val;
42                            parent[mask][j] = k;
43                        }
44                    }
45                }
46            }
47        }
48    }
49
50    // Reconstruct the path
51    let path: number[] = [];
52    let j = 0;
53
54    // Find the maximum overlap
55    for (let i = 0; i < wordCount; ++i) {
56        if (dp[(1 << wordCount) - 1][i] > dp[(1 << wordCount) - 1][j]) {
57            j = i;
58        }
59    }
60
61    // Build the path backwards
62    for (let mask = (1 << wordCount) - 1; parent[mask][j] !== -1; ) {
63        let prev = parent[mask][j];
64        path.push(j);
65        mask ^= (1 << j);
66        j = prev;
67    }
68    path.push(j);
69
70    // Include the remaining words
71    let visited = new unordered_set<number>(path);
72    for (let i = 0; i < wordCount; ++i) {
73        if (!visited.has(i)) {
74            path.push(i);
75        }
76    }
77
78    // Construct the shortest superstring
79    path.reverse();
80    let superstring = words[path[0]];
81    for (let i = 1; i < path.length; ++i) {
82        let overlap = overlaps[path[i - 1]][path[i]];
83        superstring += words[path[i]].substring(overlap);
84    }
85
86    return superstring;
87}
88
Not Sure What to Study? Take the 2-min Quiz

Consider the classic dynamic programming of fibonacci numbers, what is the recurrence relation?

Time and Space Complexity

The given code aims to find the shortest superstring that can be formed by a given list of strings by performing concatenations in a certain way.

Time Complexity

The time complexity of the code is O(n^2 * 2^n + n^2 * W), where 'n' is the number of words and 'W' is the maximum length of a word.

  1. The first nested for-loop, with g = [[0] * n for _ in range(n)], runs in O(n^2 * W) time. This loop calculates the overlap between each pair of words.

  2. The second nested for-loop updates the dynamic programming (DP) table dp and predecessor table p, and it runs in O(n^2 * 2^n) time. This happens because for each of the 2^n subsets of words, it calculates the maximum overlap by checking pairs of words, and there are n^2 potential pairs.

  3. The procedure to find the highest overlap at the last bit-mask ((1 << n) - 1) is O(n).

  4. Reconstructing the solution path has a time complexity of O(n), since it reconstructs the full path of word concatenations.

  5. Finally, the output string construction with list comprehensions has O(n) complexity but depends on the lengths of the words for actual string operations.

Considering all these steps, the first for loop and the DP table update are the most time-consuming parts.

Space Complexity

The space complexity of the code is O(n * 2^n + n^2), which comprises:

  1. O(n^2) for storing the overlap between pairs of words in g.

  2. O(n * 2^n) for storing the DP values in dp.

  3. O(n * 2^n) for storing the predecessors in p.

The DP and predecessor tables account for the largest portion of the space used, as they store information for each subset of the power set of 'n' words.

In summary, the overall space complexity mirrors the structure of the DP and predecessor tables.

Learn more about how to find time and space complexity quickly using problem constraints.

Fast Track Your Learning with Our Quick Skills Quiz:

What data structure does Breadth-first search typically uses to store intermediate states?


Recommended Readings


Got a question? Ask the Teaching Assistant anything you don't understand.

Still not clear? Ask in the Forum,  Discord or Submit the part you don't understand to our editors.

←
↑TA đŸ‘šâ€đŸ«