820. Short Encoding of Words

MediumTrieArrayHash TableString
Leetcode Link

Problem Description

The given problem is about finding the shortest reference string that can encode an array of words such that:

  • words.length == indices.length, where indices is an array of indices.
  • The reference string s must end with the character '#'.
  • For each index indices[i], the substring of s starting from indices[i] and ending before the next '#' should be exactly words[i].

The goal is to return the minimum length of such a reference string s that fulfills the above conditions.

Intuition

In order to encode the words array with the shortest reference string, we have to look for opportunities to overlap words. If a word is a suffix of another, we can encode both with the longer word in the reference string. For example, if "time" is in the array, then encoding "me" doesn't require extra space; it is the last two letters of "time."

The solution uses a Trie structure, as it is a natural fit for dealing with prefixes and suffixes. A Trie is an ordered tree structure, which is used to store a dynamic set or associative array where the keys are usually strings. Instead of storing whole words in each node, we store characters, and each path down the tree represents a word.

  1. We first reverse each word and insert it into the Trie. We reverse the words because we want to group words by suffixes (the end of the words) rather than prefixes (the start of the words), since words that share the same suffix can be encoded together.

  2. For every word, we traverse from its last character to the first character, corresponding to a path from the root to a leaf in the Trie. If the current character doesn't exist in the Trie at the current level, we create a new Trie node.

  3. After all words are in the Trie, we need to find the encoded string's length. We do this by performing a depth-first search (DFS) on the Trie.

  4. The DFS traversal keeps count of the length of each word that cannot be further combined into another word (a leaf node in the Trie). This is identified when a child node doesn't exist for the current path when traversing the Trie, indicating the end of an encoded word.

  5. Whenever we reach a leaf node during DFS, we add the depth of the node (the length of the word) plus 1 to the result (the plus 1 accounts for the '#' at the end of every word).

  6. The sum of these lengths for all the leaf nodes gives us the length of the shortest possible reference string.

In summary, by using a Trie to group suffixes, we avoid redundant encoding of words that are suffixes of other words. By performing a DFS, we calculate the sum of lengths of all unique word encodings (plus the '#' characters), achieving the desired result.

Solution Approach

The solution to this problem uses a Trie data structure and a depth-first search (DFS) algorithm to find the minimum length of a reference string that can encode a given array of words. Here is a step-by-step implementation of the solution:

  1. Trie Data Structure Setup: We start by defining a basic Trie node class that will hold an array of 26 elements for each letter in the lowercase English alphabet. Each element can point to another Trie node, representing the next character in a sequence of characters.

    1class Trie:
    2    def __init__(self) -> None:
    3        self.children = [None] * 26
  2. Reversing Words and Building the Trie: We initialize a root Trie node. For each word in the input list, we reverse the word and insert it into the Trie. The insertion is done by traversing the Trie nodes character by character from the reversed word starting at the root, creating new nodes when necessary. This loops from the end of the word to the beginning:

    1root = Trie()
    2for w in words:
    3    cur = root
    4    for i in range(len(w) - 1, -1, -1):
    5        idx = ord(w[i]) - ord('a')
    6        if cur.children[idx] == None:
    7            cur.children[idx] = Trie()
    8        cur = cur.children[idx]
  3. Depth-First Search for Encoding Length: Once the Trie is built, we use the DFS algorithm to find all leaf nodes (nodes representing the last character of a unique word). The DFS function takes two arguments: the current node and the length of the word found till now (l), which starts at 1 to account for the terminating '#'. The DFS then proceeds as follows:

    • If a child node doesn't exist for any character, we know the current node is a leaf. When a leaf is encountered, the word length plus 1 for the terminating '#' is added to the total count (ans).
    • If a child node exists, recursively call DFS on the child, incrementing the length variable to track the growing word length.
    • The recursion terminated at leaf nodes, where it adds length to the total count.
    1def dfs(self, cur: Trie, l: int) -> int:
    2    isLeaf, ans = True, 0
    3    for i in range(26):
    4        if cur.children[i] != None:
    5            isLeaf = False
    6            ans += self.dfs(cur.children[i], l + 1)
    7    if isLeaf:
    8        ans += l
    9    return ans
  4. Calculating the Minimum Length: Finally, the DFS is initiated from the root of the Trie:

    1return self.dfs(root, 1)

The total count (ans) returned from the DFS is the length of the shortest reference string that can be generated from the given array of words. The DFS ensures that only the lengths of words that are not suffixes of any other word in the Trie are added, ensuring that the final answer is the minimum possible length.

💪
Level Up Your
Algo Skills

Example Walkthrough

Let's consider a small example to illustrate the solution approach. Suppose we are given the array of words ["time", "me", "bell"]. We need to find the shortest reference string that encodes these words according to the given problem description.

Following the solution approach, we proceed with these steps:

1. Reversing Words and Building the Trie

We start by initializing our Trie and reverse each word to focus on suffixes. This would give us the reversed words: ["emit", "em", "lleb"]. We then insert these words into the Trie by creating new nodes as necessary.

The Trie would look like this (represented conceptually, and not actual nodes):

1      root
2       |
3       e   <- 'time' and 'me'
4      /
5     m
6     | \
7     i  # <- end of 'me'
8    / 
9   t   <- end of 'time'
10  /
11 #      <- '#' represents the end of each word
12 |
13 l
14 |
15 l
16 |
17 e   <- end of 'bell'
18 |
19#      <- end of 'lleb'

2. Depth-First Search for Encoding Length

We define and use our dfs function to traverse the Trie and find all leaf nodes. In our Trie, the leaves are found at the end of "emit", "em", and "lleb".

As we traverse, we perform these steps:

  • For "emit", we reach the end and find a leaf node. The length of the string plus 1 for '#' (5) is added.
  • For "em", since "emit" has already been accounted for, it does not add any extra length.
  • For "lleb", we reach another leaf node. The length of the string plus 1 for '#' (5) is added.

After calculating the lengths for all leaf nodes, we find the total length.

3. Calculating the Minimum Length

This traversal would give us:

  • Length of "time#": 5 characters
  • Length of "bell#": 5 characters

And since "me" is a suffix of "time", it does not add any additional length.

Therefore, the length of the shortest reference string is the sum of the lengths of individual words with '#' (which are not suffixes of any other word) is 5 (time#) + 5 (bell#) = 10.

Thus, our answer, in this case, would be 10, with the reference string being time#bell#.

Python Solution

1class TrieNode:
2    def __init__(self):
3        # Each node holds 26 possible children, one for each letter of the alphabet
4        self.children = [None] * 26
5
6
7class Solution:
8    def minimumLengthEncoding(self, words: List[str]) -> int:
9        # Create a trie that will be used to store the words in reverse order
10        root = TrieNode()
11        for word in words:
12            current_node = root
13            # Insert each word into the trie in reverse order
14            for char in reversed(word):
15                index = ord(char) - ord('a')  # Convert character to index (0-25)
16                if current_node.children[index] is None:
17                    current_node.children[index] = TrieNode()
18                current_node = current_node.children[index]
19              
20        # Start the depth-first search from the root with an initial length of 1
21        return self.dfs(root, 1)
22
23    def dfs(self, current_node: TrieNode, length: int) -> int:
24        is_leaf = True  # Initialize is_leaf to True; it becomes False if the node has children
25        encoding_length = 0  # This will accumulate the total encoding length
26      
27        for i in range(26):
28            if current_node.children[i] is not None:
29                is_leaf = False  # The current node is not a leaf as it has children
30                # Recursively call dfs for each child, incrementing the length by 1
31                encoding_length += self.dfs(current_node.children[i], length + 1)
32              
33        if is_leaf:
34            # If a leaf node is reached, add the length of the word plus the '#' sign
35            encoding_length += length
36          
37        # Return the total encoding length for the current subtree
38        return encoding_length
39

Java Solution

1class TrieNode {
2    // Each TrieNode contains an array of children TrieNodes representing each character in the alphabet
3    TrieNode[] children = new TrieNode[26];
4}
5
6class Solution {
7    // Function to compute the minimum length encoding of an array of words
8    public int minimumLengthEncoding(String[] words) {
9        TrieNode root = new TrieNode(); // The root of the Trie
10        // Insert each word into the Trie in reverse
11        for (String word : words) {
12            TrieNode current = root;
13            for (int i = word.length() - 1; i >= 0; i--) {
14                int index = word.charAt(i) - 'a'; // Compute the index for the character
15                if (current.children[index] == null) {
16                    // Create a new TrieNode if it does not exist
17                    current.children[index] = new TrieNode();
18                }
19                current = current.children[index]; // Move to the child TrieNode
20            }
21        }
22        // Perform depth-first search on the Trie to calculate the total encoding length
23        return calculateEncodingLengthDFS(root, 1);
24    }
25
26    // Helper DFS function to calculate encoding length, l represents the current length of the word
27    private int calculateEncodingLengthDFS(TrieNode node, int depth) {
28        boolean isLeaf = true; // A flag to check if the current node is a leaf node
29        int totalLength = 0; // Initialize the total length of the encoding
30
31        // Iterate through all the possible children
32        for (int i = 0; i < 26; i++) {
33            if (node.children[i] != null) {
34                isLeaf = false; // If a child node is found, this is not a leaf node
35                // Recursively calculate the length for the child node
36                totalLength += calculateEncodingLengthDFS(node.children[i], depth + 1);
37            }
38        }
39
40        // If the node is a leaf node, add the current depth to the total length (plus 1 for the '#')
41        if (isLeaf) {
42            totalLength += depth; // add the depth that represents the length of the word plus the '#'
43        }
44      
45        return totalLength; // Return the total encoding length
46    }
47}
48

C++ Solution

1struct TrieNode {
2    TrieNode* children[26] = {nullptr}; // A node in a Trie for lowercase English letters.
3};
4
5class Solution {
6public:
7    // Calculates the minimum length of a string that encodes input words such that 
8    // no word is a suffix of another. Each word is followed by a '#' character in the encoding.
9    int minimumLengthEncoding(vector<string>& words) {
10        TrieNode* root = new TrieNode(); // The root of the Trie.
11        // Insert all words into the Trie in reverse order (since we want to check suffixes).
12        for (const auto& word : words) {
13            TrieNode* current = root;
14            for (int i = word.size() - 1; i >= 0; --i) {
15                int index = word[i] - 'a'; // Convert char to index (0-25).
16                if (current->children[index] == nullptr) {
17                    current->children[index] = new TrieNode(); // Create a new node if necessary.
18                }
19                current = current->children[index]; // Move to the child node.
20            }
21        }
22        // Start Depth-First Search from the root. The initial length argument is 1 for the '#' character.
23        return depthFirstSearch(root, 1);
24    }
25
26private:
27    // Depth-First Search to find the total length of the encoding.
28    int depthFirstSearch(TrieNode* currentNode, int depth) {
29        bool isLeaf = true; // A flag to check if the current node is a leaf.
30        int totalLength = 0; // Accumulation of the lengths of the encoded words.
31
32        // Visit all children nodes.
33        for (int i = 0; i < 26; ++i) {
34            if (currentNode->children[i] != nullptr) {
35                isLeaf = false; // Current node has a child, so it's not a leaf.
36                totalLength += depthFirstSearch(currentNode->children[i], depth + 1); // Recurse with incremented depth.
37            }
38        }
39      
40        // If the node is a leaf, add the depth (word length + '#') to the total length of the encoding.
41        if (isLeaf) {
42            totalLength += depth;
43        }
44        return totalLength; // Return the calculated total length.
45    }
46};
47

Typescript Solution

1// Define the TrieNode structure.
2interface TrieNode {
3  children: (TrieNode | null)[];
4}
5
6// Initializes a new TrieNode with all children set to null.
7function createTrieNode(): TrieNode {
8  return { children: Array(26).fill(null) };
9}
10
11// The root of the Trie.
12const root: TrieNode = createTrieNode();
13
14// Inserts all words into the Trie in reverse order to make suffixes come first.
15function insertWords(words: string[]): void {
16  for (const word of words) {
17    let current: TrieNode = root;
18    // Iterate over the word in reverse order.
19    for (let i = word.length - 1; i >= 0; --i) {
20      const index: number = word.charCodeAt(i) - 'a'.charCodeAt(0);
21      // Create a new node if necessary.
22      if (current.children[index] === null) {
23        current.children[index] = createTrieNode();
24      }
25      // Move to the child node.
26      current = current.children[index]!;
27    }
28  }
29}
30
31// Calculates the minimum length encoding for input words.
32function minimumLengthEncoding(words: string[]): number {
33  // Insert all words into the Trie.
34  insertWords(words);
35  // Initial length is 1 for the '#' at the end of each word.
36  return depthFirstSearch(root, 1);
37}
38
39// Performs Depth-First Search to compute the total length of the encoding.
40function depthFirstSearch(currentNode: TrieNode, depth: number): number {
41  let isLeaf: boolean = true;
42  let totalLength: number = 0;
43
44  // Visit all child nodes.
45  for (let i: number = 0; i < 26; ++i) {
46    if (currentNode.children[i]) {
47      isLeaf = false; // Node has a child and is not a leaf.
48      // Recursively search and increment total length.
49      totalLength += depthFirstSearch(currentNode.children[i]!, depth + 1);
50    }
51  }
52
53  if (isLeaf) {
54    // If node is a leaf, add the depth which is word length + '#' to total length.
55    totalLength += depth;
56  }
57
58  return totalLength; // Return the total computed length.
59}
60

Time and Space Complexity

Time Complexity

The time complexity of inserting a word into the Trie is O(L) where L is the length of the word. The for loop in minimumLengthEncoding method iterates over all the words and each character of each word, giving us a complexity of O(N*L) where N is the number of words and L is the average length of the words.

The dfs method has a time complexity of O(N*L) as well for the worst case, where each node is visited once. In the worst case, we can consider the depth of the Trie could be as long as the longest word, and each node could have 26 children if every character leads to a new word.

Therefore, the overall time complexity of the minimumLengthEncoding method is O(N*L).

Space Complexity

The space complexity is dictated by the storage needed for the Trie. Each node could potentially have 26 children, and the maximum depth of the Trie is L, the maximum length of a word. In the worst case, we'd have a completely full Trie where each node has 26 children, giving us a space complexity of O(26^L).

The recursive dfs method also contributes to the space complexity due to the call stack, which in the worst case could add up to O(L) depth for recursion.

Thus, the total worst-case space complexity is O(26^L + L), where O(26^L) is the dominant term.

😈
Become an
Algo Monster

Got a question? Ask the Teaching Assistant anything you don't understand.

Still not clear? Ask in the Forum,  Discord or Submit the part you don't understand to our editors.


TA 👨‍🏫