1531. String Compression II


Problem Description

The problem is an optimization task based on modifying the run-length encoding (RLE) of a string. Run-length encoding is a simple form of lossless data compression where sequences of the same character are replaced with a single character and a count of that character's repetitions. The task is to reduce the length of the encoded string by strategically deleting up to k characters from the original string. The challenge is to determine which characters to remove in order to achieve the smallest possible length of the RLE version of the string.

For instance, if we are given s = "aabccc" and k = 2, by removing a single 'b' and one 'c', we can transform 's' to "aac" which can be RLE encoded as "a2c", the resulting length is 3, which is the minimum possible.

Intuition

To solve this problem, dynamic programming is used to build up a solution by solving smaller subproblems. The dynamic programming array dp[i][k] is defined to store the length of the optimal compression for the substring starting at the i-th character of s with at most k characters deleted.

The solution iterates over the string, keeping track of the frequency of characters encountered (count[]), and calculates the minimum RLE length that can be obtained by either keeping or deleting characters. Deleting characters comes with the trade-off of possibly increasing the size due to the frequency of a run being less compressed (e.g., "aaabb" to "aab" increases the length when encoded because "a2b2" becomes "a2b1").

The key is to determine whether to keep a character in the current run or to delete some characters to potentially form a longer run of a different character that comes later. This decision is made by considering the lengths of all possible encodings that can result from either keeping or deleting characters and choosing the one with the minimum length.

The recursive function compression performs this computation and memoizes the results in dp to avoid repeated calculations for the same subproblems. Additionally, the function getLength calculates the RLE length of a sequence with a given frequency. Since run lengths only add characters to the encoding when the count is 2 or more, the function handles single characters and different number magnitude representations separately (1 digit, 2 digits, 3 digits).

In essence, the algorithm involves exploring different scenarios of deletion while memoizing results to find the optimal compression length efficiently.

Learn more about Dynamic Programming patterns.

Solution Approach

The solution utilizes a top-down recursive approach with memoization, a common technique in dynamic programming to avoid recalculating the same outcomes multiple times. The crux of the solution lies in the compression method and the use of the dp array alongside the count array, which help to bookkeep the best results while iterating through the possible deletions and runs.

Dynamic Programming Array dp

A 2D array dp of size n by k+1 is used to store the lengths of optimal compressions for different subproblems, where n is the length of the original string and k is the maximum number of characters we can delete.

  • dp[i][k] represents the minimal length of the RLE string starting from the i-th index of s with at most k deletions allowed.
  • Initialized with a large constant K_MAX representing an unattainable maximum length, suggesting that the corresponding subproblem has not yet been solved.

The compression Method

This method takes as input the string s, the current index i, and the remaining deletion allowance k. It recurses through the string, trying all possible deletions and keeping track of the resulting RLE length.

Base Cases

  • When k < 0: Invalid, since we cannot delete more characters than allowed. Return K_MAX.
  • When i == s.length() || s.length() - i <= k: We've reached the end of the string, or we have enough deletions left to delete the remaining characters. In either case, no RLE length addition is needed, so return 0.
  • If dp[i][k] != K_MAX, return the memoized value, avoiding unnecessary recomputation.

Recursive Case

With maxFreq initialized to 0 and count to count character frequencies, the idea is to iterate from the current index i to the end of the string with index j:

  • Increase count for the current character, and update maxFreq with the maximum frequency seen so far.
  • Recurse with the updated parameters: compression(s, j + 1, k - (j - i + 1 - maxFreq)).
  • The new k is reduced by the number of deletions necessary to keep the maxFreq characters in the current range undisrupted.
  • Calculate and update the optimal dp[i][k] with the current best attempt, which includes the RLE length of the current char (getLength(maxFreq)) and the optimal compression length of the remaining string given the current deletions.

The getLength Method

The getLength method is just a helper that translates frequencies into the additional length needed in the RLE string.

  • 1 requires no additional digit, as we do not encode single characters with a count.
  • 2-9 adds 1 digit.
  • 10-99 adds 2 digits.
  • 100+ adds 3 digits.

Combining this logic with the compression routine and the dp array with memoization, the method getLengthOfOptimalCompression builds the solution from smaller subproblems and caches these solutions to use in larger problems. This reduces the time and computational complexity significantly, allowing us to find the optimal length of the RLE string after a maximum of k deletions.

Running Time Complexity

The algorithm operates in O(n^2 * k) time since for each character in the string (n), it looks at all characters that follow it (n), and this is done for every possible number of deletions (k). The use of memoization improves the time complexity from exponential, which it would have been if it was a simple recursive solution without memoization.

Ready to land your dream job?

Unlock your dream job with a 2-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's illustrate the solution approach using a small example, say s = "aaabba" and k = 2.

Step 1: Initialize the dp array to represent unsolved subproblems. For this example, our dp array will have n = 6 (length of s) rows and k+1 = 3 columns to account for 0 to 2 deletions.

Step 2: Call the compression function starting with the full string and all deletions available, compression(s, 0, 2).

Step 3: Look at the base cases: we're not yet at the end of the string nor have we used all k deletions, and dp[0][2] is uninitialized, so we move to the recursive case.

Step 4: With i = 0 and maxFreq = 0, we start examining runs of characters starting with the first 'a'. We can keep all three 'a's without any deletion, so maxFreq becomes 3 and count['a'] is also 3.

Step 5: As we have 3 'a's consecutively with no deletions, dp[0][2] will consider keeping them as is, and then recursively compute compression(s, 3, 2) for the next part of the string, which starts with "bba".

Step 6: Recursively, we process the next segment "bba" starting from i = 3. With two deletions remaining, the optimal decision is to delete two 'b's to only have "a". The compression function is then called with compression(s, 6, 0).

Step 7: The compression(s, 6, 0) call reaches the end of the string, so it returns 0 as per the base case.

Step 8: Now we add the length of 'a' run to the result of the recursion. Since we had 3 'a's initially, getLength(3) returns 1 (the additional digit for the count is '3'), resulting in dp[0][2] = 1 + 0 = 1.

Step 9: Returning to the first part of the string, we must consider the outcomes if we had deleted one of the 'a's. If we did, making maxFreq = 2, we'd end up with a string "aabba" and the recursive call compression(s, 1, 1) must be made.

Step 10: This process continues, exploring combinations of keeping and deleting characters at each step, memoizing the results in the dp array to avoid duplicate computations.

Step 11: After the recursive calls are finished, we look at dp[0][2] to get the answer, which stores the compressed string of minimum length which can be formed by starting at index 0 of s with at most 2 deletions.

In our case, the RLE of the original s = "aaabba" without deletions would be a3b2a1, and with the two deletions allowing us to remove the 'b's, the optimal RLE would be a3, yielding the smallest length of 2, which is the answer.

Solution Implementation

1from typing import List
2
3class Solution:
4    K_MAX = 101
5
6    def getLengthOfOptimalCompression(self, s: str, k: int) -> int:
7        # dp[i][k] represents the length of the optimal compression
8        # of substring s[i:] with at most k deletions
9        self.dp = [[self.K_MAX for _ in range(k + 1)] for _ in range(len(s))]
10
11        # Start compression from index 0 with k deletions allowed
12        return self._compress(s, 0, k)
13
14    def _compress(self, s: str, start_idx: int, k: int) -> int:
15        # Return an impossibly large value if deletions allowed are negative
16        if k < 0:
17            return self.K_MAX
18
19        # If we are at the end of the string or we can delete all remaining characters, return 0
20        if start_idx == len(s) or len(s) - start_idx <= k:
21            return 0
22
23        # If we have already computed this state, return its value to avoid recomputation
24        if self.dp[start_idx][k] != self.K_MAX:
25            return self.dp[start_idx][k]
26
27        max_frequency = 0
28        frequency_count = [0] * 128  # Frequency array to count occurrences of each character
29
30        # Try to find the optimal compression by maintaining a window [start_idx..j]
31        for j in range(start_idx, len(s)):
32            # Update frequency of the current character and maxFrequency found so far
33            frequency_count[ord(s[j])] += 1
34            max_frequency = max(max_frequency, frequency_count[ord(s[j])])
35
36            # Calculate the minimal length by keeping character s[j] (most frequent so far) and deleting others
37            self.dp[start_idx][k] = min(
38                self.dp[start_idx][k],
39                self._getLength(max_frequency) + self._compress(s, j + 1, k - (j - start_idx + 1 - max_frequency))
40            )
41
42        # Return the computed value for this state
43        return self.dp[start_idx][k]
44
45    def _getLength(self, max_frequency: int) -> int:
46        # Calculates the length of the compressed string after compression
47        if max_frequency == 1:
48            return 1  # Single character (e.g. "a")
49        if max_frequency < 10:
50            return 2  # Single digit frequency followed by character (e.g. "9a")
51        if max_frequency < 100:
52            return 3  # Two digit frequency followed by character (e.g. "23b")
53        return 4  # Three digit frequency followed by character (e.g. "100c")
54
1class Solution {
2    private static final int K_MAX = 101;
3    private int[][] dp;
4
5    public int getLengthOfOptimalCompression(String s, int k) {
6        // dp[i][k] represents the length of the optimal compression of substring s[i..n) with at most k deletions
7        dp = new int[s.length()][k + 1];
8      
9        // Initialize all elements of dp array to K_MAX
10        for (int[] row : dp) {
11            Arrays.fill(row, K_MAX);
12        }
13
14        // Start compression from index 0 with k deletions allowed
15        return compress(s, 0, k);
16    }
17
18    private int compress(final String s, int startIdx, int k) {
19        // If deletions allowed are negative, return an impossibly large value (K_MAX)
20        if (k < 0) {
21            return K_MAX;
22        }
23
24        // If we are at the end of the string or we can delete all remaining characters, return 0
25        if (startIdx == s.length() || s.length() - startIdx <= k) {
26            return 0;
27        }
28
29        // If we have already computed this state, return its value to avoid recomputation
30        if (dp[startIdx][k] != K_MAX) {
31            return dp[startIdx][k];
32        }
33
34        int maxFrequency = 0;
35        int[] frequencyCount = new int[128]; // Frequency array to count occurrences of each character
36
37        // Try to find the optimal compression by maintaining a window [startIdx..j]
38        for (int j = startIdx; j < s.length(); ++j) {
39            // Update frequency of the current character and maxFrequency found so far
40            maxFrequency = Math.max(maxFrequency, ++frequencyCount[s.charAt(j)]);
41
42            // Calculate the minimum length by keeping character s.charAt(j) (which is most frequent so far) and deleting others
43            dp[startIdx][k] = Math.min(
44                dp[startIdx][k], 
45                getLength(maxFrequency) + compress(s, j + 1, k - (j - startIdx + 1 - maxFrequency))
46            );
47        }
48
49        // Return the computed value for this state
50        return dp[startIdx][k];
51    }
52
53    // Calculates the length of the compressed string after compression
54    private int getLength(int maxFrequency) {
55        if (maxFrequency == 1) {
56            return 1; // Single character (e.g. "a")
57        }
58        if (maxFrequency < 10) {
59            return 2; // Single digit frequency followed by character (e.g. "9a")
60        }
61        if (maxFrequency < 100) {
62            return 3; // Two digit frequency followed by character (e.g. "23b")
63        }
64        return 4; // Three digit frequency followed by character (e.g. "100c")
65    }
66}
67
1#include <vector>
2#include <string>
3#include <algorithm>
4#include <cstring>
5using namespace std;
6
7class Solution {
8private:
9    static constexpr int K_MAX = 101; // Maximum value for K_MAX as defined
10    vector<vector<int>> dp; // dp table to store results of subproblems
11
12    // Calculates the length of compressed string given a character frequency
13    int getLength(int maxFrequency) {
14        if (maxFrequency == 1) {
15            return 1; // Single character (e.g. "a")
16        }
17        if (maxFrequency < 10) {
18            return 2; // Single digit frequency followed by character (e.g. "9a")
19        }
20        if (maxFrequency < 100) {
21            return 3; // Two digit frequency followed by character (e.g. "23b")
22        }
23        return 4; // Three digit frequency followed by character (e.g. "100c")
24    }
25
26    // Recursive function to compute the length of optimal compression
27    int compress(const string& s, int startIdx, int k) {
28        if (k < 0) {
29            return K_MAX; // If k is negative, we cannot do further deletions; return max value
30        }
31
32        if (startIdx == s.size() || s.size() - startIdx <= k) {
33            return 0; // If we reach the end or can delete all remaining, no more compression needed
34        }
35
36        if (dp[startIdx][k] != K_MAX) {
37            return dp[startIdx][k]; // Return precomputed result if we have already solved this subproblem
38        }
39
40        int maxFrequency = 0;
41        vector<int> frequencyCount(128, 0); // ASCII-based frequency count
42
43        for (int j = startIdx; j < s.size(); ++j) {
44            maxFrequency = max(maxFrequency, ++frequencyCount[s[j]]);
45
46            // Calculate minimum length by keeping the most frequent character and deleting the rest
47            dp[startIdx][k] = min(
48                dp[startIdx][k], 
49                getLength(maxFrequency) + compress(s, j + 1, k - (j - startIdx + 1 - maxFrequency))
50            );
51        }
52
53        return dp[startIdx][k]; // Return the calculated value
54    }
55
56public:
57    // Main function called to get the length of the optimal compression
58    int getLengthOfOptimalCompression(string s, int k) {
59        dp.resize(s.length(), vector<int>(k + 1, K_MAX)); // Initialize DP table with K_MAX values
60
61        // Start the compression process from index 0 with 'k' deletions allowed
62        return compress(s, 0, k);
63    }
64};
65
1const K_MAX = 101;
2let dp: number[][];
3
4function getLengthOfOptimalCompression(s: string, k: number): number {
5    // Initialize the dp array with s.length rows and k + 1 columns set to K_MAX
6    dp = Array.from({ length: s.length }, () => Array(k + 1).fill(K_MAX));
7
8    // Start the compression process from the first character with the given limit on deletions
9    return compress(s, 0, k);
10}
11
12function compress(s: string, startIndex: number, k: number): number {
13    // Invalid case with negative available deletions
14    if (k < 0) {
15        return K_MAX;
16    }
17
18    // If we're at the end of the string or we can delete all remaining characters
19    if (startIndex >= s.length || s.length - startIndex <= k) {
20        return 0;
21    }
22
23    // Return previously computed value to avoid redundant calculations
24    if (dp[startIndex][k] !== K_MAX) {
25        return dp[startIndex][k];
26    }
27
28    let maxFrequency = 0;
29    const frequencyCount = new Array(128).fill(0); // ASCII-based frequency array
30
31    // Iterate over the substring to find optimal compression
32    for (let j = startIndex; j < s.length; ++j) {
33        // Increment the frequency of the current character
34        const charIndex = s.charCodeAt(j);
35        maxFrequency = Math.max(maxFrequency, ++frequencyCount[charIndex]);
36
37        // Consider keeping the most frequent character so far and deleting the rest
38        dp[startIndex][k] = Math.min(
39            dp[startIndex][k],
40            getCompressedLength(maxFrequency) + compress(s, j + 1, k - (j - startIndex + 1 - maxFrequency))
41        );
42    }
43
44    // Return the optimal length found for this state
45    return dp[startIndex][k];
46}
47
48function getCompressedLength(maxFrequency: number): number {
49    // Determine the length of the compressed string given the character frequency
50    if (maxFrequency === 1) {
51        return 1;
52    } else if (maxFrequency < 10) {
53        return 2;
54    } else if (maxFrequency < 100) {
55        return 3;
56    } else {
57        return 4;
58    }
59}
60

Time and Space Complexity

Time Complexity

The time complexity of this approach is primarily determined by the two nested loops and the recursive compression method. Let's denote n as the length of the string s.

  • The outer loop runs from i to n.
  • For each i, the inner loop runs from i to n, and in each iteration, the compression method is called.
  • The compression method itself, in the worst case, could be called up to n times for varying values of j and k.

Due to these nested loops, the basic operation count is O(n^2) for the loops, multiplied by the cost of recursion for each loop iteration. The recursion, however, is optimized with memoization (dynamic programming), so each state (i, k) is solved only once. There are n * k states since i ranges from 0 to n and k ranges from 0 to k.

Thus, the overall time complexity is O(n^2 * k).

Space Complexity

The space complexity is governed by the memoization table dp and the recursive call stack.

  • The dp table is a 2D array with dimensions n (for the length of string s) by k + 1 (for the maximum number of deletions plus one), which gives us a space complexity of O(n * k).

  • The recursive compression method does not use any additional significant space, but there is a call stack overhead due to recursion. In the worst case, it can go n calls deep, as the recursion may span from the start to the end of the string s.

Considering both factors, the space complexity is O(n * k) for the memoization table plus O(n) for the stack space, which combines to O(n * k). Since the call stack space is typically smaller and dominated by the larger n * k term, we simplify the overall space complexity to O(n * k).

Learn more about how to find time and space complexity quickly using problem constraints.


Discover Your Strengths and Weaknesses: Take Our 2-Minute Quiz to Tailor Your Study Plan:
Question 1 out of 10

A person thinks of a number between 1 and 1000. You may ask any number questions to them, provided that the question can be answered with either "yes" or "no".

What is the minimum number of questions you needed to ask so that you are guaranteed to know the number that the person is thinking?


Recommended Readings

Want a Structured Path to Master System Design Too? Don’t Miss This!