393. UTF-8 Validation
Problem Description
You are given an integer array data
where each integer represents a byte of data (only the least significant 8 bits of each integer are used). Your task is to determine whether this array represents a valid UTF-8 encoding.
UTF-8 is a variable-length character encoding that uses 1 to 4 bytes to represent each character. The encoding follows these rules:
For a 1-byte character:
- The byte pattern is
0xxxxxxx
- The first bit is
0
, followed by 7 bits representing the character
For a 2-byte character:
- First byte:
110xxxxx
(starts with110
) - Second byte:
10xxxxxx
(starts with10
)
For a 3-byte character:
- First byte:
1110xxxx
(starts with1110
) - Second byte:
10xxxxxx
(starts with10
) - Third byte:
10xxxxxx
(starts with10
)
For a 4-byte character:
- First byte:
11110xxx
(starts with11110
) - Second byte:
10xxxxxx
(starts with10
) - Third byte:
10xxxxxx
(starts with10
) - Fourth byte:
10xxxxxx
(starts with10
)
The x
represents bits that can be either 0
or 1
and contain the actual character data.
The solution works by tracking how many continuation bytes (bytes starting with 10
) we expect to see. For each byte in the array:
- If we're expecting continuation bytes (
cnt > 0
), we check if the current byte starts with10
. If it does, we decrement our counter; if not, the encoding is invalid. - If we're not expecting continuation bytes (
cnt = 0
), we examine the current byte to determine what type of character it starts:- If it starts with
0
(checked byv >> 7 == 0
), it's a 1-byte character - If it starts with
110
(checked byv >> 5 == 0b110
), it's the start of a 2-byte character, so we expect 1 continuation byte - If it starts with
1110
(checked byv >> 4 == 0b1110
), it's the start of a 3-byte character, so we expect 2 continuation bytes - If it starts with
11110
(checked byv >> 3 == 0b11110
), it's the start of a 4-byte character, so we expect 3 continuation bytes - Any other pattern is invalid
- If it starts with
The encoding is valid only if we've processed all expected continuation bytes by the end of the array (cnt == 0
).
Intuition
The key insight is that UTF-8 encoding has a very specific structure that we can validate sequentially. When we read UTF-8 bytes, we can always determine from the first byte of a character how many bytes should follow it.
Think of it like reading a book where each sentence tells you exactly how many words it contains at the beginning. If a sentence says "3 words:", you expect exactly 2 more words to follow. Similarly, in UTF-8, the first byte's bit pattern tells us how many continuation bytes to expect.
The natural approach is to process the array byte by byte, maintaining a state that tracks whether we're:
- Starting to read a new character (expecting a start byte)
- In the middle of reading a multi-byte character (expecting continuation bytes)
This leads us to use a counter variable. When cnt = 0
, we're looking for the start of a new character. When cnt > 0
, we're expecting that many continuation bytes.
The bit manipulation comes from needing to check specific bit patterns. To check if a byte starts with certain bits, we can right-shift the byte and compare with the expected pattern:
- To check if a byte starts with
10
, we shift right by 6 bits and check if it equals0b10
- To check if a byte starts with
110
, we shift right by 5 bits and check if it equals0b110
- And so on...
The beauty of this approach is that it processes the array in a single pass with constant space. We don't need to look ahead or store previous bytes - we just need to know how many continuation bytes we're still expecting. This makes the solution both time and space efficient.
The validation fails immediately when we encounter an unexpected byte pattern, making the algorithm fail-fast. At the end, we must have cnt = 0
to ensure we didn't end in the middle of a multi-byte character.
Solution Approach
We implement a single-pass algorithm using a counter variable cnt
to track the expected number of continuation bytes (bytes starting with 10
).
Algorithm Steps:
-
Initialize the counter: Start with
cnt = 0
, indicating we're not currently processing any multi-byte character. -
Process each byte in the data array:
Case 1: We're expecting continuation bytes (
cnt > 0
):- Check if the current byte starts with
10
by usingv >> 6 != 0b10
- If it doesn't start with
10
, returnfalse
(invalid encoding) - If it does, decrement
cnt
by 1 (one less continuation byte to expect)
Case 2: We're starting a new character (
cnt == 0
):- Determine the character type by checking bit patterns from most specific to least specific:
- Check if it's a 1-byte character:
v >> 7 == 0
(starts with0
)- Set
cnt = 0
(no continuation bytes needed)
- Set
- Check if it's a 2-byte character:
v >> 5 == 0b110
(starts with110
)- Set
cnt = 1
(expect 1 continuation byte)
- Set
- Check if it's a 3-byte character:
v >> 4 == 0b1110
(starts with1110
)- Set
cnt = 2
(expect 2 continuation bytes)
- Set
- Check if it's a 4-byte character:
v >> 3 == 0b11110
(starts with11110
)- Set
cnt = 3
(expect 3 continuation bytes)
- Set
- If none of the above patterns match, return
false
(invalid start byte)
- Check if it's a 1-byte character:
- Check if the current byte starts with
-
Final validation: After processing all bytes, check if
cnt == 0
. This ensures we didn't end in the middle of a multi-byte character.
Why the bit-shifting works:
v >> 6
shifts the byte right by 6 bits, leaving only the top 2 bits, which we compare with0b10
v >> 5
shifts right by 5 bits, leaving the top 3 bits, which we compare with0b110
v >> 4
shifts right by 4 bits, leaving the top 4 bits, which we compare with0b1110
v >> 3
shifts right by 3 bits, leaving the top 5 bits, which we compare with0b11110
Time Complexity: O(n)
where n
is the length of the data array, as we process each byte exactly once.
Space Complexity: O(1)
as we only use a single variable cnt
to maintain state.
The elegance of this solution lies in its state machine-like behavior: we're either expecting continuation bytes or determining what type of character starts next, with immediate validation at each step.
Ready to land your dream job?
Unlock your dream job with a 3-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's walk through the solution with the array data = [197, 130, 1]
.
First, let's convert these to binary to see the bit patterns:
- 197 =
11000101
- 130 =
10000010
- 1 =
00000001
Step 1: Process byte 197 (11000101
)
- Current state:
cnt = 0
(starting a new character) - Check bit patterns:
197 >> 7 = 1
(not 0, so not a 1-byte character)197 >> 5 = 6 = 0b110
✓ (matches 2-byte character pattern)
- This is the start of a 2-byte character
- Set
cnt = 1
(expect 1 continuation byte)
Step 2: Process byte 130 (10000010
)
- Current state:
cnt = 1
(expecting a continuation byte) - Check if it starts with
10
:130 >> 6 = 2 = 0b10
✓ (valid continuation byte)
- Decrement
cnt = 0
(no more continuation bytes expected)
Step 3: Process byte 1 (00000001
)
- Current state:
cnt = 0
(starting a new character) - Check bit patterns:
1 >> 7 = 0
✓ (matches 1-byte character pattern)
- This is a complete 1-byte character
- Keep
cnt = 0
Step 4: Final validation
- All bytes processed,
cnt = 0
✓ - Return
true
- this is valid UTF-8!
The array represents two valid UTF-8 characters:
- A 2-byte character (bytes 197 and 130)
- A 1-byte character (byte 1)
Counter-example with invalid encoding:
Let's try data = [240, 130]
(binary: 11110000
, 10000010
):
Step 1: Process 240 (11110000
)
cnt = 0
, check patterns:240 >> 3 = 30 = 0b11110
✓ (4-byte character)- Set
cnt = 3
(expect 3 continuation bytes)
Step 2: Process 130 (10000010
)
cnt = 3
, valid continuation byte- Decrement
cnt = 2
Step 3: Final validation
- No more bytes, but
cnt = 2
(still expecting 2 more bytes!) - Return
false
- invalid UTF-8!
Solution Implementation
1class Solution:
2 def validUtf8(self, data: List[int]) -> bool:
3 # Track the number of continuation bytes expected
4 remaining_bytes = 0
5
6 for byte_value in data:
7 if remaining_bytes > 0:
8 # Check if current byte is a valid continuation byte (10xxxxxx)
9 if byte_value >> 6 != 0b10:
10 return False
11 remaining_bytes -= 1
12 else:
13 # Determine the type of UTF-8 character by checking leading bits
14 if byte_value >> 7 == 0:
15 # 1-byte character (0xxxxxxx)
16 remaining_bytes = 0
17 elif byte_value >> 5 == 0b110:
18 # 2-byte character (110xxxxx)
19 remaining_bytes = 1
20 elif byte_value >> 4 == 0b1110:
21 # 3-byte character (1110xxxx)
22 remaining_bytes = 2
23 elif byte_value >> 3 == 0b11110:
24 # 4-byte character (11110xxx)
25 remaining_bytes = 3
26 else:
27 # Invalid UTF-8 starting byte
28 return False
29
30 # All bytes should be consumed (no incomplete characters)
31 return remaining_bytes == 0
32
1class Solution {
2 /**
3 * Validates if a sequence of integers represents a valid UTF-8 encoding.
4 *
5 * @param data Array of integers where each integer represents 1 byte of data
6 * @return true if the data represents valid UTF-8 encoding, false otherwise
7 */
8 public boolean validUtf8(int[] data) {
9 // Counter to track remaining continuation bytes expected
10 int remainingBytes = 0;
11
12 // Process each byte in the data array
13 for (int currentByte : data) {
14 if (remainingBytes > 0) {
15 // We're expecting a continuation byte (format: 10xxxxxx)
16 // Check if the two most significant bits are '10'
17 if (currentByte >> 6 != 0b10) {
18 return false;
19 }
20 remainingBytes--;
21 } else {
22 // We're at the start of a new UTF-8 character
23 // Determine the type based on the leading bits
24
25 if (currentByte >> 7 == 0) {
26 // 1-byte character (0xxxxxxx)
27 remainingBytes = 0;
28 } else if (currentByte >> 5 == 0b110) {
29 // 2-byte character (110xxxxx)
30 remainingBytes = 1;
31 } else if (currentByte >> 4 == 0b1110) {
32 // 3-byte character (1110xxxx)
33 remainingBytes = 2;
34 } else if (currentByte >> 3 == 0b11110) {
35 // 4-byte character (11110xxx)
36 remainingBytes = 3;
37 } else {
38 // Invalid UTF-8 start byte
39 return false;
40 }
41 }
42 }
43
44 // All bytes processed - valid only if no continuation bytes are pending
45 return remainingBytes == 0;
46 }
47}
48
1class Solution {
2public:
3 bool validUtf8(vector<int>& data) {
4 int remainingBytes = 0; // Number of remaining bytes to process for current UTF-8 character
5
6 for (int& byte : data) {
7 if (remainingBytes > 0) {
8 // Check if current byte is a valid continuation byte (format: 10xxxxxx)
9 if (byte >> 6 != 0b10) {
10 return false;
11 }
12 remainingBytes--;
13 } else {
14 // Determine the type of UTF-8 character by checking leading bits
15 if (byte >> 7 == 0) {
16 // 1-byte character (format: 0xxxxxxx)
17 remainingBytes = 0;
18 } else if (byte >> 5 == 0b110) {
19 // 2-byte character (format: 110xxxxx)
20 remainingBytes = 1;
21 } else if (byte >> 4 == 0b1110) {
22 // 3-byte character (format: 1110xxxx)
23 remainingBytes = 2;
24 } else if (byte >> 3 == 0b11110) {
25 // 4-byte character (format: 11110xxx)
26 remainingBytes = 3;
27 } else {
28 // Invalid UTF-8 starting byte
29 return false;
30 }
31 }
32 }
33
34 // All bytes should be processed (no incomplete UTF-8 character)
35 return remainingBytes == 0;
36 }
37};
38
1/**
2 * Validates if a sequence of integers represents a valid UTF-8 encoding
3 * @param data - Array of integers representing bytes (0-255)
4 * @returns true if the data represents valid UTF-8 encoding, false otherwise
5 */
6function validUtf8(data: number[]): boolean {
7 // Counter for remaining continuation bytes expected
8 let remainingBytes: number = 0;
9
10 for (const byte of data) {
11 if (remainingBytes > 0) {
12 // Check if current byte is a valid continuation byte (10xxxxxx)
13 if (byte >> 6 !== 0b10) {
14 return false;
15 }
16 remainingBytes--;
17 } else {
18 // Determine the type of UTF-8 character by checking leading bits
19 if (byte >> 7 === 0) {
20 // 1-byte character (0xxxxxxx)
21 remainingBytes = 0;
22 } else if (byte >> 5 === 0b110) {
23 // 2-byte character (110xxxxx)
24 remainingBytes = 1;
25 } else if (byte >> 4 === 0b1110) {
26 // 3-byte character (1110xxxx)
27 remainingBytes = 2;
28 } else if (byte >> 3 === 0b11110) {
29 // 4-byte character (11110xxx)
30 remainingBytes = 3;
31 } else {
32 // Invalid UTF-8 start byte
33 return false;
34 }
35 }
36 }
37
38 // All bytes should be consumed (no incomplete multi-byte sequence)
39 return remainingBytes === 0;
40}
41
Time and Space Complexity
Time Complexity: O(n)
, where n
is the length of the array data
. The algorithm iterates through each element in the data array exactly once. For each element, it performs a constant number of bitwise operations and comparisons (checking v >> 6
, v >> 7
, v >> 5
, v >> 4
, v >> 3
), which are all O(1)
operations. Therefore, the total time complexity is O(n) * O(1) = O(n)
.
Space Complexity: O(1)
. The algorithm uses only a fixed amount of extra space regardless of the input size. It maintains a single integer variable cnt
to track the number of continuation bytes expected, and a loop variable v
. These variables do not scale with the input size, resulting in constant space complexity.
Common Pitfalls
1. Incorrect Order of Bit Pattern Checking
One of the most common mistakes is checking the bit patterns in the wrong order. Since we're using right-shift operations, the patterns become less specific as we shift more bits. If we check them in the wrong order, we might incorrectly classify bytes.
Incorrect approach:
# WRONG: Checking from least specific to most specific if byte_value >> 3 == 0b11110: # 4-byte character remaining_bytes = 3 elif byte_value >> 4 == 0b1110: # 3-byte character remaining_bytes = 2 elif byte_value >> 5 == 0b110: # 2-byte character remaining_bytes = 1 elif byte_value >> 7 == 0: # 1-byte character remaining_bytes = 0
Why this fails: A byte like 11111000
would incorrectly match the 4-byte pattern first, even though it's actually an invalid UTF-8 byte (starts with 11111
).
Correct approach: Always check from most bits to least bits (or reorganize the conditions appropriately):
if byte_value >> 7 == 0: # Check 1-byte first remaining_bytes = 0 elif byte_value >> 5 == 0b110: # Then 2-byte remaining_bytes = 1 elif byte_value >> 4 == 0b1110: # Then 3-byte remaining_bytes = 2 elif byte_value >> 3 == 0b11110: # Finally 4-byte remaining_bytes = 3 else: return False
2. Forgetting to Mask the Input Bytes
The problem states that "only the least significant 8 bits of each integer are used," but the input integers might have values greater than 255. Failing to handle this can lead to incorrect validation.
Incorrect approach:
for byte_value in data: # Using byte_value directly without masking if byte_value >> 6 != 0b10: return False
Correct approach:
for num in data: byte_value = num & 0xFF # Mask to get only the least significant 8 bits if byte_value >> 6 != 0b10: return False
3. Not Handling Invalid Starting Bytes (e.g., 10xxxxxx)
A byte starting with 10
should only appear as a continuation byte, never as the start of a character. Some implementations forget to check for this invalid case.
Incomplete validation:
# Missing check for bytes starting with 10xxxxxx as initial bytes if byte_value >> 7 == 0: remaining_bytes = 0 elif byte_value >> 5 == 0b110: remaining_bytes = 1 # ... other cases # Missing else clause to catch invalid patterns like 10xxxxxx
Correct approach: Always include an else
clause to catch invalid starting bytes:
if byte_value >> 7 == 0: remaining_bytes = 0 elif byte_value >> 5 == 0b110: remaining_bytes = 1 elif byte_value >> 4 == 0b1110: remaining_bytes = 2 elif byte_value >> 3 == 0b11110: remaining_bytes = 3 else: return False # Catches invalid patterns including 10xxxxxx
4. Off-by-One Error in Continuation Byte Count
It's easy to confuse the total number of bytes in a character with the number of continuation bytes needed.
Incorrect:
elif byte_value >> 5 == 0b110: remaining_bytes = 2 # WRONG: 2-byte character needs only 1 continuation byte
Correct:
elif byte_value >> 5 == 0b110: remaining_bytes = 1 # 2-byte character = 1 start byte + 1 continuation byte
5. Not Validating Complete Character at Array End
Forgetting the final check can result in accepting incomplete UTF-8 sequences.
Incomplete solution:
def validUtf8(self, data):
remaining_bytes = 0
for byte_value in data:
# ... processing logic
# Missing: return remaining_bytes == 0
return True # WRONG: Always returns True even for incomplete sequences
Correct approach: Always verify that no continuation bytes are pending:
def validUtf8(self, data):
remaining_bytes = 0
for byte_value in data:
# ... processing logic
return remaining_bytes == 0 # Must check that we're not mid-character
Which two pointer techniques do you use to check if a string is a palindrome?
Recommended Readings
Coding Interview Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https assets algo monster recursion jpg You first call Ben and ask
Runtime Overview When learning about algorithms and data structures you'll frequently encounter the term time complexity This concept is fundamental in computer science and offers insights into how long an algorithm takes to complete given a certain input size What is Time Complexity Time complexity represents the amount of time
Want a Structured Path to Master System Design Too? Don’t Miss This!