Facebook Pixel

393. UTF-8 Validation

MediumBit ManipulationArray
Leetcode Link

Problem Description

You are given an integer array data where each integer represents a byte of data (only the least significant 8 bits of each integer are used). Your task is to determine whether this array represents a valid UTF-8 encoding.

UTF-8 is a variable-length character encoding that uses 1 to 4 bytes to represent each character. The encoding follows these rules:

For a 1-byte character:

  • The byte pattern is 0xxxxxxx
  • The first bit is 0, followed by 7 bits representing the character

For a 2-byte character:

  • First byte: 110xxxxx (starts with 110)
  • Second byte: 10xxxxxx (starts with 10)

For a 3-byte character:

  • First byte: 1110xxxx (starts with 1110)
  • Second byte: 10xxxxxx (starts with 10)
  • Third byte: 10xxxxxx (starts with 10)

For a 4-byte character:

  • First byte: 11110xxx (starts with 11110)
  • Second byte: 10xxxxxx (starts with 10)
  • Third byte: 10xxxxxx (starts with 10)
  • Fourth byte: 10xxxxxx (starts with 10)

The x represents bits that can be either 0 or 1 and contain the actual character data.

The solution works by tracking how many continuation bytes (bytes starting with 10) we expect to see. For each byte in the array:

  • If we're expecting continuation bytes (cnt > 0), we check if the current byte starts with 10. If it does, we decrement our counter; if not, the encoding is invalid.
  • If we're not expecting continuation bytes (cnt = 0), we examine the current byte to determine what type of character it starts:
    • If it starts with 0 (checked by v >> 7 == 0), it's a 1-byte character
    • If it starts with 110 (checked by v >> 5 == 0b110), it's the start of a 2-byte character, so we expect 1 continuation byte
    • If it starts with 1110 (checked by v >> 4 == 0b1110), it's the start of a 3-byte character, so we expect 2 continuation bytes
    • If it starts with 11110 (checked by v >> 3 == 0b11110), it's the start of a 4-byte character, so we expect 3 continuation bytes
    • Any other pattern is invalid

The encoding is valid only if we've processed all expected continuation bytes by the end of the array (cnt == 0).

Quick Interview Experience
Help others by sharing your interview experience
Have you seen this problem before?

Intuition

The key insight is that UTF-8 encoding has a very specific structure that we can validate sequentially. When we read UTF-8 bytes, we can always determine from the first byte of a character how many bytes should follow it.

Think of it like reading a book where each sentence tells you exactly how many words it contains at the beginning. If a sentence says "3 words:", you expect exactly 2 more words to follow. Similarly, in UTF-8, the first byte's bit pattern tells us how many continuation bytes to expect.

The natural approach is to process the array byte by byte, maintaining a state that tracks whether we're:

  1. Starting to read a new character (expecting a start byte)
  2. In the middle of reading a multi-byte character (expecting continuation bytes)

This leads us to use a counter variable. When cnt = 0, we're looking for the start of a new character. When cnt > 0, we're expecting that many continuation bytes.

The bit manipulation comes from needing to check specific bit patterns. To check if a byte starts with certain bits, we can right-shift the byte and compare with the expected pattern:

  • To check if a byte starts with 10, we shift right by 6 bits and check if it equals 0b10
  • To check if a byte starts with 110, we shift right by 5 bits and check if it equals 0b110
  • And so on...

The beauty of this approach is that it processes the array in a single pass with constant space. We don't need to look ahead or store previous bytes - we just need to know how many continuation bytes we're still expecting. This makes the solution both time and space efficient.

The validation fails immediately when we encounter an unexpected byte pattern, making the algorithm fail-fast. At the end, we must have cnt = 0 to ensure we didn't end in the middle of a multi-byte character.

Solution Approach

We implement a single-pass algorithm using a counter variable cnt to track the expected number of continuation bytes (bytes starting with 10).

Algorithm Steps:

  1. Initialize the counter: Start with cnt = 0, indicating we're not currently processing any multi-byte character.

  2. Process each byte in the data array:

    Case 1: We're expecting continuation bytes (cnt > 0):

    • Check if the current byte starts with 10 by using v >> 6 != 0b10
    • If it doesn't start with 10, return false (invalid encoding)
    • If it does, decrement cnt by 1 (one less continuation byte to expect)

    Case 2: We're starting a new character (cnt == 0):

    • Determine the character type by checking bit patterns from most specific to least specific:
      • Check if it's a 1-byte character: v >> 7 == 0 (starts with 0)
        • Set cnt = 0 (no continuation bytes needed)
      • Check if it's a 2-byte character: v >> 5 == 0b110 (starts with 110)
        • Set cnt = 1 (expect 1 continuation byte)
      • Check if it's a 3-byte character: v >> 4 == 0b1110 (starts with 1110)
        • Set cnt = 2 (expect 2 continuation bytes)
      • Check if it's a 4-byte character: v >> 3 == 0b11110 (starts with 11110)
        • Set cnt = 3 (expect 3 continuation bytes)
      • If none of the above patterns match, return false (invalid start byte)
  3. Final validation: After processing all bytes, check if cnt == 0. This ensures we didn't end in the middle of a multi-byte character.

Why the bit-shifting works:

  • v >> 6 shifts the byte right by 6 bits, leaving only the top 2 bits, which we compare with 0b10
  • v >> 5 shifts right by 5 bits, leaving the top 3 bits, which we compare with 0b110
  • v >> 4 shifts right by 4 bits, leaving the top 4 bits, which we compare with 0b1110
  • v >> 3 shifts right by 3 bits, leaving the top 5 bits, which we compare with 0b11110

Time Complexity: O(n) where n is the length of the data array, as we process each byte exactly once.

Space Complexity: O(1) as we only use a single variable cnt to maintain state.

The elegance of this solution lies in its state machine-like behavior: we're either expecting continuation bytes or determining what type of character starts next, with immediate validation at each step.

Ready to land your dream job?

Unlock your dream job with a 3-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's walk through the solution with the array data = [197, 130, 1].

First, let's convert these to binary to see the bit patterns:

  • 197 = 11000101
  • 130 = 10000010
  • 1 = 00000001

Step 1: Process byte 197 (11000101)

  • Current state: cnt = 0 (starting a new character)
  • Check bit patterns:
    • 197 >> 7 = 1 (not 0, so not a 1-byte character)
    • 197 >> 5 = 6 = 0b110 ✓ (matches 2-byte character pattern)
  • This is the start of a 2-byte character
  • Set cnt = 1 (expect 1 continuation byte)

Step 2: Process byte 130 (10000010)

  • Current state: cnt = 1 (expecting a continuation byte)
  • Check if it starts with 10:
    • 130 >> 6 = 2 = 0b10 ✓ (valid continuation byte)
  • Decrement cnt = 0 (no more continuation bytes expected)

Step 3: Process byte 1 (00000001)

  • Current state: cnt = 0 (starting a new character)
  • Check bit patterns:
    • 1 >> 7 = 0 ✓ (matches 1-byte character pattern)
  • This is a complete 1-byte character
  • Keep cnt = 0

Step 4: Final validation

  • All bytes processed, cnt = 0
  • Return true - this is valid UTF-8!

The array represents two valid UTF-8 characters:

  1. A 2-byte character (bytes 197 and 130)
  2. A 1-byte character (byte 1)

Counter-example with invalid encoding:

Let's try data = [240, 130] (binary: 11110000, 10000010):

Step 1: Process 240 (11110000)

  • cnt = 0, check patterns:
  • 240 >> 3 = 30 = 0b11110 ✓ (4-byte character)
  • Set cnt = 3 (expect 3 continuation bytes)

Step 2: Process 130 (10000010)

  • cnt = 3, valid continuation byte
  • Decrement cnt = 2

Step 3: Final validation

  • No more bytes, but cnt = 2 (still expecting 2 more bytes!)
  • Return false - invalid UTF-8!

Solution Implementation

1class Solution:
2    def validUtf8(self, data: List[int]) -> bool:
3        # Track the number of continuation bytes expected
4        remaining_bytes = 0
5      
6        for byte_value in data:
7            if remaining_bytes > 0:
8                # Check if current byte is a valid continuation byte (10xxxxxx)
9                if byte_value >> 6 != 0b10:
10                    return False
11                remaining_bytes -= 1
12            else:
13                # Determine the type of UTF-8 character by checking leading bits
14                if byte_value >> 7 == 0:
15                    # 1-byte character (0xxxxxxx)
16                    remaining_bytes = 0
17                elif byte_value >> 5 == 0b110:
18                    # 2-byte character (110xxxxx)
19                    remaining_bytes = 1
20                elif byte_value >> 4 == 0b1110:
21                    # 3-byte character (1110xxxx)
22                    remaining_bytes = 2
23                elif byte_value >> 3 == 0b11110:
24                    # 4-byte character (11110xxx)
25                    remaining_bytes = 3
26                else:
27                    # Invalid UTF-8 starting byte
28                    return False
29      
30        # All bytes should be consumed (no incomplete characters)
31        return remaining_bytes == 0
32
1class Solution {
2    /**
3     * Validates if a sequence of integers represents a valid UTF-8 encoding.
4     * 
5     * @param data Array of integers where each integer represents 1 byte of data
6     * @return true if the data represents valid UTF-8 encoding, false otherwise
7     */
8    public boolean validUtf8(int[] data) {
9        // Counter to track remaining continuation bytes expected
10        int remainingBytes = 0;
11      
12        // Process each byte in the data array
13        for (int currentByte : data) {
14            if (remainingBytes > 0) {
15                // We're expecting a continuation byte (format: 10xxxxxx)
16                // Check if the two most significant bits are '10'
17                if (currentByte >> 6 != 0b10) {
18                    return false;
19                }
20                remainingBytes--;
21            } else {
22                // We're at the start of a new UTF-8 character
23                // Determine the type based on the leading bits
24              
25                if (currentByte >> 7 == 0) {
26                    // 1-byte character (0xxxxxxx)
27                    remainingBytes = 0;
28                } else if (currentByte >> 5 == 0b110) {
29                    // 2-byte character (110xxxxx)
30                    remainingBytes = 1;
31                } else if (currentByte >> 4 == 0b1110) {
32                    // 3-byte character (1110xxxx)
33                    remainingBytes = 2;
34                } else if (currentByte >> 3 == 0b11110) {
35                    // 4-byte character (11110xxx)
36                    remainingBytes = 3;
37                } else {
38                    // Invalid UTF-8 start byte
39                    return false;
40                }
41            }
42        }
43      
44        // All bytes processed - valid only if no continuation bytes are pending
45        return remainingBytes == 0;
46    }
47}
48
1class Solution {
2public:
3    bool validUtf8(vector<int>& data) {
4        int remainingBytes = 0;  // Number of remaining bytes to process for current UTF-8 character
5      
6        for (int& byte : data) {
7            if (remainingBytes > 0) {
8                // Check if current byte is a valid continuation byte (format: 10xxxxxx)
9                if (byte >> 6 != 0b10) {
10                    return false;
11                }
12                remainingBytes--;
13            } else {
14                // Determine the type of UTF-8 character by checking leading bits
15                if (byte >> 7 == 0) {
16                    // 1-byte character (format: 0xxxxxxx)
17                    remainingBytes = 0;
18                } else if (byte >> 5 == 0b110) {
19                    // 2-byte character (format: 110xxxxx)
20                    remainingBytes = 1;
21                } else if (byte >> 4 == 0b1110) {
22                    // 3-byte character (format: 1110xxxx)
23                    remainingBytes = 2;
24                } else if (byte >> 3 == 0b11110) {
25                    // 4-byte character (format: 11110xxx)
26                    remainingBytes = 3;
27                } else {
28                    // Invalid UTF-8 starting byte
29                    return false;
30                }
31            }
32        }
33      
34        // All bytes should be processed (no incomplete UTF-8 character)
35        return remainingBytes == 0;
36    }
37};
38
1/**
2 * Validates if a sequence of integers represents a valid UTF-8 encoding
3 * @param data - Array of integers representing bytes (0-255)
4 * @returns true if the data represents valid UTF-8 encoding, false otherwise
5 */
6function validUtf8(data: number[]): boolean {
7    // Counter for remaining continuation bytes expected
8    let remainingBytes: number = 0;
9  
10    for (const byte of data) {
11        if (remainingBytes > 0) {
12            // Check if current byte is a valid continuation byte (10xxxxxx)
13            if (byte >> 6 !== 0b10) {
14                return false;
15            }
16            remainingBytes--;
17        } else {
18            // Determine the type of UTF-8 character by checking leading bits
19            if (byte >> 7 === 0) {
20                // 1-byte character (0xxxxxxx)
21                remainingBytes = 0;
22            } else if (byte >> 5 === 0b110) {
23                // 2-byte character (110xxxxx)
24                remainingBytes = 1;
25            } else if (byte >> 4 === 0b1110) {
26                // 3-byte character (1110xxxx)
27                remainingBytes = 2;
28            } else if (byte >> 3 === 0b11110) {
29                // 4-byte character (11110xxx)
30                remainingBytes = 3;
31            } else {
32                // Invalid UTF-8 start byte
33                return false;
34            }
35        }
36    }
37  
38    // All bytes should be consumed (no incomplete multi-byte sequence)
39    return remainingBytes === 0;
40}
41

Time and Space Complexity

Time Complexity: O(n), where n is the length of the array data. The algorithm iterates through each element in the data array exactly once. For each element, it performs a constant number of bitwise operations and comparisons (checking v >> 6, v >> 7, v >> 5, v >> 4, v >> 3), which are all O(1) operations. Therefore, the total time complexity is O(n) * O(1) = O(n).

Space Complexity: O(1). The algorithm uses only a fixed amount of extra space regardless of the input size. It maintains a single integer variable cnt to track the number of continuation bytes expected, and a loop variable v. These variables do not scale with the input size, resulting in constant space complexity.

Common Pitfalls

1. Incorrect Order of Bit Pattern Checking

One of the most common mistakes is checking the bit patterns in the wrong order. Since we're using right-shift operations, the patterns become less specific as we shift more bits. If we check them in the wrong order, we might incorrectly classify bytes.

Incorrect approach:

# WRONG: Checking from least specific to most specific
if byte_value >> 3 == 0b11110:  # 4-byte character
    remaining_bytes = 3
elif byte_value >> 4 == 0b1110:  # 3-byte character
    remaining_bytes = 2
elif byte_value >> 5 == 0b110:   # 2-byte character
    remaining_bytes = 1
elif byte_value >> 7 == 0:       # 1-byte character
    remaining_bytes = 0

Why this fails: A byte like 11111000 would incorrectly match the 4-byte pattern first, even though it's actually an invalid UTF-8 byte (starts with 11111).

Correct approach: Always check from most bits to least bits (or reorganize the conditions appropriately):

if byte_value >> 7 == 0:         # Check 1-byte first
    remaining_bytes = 0
elif byte_value >> 5 == 0b110:   # Then 2-byte
    remaining_bytes = 1
elif byte_value >> 4 == 0b1110:  # Then 3-byte
    remaining_bytes = 2
elif byte_value >> 3 == 0b11110: # Finally 4-byte
    remaining_bytes = 3
else:
    return False

2. Forgetting to Mask the Input Bytes

The problem states that "only the least significant 8 bits of each integer are used," but the input integers might have values greater than 255. Failing to handle this can lead to incorrect validation.

Incorrect approach:

for byte_value in data:
    # Using byte_value directly without masking
    if byte_value >> 6 != 0b10:
        return False

Correct approach:

for num in data:
    byte_value = num & 0xFF  # Mask to get only the least significant 8 bits
    if byte_value >> 6 != 0b10:
        return False

3. Not Handling Invalid Starting Bytes (e.g., 10xxxxxx)

A byte starting with 10 should only appear as a continuation byte, never as the start of a character. Some implementations forget to check for this invalid case.

Incomplete validation:

# Missing check for bytes starting with 10xxxxxx as initial bytes
if byte_value >> 7 == 0:
    remaining_bytes = 0
elif byte_value >> 5 == 0b110:
    remaining_bytes = 1
# ... other cases
# Missing else clause to catch invalid patterns like 10xxxxxx

Correct approach: Always include an else clause to catch invalid starting bytes:

if byte_value >> 7 == 0:
    remaining_bytes = 0
elif byte_value >> 5 == 0b110:
    remaining_bytes = 1
elif byte_value >> 4 == 0b1110:
    remaining_bytes = 2
elif byte_value >> 3 == 0b11110:
    remaining_bytes = 3
else:
    return False  # Catches invalid patterns including 10xxxxxx

4. Off-by-One Error in Continuation Byte Count

It's easy to confuse the total number of bytes in a character with the number of continuation bytes needed.

Incorrect:

elif byte_value >> 5 == 0b110:
    remaining_bytes = 2  # WRONG: 2-byte character needs only 1 continuation byte

Correct:

elif byte_value >> 5 == 0b110:
    remaining_bytes = 1  # 2-byte character = 1 start byte + 1 continuation byte

5. Not Validating Complete Character at Array End

Forgetting the final check can result in accepting incomplete UTF-8 sequences.

Incomplete solution:

def validUtf8(self, data):
    remaining_bytes = 0
    for byte_value in data:
        # ... processing logic
    # Missing: return remaining_bytes == 0
    return True  # WRONG: Always returns True even for incomplete sequences

Correct approach: Always verify that no continuation bytes are pending:

def validUtf8(self, data):
    remaining_bytes = 0
    for byte_value in data:
        # ... processing logic
    return remaining_bytes == 0  # Must check that we're not mid-character
Discover Your Strengths and Weaknesses: Take Our 3-Minute Quiz to Tailor Your Study Plan:

Which two pointer techniques do you use to check if a string is a palindrome?


Recommended Readings

Want a Structured Path to Master System Design Too? Don’t Miss This!

Load More