Facebook Pixel

1410. HTML Entity Parser

MediumHash TableString
Leetcode Link

Problem Description

You need to implement an HTML entity parser that converts HTML entities in a string back to their corresponding special characters.

The parser should recognize and replace these 6 HTML entities:

  • ""
  • ''
  • &&
  • >>
  • &lt;<
  • &frasl;/

Given an input string text, you should scan through it and whenever you encounter one of these HTML entities, replace it with its corresponding character. All other characters in the string should remain unchanged.

For example:

  • Input: "&amp; is an HTML entity" would return "& is an HTML entity"
  • Input: "x &gt; y &amp;&amp; x &lt; y is always false" would return "x > y && x < y is always false"

The solution uses a hash table to map each HTML entity to its corresponding character. It then iterates through the string character by character. At each position, it checks if any HTML entity starts at that position (checking substrings of lengths 1 to 7 since the longest entity &frasl; has 7 characters). If an entity is found, it adds the corresponding character to the result and jumps past the entity. If no entity is found at the current position, it simply adds the current character to the result and moves to the next position.

Quick Interview Experience
Help others by sharing your interview experience
Have you seen this problem before?

Intuition

The key observation is that HTML entities always start with & and end with ;. When we encounter an & character while scanning the string, we need to check if it's the beginning of a valid HTML entity.

Since we have a fixed set of 6 entities to recognize, we can use a hash table for quick lookups. The challenge is that entities have different lengths - from 4 characters (&gt;, &lt;) to 7 characters (&frasl;). We can't just check a fixed length substring at each position.

The approach is to scan the string position by position. At each position, we try to match the longest possible entity first, then progressively shorter ones. Why check from shortest to longest (1 to 7 characters)? Because we want to find any valid entity that starts at the current position. Once we find a match, we can immediately process it and jump to the position right after the entity.

If no entity is found starting at the current position (the else clause after the for loop), it means the current character is just a regular character that should be included as-is in the output.

This greedy approach ensures we correctly parse entities without accidentally treating parts of longer entities as shorter ones. For example, if we have &amp; in the text, we want to recognize it as a complete entity for &, not mistakenly process just &am or leave p; unprocessed.

The time complexity is O(n) where n is the length of the string, since we visit each character at most once, and the entity checking at each position is bounded by a constant (maximum 7 checks).

Solution Approach

The solution uses a hash table combined with a simulation approach to parse the HTML entities.

Step 1: Initialize the hash table

d = {
    '&quot;': '"',
    '&apos;': "'",
    '&amp;': "&",
    "&gt;": '>',
    "&lt;": '<',
    "&frasl;": '/',
}

We create a dictionary d that maps each HTML entity string to its corresponding character. This gives us O(1) lookup time when checking if a substring is a valid entity.

Step 2: Set up the traversal variables

i, n = 0, len(text)
ans = []
  • i is our current position pointer in the string
  • n stores the length of the text for boundary checking
  • ans is a list to collect our result characters (using a list is more efficient than string concatenation)

Step 3: Main parsing loop

while i < n:
    for l in range(1, 8):
        j = i + l
        if text[i:j] in d:
            ans.append(d[text[i:j]])
            i = j
            break
    else:
        ans.append(text[i])
        i += 1

The algorithm works as follows:

  • For each position i, we try to match substrings of increasing lengths (1 to 7 characters)
  • We extract the substring text[i:j] where j = i + l
  • If this substring exists in our hash table d, we've found a valid entity:
    • Append the corresponding character to our result
    • Move the pointer i to position j (skip past the entire entity)
    • Break out of the inner loop
  • If no entity is found (the else clause executes when the for loop completes without breaking):
    • The current character is not part of an entity
    • Append the character at position i to the result
    • Move to the next position (i += 1)

Step 4: Return the result

return ''.join(ans)

Finally, we join all characters in the ans list to form the final parsed string.

The algorithm efficiently handles overlapping patterns and ensures that we always match the correct entity at each position, processing the string in a single pass with O(n) time complexity.

Ready to land your dream job?

Unlock your dream job with a 5-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's trace through the example: "&amp; is fun"

Initial Setup:

  • Input text: "&amp; is fun"
  • Hash table d contains all 6 HTML entities
  • i = 0, n = 11, ans = []

Iteration 1 (i = 0):

  • Current character: '&' at position 0
  • Check substrings of length 1-7 starting at position 0:
    • Length 1: "&" → not in hash table
    • Length 2: "&a" → not in hash table
    • Length 3: "&am" → not in hash table
    • Length 4: "&amp" → not in hash table
    • Length 5: "&amp;"found in hash table!
  • Action:
    • Append '&' (the value for "&amp;") to ans
    • Set i = 5 (jump past the entire entity)
    • ans = ['&']

Iteration 2 (i = 5):

  • Current character: ' ' (space) at position 5
  • Check substrings of length 1-7:
    • Length 1: " " → not in hash table
    • Length 2: " i" → not in hash table
    • ... (continues checking, none match)
  • No entity found, so:
    • Append ' ' to ans
    • Set i = 6
    • ans = ['&', ' ']

Iterations 3-8 (i = 6 to 11):

  • Process characters 'i', 's', ' ', 'f', 'u', 'n' one by one
  • None form valid entities, so each is added directly to ans
  • Final ans = ['&', ' ', 'i', 's', ' ', 'f', 'u', 'n']

Result:

  • Join the list: "& is fun"
  • The HTML entity &amp; was successfully converted to &

This walkthrough demonstrates how the algorithm:

  1. Recognizes multi-character entities by checking different substring lengths
  2. Skips past entire entities once found (jumping from position 0 to 5)
  3. Handles regular characters by simply appending them when no entity matches

Solution Implementation

1class Solution:
2    def entityParser(self, text: str) -> str:
3        """
4        Parse HTML entities in text and replace them with corresponding characters.
5      
6        Args:
7            text: Input string containing HTML entities
8          
9        Returns:
10            String with HTML entities replaced by their actual characters
11        """
12        # Dictionary mapping HTML entities to their corresponding characters
13        entity_map = {
14            '&quot;': '"',
15            '&apos;': "'",
16            '&amp;': "&",
17            "&gt;": '>',
18            "&lt;": '<',
19            "&frasl;": '/',
20        }
21      
22        # Initialize position index and get text length
23        position = 0
24        text_length = len(text)
25        result = []
26      
27        # Process each character in the text
28        while position < text_length:
29            # Try to match entities of different lengths (1 to 7 characters)
30            # The longest entity is "&frasl;" which is 7 characters
31            entity_found = False
32          
33            for entity_length in range(1, 8):
34                end_position = position + entity_length
35                potential_entity = text[position:end_position]
36              
37                # Check if the substring matches any HTML entity
38                if potential_entity in entity_map:
39                    # Replace entity with its corresponding character
40                    result.append(entity_map[potential_entity])
41                    position = end_position
42                    entity_found = True
43                    break
44          
45            # If no entity was found, append the current character as-is
46            if not entity_found:
47                result.append(text[position])
48                position += 1
49      
50        # Join all characters to form the final string
51        return ''.join(result)
52
1class Solution {
2    public String entityParser(String text) {
3        // Create a mapping of HTML entities to their corresponding characters
4        Map<String, String> entityMap = new HashMap<>();
5        entityMap.put("&quot;", "\"");   // Quotation mark
6        entityMap.put("&apos;", "'");    // Apostrophe
7        entityMap.put("&amp;", "&");     // Ampersand
8        entityMap.put("&gt;", ">");      // Greater than
9        entityMap.put("&lt;", "<");      // Less than
10        entityMap.put("&frasl;", "/");   // Forward slash
11      
12        // StringBuilder to construct the result string
13        StringBuilder result = new StringBuilder();
14      
15        // Initialize pointer for traversing the input text
16        int currentIndex = 0;
17        int textLength = text.length();
18      
19        // Process each character in the text
20        while (currentIndex < textLength) {
21            boolean entityFound = false;
22          
23            // Check for possible HTML entities starting at current position
24            // Maximum entity length is 7 characters (e.g., "&frasl;" has 7 chars)
25            for (int entityLength = 1; entityLength < 8; entityLength++) {
26                int endIndex = currentIndex + entityLength;
27              
28                // Ensure we don't go beyond the text boundary
29                if (endIndex <= textLength) {
30                    // Extract substring to check if it's a valid entity
31                    String potentialEntity = text.substring(currentIndex, endIndex);
32                  
33                    // If the substring matches a known entity, replace it
34                    if (entityMap.containsKey(potentialEntity)) {
35                        result.append(entityMap.get(potentialEntity));
36                        currentIndex = endIndex;  // Move pointer past the entity
37                        entityFound = true;
38                        break;
39                    }
40                }
41            }
42          
43            // If no entity was found, append the current character as-is
44            if (!entityFound) {
45                result.append(text.charAt(currentIndex));
46                currentIndex++;
47            }
48        }
49      
50        // Return the parsed string
51        return result.toString();
52    }
53}
54
1class Solution {
2public:
3    string entityParser(string text) {
4        // Map HTML entities to their corresponding characters
5        unordered_map<string, string> entityMap = {
6            {"&quot;", "\""},   // Quotation mark
7            {"&apos;", "'"},    // Apostrophe
8            {"&amp;", "&"},     // Ampersand
9            {"&gt;", ">"},      // Greater than
10            {"&lt;", "<"},      // Less than
11            {"&frasl;", "/"}    // Forward slash
12        };
13      
14        string result = "";
15        int currentIndex = 0;
16        int textLength = text.size();
17      
18        // Process each character in the text
19        while (currentIndex < textLength) {
20            bool entityFound = false;
21          
22            // Try to match entities of different lengths (1 to 7 characters)
23            // The longest entity is "&frasl;" with 7 characters
24            for (int entityLength = 1; entityLength < 8; ++entityLength) {
25                int endIndex = currentIndex + entityLength;
26              
27                // Check if we're within bounds
28                if (endIndex <= textLength) {
29                    // Extract potential entity substring
30                    string potentialEntity = text.substr(currentIndex, entityLength);
31                  
32                    // Check if this substring is a valid entity
33                    if (entityMap.count(potentialEntity)) {
34                        // Replace entity with its corresponding character
35                        result += entityMap[potentialEntity];
36                        currentIndex = endIndex;
37                        entityFound = true;
38                        break;
39                    }
40                }
41            }
42          
43            // If no entity was found, append the current character as-is
44            if (!entityFound) {
45                result += text[currentIndex++];
46            }
47        }
48      
49        return result;
50    }
51};
52
1/**
2 * Parses HTML entities in a string and replaces them with their corresponding characters
3 * @param text - The input string containing HTML entities
4 * @returns The parsed string with HTML entities replaced
5 */
6function entityParser(text: string): string {
7    // Map of HTML entities to their corresponding characters
8    const htmlEntityMap: Record<string, string> = {
9        '&quot;': '"',
10        '&apos;': "'",
11        '&amp;': '&',
12        '&gt;': '>',
13        '&lt;': '<',
14        '&frasl;': '/',
15    };
16
17    // Result string to build the parsed output
18    let result: string = '';
19    // Current position in the input text
20    let currentIndex: number = 0;
21    const textLength: number = text.length;
22
23    // Process each character in the text
24    while (currentIndex < textLength) {
25        let entityFound: boolean = false;
26      
27        // Try to match HTML entities of different lengths (1 to 7 characters)
28        // The longest entity in our map is 7 characters (&frasl;)
29        for (let entityLength: number = 1; entityLength < 8; ++entityLength) {
30            const endIndex: number = currentIndex + entityLength;
31          
32            // Check if we're within bounds
33            if (endIndex <= textLength) {
34                // Extract potential entity substring
35                const potentialEntity: string = text.substring(currentIndex, endIndex);
36              
37                // Check if this substring is a valid HTML entity
38                if (htmlEntityMap.hasOwnProperty(potentialEntity)) {
39                    // Replace entity with its corresponding character
40                    result += htmlEntityMap[potentialEntity];
41                    currentIndex = endIndex;
42                    entityFound = true;
43                    break;
44                }
45            }
46        }
47
48        // If no entity was found, add the current character as-is
49        if (!entityFound) {
50            result += text[currentIndex++];
51        }
52    }
53
54    return result;
55}
56

Time and Space Complexity

The time complexity is O(n × l), where n is the length of the input string text and l is the maximum length of an HTML entity (which is 7 for "&frasl;").

For each position i in the string, the algorithm checks up to l different substring lengths (from 1 to 7) to find a matching entity in the dictionary. In the worst case, we iterate through all n positions and for each position, we check l substrings, resulting in O(n × l) time complexity.

The space complexity is O(n) for the output list ans that stores the parsed result, plus O(m × l) for the dictionary d where m is the number of entities (6 in this case) and l is the average length of the entities. Since the dictionary size is constant (6 entities), the dictionary space can be considered O(1). Therefore, the overall space complexity is O(n) due to the output storage.

Note: The reference answer states space complexity as O(l), which appears to consider only the constant space used by the dictionary and temporary variables, excluding the output space. If we exclude the output space (which is a common practice in complexity analysis), then the space complexity would indeed be O(l) or more precisely O(1) since l is a constant (maximum 7).

Learn more about how to find time and space complexity quickly.

Common Pitfalls

1. Greedy Matching Leading to Incorrect Parsing

One of the most common pitfalls in this problem is attempting to use a greedy approach that always matches the longest possible entity starting with &. This can lead to incorrect parsing when entities are nested or when & appears without forming a valid entity.

Problem Example:

# Incorrect greedy approach
text = "&amp;gt;"
# Expected output: "&gt;" (parse &amp; first, then gt; remains as is)
# Wrong greedy output: "&amp;gt;" (tries to match longest, finds nothing)

Why it happens: If you try to find the longest match starting from & to the next ;, you might miss valid shorter entities that should be parsed first.

Solution: The correct approach checks all possible lengths (1-7) at each position, not just looking for the longest match:

# Correct: Check each possible length
for entity_length in range(1, 8):
    potential_entity = text[position:position + entity_length]
    if potential_entity in entity_map:
        # Found valid entity, process it
        break

2. Incorrect Handling of Incomplete or Invalid Entities

Another pitfall is mishandling strings like &amp (missing semicolon) or &something; (invalid entity).

Problem Example:

text = "x &amp y"  # Missing semicolon
# Should output: "x &amp y" (no valid entity, keep as is)
# Wrong output might be: "x & y" (incorrectly assuming &amp is valid)

Solution: Only recognize exact matches from the predefined entity map:

# Don't use pattern matching or regex that might be too permissive
# Only exact dictionary lookup ensures correctness
if potential_entity in entity_map:  # Exact match required
    result.append(entity_map[potential_entity])

3. Buffer Overflow When Checking Substrings

When checking for entities near the end of the string, accessing text[position:position + 7] won't cause an error in Python, but in other languages or implementations, you need to be careful about bounds.

Problem Example:

text = "test&"
# At position 4, checking text[4:11] is safe in Python (returns "&")
# But conceptually, we're checking beyond string bounds

Solution: Python handles this gracefully with string slicing, but be aware that:

  • text[i:j] where j > len(text) simply returns text[i:]
  • This is actually beneficial as it naturally handles edge cases
  • In other languages, add explicit bounds checking:
for entity_length in range(1, min(8, text_length - position + 1)):
    # This ensures we don't check beyond available characters

4. Using String Concatenation Instead of List for Building Result

Problem Example:

# Inefficient approach
result = ""
while position < text_length:
    result += character  # String concatenation is O(n) each time

Why it's a problem: Strings are immutable in Python, so each concatenation creates a new string object, leading to O(n²) time complexity for the entire parsing.

Solution: Use a list to collect characters, then join at the end:

result = []
# ... append characters to result list ...
return ''.join(result)  # O(n) operation done once

This approach maintains O(n) time complexity for the entire solution.

Discover Your Strengths and Weaknesses: Take Our 5-Minute Quiz to Tailor Your Study Plan:

Consider the classic dynamic programming of longest increasing subsequence:

Find the length of the longest subsequence of a given sequence such that all elements of the subsequence are sorted in increasing order.

For example, the length of LIS for [50, 3, 10, 7, 40, 80] is 4 and LIS is [3, 7, 40, 80].

What is the recurrence relation?


Recommended Readings

Want a Structured Path to Master System Design Too? Don’t Miss This!

Load More