1410. HTML Entity Parser
Problem Description
You need to implement an HTML entity parser that converts HTML entities in a string back to their corresponding special characters.
The parser should recognize and replace these 6 HTML entities:
"
→"
'
→'
&
→&
>
→>
<
→<
⁄
→/
Given an input string text
, you should scan through it and whenever you encounter one of these HTML entities, replace it with its corresponding character. All other characters in the string should remain unchanged.
For example:
- Input:
"& is an HTML entity"
would return"& is an HTML entity"
- Input:
"x > y && x < y is always false"
would return"x > y && x < y is always false"
The solution uses a hash table to map each HTML entity to its corresponding character. It then iterates through the string character by character. At each position, it checks if any HTML entity starts at that position (checking substrings of lengths 1 to 7 since the longest entity ⁄
has 7 characters). If an entity is found, it adds the corresponding character to the result and jumps past the entity. If no entity is found at the current position, it simply adds the current character to the result and moves to the next position.
Intuition
The key observation is that HTML entities always start with &
and end with ;
. When we encounter an &
character while scanning the string, we need to check if it's the beginning of a valid HTML entity.
Since we have a fixed set of 6 entities to recognize, we can use a hash table for quick lookups. The challenge is that entities have different lengths - from 4 characters (>
, <
) to 7 characters (⁄
). We can't just check a fixed length substring at each position.
The approach is to scan the string position by position. At each position, we try to match the longest possible entity first, then progressively shorter ones. Why check from shortest to longest (1 to 7 characters)? Because we want to find any valid entity that starts at the current position. Once we find a match, we can immediately process it and jump to the position right after the entity.
If no entity is found starting at the current position (the else
clause after the for
loop), it means the current character is just a regular character that should be included as-is in the output.
This greedy approach ensures we correctly parse entities without accidentally treating parts of longer entities as shorter ones. For example, if we have &
in the text, we want to recognize it as a complete entity for &
, not mistakenly process just &am
or leave p;
unprocessed.
The time complexity is O(n)
where n
is the length of the string, since we visit each character at most once, and the entity checking at each position is bounded by a constant (maximum 7 checks).
Solution Approach
The solution uses a hash table combined with a simulation approach to parse the HTML entities.
Step 1: Initialize the hash table
d = { '"': '"', ''': "'", '&': "&", ">": '>', "<": '<', "⁄": '/', }
We create a dictionary d
that maps each HTML entity string to its corresponding character. This gives us O(1) lookup time when checking if a substring is a valid entity.
Step 2: Set up the traversal variables
i, n = 0, len(text)
ans = []
i
is our current position pointer in the stringn
stores the length of the text for boundary checkingans
is a list to collect our result characters (using a list is more efficient than string concatenation)
Step 3: Main parsing loop
while i < n:
for l in range(1, 8):
j = i + l
if text[i:j] in d:
ans.append(d[text[i:j]])
i = j
break
else:
ans.append(text[i])
i += 1
The algorithm works as follows:
- For each position
i
, we try to match substrings of increasing lengths (1 to 7 characters) - We extract the substring
text[i:j]
wherej = i + l
- If this substring exists in our hash table
d
, we've found a valid entity:- Append the corresponding character to our result
- Move the pointer
i
to positionj
(skip past the entire entity) - Break out of the inner loop
- If no entity is found (the
else
clause executes when thefor
loop completes without breaking):- The current character is not part of an entity
- Append the character at position
i
to the result - Move to the next position (
i += 1
)
Step 4: Return the result
return ''.join(ans)
Finally, we join all characters in the ans
list to form the final parsed string.
The algorithm efficiently handles overlapping patterns and ensures that we always match the correct entity at each position, processing the string in a single pass with O(n) time complexity.
Ready to land your dream job?
Unlock your dream job with a 5-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's trace through the example: "& is fun"
Initial Setup:
- Input text:
"& is fun"
- Hash table
d
contains all 6 HTML entities i = 0
,n = 11
,ans = []
Iteration 1 (i = 0):
- Current character:
'&'
at position 0 - Check substrings of length 1-7 starting at position 0:
- Length 1:
"&"
→ not in hash table - Length 2:
"&a"
→ not in hash table - Length 3:
"&am"
→ not in hash table - Length 4:
"&"
→ not in hash table - Length 5:
"&"
→ found in hash table!
- Length 1:
- Action:
- Append
'&'
(the value for"&"
) toans
- Set
i = 5
(jump past the entire entity) ans = ['&']
- Append
Iteration 2 (i = 5):
- Current character:
' '
(space) at position 5 - Check substrings of length 1-7:
- Length 1:
" "
→ not in hash table - Length 2:
" i"
→ not in hash table - ... (continues checking, none match)
- Length 1:
- No entity found, so:
- Append
' '
toans
- Set
i = 6
ans = ['&', ' ']
- Append
Iterations 3-8 (i = 6 to 11):
- Process characters
'i'
,'s'
,' '
,'f'
,'u'
,'n'
one by one - None form valid entities, so each is added directly to
ans
- Final
ans = ['&', ' ', 'i', 's', ' ', 'f', 'u', 'n']
Result:
- Join the list:
"& is fun"
- The HTML entity
&
was successfully converted to&
This walkthrough demonstrates how the algorithm:
- Recognizes multi-character entities by checking different substring lengths
- Skips past entire entities once found (jumping from position 0 to 5)
- Handles regular characters by simply appending them when no entity matches
Solution Implementation
1class Solution:
2 def entityParser(self, text: str) -> str:
3 """
4 Parse HTML entities in text and replace them with corresponding characters.
5
6 Args:
7 text: Input string containing HTML entities
8
9 Returns:
10 String with HTML entities replaced by their actual characters
11 """
12 # Dictionary mapping HTML entities to their corresponding characters
13 entity_map = {
14 '"': '"',
15 ''': "'",
16 '&': "&",
17 ">": '>',
18 "<": '<',
19 "⁄": '/',
20 }
21
22 # Initialize position index and get text length
23 position = 0
24 text_length = len(text)
25 result = []
26
27 # Process each character in the text
28 while position < text_length:
29 # Try to match entities of different lengths (1 to 7 characters)
30 # The longest entity is "⁄" which is 7 characters
31 entity_found = False
32
33 for entity_length in range(1, 8):
34 end_position = position + entity_length
35 potential_entity = text[position:end_position]
36
37 # Check if the substring matches any HTML entity
38 if potential_entity in entity_map:
39 # Replace entity with its corresponding character
40 result.append(entity_map[potential_entity])
41 position = end_position
42 entity_found = True
43 break
44
45 # If no entity was found, append the current character as-is
46 if not entity_found:
47 result.append(text[position])
48 position += 1
49
50 # Join all characters to form the final string
51 return ''.join(result)
52
1class Solution {
2 public String entityParser(String text) {
3 // Create a mapping of HTML entities to their corresponding characters
4 Map<String, String> entityMap = new HashMap<>();
5 entityMap.put(""", "\""); // Quotation mark
6 entityMap.put("'", "'"); // Apostrophe
7 entityMap.put("&", "&"); // Ampersand
8 entityMap.put(">", ">"); // Greater than
9 entityMap.put("<", "<"); // Less than
10 entityMap.put("⁄", "/"); // Forward slash
11
12 // StringBuilder to construct the result string
13 StringBuilder result = new StringBuilder();
14
15 // Initialize pointer for traversing the input text
16 int currentIndex = 0;
17 int textLength = text.length();
18
19 // Process each character in the text
20 while (currentIndex < textLength) {
21 boolean entityFound = false;
22
23 // Check for possible HTML entities starting at current position
24 // Maximum entity length is 7 characters (e.g., "⁄" has 7 chars)
25 for (int entityLength = 1; entityLength < 8; entityLength++) {
26 int endIndex = currentIndex + entityLength;
27
28 // Ensure we don't go beyond the text boundary
29 if (endIndex <= textLength) {
30 // Extract substring to check if it's a valid entity
31 String potentialEntity = text.substring(currentIndex, endIndex);
32
33 // If the substring matches a known entity, replace it
34 if (entityMap.containsKey(potentialEntity)) {
35 result.append(entityMap.get(potentialEntity));
36 currentIndex = endIndex; // Move pointer past the entity
37 entityFound = true;
38 break;
39 }
40 }
41 }
42
43 // If no entity was found, append the current character as-is
44 if (!entityFound) {
45 result.append(text.charAt(currentIndex));
46 currentIndex++;
47 }
48 }
49
50 // Return the parsed string
51 return result.toString();
52 }
53}
54
1class Solution {
2public:
3 string entityParser(string text) {
4 // Map HTML entities to their corresponding characters
5 unordered_map<string, string> entityMap = {
6 {""", "\""}, // Quotation mark
7 {"'", "'"}, // Apostrophe
8 {"&", "&"}, // Ampersand
9 {">", ">"}, // Greater than
10 {"<", "<"}, // Less than
11 {"⁄", "/"} // Forward slash
12 };
13
14 string result = "";
15 int currentIndex = 0;
16 int textLength = text.size();
17
18 // Process each character in the text
19 while (currentIndex < textLength) {
20 bool entityFound = false;
21
22 // Try to match entities of different lengths (1 to 7 characters)
23 // The longest entity is "⁄" with 7 characters
24 for (int entityLength = 1; entityLength < 8; ++entityLength) {
25 int endIndex = currentIndex + entityLength;
26
27 // Check if we're within bounds
28 if (endIndex <= textLength) {
29 // Extract potential entity substring
30 string potentialEntity = text.substr(currentIndex, entityLength);
31
32 // Check if this substring is a valid entity
33 if (entityMap.count(potentialEntity)) {
34 // Replace entity with its corresponding character
35 result += entityMap[potentialEntity];
36 currentIndex = endIndex;
37 entityFound = true;
38 break;
39 }
40 }
41 }
42
43 // If no entity was found, append the current character as-is
44 if (!entityFound) {
45 result += text[currentIndex++];
46 }
47 }
48
49 return result;
50 }
51};
52
1/**
2 * Parses HTML entities in a string and replaces them with their corresponding characters
3 * @param text - The input string containing HTML entities
4 * @returns The parsed string with HTML entities replaced
5 */
6function entityParser(text: string): string {
7 // Map of HTML entities to their corresponding characters
8 const htmlEntityMap: Record<string, string> = {
9 '"': '"',
10 ''': "'",
11 '&': '&',
12 '>': '>',
13 '<': '<',
14 '⁄': '/',
15 };
16
17 // Result string to build the parsed output
18 let result: string = '';
19 // Current position in the input text
20 let currentIndex: number = 0;
21 const textLength: number = text.length;
22
23 // Process each character in the text
24 while (currentIndex < textLength) {
25 let entityFound: boolean = false;
26
27 // Try to match HTML entities of different lengths (1 to 7 characters)
28 // The longest entity in our map is 7 characters (⁄)
29 for (let entityLength: number = 1; entityLength < 8; ++entityLength) {
30 const endIndex: number = currentIndex + entityLength;
31
32 // Check if we're within bounds
33 if (endIndex <= textLength) {
34 // Extract potential entity substring
35 const potentialEntity: string = text.substring(currentIndex, endIndex);
36
37 // Check if this substring is a valid HTML entity
38 if (htmlEntityMap.hasOwnProperty(potentialEntity)) {
39 // Replace entity with its corresponding character
40 result += htmlEntityMap[potentialEntity];
41 currentIndex = endIndex;
42 entityFound = true;
43 break;
44 }
45 }
46 }
47
48 // If no entity was found, add the current character as-is
49 if (!entityFound) {
50 result += text[currentIndex++];
51 }
52 }
53
54 return result;
55}
56
Time and Space Complexity
The time complexity is O(n × l)
, where n
is the length of the input string text
and l
is the maximum length of an HTML entity (which is 7 for "⁄"
).
For each position i
in the string, the algorithm checks up to l
different substring lengths (from 1 to 7) to find a matching entity in the dictionary. In the worst case, we iterate through all n
positions and for each position, we check l
substrings, resulting in O(n × l)
time complexity.
The space complexity is O(n)
for the output list ans
that stores the parsed result, plus O(m × l)
for the dictionary d
where m
is the number of entities (6 in this case) and l
is the average length of the entities. Since the dictionary size is constant (6 entities), the dictionary space can be considered O(1)
. Therefore, the overall space complexity is O(n)
due to the output storage.
Note: The reference answer states space complexity as O(l)
, which appears to consider only the constant space used by the dictionary and temporary variables, excluding the output space. If we exclude the output space (which is a common practice in complexity analysis), then the space complexity would indeed be O(l)
or more precisely O(1)
since l
is a constant (maximum 7).
Learn more about how to find time and space complexity quickly.
Common Pitfalls
1. Greedy Matching Leading to Incorrect Parsing
One of the most common pitfalls in this problem is attempting to use a greedy approach that always matches the longest possible entity starting with &
. This can lead to incorrect parsing when entities are nested or when &
appears without forming a valid entity.
Problem Example:
# Incorrect greedy approach text = "&gt;" # Expected output: ">" (parse & first, then gt; remains as is) # Wrong greedy output: "&gt;" (tries to match longest, finds nothing)
Why it happens: If you try to find the longest match starting from &
to the next ;
, you might miss valid shorter entities that should be parsed first.
Solution: The correct approach checks all possible lengths (1-7) at each position, not just looking for the longest match:
# Correct: Check each possible length
for entity_length in range(1, 8):
potential_entity = text[position:position + entity_length]
if potential_entity in entity_map:
# Found valid entity, process it
break
2. Incorrect Handling of Incomplete or Invalid Entities
Another pitfall is mishandling strings like &
(missing semicolon) or &something;
(invalid entity).
Problem Example:
text = "x & y" # Missing semicolon # Should output: "x & y" (no valid entity, keep as is) # Wrong output might be: "x & y" (incorrectly assuming & is valid)
Solution: Only recognize exact matches from the predefined entity map:
# Don't use pattern matching or regex that might be too permissive # Only exact dictionary lookup ensures correctness if potential_entity in entity_map: # Exact match required result.append(entity_map[potential_entity])
3. Buffer Overflow When Checking Substrings
When checking for entities near the end of the string, accessing text[position:position + 7]
won't cause an error in Python, but in other languages or implementations, you need to be careful about bounds.
Problem Example:
text = "test&" # At position 4, checking text[4:11] is safe in Python (returns "&") # But conceptually, we're checking beyond string bounds
Solution: Python handles this gracefully with string slicing, but be aware that:
text[i:j]
wherej > len(text)
simply returnstext[i:]
- This is actually beneficial as it naturally handles edge cases
- In other languages, add explicit bounds checking:
for entity_length in range(1, min(8, text_length - position + 1)):
# This ensures we don't check beyond available characters
4. Using String Concatenation Instead of List for Building Result
Problem Example:
# Inefficient approach result = "" while position < text_length: result += character # String concatenation is O(n) each time
Why it's a problem: Strings are immutable in Python, so each concatenation creates a new string object, leading to O(n²) time complexity for the entire parsing.
Solution: Use a list to collect characters, then join at the end:
result = [] # ... append characters to result list ... return ''.join(result) # O(n) operation done once
This approach maintains O(n) time complexity for the entire solution.
Consider the classic dynamic programming of longest increasing subsequence:
Find the length of the longest subsequence of a given sequence such that all elements of the subsequence are sorted in increasing order.
For example, the length of LIS for [50, 3, 10, 7, 40, 80]
is 4
and LIS is
[3, 7, 40, 80]
.
What is the recurrence relation?
Recommended Readings
Coding Interview Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https assets algo monster recursion jpg You first call Ben and ask
Runtime Overview When learning about algorithms and data structures you'll frequently encounter the term time complexity This concept is fundamental in computer science and offers insights into how long an algorithm takes to complete given a certain input size What is Time Complexity Time complexity represents the amount of time
Want a Structured Path to Master System Design Too? Don’t Miss This!