Leetcode 1410. HTML Entity Parser

Problem Explanation

The goal of this problem is to create a HTML entity parser which receives an input string with HTML entities and returns a string where all HTML entities are replaced by their corresponding characters.

The HTML entities this parser should be able to handle are:

  • " => "
  • ' => '
  • & => &
  • > => >
  • &lt; => <
  • &frasl; => /

Other occurences of & and ; that do not form a valid entity should be kept as they are in the input string.

Let's say we have the following input string:

1
2
3"Stay home! Practice on Leetcode :)"

Since there are no HTML entities in this string, it should be returned unchanged.

Now consider the following string:

1
2
3"and I quote: &quot;...&quot;"

The output of the parser should be:

1
2
3"and I quote: \"...\""

The constrains of the problem are:

  • The length of the string can be at most 10510^5.
  • The string can contain any of the 256 ASCII characters.

Solution Approach

To solve this problem, we will use the following steps:

  1. First, create a hash map (unordered_map in C++, dict in Python, HashMap in Java, etc.) to store the mapping between the HTML entities and their corresponding characters.

  2. Then, we will go through the input string and for each character:

  • If the character is &, it means we found the start of a potential HTML entity. We will store its position in the ampersandIndex variable.
  • If the character is ; and there's an & detected before (ampersandIndex is not -1), it could indicate we found an HTML entity:
    • So, we take the substring from ampersandIndex to the current index (included), and we check if it matches any of the keys of our hash map. If the substring is a valid entity, we replace it by the corresponding character, else we leave it as is.
    • Add the string from the start of the previous entity or the start of the string and the current element to the result string.
  • If the character is neither & nor ;, we continue to the next character.
  1. Add the remaining part of the text to the result string.

  2. Finally, we return the result string.

This is an efficient approach because its time complexity is linear in the size of the text, which is important given the constraints of the problem.

Solutions

Python

1
2python
3class Solution:
4    def entityParser(self, text: str) -> str:
5      # Setup a map for HTML entities to their corresponding characters
6      entityToChar = {"&quot;": '"', "&apos;": "'", "&amp;": '&', "&gt;": '>', "&lt;": '<', "&frasl;": '/'}
7      result = ''
8      # j is used to track the start of the last matched entity
9      # ampersandIndex is used to track the last unmatched &
10      j = 0
11      ampersandIndex = -1
12
13      for i in range(len(text)):
14        # Found & (start of a potential HTML entity)
15        if text[i] == '&':
16          ampersandIndex = i
17        # Found ; and there was an unmatched & before
18        elif text[i] == ';' and ampersandIndex >= j:
19          sub = text[ampersandIndex:i + 1]
20          result += text[j:ampersandIndex]
21          result += entityToChar.get(sub, sub)  # If sub is a valid entity, replace it
22          j = i + 1  # Move j to the next character after ;
23
24      # Add the remaining part of the string that was not covered in the loop
25      return result + text[j:]

Java

1
2java
3class Solution {
4    public String entityParser(String text) {
5        HashMap<String, Character> entityToChar = new HashMap<String, Character>(){{
6            put("&quot;", '"'); put("&apos;", '\''); put("&amp;", '&'); 
7            put("&gt;", '>'); put("&lt;", '<'); put("&frasl;", '/');
8        }};
9
10        StringBuilder result = new StringBuilder();
11        int ampersandIndex = -1;
12        int j = 0;
13
14        for (int i = 0; i < text.length(); i++) {
15            if (text.charAt(i) == '&') {
16                ampersandIndex = i;
17            } else if (text.charAt(i) == ';' && ampersandIndex >= j) {
18                String sub = text.substring(ampersandIndex, i + 1);
19                result.append(text.substring(j, ampersandIndex));
20                result.append(entityToChar.getOrDefault(sub, sub.charAt(0)));
21                j = i + 1;
22            }
23        }
24        result.append(text.substring(j));
25        return result.toString();
26    }
27}

JavaScript

1
2javascript
3var entityParser = function(text) {
4  let entityToChar = {"&quot;": '"', "&apos;": "'", "&amp;": '&', "&gt;": '>', "&lt;": '<', "&frasl;": '/'};
5  let result = '';
6  let j = 0;
7  let ampersandIndex = -1;
8  
9  for (let i = 0; i < text.length; i++) {
10    if (text[i] === '&') {
11      ampersandIndex = i;
12    } else if (text[i] === ';' && ampersandIndex >= j) {
13      let sub = text.substring(ampersandIndex, i + 1);
14      result += text.substring(j, ampersandIndex);
15      result += entityToChar[sub] !== undefined ? entityToChar[sub] : sub;
16      j = i + 1;
17    }
18  }
19
20  result += text.substring(j);
21  return result;
22};

C++

1
2c++
3#include <unordered_map>
4
5class Solution {
6public:
7    string entityParser(string text) {
8        unordered_map<string, char> entityToChar = {{"&quot;", '"'}, {"&apos;", '\''}, {"&amp;", '&'}, 
9                                                    {"&gt;", '>'}, {"&lt;", '<'}, {"&frasl;", '/'}};
10        string result;
11        int ampersandIndex = -1;
12        int j = 0;
13
14        for (int i = 0; i < text.size(); ++i) {
15            if (text[i] == '&') {
16                ampersandIndex = i;
17            } else if (text[i] == ';' && ampersandIndex >= j) {
18                string sub = text.substr(ampersandIndex, i - ampersandIndex + 1);
19                result += text.substr(j, ampersandIndex - j);
20                result += entityToChar.count(sub) ? entityToChar[sub] : sub;
21                j = i + 1;
22            }
23        }
24        result += text.substr(j);
25        return result;
26    }
27};

C#

1
2csharp
3public class Solution {
4    public string EntityParser(string text) {
5        Dictionary<string, char> entityToChar = new Dictionary<string, char>() {
6            {"&quot;", '"'}, {"&apos;", '\''}, {"&amp;", '&'}, {"&gt;", '>'}, 
7            {"&lt;", '<'}, {"&frasl;", '/'}
8        };
9        string result = "";
10        int ampersandIndex = -1;
11        int j = 0;
12
13        for (int i = 0; i < text.Length; i++) {
14            if (text[i] == '&') {
15                ampersandIndex = i;
16            } else if (text[i] == ';' && ampersandIndex >= j) {
17                string sub = text.Substring(ampersandIndex, i - ampersandIndex + 1);
18                result += text.Substring(j, ampersandIndex - j);
19                result += entityToChar.ContainsKey(sub) ? entityToChar[sub].ToString() : sub;
20                j = i + 1;
21            }
22        }
23        result += text.Substring(j);
24        return result;
25    }
26}

In summary, this solution reads the string from beginning to the end and looks for potential HTML entities based on the presence of & and ; characters. If an HTML entity is found, it is replaced in the output string, otherwise the characters are kept. At the end, it puts the remaining of the string to the result.The key to this solution is the use of a hash map to quickly check if a substring is a valid HTML entity or not, to provide the replacement character if it is valid, or keep the original substring if it is not.

The recognition of HTML entities is done by looking at '&' and ';' characters as possible beginning and end (respectively) of an entity. It is important to note that if we see an '&' character, we just record its position but do not immediately decide if it is the start of a valid HTML entity. We have to wait until we find the next ';' character to check if the substring from the recorded '&' to the current ';' forms a valid HTML entity.

This solution works well under given constraints due to the efficient dictionary-lookup and that the main operation (checking and replacing HTML entities) is performed with a linear scan of the string. Moreover, the replacement of substrings by individual characters in the result string does not increase the length of the string, ensuring the solution remains efficient.

In conclusion, the HTML entity parser problem boils down to implementing a simple state machine. With a few hash map operations and string manipulation routines, we can parse and replace HTML entities from any given string.

This exercise is a good showcase of how certain problems can be solved by replacing multiple string patterns with other replacements in one pass. This can be an important skill in tackling problems related to text processing and manipulation, where we are often required to replace certain patterns in an efficient way.

Remember, making good use of data structures (like a hash map in this case) can often lead to efficient and clean solutions to your problems.


Got a question? Ask the Teaching Assistant anything you don't understand.

Still not clear? Ask in the Forum,  Discord or Submit the part you don't understand to our editors.


TA 👨‍🏫