Leetcode 1236. Web Crawler
Problem Explanation
Here we are asked to implement a basic Web Crawler. A web crawler is a bot or a script that systematically browses the world wide web in an automated manner. They are mainly used to create a copy of all the visited pages for later processing. One can use a web crawler for various purposes, from analysing the data on those pages to indexing these pages to make them searchable.
Our program must input a start url and then find all links contained in that url's page that are of the same hostname as the start url. It should continually follow those links to find more links and so on until every page is visited and whose hostname matches that of the start url. A hostname of a url is everything after the http:// or https:// up to the trailing slash '/'.
As an example, given the following urls:
urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com", "http://news.yahoo.com/us" ]
and the startUrl = "http://news.yahoo.com/news/topics/", our program will fetch all pages in these urls that have the hostname "news.yahoo.com" which are all the pages that do not include "news.google.com". The program ignores the "news.google.com" url and returns all urls under "news.yahoo.com"
Solution Approach
In our solution, we implement a Bread-first search (BFS) traversal algorithm using a queue and an unordered set to keep track of visited sites and avoid visiting duplicates. We first add the start url to the queue and as we process each url from the queue, we get all the links in the current url's page using the HtmlParser's getUrls function, for each link, we check if its hostname matches that of the start url and if it is not a duplicate link (not visited before). If it is a matching and new link, we add the url to the queue for processing and add to the visited set so we do not process it again in case it appears in another page.
C++ solution
1
2cpp
3public:
4 vector<string> crawl(string startUrl, HtmlParser htmlParser) {
5 // Initialize a queue with the start url
6 queue<string> q{{startUrl}};
7 // Initialize a set to keep track of visited urls
8 unordered_set<string> seen{{startUrl}};
9 // Extract the hostname
10 const string& hostname = getHostname(startUrl);
11
12 // iterate while queue isn't empty
13 while (!q.empty()) {
14 const string currUrl = q.front();
15 q.pop();
16 // Iterate over the list of urls in the current page
17 for (const string& url : htmlParser.getUrls(currUrl)) {
18 // If url is not visited before proceed
19 if (seen.count(url) == 0)
20 // If url has the same hostname then add it to visited and queue
21 if (url.find(hostname) != string::npos) {
22 q.push(url);
23 seen.insert(url);
24 }
25 }
26 }
27 // Return visited urls
28 return {begin(seen), end(seen)};
29 }
30
31private:
32 string getHostname(const string& url) {
33 // Extract the hostname
34 const int firstSlash = url.find_first_of('/');
35 const int thirdSlash = url.find_first_of('/', firstSlash + 2);
36 return url.substr(firstSlash + 2, thirdSlash - firstSlash - 2);
37 }
38});
Python
1 2python 3class Solution: 4 def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]: 5 6 # extract hostname from startUrl 7 host = startUrl.split("/")[2] 8 seen = [] 9 10 def dfs(startUrl: str): 11 # add link to seen 12 seen.append(startUrl) 13 # check all links in startUrl. 14 for link in htmlParser.getUrls(startUrl): 15 # if they're in the same host but haven't been seen yet, visit them 16 if link not in seen and link.split("/")[2] == host: 17 dfs(link) 18 19 dfs(startUrl) 20 21 return seen
Java
1
2java
3public class Solution {
4 public List<String> crawl(String startUrl, HtmlParser htmlParser) {
5 // Extract the hostname
6 String hostname = startUrl.split("/")[2];
7 Set<String> result = new HashSet<>();
8 result.add(startUrl);
9 crawl(startUrl, hostname, htmlParser, result);
10 return new ArrayList<>(result);
11 }
12
13 private void crawl(String startUrl, String hostname, HtmlParser htmlParser, Set<String> result) {
14 // Extract urls in current page
15 List<String> urls = htmlParser.getUrls(startUrl);
16 for(String url: urls) {
17 // If the url has the same hostname and hasn't been visited
18 if(url.contains(hostname) && !result.contains(url)) {
19 // Add url to visited and continue dfs
20 result.add(url);
21 crawl(url, hostname, htmlParser, result);
22 }
23 }
24 }
25}
JavaScript
1
2javascript
3/**
4 * @param {string} startUrl
5 * @param {HtmlParser} htmlParser
6 * @return {string[]}
7 */
8
9const crawl = function(startUrl, htmlParser) {
10 // Initialize queue and visited set with startUrl
11 let queue = [startUrl],
12 seen = new Set(queue);
13 // Extract hostname of startUrl
14 let hostname = getHostname(startUrl);
15
16 while(queue.length != 0) {
17 // Fetch and remove the first url in queue
18 let curr = queue.shift();
19 for (const url of htmlParser.getUrls(curr)) {
20 // If url is new and has matching hostname, add it to visited and queue
21 if (!seen.has(url) && getHostname(url) === hostname) {
22 seen.add(url);
23 queue.push(url);
24 }
25 }
26 }
27 // Return visited urls
28 return Array.from(seen);
29};
30
31// Helper function to get hostname of url
32function getHostname(url){
33 let hostname = url.split('/')[2];
34 return hostname;
35}
Now that you have a good understanding of how to solve the problem, let's also understand the time complexity of these solutions. Although the time complexity can vary based on the structure of the URLs and the number of URLs in total, it will largely be linear (O(N)), where N is the number of valid URLs having the same hostname as the startUrl
.
Why?
Every URL needs to be processed only once and kept in a 'visited' set or list to prevent reprocessing. The 'getUrls' function call, whether it is performed using Python's htmlParser.getUrls, JavaScript's htmlParser.getUrls or Java's htmlParser.getUrls, will involve parsing the webpage's HTML associated with the URL and would theoretically involve a time complexity of O(M) where M is the size of the webpage (e.g., in kilobytes, megabytes). However, in practice, as large parts of HTML are likely to be similar across different web pages of the same website, the parsing can be faster than O(M).
In any case, the time complexity, while proportional to N and M, does not involve any quadratic or exponential factors attributable to the algorithm itself. Depending on the specific use-case and requirements, it could be possible to optimize the web-page crawling even further by using concurrent requests to fetch multiple webpages at the same time, or by caching commonly found parts of the website's HTML structure to speed up the parsing. For a website having thousands or tens of thousands of linked URLs, these optimizations can reduce the runtime by a significant margin. However, be aware that such high request rates will likely be seen as an attack by the servers of the website being crawled, and could result in IP bans.
Always keep in mind the ethical aspects and terms of service use when building a web crawler. Respect the rules specified in the website's robots.txt file, and do not overload the servers with excessively high request rates.
Got a question?ย Ask the Teaching Assistantย anything you don't understand.
Still not clear? Ask in the Forum, ย Discordย orย Submitย the part you don't understand to our editors.