Leetcode 1236. Web Crawler

Problem Explanation

Here we are asked to implement a basic Web Crawler. A web crawler is a bot or a script that systematically browses the world wide web in an automated manner. They are mainly used to create a copy of all the visited pages for later processing. One can use a web crawler for various purposes, from analysing the data on those pages to indexing these pages to make them searchable.

Our program must input a start url and then find all links contained in that url's page that are of the same hostname as the start url. It should continually follow those links to find more links and so on until every page is visited and whose hostname matches that of the start url. A hostname of a url is everything after the http:// or https:// up to the trailing slash '/'.

As an example, given the following urls:

urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.yahoo.com/news/topics/", "http://news.google.com", "http://news.yahoo.com/us" ]

and the startUrl = "http://news.yahoo.com/news/topics/", our program will fetch all pages in these urls that have the hostname "news.yahoo.com" which are all the pages that do not include "news.google.com". The program ignores the "news.google.com" url and returns all urls under "news.yahoo.com"

Solution Approach

In our solution, we implement a Bread-first search (BFS) traversal algorithm using a queue and an unordered set to keep track of visited sites and avoid visiting duplicates. We first add the start url to the queue and as we process each url from the queue, we get all the links in the current url's page using the HtmlParser's getUrls function, for each link, we check if its hostname matches that of the start url and if it is not a duplicate link (not visited before). If it is a matching and new link, we add the url to the queue for processing and add to the visited set so we do not process it again in case it appears in another page.

C++ solution

1
2cpp
3public:
4  vector<string> crawl(string startUrl, HtmlParser htmlParser) {
5    // Initialize a queue with the start url
6    queue<string> q{{startUrl}};
7    // Initialize a set to keep track of visited urls
8    unordered_set<string> seen{{startUrl}};
9    // Extract the hostname
10    const string& hostname = getHostname(startUrl);
11
12    // iterate while queue isn't empty
13    while (!q.empty()) {
14      const string currUrl = q.front();
15      q.pop();
16      // Iterate over the list of urls in the current page
17      for (const string& url : htmlParser.getUrls(currUrl)) {
18        // If url is not visited before proceed
19        if (seen.count(url) == 0)
20          // If url has the same hostname then add it to visited and queue
21          if (url.find(hostname) != string::npos) {
22            q.push(url);
23            seen.insert(url);
24          }
25       }
26    }
27    // Return visited urls
28    return {begin(seen), end(seen)};
29  }
30
31private:
32  string getHostname(const string& url) {
33    // Extract the hostname 
34    const int firstSlash = url.find_first_of('/');
35    const int thirdSlash = url.find_first_of('/', firstSlash + 2);
36    return url.substr(firstSlash + 2, thirdSlash - firstSlash - 2);
37  }
38});

Python

1
2python
3class Solution:
4    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
5        
6        # extract hostname from startUrl
7        host = startUrl.split("/")[2]
8        seen = []
9
10        def dfs(startUrl: str):
11            # add link to seen
12            seen.append(startUrl)
13            # check all links in startUrl.
14            for link in htmlParser.getUrls(startUrl):
15                # if they're in the same host but haven't been seen yet, visit them
16                if link not in seen and link.split("/")[2] == host:
17                    dfs(link)
18
19        dfs(startUrl)
20
21        return seen

Java

1
2java
3public class Solution {
4    public List<String> crawl(String startUrl, HtmlParser htmlParser) {
5        // Extract the hostname
6        String hostname = startUrl.split("/")[2];
7        Set<String> result = new HashSet<>();
8        result.add(startUrl);
9        crawl(startUrl, hostname, htmlParser, result);
10        return new ArrayList<>(result);
11    }
12
13    private void  crawl(String startUrl, String hostname, HtmlParser htmlParser, Set<String> result) {
14        // Extract urls in current page
15        List<String> urls = htmlParser.getUrls(startUrl);
16        for(String url: urls) {
17            // If the url has the same hostname and hasn't been visited
18            if(url.contains(hostname) && !result.contains(url)) {
19                // Add url to visited and continue dfs
20                result.add(url);
21                crawl(url, hostname, htmlParser, result);
22           }
23        }
24    }
25}

JavaScript

1
2javascript
3/**
4 * @param {string} startUrl
5 * @param {HtmlParser} htmlParser
6 * @return {string[]}
7 */
8
9const crawl = function(startUrl, htmlParser) {
10    // Initialize queue and visited set with startUrl
11    let queue = [startUrl],
12        seen = new Set(queue);
13    // Extract hostname of startUrl
14    let hostname = getHostname(startUrl);
15
16    while(queue.length != 0) {
17        // Fetch and remove the first url in queue
18        let curr = queue.shift();
19        for (const url of htmlParser.getUrls(curr)) {
20            // If url is new and has matching hostname, add it to visited and queue
21            if (!seen.has(url) && getHostname(url) === hostname) {
22                seen.add(url);
23                queue.push(url);
24            }
25        }
26    }
27    // Return visited urls
28    return Array.from(seen);
29};
30
31// Helper function to get hostname of url
32function getHostname(url){
33    let hostname = url.split('/')[2];
34    return hostname;
35}

Now that you have a good understanding of how to solve the problem, let's also understand the time complexity of these solutions. Although the time complexity can vary based on the structure of the URLs and the number of URLs in total, it will largely be linear (O(N)), where N is the number of valid URLs having the same hostname as the startUrl.

Why?

Every URL needs to be processed only once and kept in a 'visited' set or list to prevent reprocessing. The 'getUrls' function call, whether it is performed using Python's htmlParser.getUrls, JavaScript's htmlParser.getUrls or Java's htmlParser.getUrls, will involve parsing the webpage's HTML associated with the URL and would theoretically involve a time complexity of O(M) where M is the size of the webpage (e.g., in kilobytes, megabytes). However, in practice, as large parts of HTML are likely to be similar across different web pages of the same website, the parsing can be faster than O(M).

In any case, the time complexity, while proportional to N and M, does not involve any quadratic or exponential factors attributable to the algorithm itself. Depending on the specific use-case and requirements, it could be possible to optimize the web-page crawling even further by using concurrent requests to fetch multiple webpages at the same time, or by caching commonly found parts of the website's HTML structure to speed up the parsing. For a website having thousands or tens of thousands of linked URLs, these optimizations can reduce the runtime by a significant margin. However, be aware that such high request rates will likely be seen as an attack by the servers of the website being crawled, and could result in IP bans.

Always keep in mind the ethical aspects and terms of service use when building a web crawler. Respect the rules specified in the website's robots.txt file, and do not overload the servers with excessively high request rates.


Got a question? Ask the Teaching Assistant anything you don't understand.

Still not clear? Ask in the Forum,  Discord or Submit the part you don't understand to our editors.


TA 👨‍🏫