1242. Web Crawler Multithreaded


Problem Explanation

Our task is to build a web crawler in any programming language, using the provided HtmlParser interface. We're supposed to build a multi-threaded web crawler that can crawl through all links under the same hostname as the startUrl. By multi-threaded, it means that we need to design a solution that can work on multiple threads simultaneously and fetch the pages, rather than fetching one by one.

We're given some constraints on what hostname can be, which is an important clue about the problem. In our input, we have multiple urls, and a query/url to start with.

Here is a step-by-step visual explanation of how to approach this problem:

Let's consider this simple example input:

urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.google.com" ]

startUrl = "http://news.yahoo.com"

First, our crawler starts from startUrl which is "http://news.yahoo.com" and fetches all its urls using HtmlParser.getUrls(url). If there are more urls under the same host, it will push them in the queue for further crawling. So the output will be: "news.yahoo.com" "news.yahoo.com/news"

Then, the crawler, in parallel will consider next url in queue and fetches its urls. As there is no url, it will not add anything to the queue.

In the end, our final output will be ["news.yahoo.com", "news.yahoo.com/news"]

Solution

The problem can be solved by implementing a Breadth-First Search (BFS) algorithm running concurrently in different threads. At the start, spin up the threads, then use BFS on each thread. We ensure that we don't visit the same URL twice by keeping track of URLs in a visited set.

Let's look at the solutions for different languages.

C## Solution

1
2csharp
3public class Solution {
4    public IList<string> Crawl(string startUrl, HtmlParser htmlParser) {
5        HashSet<string> ans = new HashSet<string>();
6        string PREFIX = startUrl.Split("/")[2];
7        Queue<string> queue = new Queue<string>();
8        queue.Enqueue(startUrl);
9        ans.Add(startUrl);
10        Parallel.ForEach(Enumerable.Range(0, Environment.ProcessorCount * 2), _ => {
11            while (true) {
12                string current = null;
13                lock (queue) {
14                    if (queue.Count == 0) return;
15                    current = queue.Dequeue();
16                }
17                foreach (var nextUrl in htmlParser.GetUrls(current)) {
18                    if (!nextUrl.Split("/")[2].Equals(PREFIX) || ans.Contains(nextUrl)) {
19                        continue;
20                    };
21                    lock(ans) {
22                        ans.Add(nextUrl);
23                    }
24                    lock(queue) {
25                        queue.Enqueue(nextUrl);
26                    }
27                };
28            }
29        });
30        return ans.ToList();
31    }
32}

Python Solution

1
2python
3class Solution:
4    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
5        import threading
6        visited = set()
7        visited_lock = threading.Lock()
8        crawl = [False]
9        crawl_lock = threading.Condition()
10
11        def worker(url, htmlParser):
12            nonlocal crawl
13            while True:
14                my_html = None
15                with visited_lock:
16                    for html in htmlParser.getUrls(url):
17                        hostname = html.split('/')[2]
18                        if hostname == startUrl.split('/')[2] and html not in visited:
19                            visited.add(html)
20                            my_html = html
21                            break
22                if my_html != None:
23                    worker(my_html, htmlParser)
24                else:
25                    with crawl_lock:
26                        crawl[0] = crawl[0] - 1
27                        if crawl[0] == 0:
28                            crawl_lock.notify_all()
29                        crawl_lock.wait()
30
31        with crawl_lock:
32            with visited_lock:
33                visited.add(startUrl)
34            crawl[0] = threading.active_count()
35            threading.Thread(target=worker, args=[startUrl, htmlParser]).start()
36            crawl_lock.wait()
37        return list(visited)

Java Solution

1
2java
3class Solution {
4    public List<String> crawl(String startUrl, HtmlParser htmlParser) {
5        Set<String> result = ConcurrentHashMap.newKeySet();
6        String hostname = getHostname(startUrl);
7
8        ExecutorService executor = Executors.newFixedThreadPool(64);
9        result.add(startUrl);
10        crawl(result, startUrl, hostname, executor, htmlParser);
11        executor.shutdown();
12
13        try {
14            executor.awaitTermination(10, TimeUnit.SECONDS);
15        } catch (InterruptedException e) {
16            e.printStackTrace();
17        }
18
19        return new ArrayList<>(result);
20    }
21
22    private String getHostname(String url) {
23        int idx = url.indexOf('/', 7);
24        return (idx != -1) ? url.substring(0, idx) : url;
25    }
26
27    private void crawl(Set<String> result, String start, String hostname, ExecutorService executor, HtmlParser htmlParser) {
28        List<Future> futures = new ArrayList<>();
29        for (String url : htmlParser.getUrls(start)) {
30            if (url.startsWith(hostname) && result.add(url)) {
31                futures.add(executor.submit(() -> crawl(result, url, hostname, executor,htmlParser)));
32            }
33        }
34        for (Future f : futures) {
35            try {
36                f.get();
37            } catch (Exception e) {
38                e.printStackTrace();
39            }
40        }
41    }
42}

JavaScript Solution

1
2javascript
3// Import `HtmlParser` interface
4
5// Initialize list, host, visited, and queue
6let list = htmlParser.getUrls(startUrl)
7let host = startUrl.split("/")[2];
8let visited = new Set([startUrl])
9let queue = [startUrl]
10
11
12// Crawler worker
13const worker=(url)=>{
14    // Check if URL has been visited
15    if (visited.has(url)){
16        return
17    }
18    // Check if URL is same host
19    if(url.indexOf(`http://${host}`)!== 0){
20        return
21    }
22    // Add to visited
23    visited.add(url);
24    let urls = htmlParser.getUrls(url)
25    urls.forEach(worker)
26}
27
28// Iterate over each list item
29for (let i=0; i<list.length;i++){
30    worker(list[i])
31}
32
33// Return final result from visited set
34return [...visited]
35

Please note that JavaScript does not support native multi-threading (although worker threads are available in Node.js). This is a single-threaded solution but it explains the basic idea.

C++ Solution

1
2cpp
3class Solution {
4public:
5    vector<string> crawl(string startUrl, HtmlParser htmlParser) {
6        queue<string> q{{startUrl}};
7        unordered_set<string> seen{{startUrl}};
8        string hostname = getHostname(startUrl);
9
10        vector<thread> threads;
11        mutex mtx;
12        condition_variable cv;
13
14        auto worker = [&]() {
15            while (true) {
16                unique_lock<mutex> lock(mtx);
17                cv.wait_for(lock, 30ms, [&]() { return q.size(); });
18
19                if (q.empty())
20                    return;
21
22                auto url = q.front(); q.pop();
23                auto urls = htmlParser.getUrls(url);
24
25                lock.unlock();
26
27                for (const auto& url : urls) {
28                    if (url.find(hostname) != string::npos) {
29                        lock_guard<mutex> lock(mtx);
30                        if (seen.insert(url).second)
31                            q.push(url);
32                    }
33                }
34
35                lock.lock();
36                cv.notify_all();
37            }
38        };
39
40        for (int i = 0; i < thread::hardware_concurrency(); ++i)
41            threads.emplace_back(worker);
42
43        for (auto& t : threads) t.join();
44
45        return {seen.begin(), seen.end()};
46    }
47private:
48    string getHostname(const string& url) {
49        return url.substr(0, url.find('/', 7));
50    }
51};

These solutions effectively use multithreading to provide a fast and parallelized web crawler. With these implementations, you can efficiently crawl millions of web pages under the same hostname without the need to wait for a single page to finish loading before moving onto the next one.This C++ solution uses a similar approach to the solutions above, however, it adds in condition variables and mutexes to allow for more efficient thread synchronization. The queue data structure is used to store the urls that need to be visited and once a thread is done with its task, it waits (cv.wait_for(...)) for more urls to be added to the queue. If the queue is not empty, it pops a url and continues its task. If all the urls have been visited, the thread will hang indefinitely (if no timeout was specified), therefore a timer (30ms) is added to automatically exit the thread when its inactive for a certain period.

This setup allows for a dynamic number of threads to be created, based on the hardware used (thread::hardware_concurrency()). This results in utilizing the maximum potential of the machine by using the maximum possible threads that the hardware supports. Therefore you will get different number of threads for different machines (high-end server vs home PC).

Every thread runs the same worker function, and to ensure that the shared queue and set (containing the URLs) are not corrupted by multiple threads accessing them simultaneously, mutex locks are used over the operations which modify them.

In conclusion, web crawling is a powerful tool in retrieving data from multiple web pages under a particular host, saving both time and resources by using multi-threaded programming. By implementing it in different languages like Python, JavaScript, Java, C#, C++ and by understanding the differences and limitations of each language's thread handling, developers can effectively use the right tool for the right job. Whether you're trying to scrape websites for data analysis, or to index web pages for a search engine, these solutions will help you to design efficient web crawling algorithms.

Not Sure What to Study? Take the 2-min Quiz to Find Your Missing Piece:

Depth first search can be used to find whether two components in a graph are connected.

Discover Your Strengths and Weaknesses: Take Our 2-Minute Quiz to Tailor Your Study Plan:

Depth first search is equivalent to which of the tree traversal order?

Solution Implementation

Not Sure What to Study? Take the 2-min Quiz:

Which two pointer techniques do you use to check if a string is a palindrome?

Fast Track Your Learning with Our Quick Skills Quiz:

The three-steps of Depth First Search are:

  1. Identify states;
  2. Draw the state-space tree;
  3. DFS on the state-space tree.

Recommended Readings


Got a question? Ask the Teaching Assistant anything you don't understand.

Still not clear? Ask in the Forum,  Discord or Submit the part you don't understand to our editors.


TA 👨‍🏫