1242. Web Crawler Multithreaded
Problem Explanation
Our task is to build a web crawler in any programming language, using the provided HtmlParser interface. We're supposed to build a multi-threaded web crawler that can crawl through all links under the same hostname as the startUrl
. By multi-threaded, it means that we need to design a solution that can work on multiple threads simultaneously and fetch the pages, rather than fetching one by one.
We're given some constraints on what hostname can be, which is an important clue about the problem. In our input, we have multiple urls, and a query/url to start with.
Here is a step-by-step visual explanation of how to approach this problem:
Let's consider this simple example input:
urls = [ "http://news.yahoo.com", "http://news.yahoo.com/news", "http://news.google.com" ]
startUrl = "http://news.yahoo.com"
First, our crawler starts from startUrl which is "http://news.yahoo.com" and fetches all its urls using HtmlParser.getUrls(url). If there are more urls under the same host, it will push them in the queue for further crawling. So the output will be: "news.yahoo.com" "news.yahoo.com/news"
Then, the crawler, in parallel will consider next url in queue and fetches its urls. As there is no url, it will not add anything to the queue.
In the end, our final output will be ["news.yahoo.com", "news.yahoo.com/news"]
Solution
The problem can be solved by implementing a Breadth-First Search (BFS) algorithm running concurrently in different threads. At the start, spin up the threads, then use BFS on each thread. We ensure that we don't visit the same URL twice by keeping track of URLs in a visited set.
Let's look at the solutions for different languages.
C## Solution
1
2csharp
3public class Solution {
4 public IList<string> Crawl(string startUrl, HtmlParser htmlParser) {
5 HashSet<string> ans = new HashSet<string>();
6 string PREFIX = startUrl.Split("/")[2];
7 Queue<string> queue = new Queue<string>();
8 queue.Enqueue(startUrl);
9 ans.Add(startUrl);
10 Parallel.ForEach(Enumerable.Range(0, Environment.ProcessorCount * 2), _ => {
11 while (true) {
12 string current = null;
13 lock (queue) {
14 if (queue.Count == 0) return;
15 current = queue.Dequeue();
16 }
17 foreach (var nextUrl in htmlParser.GetUrls(current)) {
18 if (!nextUrl.Split("/")[2].Equals(PREFIX) || ans.Contains(nextUrl)) {
19 continue;
20 };
21 lock(ans) {
22 ans.Add(nextUrl);
23 }
24 lock(queue) {
25 queue.Enqueue(nextUrl);
26 }
27 };
28 }
29 });
30 return ans.ToList();
31 }
32}
Python Solution
1 2python 3class Solution: 4 def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]: 5 import threading 6 visited = set() 7 visited_lock = threading.Lock() 8 crawl = [False] 9 crawl_lock = threading.Condition() 10 11 def worker(url, htmlParser): 12 nonlocal crawl 13 while True: 14 my_html = None 15 with visited_lock: 16 for html in htmlParser.getUrls(url): 17 hostname = html.split('/')[2] 18 if hostname == startUrl.split('/')[2] and html not in visited: 19 visited.add(html) 20 my_html = html 21 break 22 if my_html != None: 23 worker(my_html, htmlParser) 24 else: 25 with crawl_lock: 26 crawl[0] = crawl[0] - 1 27 if crawl[0] == 0: 28 crawl_lock.notify_all() 29 crawl_lock.wait() 30 31 with crawl_lock: 32 with visited_lock: 33 visited.add(startUrl) 34 crawl[0] = threading.active_count() 35 threading.Thread(target=worker, args=[startUrl, htmlParser]).start() 36 crawl_lock.wait() 37 return list(visited)
Java Solution
1
2java
3class Solution {
4 public List<String> crawl(String startUrl, HtmlParser htmlParser) {
5 Set<String> result = ConcurrentHashMap.newKeySet();
6 String hostname = getHostname(startUrl);
7
8 ExecutorService executor = Executors.newFixedThreadPool(64);
9 result.add(startUrl);
10 crawl(result, startUrl, hostname, executor, htmlParser);
11 executor.shutdown();
12
13 try {
14 executor.awaitTermination(10, TimeUnit.SECONDS);
15 } catch (InterruptedException e) {
16 e.printStackTrace();
17 }
18
19 return new ArrayList<>(result);
20 }
21
22 private String getHostname(String url) {
23 int idx = url.indexOf('/', 7);
24 return (idx != -1) ? url.substring(0, idx) : url;
25 }
26
27 private void crawl(Set<String> result, String start, String hostname, ExecutorService executor, HtmlParser htmlParser) {
28 List<Future> futures = new ArrayList<>();
29 for (String url : htmlParser.getUrls(start)) {
30 if (url.startsWith(hostname) && result.add(url)) {
31 futures.add(executor.submit(() -> crawl(result, url, hostname, executor,htmlParser)));
32 }
33 }
34 for (Future f : futures) {
35 try {
36 f.get();
37 } catch (Exception e) {
38 e.printStackTrace();
39 }
40 }
41 }
42}
JavaScript Solution
1
2javascript
3// Import `HtmlParser` interface
4
5// Initialize list, host, visited, and queue
6let list = htmlParser.getUrls(startUrl)
7let host = startUrl.split("/")[2];
8let visited = new Set([startUrl])
9let queue = [startUrl]
10
11
12// Crawler worker
13const worker=(url)=>{
14 // Check if URL has been visited
15 if (visited.has(url)){
16 return
17 }
18 // Check if URL is same host
19 if(url.indexOf(`http://${host}`)!== 0){
20 return
21 }
22 // Add to visited
23 visited.add(url);
24 let urls = htmlParser.getUrls(url)
25 urls.forEach(worker)
26}
27
28// Iterate over each list item
29for (let i=0; i<list.length;i++){
30 worker(list[i])
31}
32
33// Return final result from visited set
34return [...visited]
35
Please note that JavaScript does not support native multi-threading (although worker threads are available in Node.js). This is a single-threaded solution but it explains the basic idea.
C++ Solution
1
2cpp
3class Solution {
4public:
5 vector<string> crawl(string startUrl, HtmlParser htmlParser) {
6 queue<string> q{{startUrl}};
7 unordered_set<string> seen{{startUrl}};
8 string hostname = getHostname(startUrl);
9
10 vector<thread> threads;
11 mutex mtx;
12 condition_variable cv;
13
14 auto worker = [&]() {
15 while (true) {
16 unique_lock<mutex> lock(mtx);
17 cv.wait_for(lock, 30ms, [&]() { return q.size(); });
18
19 if (q.empty())
20 return;
21
22 auto url = q.front(); q.pop();
23 auto urls = htmlParser.getUrls(url);
24
25 lock.unlock();
26
27 for (const auto& url : urls) {
28 if (url.find(hostname) != string::npos) {
29 lock_guard<mutex> lock(mtx);
30 if (seen.insert(url).second)
31 q.push(url);
32 }
33 }
34
35 lock.lock();
36 cv.notify_all();
37 }
38 };
39
40 for (int i = 0; i < thread::hardware_concurrency(); ++i)
41 threads.emplace_back(worker);
42
43 for (auto& t : threads) t.join();
44
45 return {seen.begin(), seen.end()};
46 }
47private:
48 string getHostname(const string& url) {
49 return url.substr(0, url.find('/', 7));
50 }
51};
These solutions effectively use multithreading to provide a fast and parallelized web crawler. With these implementations, you can efficiently crawl millions of web pages under the same hostname without the need to wait for a single page to finish loading before moving onto the next one.This C++ solution uses a similar approach to the solutions above, however, it adds in condition variables and mutexes to allow for more efficient thread synchronization. The queue data structure is used to store the urls that need to be visited and once a thread is done with its task, it waits (cv.wait_for(...)) for more urls to be added to the queue. If the queue is not empty, it pops a url and continues its task. If all the urls have been visited, the thread will hang indefinitely (if no timeout was specified), therefore a timer (30ms) is added to automatically exit the thread when its inactive for a certain period.
This setup allows for a dynamic number of threads to be created, based on the hardware used (thread::hardware_concurrency()). This results in utilizing the maximum potential of the machine by using the maximum possible threads that the hardware supports. Therefore you will get different number of threads for different machines (high-end server vs home PC).
Every thread runs the same worker function, and to ensure that the shared queue and set (containing the URLs) are not corrupted by multiple threads accessing them simultaneously, mutex locks are used over the operations which modify them.
In conclusion, web crawling is a powerful tool in retrieving data from multiple web pages under a particular host, saving both time and resources by using multi-threaded programming. By implementing it in different languages like Python, JavaScript, Java, C#, C++ and by understanding the differences and limitations of each language's thread handling, developers can effectively use the right tool for the right job. Whether you're trying to scrape websites for data analysis, or to index web pages for a search engine, these solutions will help you to design efficient web crawling algorithms.
Ready to land your dream job?
Unlock your dream job with a 2-minute evaluator for a personalized learning plan!
Start EvaluatorIs the following code DFS or BFS?
1void search(Node root) { 2 if (!root) return; 3 visit(root); 4 root.visited = true; 5 for (Node node in root.adjacent) { 6 if (!node.visited) { 7 search(node); 8 } 9 } 10}
Recommended Readings
Patterns The Shortest Path Algorithm for Coding Interviews The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way we can determine the
Recursion Recursion is one of the most important concepts in computer science Simply speaking recursion is the process of a function calling itself Using a real life analogy imagine a scenario where you invite your friends to lunch https algomonster s3 us east 2 amazonaws com recursion jpg You first
Runtime Overview When learning about algorithms and data structures you'll frequently encounter the term time complexity This concept is fundamental in computer science and offers insights into how long an algorithm takes to complete given a certain input size What is Time Complexity Time complexity represents the amount of time
Got a question?ย Ask the Monster Assistantย anything you don't understand.
Still not clear? ย Submitย the part you don't understand to our editors. Or join ourย Discord and ask the community.
Won't the program get stuck if the depth is more than 64 as all the threads get occupied and you push new callables to the same task queue?