1236. Web Crawler


Problem Description

The task is to implement a web crawler that crawls all URLs that are under the same hostname as the starting URL. A crawler is typically a program that systematically browses the Web or a website information to create an index of data. In this problem, you are given a starting URL and an HtmlParser interface with a getUrls method that, when called, returns all URLs found on a given webpage.

Here are the key requirements for the crawler:

  • The crawler begins from the page specified by startUrl.
  • It should call the HtmlParser.getUrls(url) method to retrieve all URLs from the webpage at the specified URL.
  • The crawler should prevent visiting the same URL more than once to avoid infinite loops and redundant operations.
  • Only URLs with the same hostname as the startUrl should be crawled. The hostname is the portion of the URL that usually includes the domain name and the top-level domain, but, for the purpose of this problem, it excludes subdomains, protocols (like http), and ports.
  • While the question mentions that URLs with or without a trailing slash are considered different, it doesn't play a significant role in the implementation of the solution provided because the crawled URLs are obtained from the getUrls method, which should handle this distinction.

The description implies storing crawled URLs and making sure that new URLs are only crawled if they belong to the same domain and haven't been crawled before.

Intuition

The intuition behind the solution is to perform a depth-first search (DFS) on the URLs, starting from startUrl. Depth-first search is a recursive algorithm that uses the concept of backtracking. It involves exhaustive searches of all the nodes by going ahead, if possible, else by backtracking. This approach perfectly aligns with the nature of web crawling, where we dive deep into one link (node) before moving to adjacent links.

To implement DFS effectively and adhere to the key requirements, the following aspects are essential:

  • A mechanism to determine the hostname from a given URL, so we can compare whether subsequent URLs belong to the same host as the startUrl.
  • A data structure to keep track of the URLs that have already been visited. A set is ideal in this scenario to avoid duplicates efficiently.
  • A recursive function that performs the DFS. It should:
    • Add the current URL to the list of visited URLs.
    • Retrieve all URLs linked from the current URL using the getUrls method.
    • For each retrieved URL, check if the hostname matches the startUrl's hostname and recurse only if the URL hasn't been visited yet.

This process creates a recursive traversal that explores each URL within the same hostname and builds up the set of all visited URLs, which is the final result.

The provided solution encodes this intuition. The host function extracts the hostname from the URL, the dfs function embodies the depth-first traversal logic, and the ans set maintains the record of visited URLs to ensure each URL is only crawled once.

Learn more about Depth-First Search and Breadth-First Search patterns.

Not Sure What to Study? Take the 2-min Quiz to Find Your Missing Piece:

Given an array of 1,000,000 integers that is almost sorted, except for 2 pairs of integers. Which algorithm is fastest for sorting the array?

Solution Approach

The solution is implemented using a classic DFS algorithm. Let's walk through the implementation step by step.

First, we define a helper function named host that parses a given URL to return its hostname. The function assumes that all URLs begin with "http://" (length 7 characters), so it strips this part before splitting the URL by '/' and returning the first part, which is the hostname:

1def host(url):
2    url = url[7:]
3    return url.split('/')[0]

Next, we define a recursive function dfs that performs the depth-first search. This function takes an url as an argument:

  • It first checks if the URL is already in the set named ans, which is used to store crawled URLs. If the URL is already in the set, the function returns immediately to avoid processing the same URL twice.
  • If the URL is not in the set, it's added to ans.
  • Then, for each URL retrieved by the HtmlParser.getUrls(url) call, the function checks whether each of those URLs shares the same hostname as our original startUrl. If it does, the function is recursively called with the new URL.
1def dfs(url):
2    if url in ans:
3        return
4    ans.add(url)
5    for next_url in htmlParser.getUrls(url):
6        if host(url) == host(next_url):
7            dfs(next_url)

Finally, we define the set ans as an empty set, used to store the unique URLs:

1ans = set()

And we start the DFS crawl from the startUrl:

1dfs(startUrl)

After the DFS is complete, ans contains all the unique URLs that share the same hostname as the startUrl and have been crawled. As per the requirements, the function returns a list of these URLs:

1return list(ans)

The beauty of this approach is its simplicity and effectiveness. Using DFS allows us to deep dive into each path of linked URLs while keeping track of visited URLs through the ans set to avoid revisits, which could lead to infinite loops. Filter URLs with the same hostname ensures that we're adhering to the crawling boundaries set by the problem.

This solution effectively uses basic string manipulation for URL parsing, recursive DFS for navigating through links, and a set data structure to manage visited URLs.

Discover Your Strengths and Weaknesses: Take Our 2-Minute Quiz to Tailor Your Study Plan:

Which algorithm should you use to find a node that is close to the root of the tree?

Example Walkthrough

Let's walk through a small example to illustrate how the solution operates. Suppose we have the following web pages and links between them and are provided with the starting URL "http://example.com/a":

  • "http://example.com/a" links to ["http://example.com/b", "http://example.com/c"]
  • "http://example.com/b" links to ["http://example.com/a"] (back-link to a)
  • "http://example.com/c" links to [] (no links)

The HtmlParser.getUrls method simulates the links as follows for each URL:

1def getUrls(url):
2    links = {
3        "http://example.com/a": ["http://example.com/b", "http://example.com/c"],
4        "http://example.com/b": ["http://example.com/a"],
5        "http://example.com/c": []
6    }
7    return links[url]

Applying the solution:

  1. We begin with the dfs function called on the startUrl which is "http://example.com/a".
  2. First, we use the host function to determine the hostname. For "http://example.com/a", the hostname is "example.com".
  3. Since this URL has not been visited yet, we add it to the ans set.
  4. We call HtmlParser.getUrls("http://example.com/a") and receive two URLs: "http://example.com/b" and "http://example.com/c".
  5. For each of these URLs, we check the hostname using the host function and compare it to our start URL's hostname "example.com". They match, so we proceed to call dfs recursively on these URLs.
  6. The dfs call on "http://example.com/b" adds this URL to the ans set and then calls HtmlParser.getUrls on this URL, which only returns "http://example.com/a". However, since "http://example.com/a" is already in the ans set, it doesn't recurse on it to avoid a loop.
  7. Similarly, the dfs call on "http://example.com/c" adds the URL to the ans set. HtmlParser.getUrls("http://example.com/c") returns an empty list, so there are no further URLs to process.
  8. At the end of this process, the ans set contains {"http://example.com/a", "http://example.com/b", "http://example.com/c"}, which are the unique URLs that were crawled.
  9. We return the list form of the ans set, giving us our final result: ["http://example.com/a", "http://example.com/b", "http://example.com/c"].

This walkthrough demonstrates the DFS-based crawling, visiting each unique URL starting from the startUrl and respecting the same hostname constraint without visiting any URL more than once.

Not Sure What to Study? Take the 2-min Quiz:

Which algorithm should you use to find a node that is close to the root of the tree?

Python Solution

1class Solution:
2    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
3        # Retrieves the hostname from a given URL.
4        def get_host_name(url: str) -> str:
5            # Trim the 'http://' part from the URL and return the hostname part.
6            # Example: "http://foo.bar.com/page" becomes "foo.bar.com"
7            return url[7:].split('/')[0]
8
9        # Performs a depth-first search to crawl all URLs that share the same host as startUrl.
10        def dfs(url: str):
11            # If the URL is already in the set, do not process it again.
12            if url in visited:
13                return
14            # Add the current URL to the visited set.
15            visited.add(url)
16            # Get all URLs from the current page using the HtmlParser API.
17            for next_url in htmlParser.getUrls(url):
18                # If the next URL shares the same host as the start URL, continue the DFS from there.
19                if get_host_name(url) == get_host_name(next_url):
20                    dfs(next_url)
21
22        # Initialize a set to keep track of visited URLs.
23        visited = set()
24        # Start crawling from the starting URL.
25        dfs(startUrl)
26        # Return the list of visited URLs.
27        return list(visited)
28

Java Solution

1import java.util.ArrayList;
2import java.util.HashSet;
3import java.util.List;
4import java.util.Set;
5
6// This is the HtmlParser's API interface.
7// You should not implement it, or speculate about its implementation
8interface HtmlParser {
9    List<String> getUrls(String url);
10}
11
12public class Solution {
13  
14    private Set<String> crawledUrls; // To store unique crawled URLs
15
16    // The crawl method initiates the crawling process from a startUrl
17    public List<String> crawl(String startUrl, HtmlParser htmlParser) {
18        crawledUrls = new HashSet<>();
19        depthFirstSearch(startUrl, htmlParser);
20        return new ArrayList<>(crawledUrls); // Convert the set of URLs to a list before returning
21    }
22  
23    // Helper method to perform depth-first search on the web page's links
24    private void depthFirstSearch(String url, HtmlParser htmlParser) {
25        // Check if the url is already visited to avoid loops
26        if (crawledUrls.contains(url)) {
27            return;
28        }
29      
30        // Mark the current URL as visited
31        crawledUrls.add(url);
32      
33        // Iterate through all URLs obtained from the current page
34        for (String nextUrl : htmlParser.getUrls(url)) {
35            // Only continue crawling if the next URL has the same host as the start URL
36            if (extractHostName(nextUrl).equals(extractHostName(url))) {
37                depthFirstSearch(nextUrl, htmlParser); // Recursively visit the next URL
38            }
39        }
40    }
41
42    // Helper method to extract the host name from a given URL
43    private String extractHostName(String url) {
44        // Removing the protocol ("http://") part of the URL to get the hostname with possible path
45        String urlWithoutProtocol = url.substring(7);
46      
47        // Split the hostname and path and return the hostname part
48        return urlWithoutProtocol.split("/")[0];
49    }
50}
51

C++ Solution

1/**
2 * // This is the HtmlParser's API interface.
3 * // You should not implement it, or speculate about its implementation
4 * class HtmlParser {
5 *   public:
6 *     vector<string> getUrls(string url);
7 * };
8 */
9
10class Solution {
11public:
12    vector<string> crawledUrls;             // List of all crawled URLs
13    unordered_set<string> visitedUrls;      // Set to keep track of visited URLs
14
15    // Entry function to start crawling from the given startUrl
16    vector<string> crawl(string startUrl, HtmlParser htmlParser) {
17        dfsCrawl(startUrl, htmlParser);
18        return crawledUrls;
19    }
20
21    // Helper function to perform DFS crawl starting from url using htmlParser
22    void dfsCrawl(string& url, HtmlParser& htmlParser) {
23        // If the URL has already been visited, do nothing
24        if (visitedUrls.count(url)) {
25            return;
26        }
27      
28        // Mark the url as visited
29        visitedUrls.insert(url);
30        // Add the url to the list of crawled URLs
31        crawledUrls.push_back(url);
32      
33        // For each URL found in the current URL's page
34        for (string nextUrl : htmlParser.getUrls(url)) {
35            // Crawl the next URL if it is of the same host
36            if (extractHostName(url) == extractHostName(nextUrl)) {
37                dfsCrawl(nextUrl, htmlParser);
38            }
39        }
40    }
41
42    // Helper function to extract the hostname from a given URL
43    string extractHostName(string url) {
44        int hostNameStartIndex = strlen("http://"); // Start index for host name
45        string hostName; // Placeholder for the extracted host name
46      
47        // Start from the initial index past "http://" and read until '/'
48        for (size_t index = hostNameStartIndex; index < url.size(); ++index) {
49            if (url[index] == '/') {
50                break;
51            }
52            hostName += url[index];
53        }
54      
55        return hostName;
56    }
57};
58

Typescript Solution

1/**
2 * This represents the HtmlParser's API interface.
3 * Detail implementation of `getUrls` method is not given.
4 */
5interface HtmlParser {
6  getUrls(url: string): string[];
7}
8
9// Array to hold all the crawled URLs
10const crawledUrls: string[] = [];
11// Set to keep track of visited URLs
12const visitedUrls: Set<string> = new Set();
13
14/**
15 * Initiates the crawl starting from the given `startUrl` and using the
16 * provided `htmlParser` to find other URLs to crawl.
17 * @param startUrl The starting URL for the crawl process.
18 * @param htmlParser The HtmlParser instance to retrieve URLs on a page.
19 * @returns The list of URLs crawled.
20 */
21function crawl(startUrl: string, htmlParser: HtmlParser): string[] {
22  dfsCrawl(startUrl, htmlParser);
23  return crawledUrls;
24}
25
26/**
27 * Recursively crawls URLs starting from the `url` using Depth-First Search.
28 * It uses the `htmlParser` to get connected URLs and only continues with
29 * those that have the same hostname.
30 * @param url The current URL to crawl.
31 * @param htmlParser HtmlParser instance to get adjacent URLs.
32 */
33function dfsCrawl(url: string, htmlParser: HtmlParser): void {
34  // If the URL has already been visited, stop the crawl for this URL
35  if (visitedUrls.has(url)) {
36    return;
37  }
38
39  // Mark the URL as visited
40  visitedUrls.add(url);
41  // Add the URL to the list of crawled URLs
42  crawledUrls.push(url);
43
44  // Retrieve all URLs found in the current page and crawl them
45  const urlsFromPage = htmlParser.getUrls(url);
46  for (const nextUrl of urlsFromPage) {
47    // Continue the crawl on the next URL if it has the same host as the start URL
48    if (extractHostName(url) === extractHostName(nextUrl)) {
49      dfsCrawl(nextUrl, htmlParser);
50    }
51  }
52}
53
54/**
55 * Extracts the hostname from the given URL string.
56 * @param url The URL from which the hostname is to be extracted.
57 * @returns The hostname as a string.
58 */
59function extractHostName(url: string): string {
60  const hostNameStartIndex = "http://".length; // Start index for hostname
61  let hostName: string = ""; // Initialize extracted hostname
62
63  // Loop to extract the host name portion of the URL
64  for (let index = hostNameStartIndex; index < url.length; index++) {
65    if (url[index] === '/') {
66      break; // Stop at the first '/' after the protocol indicator
67    }
68    hostName += url[index];
69  }
70
71  return hostName;
72}
73
Fast Track Your Learning with Our Quick Skills Quiz:

The three-steps of Depth First Search are:

  1. Identify states;
  2. Draw the state-space tree;
  3. DFS on the state-space tree.

Time and Space Complexity

The given Python function implements a Depth-First Search (DFS) to crawl web pages. Here's the analysis of the time complexity and space complexity:

Time Complexity

The time complexity of the code is governed by the number of URLs that are visited and the number of out-links that each URL contains. If we assume there are N unique URLs and each URL could have at most D out-links we will visit:

  • Each URL is processed once, and for each URL, the getUrls method of HtmlParser could be called. This operation is O(1) for each URL since we are just processing the URL and not actually parsing HTML (parsing in reality would depend on the HTML content size).
  • However, each URL could link out to D other URLs. In the worst case where all URLs point to all other URLs, each edge (link) would effectively be visited once.

Therefore, the time complexity is O(N + N * D), which simplifies to O(ND) as each URL is expanded once and then each out-link is explored.

Space Complexity

The space complexity comprises the memory required for the recursive call stack when performing DFS as well as the memory required to store the set of visited URLs:

  • The recursive DFS could go as deep as N if there is a URL that leads to a long chain of other URLs. This contributes O(N) to the space complexity.
  • The ans set is used to store unique URLs that are visited. In the worst-case scenario where all URLs are accessible from the starting URL, this set contributes O(N).

Combining the two gives us O(2N), which simplifies to O(N) space complexity.

In summary:

  • Time Complexity: O(ND)
  • Space Complexity: O(N)

Learn more about how to find time and space complexity quickly.


Recommended Readings


Got a question? Ask the Teaching Assistant anything you don't understand.

Still not clear? Ask in the Forum,  Discord or Submit the part you don't understand to our editors.


TA 👨‍🏫