Facebook Pixel

1236. Web Crawler πŸ”’

Problem Description

This problem asks you to implement a web crawler that explores web pages starting from a given URL and collects all URLs that belong to the same hostname.

The web crawler should follow these rules:

  1. Start from a given URL (startUrl) and explore linked pages from there.

  2. Use the HtmlParser interface to get URLs from any webpage by calling HtmlParser.getUrls(url), which returns a list of all URLs found on that page.

  3. Stay within the same hostname - only crawl URLs that have the same hostname as the starting URL. For example, if you start at http://example.org/news, you can crawl http://example.org/sports but not http://another.com/page.

  4. Avoid duplicates - each unique URL should be visited only once.

  5. Return all discovered URLs in any order that belong to the same hostname.

The hostname is the domain part of the URL. For instance, in http://leetcode.com/problems, the hostname is leetcode.com. The crawler should treat URLs with and without trailing slashes as different (e.g., http://site.com and http://site.com/ are considered different URLs).

The solution uses a depth-first search (DFS) approach:

  • It extracts the hostname from a URL by removing the http:// prefix (7 characters) and splitting by / to get the domain
  • It maintains a set ans to track visited URLs and avoid duplicates
  • For each URL, it gets all linked URLs using the HtmlParser, filters for same hostname, and recursively visits unvisited ones
  • Finally, it returns all collected URLs as a list

Flowchart Walkthrough

First, let's pin down the algorithm using the Flowchart. Here's a step-by-step walkthrough:

Is it a graph?

  • Yes: The web pages and their links form a graph structure. Each URL is a node, and the links between pages are edges. When a page contains links to other pages, those represent directed edges in the graph.

Is it a tree?

  • No: The web structure is not a tree because:
    • Pages can have cycles (Page A can link to Page B, and Page B can link back to Page A)
    • Multiple pages can link to the same page (multiple parents)
    • There's no strict hierarchical parent-child relationship

Is the problem related to directed acyclic graphs?

  • No: The web graph can contain cycles. Websites commonly have circular references where pages link back to each other.

Is the problem related to shortest paths?

  • No: We're not looking for the shortest path between pages. The goal is to explore and collect all reachable URLs within the same hostname.

Does the problem involve connectivity?

  • No: We're not checking if nodes are connected or finding connected components. We're performing an exhaustive exploration within a constrained domain (same hostname).

Does the problem have small constraints?

  • Yes: The problem is constrained to URLs within the same hostname, which typically limits the search space. Also, we need to explore all possible paths from the starting URL.

DFS/backtracking?

  • Yes: DFS is the appropriate choice because:
    • We need to explore all reachable pages systematically
    • We mark visited pages to avoid infinite loops (cycles)
    • We recursively explore each unvisited link before backtracking
    • The traversal naturally follows the link structure depth-first

Conclusion: The flowchart correctly leads us to use a DFS (Depth-First Search) approach for the web crawler problem. DFS allows us to systematically explore all pages within the same hostname while keeping track of visited URLs to avoid revisiting them.

Quick Interview Experience
Help others by sharing your interview experience
Have you seen this problem before?

Intuition

Think of web crawling like exploring a maze of interconnected rooms. You start in one room (the starting URL) and can see doors leading to other rooms (links to other pages). Your goal is to visit every room in your building (same hostname) exactly once, while ignoring doors that lead to other buildings (different hostnames).

The key insight is recognizing this as a graph traversal problem. Each webpage is a node, and links between pages are directed edges. Since we need to explore all reachable pages within a domain, we need a systematic way to:

  1. Visit each page
  2. Discover all its links
  3. Follow only the valid links (same hostname)
  4. Remember which pages we've already visited

DFS is a natural fit because it follows a "go as deep as possible" strategy - when you find a link, you immediately follow it, explore everything reachable from there, then backtrack. This is simpler than maintaining a queue (as in BFS) and works well since we don't care about finding the shortest path to pages, just finding all pages.

The critical observation for avoiding infinite loops is that websites often have cycles (Page A β†’ Page B β†’ Page A). Without tracking visited pages, we'd get stuck in these cycles forever. By maintaining a set of visited URLs, we ensure each page is processed exactly once.

The hostname extraction is straightforward: after removing the http:// prefix (7 characters), the hostname is everything before the first /. By comparing hostnames before following links, we stay within our domain boundary.

The recursive DFS structure emerges naturally: for each unvisited URL with the same hostname, we recursively apply the same exploration process. The base case is when we encounter an already-visited URL, at which point we stop to avoid redundant work.

Learn more about Depth-First Search and Breadth-First Search patterns.

Solution Approach

The implementation uses a recursive DFS approach with a helper function to extract hostnames and a set to track visited URLs.

Helper Function for Hostname Extraction:

def host(url):
    url = url[7:]  # Remove "http://" (7 characters)
    return url.split('/')[0]  # Get everything before first '/'

This function extracts the hostname from a URL. For example, http://example.com/page becomes example.com.

Main DFS Function: The core exploration logic is implemented as a nested recursive function:

def dfs(url):
    if url in ans:
        return  # Base case: already visited
    ans.add(url)  # Mark as visited
    for next in htmlParser.getUrls(url):
        if host(url) == host(next):  # Same hostname check
            dfs(next)  # Recursive exploration

Algorithm Steps:

  1. Initialize a set ans to store all visited URLs. Using a set ensures O(1) lookup time for checking if a URL has been visited and automatically handles duplicates.

  2. Start DFS from the starting URL by calling dfs(startUrl).

  3. In each DFS call:

    • Check if the current URL has been visited (base case)
    • If visited, return immediately to avoid cycles
    • If not visited, add it to the ans set
    • Get all linked URLs from the current page using htmlParser.getUrls(url)
    • For each linked URL, check if it has the same hostname
    • If the hostname matches, recursively explore that URL
  4. Return the result by converting the set to a list: return list(ans)

Key Design Decisions:

  • Set vs List for tracking visited URLs: A set provides O(1) membership checking compared to O(n) for a list, making the algorithm more efficient when dealing with many URLs.

  • Recursive DFS vs Iterative: The recursive approach is cleaner and more intuitive for this problem. The call stack naturally handles the backtracking.

  • Hostname comparison: By comparing hostnames before making the recursive call, we ensure we never explore URLs outside our domain, maintaining the problem constraints.

The time complexity is O(N + E) where N is the number of unique URLs within the hostname and E is the total number of edges (links) between them. The space complexity is O(N) for storing the visited URLs plus O(H) for the recursion stack depth, where H is the maximum depth of the URL graph.

Ready to land your dream job?

Unlock your dream job with a 5-minute evaluator for a personalized learning plan!

Start Evaluator

Example Walkthrough

Let's walk through a small example to illustrate the solution approach.

Given:

  • startUrl = "http://news.com/world"
  • The HtmlParser returns the following links for each URL:
    • "http://news.com/world" β†’ ["http://news.com/sports", "http://news.com", "http://blog.com/travel"]
    • "http://news.com/sports" β†’ ["http://news.com/world", "http://news.com/tech"]
    • "http://news.com" β†’ ["http://news.com/world"]
    • "http://news.com/tech" β†’ []

Step-by-step execution:

  1. Initialize: Create empty set ans = {}

  2. Start DFS with "http://news.com/world":

    • Not in ans, so add it: ans = {"http://news.com/world"}
    • Get linked URLs: ["http://news.com/sports", "http://news.com", "http://blog.com/travel"]
    • Check each link:
      • "http://news.com/sports": hostname is "news.com" (same as start) β†’ recurse
      • "http://news.com": hostname is "news.com" (same) β†’ recurse later
      • "http://blog.com/travel": hostname is "blog.com" (different) β†’ skip
  3. DFS on "http://news.com/sports":

    • Not in ans, so add it: ans = {"http://news.com/world", "http://news.com/sports"}
    • Get linked URLs: ["http://news.com/world", "http://news.com/tech"]
    • Check each link:
      • "http://news.com/world": already in ans β†’ skip
      • "http://news.com/tech": hostname is "news.com" (same) β†’ recurse
  4. DFS on "http://news.com/tech":

    • Not in ans, so add it: ans = {"http://news.com/world", "http://news.com/sports", "http://news.com/tech"}
    • Get linked URLs: [] (no links)
    • Nothing to explore, return
  5. Back to step 2, continue with "http://news.com":

    • Not in ans, so add it: ans = {"http://news.com/world", "http://news.com/sports", "http://news.com/tech", "http://news.com"}
    • Get linked URLs: ["http://news.com/world"]
    • "http://news.com/world": already in ans β†’ skip
  6. Final result: Convert set to list and return ["http://news.com/world", "http://news.com/sports", "http://news.com/tech", "http://news.com"]

The crawler successfully found all URLs within the news.com domain while avoiding the external blog.com link and preventing infinite loops from circular references.

Solution Implementation

1# """
2# This is HtmlParser's API interface.
3# You should not implement it, or speculate about its implementation
4# """
5# class HtmlParser(object):
6#    def getUrls(self, url):
7#        """
8#        :type url: str
9#        :rtype List[str]
10#        """
11
12from typing import List, Set
13
14
15class Solution:
16    def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
17        """
18        Web crawler that crawls all URLs under the same hostname as startUrl.
19      
20        Args:
21            startUrl: The starting URL for crawling
22            htmlParser: Parser object to get URLs from a given page
23          
24        Returns:
25            List of all crawled URLs under the same hostname
26        """
27      
28        def get_hostname(url: str) -> str:
29            """
30            Extract hostname from a URL.
31            Assumes URL format starts with 'http://'
32          
33            Args:
34                url: URL string starting with 'http://'
35              
36            Returns:
37                Hostname portion of the URL
38            """
39            # Remove 'http://' prefix (7 characters)
40            url_without_protocol = url[7:]
41            # Split by '/' and take the first part (hostname)
42            return url_without_protocol.split('/')[0]
43      
44        def depth_first_search(current_url: str) -> None:
45            """
46            Recursively crawl URLs using depth-first search.
47            Only visits URLs with the same hostname as the start URL.
48          
49            Args:
50                current_url: The current URL being processed
51            """
52            # Skip if URL has already been visited
53            if current_url in visited_urls:
54                return
55          
56            # Mark current URL as visited
57            visited_urls.add(current_url)
58          
59            # Get all URLs from the current page
60            for next_url in htmlParser.getUrls(current_url):
61                # Only crawl URLs with the same hostname
62                if get_hostname(current_url) == get_hostname(next_url):
63                    depth_first_search(next_url)
64      
65        # Set to store all visited URLs (prevents duplicate visits)
66        visited_urls: Set[str] = set()
67      
68        # Start crawling from the initial URL
69        depth_first_search(startUrl)
70      
71        # Convert set to list and return
72        return list(visited_urls)
73
1/**
2 * // This is the HtmlParser's API interface.
3 * // You should not implement it, or speculate about its implementation
4 * interface HtmlParser {
5 *     public List<String> getUrls(String url) {}
6 * }
7 */
8
9class Solution {
10    // Set to store all visited URLs from the same hostname
11    private Set<String> visitedUrls;
12
13    /**
14     * Crawls all URLs under the same hostname starting from the given URL
15     * @param startUrl The starting URL for web crawling
16     * @param htmlParser The HTML parser to extract URLs from web pages
17     * @return List of all URLs under the same hostname
18     */
19    public List<String> crawl(String startUrl, HtmlParser htmlParser) {
20        visitedUrls = new HashSet<>();
21        dfs(startUrl, htmlParser);
22        return new ArrayList<>(visitedUrls);
23    }
24
25    /**
26     * Depth-first search to recursively crawl URLs
27     * @param currentUrl The current URL being processed
28     * @param htmlParser The HTML parser to extract URLs from web pages
29     */
30    private void dfs(String currentUrl, HtmlParser htmlParser) {
31        // Skip if URL has already been visited
32        if (visitedUrls.contains(currentUrl)) {
33            return;
34        }
35      
36        // Mark current URL as visited
37        visitedUrls.add(currentUrl);
38      
39        // Get all URLs from the current page and process them
40        for (String nextUrl : htmlParser.getUrls(currentUrl)) {
41            // Only crawl URLs with the same hostname
42            if (extractHostname(nextUrl).equals(extractHostname(currentUrl))) {
43                dfs(nextUrl, htmlParser);
44            }
45        }
46    }
47
48    /**
49     * Extracts the hostname from a URL
50     * @param url The URL string (format: http://hostname/path)
51     * @return The hostname portion of the URL
52     */
53    private String extractHostname(String url) {
54        // Remove "http://" prefix (7 characters)
55        String urlWithoutProtocol = url.substring(7);
56        // Split by "/" and return the hostname part
57        return urlWithoutProtocol.split("/")[0];
58    }
59}
60
1/**
2 * // This is the HtmlParser's API interface.
3 * // You should not implement it, or speculate about its implementation
4 * class HtmlParser {
5 *   public:
6 *     vector<string> getUrls(string url);
7 * };
8 */
9
10class Solution {
11public:
12    vector<string> crawl(string startUrl, HtmlParser htmlParser) {
13        // Initialize result vector and visited set
14        vector<string> result;
15        unordered_set<string> visited;
16      
17        // Start DFS traversal from the starting URL
18        dfs(startUrl, htmlParser, visited, result);
19      
20        return result;
21    }
22
23private:
24    /**
25     * Performs depth-first search to crawl all URLs under the same hostname
26     * @param url: Current URL to process
27     * @param htmlParser: Parser to get URLs from current page
28     * @param visited: Set of already visited URLs to avoid duplicates
29     * @param result: Vector to store all crawled URLs
30     */
31    void dfs(const string& url, HtmlParser& htmlParser, 
32             unordered_set<string>& visited, vector<string>& result) {
33        // Skip if URL has already been visited
34        if (visited.count(url)) {
35            return;
36        }
37      
38        // Mark current URL as visited and add to result
39        visited.insert(url);
40        result.push_back(url);
41      
42        // Get all URLs from current page and recursively visit those with same hostname
43        vector<string> nextUrls = htmlParser.getUrls(url);
44        for (const string& nextUrl : nextUrls) {
45            // Only crawl URLs with the same hostname
46            if (extractHostname(url) == extractHostname(nextUrl)) {
47                dfs(nextUrl, htmlParser, visited, result);
48            }
49        }
50    }
51  
52    /**
53     * Extracts hostname from a URL
54     * Assumes URL starts with "http://" (7 characters)
55     * @param url: URL string to extract hostname from
56     * @return: Hostname portion of the URL
57     */
58    string extractHostname(const string& url) {
59        // Skip "http://" prefix (7 characters)
60        const int protocolLength = 7;
61        string hostname;
62      
63        // Extract characters until '/' or end of string
64        for (int i = protocolLength; i < url.size(); ++i) {
65            if (url[i] == '/') {
66                break;
67            }
68            hostname += url[i];
69        }
70      
71        return hostname;
72    }
73};
74
1/**
2 * // This is the HtmlParser's API interface.
3 * // You should not implement it, or speculate about its implementation
4 * interface HtmlParser {
5 *     getUrls(url: string): string[];
6 * }
7 */
8
9/**
10 * Main function to crawl all URLs under the same hostname starting from a given URL
11 * @param startUrl - The starting URL to begin crawling from
12 * @param htmlParser - Parser instance to extract URLs from HTML pages
13 * @returns Array of all crawled URLs under the same hostname
14 */
15function crawl(startUrl: string, htmlParser: HtmlParser): string[] {
16    // Initialize result array and visited set
17    const result: string[] = [];
18    const visited: Set<string> = new Set();
19  
20    // Start DFS traversal from the starting URL
21    dfs(startUrl, htmlParser, visited, result);
22  
23    return result;
24}
25
26/**
27 * Performs depth-first search to crawl all URLs under the same hostname
28 * @param url - Current URL to process
29 * @param htmlParser - Parser to get URLs from current page
30 * @param visited - Set of already visited URLs to avoid duplicates
31 * @param result - Array to store all crawled URLs
32 */
33function dfs(
34    url: string, 
35    htmlParser: HtmlParser, 
36    visited: Set<string>, 
37    result: string[]
38): void {
39    // Skip if URL has already been visited
40    if (visited.has(url)) {
41        return;
42    }
43  
44    // Mark current URL as visited and add to result
45    visited.add(url);
46    result.push(url);
47  
48    // Get all URLs from current page and recursively visit those with same hostname
49    const nextUrls: string[] = htmlParser.getUrls(url);
50    for (const nextUrl of nextUrls) {
51        // Only crawl URLs with the same hostname
52        if (extractHostname(url) === extractHostname(nextUrl)) {
53            dfs(nextUrl, htmlParser, visited, result);
54        }
55    }
56}
57
58/**
59 * Extracts hostname from a URL
60 * Assumes URL starts with "http://" (7 characters)
61 * @param url - URL string to extract hostname from
62 * @returns Hostname portion of the URL
63 */
64function extractHostname(url: string): string {
65    // Skip "http://" prefix (7 characters)
66    const protocolLength: number = 7;
67    let hostname: string = "";
68  
69    // Extract characters until '/' or end of string
70    for (let i = protocolLength; i < url.length; i++) {
71        if (url[i] === '/') {
72            break;
73        }
74        hostname += url[i];
75    }
76  
77    return hostname;
78}
79

Time and Space Complexity

Time Complexity: O(N + E)

Where N is the total number of unique URLs within the same hostname that can be reached from the startUrl, and E is the total number of edges (links between pages).

  • The DFS traversal visits each unique URL exactly once due to the check if url in ans: return, which takes O(1) time for set lookup
  • For each visited URL, we call htmlParser.getUrls(url) once, which returns a list of URLs
  • We iterate through all returned URLs to check if they belong to the same hostname using the host() function
  • The host() function performs string operations that take O(L) time where L is the length of the URL, but this can be considered O(1) for practical URL lengths
  • In total, we process each URL once and examine each edge once, resulting in O(N + E) time complexity

Space Complexity: O(N + H)

Where N is the number of unique URLs visited and H is the maximum depth of the recursion stack.

  • The ans set stores all unique URLs visited within the same hostname, requiring O(N) space
  • The recursive DFS call stack can go as deep as the longest path in the crawl graph, which in the worst case (a linear chain of URLs) could be O(N), but typically would be much smaller
  • The space used by htmlParser.getUrls() to return lists of URLs is temporary and gets garbage collected after each call
  • Converting the set to a list at the end requires O(N) additional space temporarily
  • Overall space complexity is O(N) for storing the URLs plus O(H) for the recursion depth, giving us O(N + H), which simplifies to O(N) in the worst case

Learn more about how to find time and space complexity quickly.

Common Pitfalls

1. Incorrect Hostname Extraction for Edge Cases

The Pitfall: The current hostname extraction function assumes all URLs start with http:// and uses a fixed offset of 7 characters. This breaks if:

  • URLs use https:// (8 characters)
  • URLs have different protocols or formats
  • The URL format is unexpected

Example failure case:

# Current implementation fails for:
url = "https://example.com/page"  # Would incorrectly parse as "s://exa..."
url = "http://example.com"  # Works, but what if no trailing path?

Solution: Use a more robust hostname extraction method:

def get_hostname(url: str) -> str:
    # Handle both http:// and https://
    if url.startswith('http://'):
        url = url[7:]
    elif url.startswith('https://'):
        url = url[8:]
  
    # Split by '/' and take hostname part
    # Handle case where there's no path after hostname
    parts = url.split('/', 1)  # Split only at first '/'
    return parts[0]

2. Stack Overflow with Deep or Cyclic URL Structures

The Pitfall: The recursive DFS approach can cause stack overflow if the URL graph is very deep or has many levels. Python's default recursion limit is around 1000, which could be exceeded in large websites.

Solution: Implement an iterative version using an explicit stack:

def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
    visited_urls = set()
    stack = [startUrl]
    start_hostname = get_hostname(startUrl)
  
    while stack:
        current_url = stack.pop()
        if current_url in visited_urls:
            continue
          
        visited_urls.add(current_url)
      
        for next_url in htmlParser.getUrls(current_url):
            if get_hostname(next_url) == start_hostname and next_url not in visited_urls:
                stack.append(next_url)
  
    return list(visited_urls)

3. Inefficient Hostname Comparison

The Pitfall: The current code calls get_hostname(current_url) inside the loop for every linked URL, even though the current URL's hostname never changes. This results in redundant parsing.

Solution: Cache the hostname calculation:

def depth_first_search(current_url: str) -> None:
    if current_url in visited_urls:
        return
  
    visited_urls.add(current_url)
    current_hostname = get_hostname(current_url)  # Calculate once
  
    for next_url in htmlParser.getUrls(current_url):
        if current_hostname == get_hostname(next_url):  # Use cached value
            depth_first_search(next_url)

4. Not Handling the Start URL's Hostname Consistently

The Pitfall: The code compares get_hostname(current_url) with get_hostname(next_url) in each recursive call, but this could lead to issues if there are redirects or if the comparison logic changes deeper in the recursion.

Solution: Store the target hostname once at the beginning:

def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
    target_hostname = get_hostname(startUrl)  # Fixed target
    visited_urls = set()
  
    def dfs(current_url: str) -> None:
        if current_url in visited_urls:
            return
        visited_urls.add(current_url)
      
        for next_url in htmlParser.getUrls(current_url):
            # Always compare against the original hostname
            if get_hostname(next_url) == target_hostname:
                dfs(next_url)
  
    dfs(startUrl)
    return list(visited_urls)

This ensures that all URLs are compared against the same reference hostname throughout the crawling process, preventing drift or inconsistencies.

Discover Your Strengths and Weaknesses: Take Our 5-Minute Quiz to Tailor Your Study Plan:

Which of the tree traversal order can be used to obtain elements in a binary search tree in sorted order?


Recommended Readings

Want a Structured Path to Master System Design Too? Don’t Miss This!

Load More