1236. Web Crawler π
Problem Description
This problem asks you to implement a web crawler that explores web pages starting from a given URL and collects all URLs that belong to the same hostname.
The web crawler should follow these rules:
-
Start from a given URL (
startUrl
) and explore linked pages from there. -
Use the HtmlParser interface to get URLs from any webpage by calling
HtmlParser.getUrls(url)
, which returns a list of all URLs found on that page. -
Stay within the same hostname - only crawl URLs that have the same hostname as the starting URL. For example, if you start at
http://example.org/news
, you can crawlhttp://example.org/sports
but nothttp://another.com/page
. -
Avoid duplicates - each unique URL should be visited only once.
-
Return all discovered URLs in any order that belong to the same hostname.
The hostname is the domain part of the URL. For instance, in http://leetcode.com/problems
, the hostname is leetcode.com
. The crawler should treat URLs with and without trailing slashes as different (e.g., http://site.com
and http://site.com/
are considered different URLs).
The solution uses a depth-first search (DFS) approach:
- It extracts the hostname from a URL by removing the
http://
prefix (7 characters) and splitting by/
to get the domain - It maintains a set
ans
to track visited URLs and avoid duplicates - For each URL, it gets all linked URLs using the HtmlParser, filters for same hostname, and recursively visits unvisited ones
- Finally, it returns all collected URLs as a list
Flowchart Walkthrough
First, let's pin down the algorithm using the Flowchart. Here's a step-by-step walkthrough:
Is it a graph?
- Yes: The web pages and their links form a graph structure. Each URL is a node, and the links between pages are edges. When a page contains links to other pages, those represent directed edges in the graph.
Is it a tree?
- No: The web structure is not a tree because:
- Pages can have cycles (Page A can link to Page B, and Page B can link back to Page A)
- Multiple pages can link to the same page (multiple parents)
- There's no strict hierarchical parent-child relationship
Is the problem related to directed acyclic graphs?
- No: The web graph can contain cycles. Websites commonly have circular references where pages link back to each other.
Is the problem related to shortest paths?
- No: We're not looking for the shortest path between pages. The goal is to explore and collect all reachable URLs within the same hostname.
Does the problem involve connectivity?
- No: We're not checking if nodes are connected or finding connected components. We're performing an exhaustive exploration within a constrained domain (same hostname).
Does the problem have small constraints?
- Yes: The problem is constrained to URLs within the same hostname, which typically limits the search space. Also, we need to explore all possible paths from the starting URL.
DFS/backtracking?
- Yes: DFS is the appropriate choice because:
- We need to explore all reachable pages systematically
- We mark visited pages to avoid infinite loops (cycles)
- We recursively explore each unvisited link before backtracking
- The traversal naturally follows the link structure depth-first
Conclusion: The flowchart correctly leads us to use a DFS (Depth-First Search) approach for the web crawler problem. DFS allows us to systematically explore all pages within the same hostname while keeping track of visited URLs to avoid revisiting them.
Intuition
Think of web crawling like exploring a maze of interconnected rooms. You start in one room (the starting URL) and can see doors leading to other rooms (links to other pages). Your goal is to visit every room in your building (same hostname) exactly once, while ignoring doors that lead to other buildings (different hostnames).
The key insight is recognizing this as a graph traversal problem. Each webpage is a node, and links between pages are directed edges. Since we need to explore all reachable pages within a domain, we need a systematic way to:
- Visit each page
- Discover all its links
- Follow only the valid links (same hostname)
- Remember which pages we've already visited
DFS is a natural fit because it follows a "go as deep as possible" strategy - when you find a link, you immediately follow it, explore everything reachable from there, then backtrack. This is simpler than maintaining a queue (as in BFS) and works well since we don't care about finding the shortest path to pages, just finding all pages.
The critical observation for avoiding infinite loops is that websites often have cycles (Page A β Page B β Page A). Without tracking visited pages, we'd get stuck in these cycles forever. By maintaining a set of visited URLs, we ensure each page is processed exactly once.
The hostname extraction is straightforward: after removing the http://
prefix (7 characters), the hostname is everything before the first /
. By comparing hostnames before following links, we stay within our domain boundary.
The recursive DFS structure emerges naturally: for each unvisited URL with the same hostname, we recursively apply the same exploration process. The base case is when we encounter an already-visited URL, at which point we stop to avoid redundant work.
Learn more about Depth-First Search and Breadth-First Search patterns.
Solution Approach
The implementation uses a recursive DFS approach with a helper function to extract hostnames and a set to track visited URLs.
Helper Function for Hostname Extraction:
def host(url):
url = url[7:] # Remove "http://" (7 characters)
return url.split('/')[0] # Get everything before first '/'
This function extracts the hostname from a URL. For example, http://example.com/page
becomes example.com
.
Main DFS Function: The core exploration logic is implemented as a nested recursive function:
def dfs(url):
if url in ans:
return # Base case: already visited
ans.add(url) # Mark as visited
for next in htmlParser.getUrls(url):
if host(url) == host(next): # Same hostname check
dfs(next) # Recursive exploration
Algorithm Steps:
-
Initialize a set
ans
to store all visited URLs. Using a set ensures O(1) lookup time for checking if a URL has been visited and automatically handles duplicates. -
Start DFS from the starting URL by calling
dfs(startUrl)
. -
In each DFS call:
- Check if the current URL has been visited (base case)
- If visited, return immediately to avoid cycles
- If not visited, add it to the
ans
set - Get all linked URLs from the current page using
htmlParser.getUrls(url)
- For each linked URL, check if it has the same hostname
- If the hostname matches, recursively explore that URL
-
Return the result by converting the set to a list:
return list(ans)
Key Design Decisions:
-
Set vs List for tracking visited URLs: A set provides O(1) membership checking compared to O(n) for a list, making the algorithm more efficient when dealing with many URLs.
-
Recursive DFS vs Iterative: The recursive approach is cleaner and more intuitive for this problem. The call stack naturally handles the backtracking.
-
Hostname comparison: By comparing hostnames before making the recursive call, we ensure we never explore URLs outside our domain, maintaining the problem constraints.
The time complexity is O(N + E) where N is the number of unique URLs within the hostname and E is the total number of edges (links) between them. The space complexity is O(N) for storing the visited URLs plus O(H) for the recursion stack depth, where H is the maximum depth of the URL graph.
Ready to land your dream job?
Unlock your dream job with a 5-minute evaluator for a personalized learning plan!
Start EvaluatorExample Walkthrough
Let's walk through a small example to illustrate the solution approach.
Given:
startUrl = "http://news.com/world"
- The HtmlParser returns the following links for each URL:
"http://news.com/world"
β["http://news.com/sports", "http://news.com", "http://blog.com/travel"]
"http://news.com/sports"
β["http://news.com/world", "http://news.com/tech"]
"http://news.com"
β["http://news.com/world"]
"http://news.com/tech"
β[]
Step-by-step execution:
-
Initialize: Create empty set
ans = {}
-
Start DFS with
"http://news.com/world"
:- Not in
ans
, so add it:ans = {"http://news.com/world"}
- Get linked URLs:
["http://news.com/sports", "http://news.com", "http://blog.com/travel"]
- Check each link:
"http://news.com/sports"
: hostname is"news.com"
(same as start) β recurse"http://news.com"
: hostname is"news.com"
(same) β recurse later"http://blog.com/travel"
: hostname is"blog.com"
(different) β skip
- Not in
-
DFS on
"http://news.com/sports"
:- Not in
ans
, so add it:ans = {"http://news.com/world", "http://news.com/sports"}
- Get linked URLs:
["http://news.com/world", "http://news.com/tech"]
- Check each link:
"http://news.com/world"
: already inans
β skip"http://news.com/tech"
: hostname is"news.com"
(same) β recurse
- Not in
-
DFS on
"http://news.com/tech"
:- Not in
ans
, so add it:ans = {"http://news.com/world", "http://news.com/sports", "http://news.com/tech"}
- Get linked URLs:
[]
(no links) - Nothing to explore, return
- Not in
-
Back to step 2, continue with
"http://news.com"
:- Not in
ans
, so add it:ans = {"http://news.com/world", "http://news.com/sports", "http://news.com/tech", "http://news.com"}
- Get linked URLs:
["http://news.com/world"]
"http://news.com/world"
: already inans
β skip
- Not in
-
Final result: Convert set to list and return
["http://news.com/world", "http://news.com/sports", "http://news.com/tech", "http://news.com"]
The crawler successfully found all URLs within the news.com
domain while avoiding the external blog.com
link and preventing infinite loops from circular references.
Solution Implementation
1# """
2# This is HtmlParser's API interface.
3# You should not implement it, or speculate about its implementation
4# """
5# class HtmlParser(object):
6# def getUrls(self, url):
7# """
8# :type url: str
9# :rtype List[str]
10# """
11
12from typing import List, Set
13
14
15class Solution:
16 def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
17 """
18 Web crawler that crawls all URLs under the same hostname as startUrl.
19
20 Args:
21 startUrl: The starting URL for crawling
22 htmlParser: Parser object to get URLs from a given page
23
24 Returns:
25 List of all crawled URLs under the same hostname
26 """
27
28 def get_hostname(url: str) -> str:
29 """
30 Extract hostname from a URL.
31 Assumes URL format starts with 'http://'
32
33 Args:
34 url: URL string starting with 'http://'
35
36 Returns:
37 Hostname portion of the URL
38 """
39 # Remove 'http://' prefix (7 characters)
40 url_without_protocol = url[7:]
41 # Split by '/' and take the first part (hostname)
42 return url_without_protocol.split('/')[0]
43
44 def depth_first_search(current_url: str) -> None:
45 """
46 Recursively crawl URLs using depth-first search.
47 Only visits URLs with the same hostname as the start URL.
48
49 Args:
50 current_url: The current URL being processed
51 """
52 # Skip if URL has already been visited
53 if current_url in visited_urls:
54 return
55
56 # Mark current URL as visited
57 visited_urls.add(current_url)
58
59 # Get all URLs from the current page
60 for next_url in htmlParser.getUrls(current_url):
61 # Only crawl URLs with the same hostname
62 if get_hostname(current_url) == get_hostname(next_url):
63 depth_first_search(next_url)
64
65 # Set to store all visited URLs (prevents duplicate visits)
66 visited_urls: Set[str] = set()
67
68 # Start crawling from the initial URL
69 depth_first_search(startUrl)
70
71 # Convert set to list and return
72 return list(visited_urls)
73
1/**
2 * // This is the HtmlParser's API interface.
3 * // You should not implement it, or speculate about its implementation
4 * interface HtmlParser {
5 * public List<String> getUrls(String url) {}
6 * }
7 */
8
9class Solution {
10 // Set to store all visited URLs from the same hostname
11 private Set<String> visitedUrls;
12
13 /**
14 * Crawls all URLs under the same hostname starting from the given URL
15 * @param startUrl The starting URL for web crawling
16 * @param htmlParser The HTML parser to extract URLs from web pages
17 * @return List of all URLs under the same hostname
18 */
19 public List<String> crawl(String startUrl, HtmlParser htmlParser) {
20 visitedUrls = new HashSet<>();
21 dfs(startUrl, htmlParser);
22 return new ArrayList<>(visitedUrls);
23 }
24
25 /**
26 * Depth-first search to recursively crawl URLs
27 * @param currentUrl The current URL being processed
28 * @param htmlParser The HTML parser to extract URLs from web pages
29 */
30 private void dfs(String currentUrl, HtmlParser htmlParser) {
31 // Skip if URL has already been visited
32 if (visitedUrls.contains(currentUrl)) {
33 return;
34 }
35
36 // Mark current URL as visited
37 visitedUrls.add(currentUrl);
38
39 // Get all URLs from the current page and process them
40 for (String nextUrl : htmlParser.getUrls(currentUrl)) {
41 // Only crawl URLs with the same hostname
42 if (extractHostname(nextUrl).equals(extractHostname(currentUrl))) {
43 dfs(nextUrl, htmlParser);
44 }
45 }
46 }
47
48 /**
49 * Extracts the hostname from a URL
50 * @param url The URL string (format: http://hostname/path)
51 * @return The hostname portion of the URL
52 */
53 private String extractHostname(String url) {
54 // Remove "http://" prefix (7 characters)
55 String urlWithoutProtocol = url.substring(7);
56 // Split by "/" and return the hostname part
57 return urlWithoutProtocol.split("/")[0];
58 }
59}
60
1/**
2 * // This is the HtmlParser's API interface.
3 * // You should not implement it, or speculate about its implementation
4 * class HtmlParser {
5 * public:
6 * vector<string> getUrls(string url);
7 * };
8 */
9
10class Solution {
11public:
12 vector<string> crawl(string startUrl, HtmlParser htmlParser) {
13 // Initialize result vector and visited set
14 vector<string> result;
15 unordered_set<string> visited;
16
17 // Start DFS traversal from the starting URL
18 dfs(startUrl, htmlParser, visited, result);
19
20 return result;
21 }
22
23private:
24 /**
25 * Performs depth-first search to crawl all URLs under the same hostname
26 * @param url: Current URL to process
27 * @param htmlParser: Parser to get URLs from current page
28 * @param visited: Set of already visited URLs to avoid duplicates
29 * @param result: Vector to store all crawled URLs
30 */
31 void dfs(const string& url, HtmlParser& htmlParser,
32 unordered_set<string>& visited, vector<string>& result) {
33 // Skip if URL has already been visited
34 if (visited.count(url)) {
35 return;
36 }
37
38 // Mark current URL as visited and add to result
39 visited.insert(url);
40 result.push_back(url);
41
42 // Get all URLs from current page and recursively visit those with same hostname
43 vector<string> nextUrls = htmlParser.getUrls(url);
44 for (const string& nextUrl : nextUrls) {
45 // Only crawl URLs with the same hostname
46 if (extractHostname(url) == extractHostname(nextUrl)) {
47 dfs(nextUrl, htmlParser, visited, result);
48 }
49 }
50 }
51
52 /**
53 * Extracts hostname from a URL
54 * Assumes URL starts with "http://" (7 characters)
55 * @param url: URL string to extract hostname from
56 * @return: Hostname portion of the URL
57 */
58 string extractHostname(const string& url) {
59 // Skip "http://" prefix (7 characters)
60 const int protocolLength = 7;
61 string hostname;
62
63 // Extract characters until '/' or end of string
64 for (int i = protocolLength; i < url.size(); ++i) {
65 if (url[i] == '/') {
66 break;
67 }
68 hostname += url[i];
69 }
70
71 return hostname;
72 }
73};
74
1/**
2 * // This is the HtmlParser's API interface.
3 * // You should not implement it, or speculate about its implementation
4 * interface HtmlParser {
5 * getUrls(url: string): string[];
6 * }
7 */
8
9/**
10 * Main function to crawl all URLs under the same hostname starting from a given URL
11 * @param startUrl - The starting URL to begin crawling from
12 * @param htmlParser - Parser instance to extract URLs from HTML pages
13 * @returns Array of all crawled URLs under the same hostname
14 */
15function crawl(startUrl: string, htmlParser: HtmlParser): string[] {
16 // Initialize result array and visited set
17 const result: string[] = [];
18 const visited: Set<string> = new Set();
19
20 // Start DFS traversal from the starting URL
21 dfs(startUrl, htmlParser, visited, result);
22
23 return result;
24}
25
26/**
27 * Performs depth-first search to crawl all URLs under the same hostname
28 * @param url - Current URL to process
29 * @param htmlParser - Parser to get URLs from current page
30 * @param visited - Set of already visited URLs to avoid duplicates
31 * @param result - Array to store all crawled URLs
32 */
33function dfs(
34 url: string,
35 htmlParser: HtmlParser,
36 visited: Set<string>,
37 result: string[]
38): void {
39 // Skip if URL has already been visited
40 if (visited.has(url)) {
41 return;
42 }
43
44 // Mark current URL as visited and add to result
45 visited.add(url);
46 result.push(url);
47
48 // Get all URLs from current page and recursively visit those with same hostname
49 const nextUrls: string[] = htmlParser.getUrls(url);
50 for (const nextUrl of nextUrls) {
51 // Only crawl URLs with the same hostname
52 if (extractHostname(url) === extractHostname(nextUrl)) {
53 dfs(nextUrl, htmlParser, visited, result);
54 }
55 }
56}
57
58/**
59 * Extracts hostname from a URL
60 * Assumes URL starts with "http://" (7 characters)
61 * @param url - URL string to extract hostname from
62 * @returns Hostname portion of the URL
63 */
64function extractHostname(url: string): string {
65 // Skip "http://" prefix (7 characters)
66 const protocolLength: number = 7;
67 let hostname: string = "";
68
69 // Extract characters until '/' or end of string
70 for (let i = protocolLength; i < url.length; i++) {
71 if (url[i] === '/') {
72 break;
73 }
74 hostname += url[i];
75 }
76
77 return hostname;
78}
79
Time and Space Complexity
Time Complexity: O(N + E)
Where N
is the total number of unique URLs within the same hostname that can be reached from the startUrl, and E
is the total number of edges (links between pages).
- The DFS traversal visits each unique URL exactly once due to the check
if url in ans: return
, which takesO(1)
time for set lookup - For each visited URL, we call
htmlParser.getUrls(url)
once, which returns a list of URLs - We iterate through all returned URLs to check if they belong to the same hostname using the
host()
function - The
host()
function performs string operations that takeO(L)
time whereL
is the length of the URL, but this can be consideredO(1)
for practical URL lengths - In total, we process each URL once and examine each edge once, resulting in
O(N + E)
time complexity
Space Complexity: O(N + H)
Where N
is the number of unique URLs visited and H
is the maximum depth of the recursion stack.
- The
ans
set stores all unique URLs visited within the same hostname, requiringO(N)
space - The recursive DFS call stack can go as deep as the longest path in the crawl graph, which in the worst case (a linear chain of URLs) could be
O(N)
, but typically would be much smaller - The space used by
htmlParser.getUrls()
to return lists of URLs is temporary and gets garbage collected after each call - Converting the set to a list at the end requires
O(N)
additional space temporarily - Overall space complexity is
O(N)
for storing the URLs plusO(H)
for the recursion depth, giving usO(N + H)
, which simplifies toO(N)
in the worst case
Learn more about how to find time and space complexity quickly.
Common Pitfalls
1. Incorrect Hostname Extraction for Edge Cases
The Pitfall:
The current hostname extraction function assumes all URLs start with http://
and uses a fixed offset of 7 characters. This breaks if:
- URLs use
https://
(8 characters) - URLs have different protocols or formats
- The URL format is unexpected
Example failure case:
# Current implementation fails for: url = "https://example.com/page" # Would incorrectly parse as "s://exa..." url = "http://example.com" # Works, but what if no trailing path?
Solution: Use a more robust hostname extraction method:
def get_hostname(url: str) -> str:
# Handle both http:// and https://
if url.startswith('http://'):
url = url[7:]
elif url.startswith('https://'):
url = url[8:]
# Split by '/' and take hostname part
# Handle case where there's no path after hostname
parts = url.split('/', 1) # Split only at first '/'
return parts[0]
2. Stack Overflow with Deep or Cyclic URL Structures
The Pitfall: The recursive DFS approach can cause stack overflow if the URL graph is very deep or has many levels. Python's default recursion limit is around 1000, which could be exceeded in large websites.
Solution: Implement an iterative version using an explicit stack:
def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
visited_urls = set()
stack = [startUrl]
start_hostname = get_hostname(startUrl)
while stack:
current_url = stack.pop()
if current_url in visited_urls:
continue
visited_urls.add(current_url)
for next_url in htmlParser.getUrls(current_url):
if get_hostname(next_url) == start_hostname and next_url not in visited_urls:
stack.append(next_url)
return list(visited_urls)
3. Inefficient Hostname Comparison
The Pitfall:
The current code calls get_hostname(current_url)
inside the loop for every linked URL, even though the current URL's hostname never changes. This results in redundant parsing.
Solution: Cache the hostname calculation:
def depth_first_search(current_url: str) -> None:
if current_url in visited_urls:
return
visited_urls.add(current_url)
current_hostname = get_hostname(current_url) # Calculate once
for next_url in htmlParser.getUrls(current_url):
if current_hostname == get_hostname(next_url): # Use cached value
depth_first_search(next_url)
4. Not Handling the Start URL's Hostname Consistently
The Pitfall:
The code compares get_hostname(current_url)
with get_hostname(next_url)
in each recursive call, but this could lead to issues if there are redirects or if the comparison logic changes deeper in the recursion.
Solution: Store the target hostname once at the beginning:
def crawl(self, startUrl: str, htmlParser: 'HtmlParser') -> List[str]:
target_hostname = get_hostname(startUrl) # Fixed target
visited_urls = set()
def dfs(current_url: str) -> None:
if current_url in visited_urls:
return
visited_urls.add(current_url)
for next_url in htmlParser.getUrls(current_url):
# Always compare against the original hostname
if get_hostname(next_url) == target_hostname:
dfs(next_url)
dfs(startUrl)
return list(visited_urls)
This ensures that all URLs are compared against the same reference hostname throughout the crawling process, preventing drift or inconsistencies.
Which of the tree traversal order can be used to obtain elements in a binary search tree in sorted order?
Recommended Readings
https assets algo monster cover_photos dfs svg Depth First Search Prereqs Recursion Review problems recursion_intro Trees problems tree_intro With a solid understanding of recursion under our belts we are now ready to tackle one of the most useful techniques in coding interviews Depth First Search DFS As the name suggests
https assets algo monster cover_photos bfs svg Breadth First Search on Trees Hopefully by this time you've drunk enough DFS Kool Aid to understand its immense power and seen enough visualization to create a call stack in your mind Now let me introduce the companion spell Breadth First Search BFS
Coding Interview Patterns Your Personal Dijkstra's Algorithm to Landing Your Dream Job The goal of AlgoMonster is to help you get a job in the shortest amount of time possible in a data driven way We compiled datasets of tech interview problems and broke them down by patterns This way
Want a Structured Path to Master System Design Too? Donβt Miss This!