Crawlee for Python: SSRF via sitemap-derived URLs
Overview
- Vulnerability type: Blind SSRF
- Affected components:
src/crawlee/_utils/sitemap.py,src/crawlee/_utils/robots.py,src/crawlee/request_loaders/_sitemap_request_loader.py, and all built-in HTTP clients. - Trigger: an attacker-controlled sitemap or
robots.txtcontaining a URL that points to an internal host (layer 1) or uses a non-http scheme (layer 2).
Two-layer SSRF via sitemap-derived URLs:
1) Cross-host HTTP SSRF
Base case, affects every HTTP client.** Sitemap entries and robots.txt Sitemap: directives were accepted regardless of the host they pointed to. A sitemap on example.com could push http://internal.corp/admin into the crawler's queue, and the configured HTTP client would dispatch the request.
2) Non-HTTP scheme SSRF
Escalation, only CurlImpersonateHttpClient.** Nested-sitemap fetching dispatches the URL straight to the HTTP client, bypassing the Request construction step where Pydantic enforces http(s). Combined with the libcurl-backed CurlImpersonateHttpClient, this lets gopher://, file://, dict://, ftp://, etc., through.
Root cause
Crawlee already validates URL schemes through Pydantic's AnyHttpUrl (via validate_http_url in src/crawlee/_utils/urls.py) wherever a crawl target is materialised as a Request: the Request.url field is declared as Annotated[str, BeforeValidator(validate_http_url), Field(frozen=True)]. Anything that becomes a Request is therefore guaranteed to be http(s).
Two parts of the sitemap pipeline sidestepped this property in different ways:
1) Sitemap-derived URLs were enqueued without any host policy
SitemapRequestLoader took every ` entry, wrapped it in Request.from_url (which accepts any valid http(s) URL), and pushed the result into the request queue. RobotsTxtFile.get_sitemaps() returned every Sitemap: directive verbatim. Neither imposed any host check against the parent sitemap or robots.txt URL, so an attacker controlling that content could push internal-network HTTP URLs into the queue and have them crawled by whichever HTTP client was configured.
2) Nested sitemap fetching bypassed the Request chokepoint entirely
When _XmlSitemapParser encountered …, or when RobotsTxtFile.parse_sitemaps forwarded Sitemap: directives into the same pipeline, _fetch_and_process_sitemap dispatched the URL directly to the HTTP client:
async with http_client.stream(
sitemap_url,
method='GET',
headers=SITEMAP_HEADERS,
proxy_info=proxy_info,
timeout=timeout,
) as response:
...No Request was constructed, so the Pydantic validator never ran. Before the fix, the HTTP clients' own send_request() and stream() methods did not call validate_http_url either, so a non-http(s) scheme could pass straight through to the backend client.
The non-HTTP escalation in layer 2 is specific to CurlImpersonateHttpClient, which is backed by curl-cffi / libcurl and speaks gopher, file, dict, ftp, and other non-HTTP protocols. The other clients shipped with Crawlee (HttpxHttpClient, ImpitHttpClient, PlaywrightHttpClient) reject non-http(s) schemes at their own backend layer, regardless of what Crawlee passes in, so they were only affected by layer 1.
Vulnerable paths
Layer 1 — cross-host HTTP (all HTTP clients)
- *Source:* an attacker-controlled sitemap that lists internal URLs under
or, or an attacker-controlledrobots.txtthat lists internal URLs underSitemap:. - *Sink:* the configured HTTP client issues GET
requests against those URLs — either viaclient.request(url=request.url, …)insidecrawl()for regular sitemap URLs, or viaclient.stream(url, …)inside the nested-sitemap fetch.
Layer 2 — non-HTTP schemes (CurlImpersonateHttpClient only)
- *Source:* a nested
entry or arobots.txtSitemap:directive pointing to a non-http(s)URL. - *Sink:* CurlImpersonateHttpClient.stream(...)
hands the URL string verbatim toclient.request(url=…, …), which dispatches via libcurl.
Hardening in 1.7.0 was added at both producer and consumer ends — see *Remediation*.
Exploitation preconditions
- The crawler uses sitemap loading: any of SitemapRequestLoader
,Sitemap.load/parse_sitemap,discover_valid_sitemaps, orRobotsTxtFile.parse_sitemaps. - The attacker controls the body of a sitemap or robots.txt
that the crawler fetches — typically by being the target site, or by getting a target site to publish a malicious sitemap. - The crawler's network egress can reach the attacker-chosen destination (e.g., internal services on the same network).
- The targeted endpoint accepts unauthenticated requests. Crawlee does not supply credentials to the forged destination, so authenticated services (IMDSv2 with token, password-protected Redis, protected admin panels) are not reachable through this path.
For layer 2 (non-HTTP), the configured HTTP client must additionally be CurlImpersonateHttpClient.
Impact
Layer 1 — cross-host HTTP (any client)
The crawler can be coerced into issuing GET requests against internal HTTP services on its own network: admin panels, unauthenticated internal APIs, cloud metadata endpoints, etc. Read-back is blind — Crawlee surfaces fetched content only through its local Dataset / KeyValueStore (push_data() etc.) and does not natively forward scraped bodies anywhere external — so direct impact is mostly existence/timing probing and occasional state changes via side-effecting GET endpoints. Read-side leakage of internal content is only exploitable end-to-end if the deployer's own application separately exposes scraped data (for example, a public summariser or aggregator built on top of Crawlee).
Layer 2 — non-HTTP escalation (only CurlImpersonateHttpClient)
Under the affected client, attackers gain the libcurl scheme set:
- gopher://
is the canonical RESP-injection vector: pipelineFLUSHALL,CONFIG SET dir,CONFIG SET dbfilename,SAVEto an unauthenticated Redis on the crawler's network — enough to write attacker-controlled bytes to disk and, in the standard escalation, achieve remote code execution on the Redis host. - file://
allows the crawler to read local files (application secrets, configuration) on the crawler host. - dict://
andftp://permit fingerprinting and limited interaction with text-protocol services.
In both layers, the SSRF is blind in the default configuration. Write-side impact (gopher:// → Redis) and timing-based internal probing do not depend on read-back and remain viable regardless of whether the deployer surfaces scraped content.
Remediation
Both layers are fixed in crawlee==1.7.0. The fix is split across two PRs, applied at the two complementary boundaries of the affected pipeline:
SitemapRequestLoader and RobotsTxtFile.get_sitemaps() now run every nested-sitemap entry, every regular sitemap URL, and every Sitemap: directive through crawlee._utils.urls.filter_url. This applies to an EnqueueStrategy (default 'same-hostname') against the parent sitemap / robots.txt URL — cross-host entries are dropped — and rejects non-http(s) schemes. The strategy is stamped onto the emitted Requests, so BasicCrawler._check_url_after_redirects continues policing the policy across redirects.
validate_http_url(url) is now called at the top of send_request() and stream() in ImpitHttpClient, HttpxHttpClient, CurlImpersonateHttpClient, and PlaywrightHttpClient. Non-http(s) schemes raise pydantic.ValidationError before any backend call. crawl()` was already c