No description
|
All checks were successful
Publish to PyPI on Tag / Build and Publish (push) Successful in 20s
|
||
|---|---|---|
| .forgejo/workflows | ||
| src | ||
| tests | ||
| .gitignore | ||
| .pre-commit-config.yaml | ||
| LICENSE | ||
| mypy.ini | ||
| pyproject.toml | ||
| pytest.ini | ||
| README.md | ||
| requirements.txt | ||
| uv.lock | ||
sitemap-scanner
Async crawler for site sitemaps with built-in SEO metadata extraction.
Two composable scanners:
SitemapScanner— discovers and iterates URLs from a site's sitemap. Readsrobots.txtto find sitemap locations or acceptssitemap.xmldirectly, recursively follows nested sitemap indexes, supports filters to prune branches or individual URLs.SeoScanner— fetches HTML pages and extracts the full SEO/social metadata stack: Open Graph, Twitter Cards, classic HTML<meta>, JSON-LD (schema.org), canonical,hreflang, favicon. Exposes computedtitle/description/imagefields with Google-style priority:og > twitter > meta > json_ld.
Both scanners share the BaseGetRepository[T].iter_get(...) interface and
use cloudscraper under the hood for Cloudflare-protected sites.
Installation
uv add sitemap-scanner --index CobraPack
Quick start
import asyncio
from pydantic import AnyHttpUrl
from sitemap_scanner import SitemapScanner
from sitemap_scanner.scanners.seo_scanner import SeoScanner
async def main() -> None:
sitemap = SitemapScanner()
seo = SeoScanner(concurrency=10)
async for sitemap_chunk in sitemap.iter_get(
AnyHttpUrl("https://example.com")
):
urls = [s.loc for s in sitemap_chunk]
async for results in seo.iter_get(*urls, chunk_size=50):
for page in results:
print(page.title, page.image)
asyncio.run(main())
SitemapScanner
async for chunk in scanner.iter_get(
url,
chunk_size=1000,
loc_filter=None,
sitemap_filter=None,
):
...
url— site root, sitemap URL, orrobots.txt. Auto-resolveshttps://example.comtohttps://example.com/robots.txt.loc_filter(url, recursion_level, parent_urls) -> bool— keep/drop individual<loc>entries.sitemap_filter(url, recursion_level, parent_urls) -> bool— keep/drop whole nested sitemap branches.
Yields Sequence[SitemapURL] (loc, lastmod, changefreq, priority).
SeoScanner
seo = SeoScanner(timeout=10.0, concurrency=10)
async for chunk in seo.iter_get(*urls, chunk_size=1000):
for page in chunk:
page.title # computed: og > twitter > meta > json_ld
page.description # computed
page.image # computed AnyHttpUrl
page.open_graph # OpenGraphData
page.twitter_card # TwitterCardData
page.meta # HtmlMetaData (canonical, hreflang, favicon, ...)
page.json_ld # list[dict] - parsed schema.org objects
Schemas
PageMetadata— full snapshot with computed top-level fields.OpenGraphData—og:title,og:description,og:image,og:url,og:type,og:site_name,og:locale.TwitterCardData—twitter:card,twitter:title,twitter:description,twitter:image,twitter:site,twitter:creator.HtmlMetaData—<title>,<meta name="description|keywords|author|robots">,<html lang>,<link rel="canonical|icon">,<link rel="alternate" hreflang=...>.
Behavior
- All extracted URLs are resolved against the page URL (relative links
produce absolute
AnyHttpUrl). - Duplicate meta tags: first occurrence wins.
- Malformed JSON-LD blocks are silently skipped.
@graphcontainers in JSON-LD are flattened into individual objects.- Fetch failures (network, HTTP errors) propagate as exceptions — wrap the call site if you want to swallow them.
Tests
uv run --group test pytest # unit (default)
uv run --group test pytest -m e2e # network e2e (opt-in)
E2E tests auto-skip when no network is available; force skip with
NO_NETWORK=1.
Development
uv run ruff check src tests
uv run mypy src