No description

Find a file

Illia Bahlai 79661add80 All checks were successful Publish to PyPI on Tag / Build and Publish (push) Successful in 20s Details patch: fix slow iteration on xml parsing		2026-04-29 17:58:26 +00:00
.forgejo/workflows	feat(ci): add pipelines	2026-04-29 06:27:37 +00:00
src	patch: fix slow iteration on xml parsing	2026-04-29 17:58:26 +00:00
tests	minor: add seo-scanner	2026-04-29 08:54:27 +00:00
.gitignore	feat(tests): add unit/e2e tests for sitemaps scanner	2026-04-29 06:25:55 +00:00
.pre-commit-config.yaml	init commit	2026-04-29 06:25:55 +00:00
LICENSE	Initial commit	2026-04-29 06:23:36 +00:00
mypy.ini	init commit	2026-04-29 06:25:55 +00:00
pyproject.toml	patch: fix slow iteration on xml parsing	2026-04-29 17:58:26 +00:00
pytest.ini	feat(tests): add unit/e2e tests for sitemaps scanner	2026-04-29 06:25:55 +00:00
README.md	minor: add seo-scanner	2026-04-29 08:54:27 +00:00
requirements.txt	feat(tests): add unit/e2e tests for sitemaps scanner	2026-04-29 06:25:55 +00:00
uv.lock	patch: fix slow iteration on xml parsing	2026-04-29 17:58:26 +00:00

README.md

sitemap-scanner

Async crawler for site sitemaps with built-in SEO metadata extraction.

Two composable scanners:

SitemapScanner — discovers and iterates URLs from a site's sitemap. Reads robots.txt to find sitemap locations or accepts sitemap.xml directly, recursively follows nested sitemap indexes, supports filters to prune branches or individual URLs.
SeoScanner — fetches HTML pages and extracts the full SEO/social metadata stack: Open Graph, Twitter Cards, classic HTML <meta>, JSON-LD (schema.org), canonical, hreflang, favicon. Exposes computed title / description / image fields with Google-style priority: og > twitter > meta > json_ld.

Both scanners share the BaseGetRepository[T].iter_get(...) interface and use cloudscraper under the hood for Cloudflare-protected sites.

Installation

uv add sitemap-scanner --index CobraPack

Quick start

import asyncio

from pydantic import AnyHttpUrl

from sitemap_scanner import SitemapScanner
from sitemap_scanner.scanners.seo_scanner import SeoScanner


async def main() -> None:
    sitemap = SitemapScanner()
    seo = SeoScanner(concurrency=10)

    async for sitemap_chunk in sitemap.iter_get(
        AnyHttpUrl("https://example.com")
    ):
        urls = [s.loc for s in sitemap_chunk]
        async for results in seo.iter_get(*urls, chunk_size=50):
            for page in results:
                print(page.title, page.image)


asyncio.run(main())

SitemapScanner

async for chunk in scanner.iter_get(
    url,
    chunk_size=1000,
    loc_filter=None,
    sitemap_filter=None,
):
    ...

url — site root, sitemap URL, or robots.txt. Auto-resolves https://example.com to https://example.com/robots.txt.
loc_filter(url, recursion_level, parent_urls) -> bool — keep/drop individual <loc> entries.
sitemap_filter(url, recursion_level, parent_urls) -> bool — keep/drop whole nested sitemap branches.

Yields Sequence[SitemapURL] (loc, lastmod, changefreq, priority).

SeoScanner

seo = SeoScanner(timeout=10.0, concurrency=10)

async for chunk in seo.iter_get(*urls, chunk_size=1000):
    for page in chunk:
        page.title          # computed: og > twitter > meta > json_ld
        page.description    # computed
        page.image          # computed AnyHttpUrl
        page.open_graph     # OpenGraphData
        page.twitter_card   # TwitterCardData
        page.meta           # HtmlMetaData (canonical, hreflang, favicon, ...)
        page.json_ld        # list[dict] - parsed schema.org objects

Schemas

PageMetadata — full snapshot with computed top-level fields.
OpenGraphData — og:title, og:description, og:image, og:url, og:type, og:site_name, og:locale.
TwitterCardData — twitter:card, twitter:title, twitter:description, twitter:image, twitter:site, twitter:creator.
HtmlMetaData — <title>, <meta name="description|keywords|author|robots">, <html lang>, <link rel="canonical|icon">, <link rel="alternate" hreflang=...>.

Behavior

All extracted URLs are resolved against the page URL (relative links produce absolute AnyHttpUrl).
Duplicate meta tags: first occurrence wins.
Malformed JSON-LD blocks are silently skipped.
@graph containers in JSON-LD are flattened into individual objects.
Fetch failures (network, HTTP errors) propagate as exceptions — wrap the call site if you want to swallow them.

Tests

uv run --group test pytest              # unit (default)
uv run --group test pytest -m e2e       # network e2e (opt-in)

E2E tests auto-skip when no network is available; force skip with NO_NETWORK=1.

Development

uv run ruff check src tests
uv run mypy src