sitemap-scanner (1.1.0)

Published 2026-04-29 08:55:49 +00:00 by forgejo-bot in CobraPack/sitemap-scanner

Installation

pip install --index-url  sitemap-scanner

About this package

Async sitemap crawler and SEO metadata extractor (Open Graph, Twitter Cards, JSON-LD)

sitemap-scanner

Async crawler for site sitemaps with built-in SEO metadata extraction.

Two composable scanners:

  • SitemapScanner — discovers and iterates URLs from a site's sitemap. Reads robots.txt to find sitemap locations or accepts sitemap.xml directly, recursively follows nested sitemap indexes, supports filters to prune branches or individual URLs.
  • SeoScanner — fetches HTML pages and extracts the full SEO/social metadata stack: Open Graph, Twitter Cards, classic HTML <meta>, JSON-LD (schema.org), canonical, hreflang, favicon. Exposes computed title / description / image fields with Google-style priority: og > twitter > meta > json_ld.

Both scanners share the BaseGetRepository[T].iter_get(...) interface and use cloudscraper under the hood for Cloudflare-protected sites.

Installation

uv add sitemap-scanner --index CobraPack

Quick start

import asyncio

from pydantic import AnyHttpUrl

from sitemap_scanner import SitemapScanner
from sitemap_scanner.scanners.seo_scanner import SeoScanner


async def main() -> None:
    sitemap = SitemapScanner()
    seo = SeoScanner(concurrency=10)

    async for sitemap_chunk in sitemap.iter_get(
        AnyHttpUrl("https://example.com")
    ):
        urls = [s.loc for s in sitemap_chunk]
        async for results in seo.iter_get(*urls, chunk_size=50):
            for page in results:
                print(page.title, page.image)


asyncio.run(main())

SitemapScanner

async for chunk in scanner.iter_get(
    url,
    chunk_size=1000,
    loc_filter=None,
    sitemap_filter=None,
):
    ...
  • url — site root, sitemap URL, or robots.txt. Auto-resolves https://example.com to https://example.com/robots.txt.
  • loc_filter(url, recursion_level, parent_urls) -> bool — keep/drop individual <loc> entries.
  • sitemap_filter(url, recursion_level, parent_urls) -> bool — keep/drop whole nested sitemap branches.

Yields Sequence[SitemapURL] (loc, lastmod, changefreq, priority).

SeoScanner

seo = SeoScanner(timeout=10.0, concurrency=10)

async for chunk in seo.iter_get(*urls, chunk_size=1000):
    for page in chunk:
        page.title          # computed: og > twitter > meta > json_ld
        page.description    # computed
        page.image          # computed AnyHttpUrl
        page.open_graph     # OpenGraphData
        page.twitter_card   # TwitterCardData
        page.meta           # HtmlMetaData (canonical, hreflang, favicon, ...)
        page.json_ld        # list[dict] - parsed schema.org objects

Schemas

  • PageMetadata — full snapshot with computed top-level fields.
  • OpenGraphDataog:title, og:description, og:image, og:url, og:type, og:site_name, og:locale.
  • TwitterCardDatatwitter:card, twitter:title, twitter:description, twitter:image, twitter:site, twitter:creator.
  • HtmlMetaData<title>, <meta name="description|keywords|author|robots">, <html lang>, <link rel="canonical|icon">, <link rel="alternate" hreflang=...>.

Behavior

  • All extracted URLs are resolved against the page URL (relative links produce absolute AnyHttpUrl).
  • Duplicate meta tags: first occurrence wins.
  • Malformed JSON-LD blocks are silently skipped.
  • @graph containers in JSON-LD are flattened into individual objects.
  • Fetch failures (network, HTTP errors) propagate as exceptions — wrap the call site if you want to swallow them.

Tests

uv run --group test pytest              # unit (default)
uv run --group test pytest -m e2e       # network e2e (opt-in)

E2E tests auto-skip when no network is available; force skip with NO_NETWORK=1.

Development

uv run ruff check src tests
uv run mypy src

Requirements

Requires Python: >=3.12
Details
PyPI
2026-04-29 08:55:49 +00:00
2
115 KiB
Assets (2)
Versions (5) View all
1.1.3 2026-04-29
1.1.2 2026-04-29
1.1.1 2026-04-29
1.1.0 2026-04-29
1.0.0 2026-04-29