sitemap-scanner (1.1.0)

Published 2026-04-29 08:55:49 +00:00 by forgejo-bot in CobraPack/sitemap-scanner

To install the package using pip, run the following command:

pip install --index-url  sitemap-scanner

For more information on the PyPI registry, see the documentation.

Async sitemap crawler and SEO metadata extractor (Open Graph, Twitter Cards, JSON-LD)

sitemap-scanner

Async crawler for site sitemaps with built-in SEO metadata extraction.

Two composable scanners:

SitemapScanner — discovers and iterates URLs from a site's sitemap. Reads robots.txt to find sitemap locations or accepts sitemap.xml directly, recursively follows nested sitemap indexes, supports filters to prune branches or individual URLs.
SeoScanner — fetches HTML pages and extracts the full SEO/social metadata stack: Open Graph, Twitter Cards, classic HTML <meta>, JSON-LD (schema.org), canonical, hreflang, favicon. Exposes computed title / description / image fields with Google-style priority: og > twitter > meta > json_ld.

Both scanners share the BaseGetRepository[T].iter_get(...) interface and use cloudscraper under the hood for Cloudflare-protected sites.

Installation

uv add sitemap-scanner --index CobraPack

Quick start

import asyncio

from pydantic import AnyHttpUrl

from sitemap_scanner import SitemapScanner
from sitemap_scanner.scanners.seo_scanner import SeoScanner


async def main() -> None:
    sitemap = SitemapScanner()
    seo = SeoScanner(concurrency=10)

    async for sitemap_chunk in sitemap.iter_get(
        AnyHttpUrl("https://example.com")
    ):
        urls = [s.loc for s in sitemap_chunk]
        async for results in seo.iter_get(*urls, chunk_size=50):
            for page in results:
                print(page.title, page.image)


asyncio.run(main())

SitemapScanner

async for chunk in scanner.iter_get(
    url,
    chunk_size=1000,
    loc_filter=None,
    sitemap_filter=None,
):
    ...

url — site root, sitemap URL, or robots.txt. Auto-resolves https://example.com to https://example.com/robots.txt.
loc_filter(url, recursion_level, parent_urls) -> bool — keep/drop individual <loc> entries.
sitemap_filter(url, recursion_level, parent_urls) -> bool — keep/drop whole nested sitemap branches.

Yields Sequence[SitemapURL] (loc, lastmod, changefreq, priority).

SeoScanner

seo = SeoScanner(timeout=10.0, concurrency=10)

async for chunk in seo.iter_get(*urls, chunk_size=1000):
    for page in chunk:
        page.title          # computed: og > twitter > meta > json_ld
        page.description    # computed
        page.image          # computed AnyHttpUrl
        page.open_graph     # OpenGraphData
        page.twitter_card   # TwitterCardData
        page.meta           # HtmlMetaData (canonical, hreflang, favicon, ...)
        page.json_ld        # list[dict] - parsed schema.org objects

Schemas

PageMetadata — full snapshot with computed top-level fields.
OpenGraphData — og:title, og:description, og:image, og:url, og:type, og:site_name, og:locale.
TwitterCardData — twitter:card, twitter:title, twitter:description, twitter:image, twitter:site, twitter:creator.
HtmlMetaData — <title>, <meta name="description|keywords|author|robots">, <html lang>, <link rel="canonical|icon">, <link rel="alternate" hreflang=...>.

Behavior

All extracted URLs are resolved against the page URL (relative links produce absolute AnyHttpUrl).
Duplicate meta tags: first occurrence wins.
Malformed JSON-LD blocks are silently skipped.
@graph containers in JSON-LD are flattened into individual objects.
Fetch failures (network, HTTP errors) propagate as exceptions — wrap the call site if you want to swallow them.

Tests

uv run --group test pytest              # unit (default)
uv run --group test pytest -m e2e       # network e2e (opt-in)

E2E tests auto-skip when no network is available; force skip with NO_NETWORK=1.

Development

uv run ruff check src tests
uv run mypy src

Requires Python: >=3.12

Details

PyPI

CobraPack/sitemap-scanner

2026-04-29 08:55:49 +00:00

115 KiB

Assets (2)

sitemap_scanner-1.1.0-py3-none-any.whl 9.8 KiB

sitemap_scanner-1.1.0.tar.gz 105 KiB

Versions (5) View all

1.1.3

2026-04-29

1.1.2

2026-04-29

1.1.1

2026-04-29

1.1.0

2026-04-29

1.0.0

2026-04-29

Issues

sitemap-scanner (1.1.0)

Installation

About this package

sitemap-scanner

Installation

Quick start

SitemapScanner

SeoScanner

Schemas

Behavior

Tests

Development

Requirements