pdf-data-extractor is an MIT-licensed Python CLI and library (v2.0.0) that turns any PDF into structured data — text, tables, images, links, forms, signatures, metadata, and 23 regex-matched fields like emails, phones, national IDs, and IBANs — as JSON, JSONL, or CSV. It stands on production-proven libraries (PyMuPDF, pypdf, pdfplumber, Tesseract) and adds the parts that hurt in the real world: first-class pattern extraction with checksum validation (Luhn for cards, mod-97 for IBANs), size-bounded split-file output so huge results stay openable, and a resumable batched cache so a killed 50,000-page run picks up exactly where it stopped. It runs zero-install via uvx/pipx, exposes a typed Python API, and offers three engines from a fast default to an LLM-grade semantic pipeline (IBM Docling). Free, dependency-light, and designed for developers who just need to get the data out of a PDF — reliably, at scale.
Category: other · Status: completed · Started: 2026-05 · Through: Present · Client: Open Source
Links: https://github.com/aoneahsan/pdf-data-extractor
Tags: python · pdf · data-extraction · ocr · cli · regex · developer-tools
Website: https://zaions.com
Email: aoneahsan@gmail.com
GitHub: github.com/aoneahsan
LinkedIn: linkedin.com/in/aoneahsan
RSS Feed: https://zaions.com/feed.xml
Sitemap: https://zaions.com/sitemap.xml
LLM Index: https://zaions.com/llms.txt