PDF Data Extractor — other project

pdf-data-extractor is an MIT-licensed Python CLI and library (v2.0.0) that turns any PDF into structured data — text, tables, images, links, forms, signatures, metadata, and 23 regex-matched fields like emails, phones, national IDs, and IBANs — as JSON, JSONL, or CSV. It stands on production-proven libraries (PyMuPDF, pypdf, pdfplumber, Tesseract) and adds the parts that hurt in the real world: first-class pattern extraction with checksum validation (Luhn for cards, mod-97 for IBANs), size-bounded split-file output so huge results stay openable, and a resumable batched cache so a killed 50,000-page run picks up exactly where it stopped. It runs zero-install via uvx/pipx, exposes a typed Python API, and offers three engines from a fast default to an LLM-grade semantic pipeline (IBM Docling). Free, dependency-light, and designed for developers who just need to get the data out of a PDF — reliably, at scale.

Tech stack

Python
PyMuPDF
pypdf
pdfplumber
Tesseract
Pillow
IBM Docling

Key features

Text, tables, images, forms, signatures & metadata
23 built-in regex patterns (email, IBAN, national IDs)
Checksum-validated matches (Luhn, mod-97)
Three output formats (JSON/JSONL/CSV) with split-file rolling
Resumable batched extraction with atomic cache
Optional OCR for scanned PDFs
Zero-install via uvx/pipx + typed Python API

Category: other · Status: completed · Started: 2026-05 · Through: Present · Client: Open Source

Links: https://github.com/aoneahsan/pdf-data-extractor

Tags: python · pdf · data-extraction · ocr · cli · regex · developer-tools

Contact

Website: https://zaions.com
Email: aoneahsan@gmail.com
GitHub: github.com/aoneahsan
LinkedIn: linkedin.com/in/aoneahsan
RSS Feed: https://zaions.com/feed.xml
Sitemap: https://zaions.com/sitemap.xml
LLM Index: https://zaions.com/llms.txt