PDF Data Extractor — other project

pdf-data-extractor is an MIT-licensed Python CLI and library (v2.0.0) that turns any PDF into structured data — text, tables, images, links, forms, signatures, metadata, and 23 regex-matched fields like emails, phones, national IDs, and IBANs — as JSON, JSONL, or CSV. It stands on production-proven libraries (PyMuPDF, pypdf, pdfplumber, Tesseract) and adds the parts that hurt in the real world: first-class pattern extraction with checksum validation (Luhn for cards, mod-97 for IBANs), size-bounded split-file output so huge results stay openable, and a resumable batched cache so a killed 50,000-page run picks up exactly where it stopped. It runs zero-install via uvx/pipx, exposes a typed Python API, and offers three engines from a fast default to an LLM-grade semantic pipeline (IBM Docling). Free, dependency-light, and designed for developers who just need to get the data out of a PDF — reliably, at scale.

Tech stack

  • Python
  • PyMuPDF
  • pypdf
  • pdfplumber
  • Tesseract
  • Pillow
  • IBM Docling

Key features

  • Text, tables, images, forms, signatures & metadata
  • 23 built-in regex patterns (email, IBAN, national IDs)
  • Checksum-validated matches (Luhn, mod-97)
  • Three output formats (JSON/JSONL/CSV) with split-file rolling
  • Resumable batched extraction with atomic cache
  • Optional OCR for scanned PDFs
  • Zero-install via uvx/pipx + typed Python API

Category: other · Status: completed · Started: 2026-05 · Through: Present · Client: Open Source

Links: https://github.com/aoneahsan/pdf-data-extractor

Tags: python · pdf · data-extraction · ocr · cli · regex · developer-tools


Contact

Website: https://zaions.com
Email: aoneahsan@gmail.com
GitHub: github.com/aoneahsan
LinkedIn: linkedin.com/in/aoneahsan
RSS Feed: https://zaions.com/feed.xml
Sitemap: https://zaions.com/sitemap.xml
LLM Index: https://zaions.com/llms.txt