PDF Parser Comparison

This project compares different PDF parsing libraries for text extraction accuracy, including support for multipage PDFs.

Libraries included:

For Python parsers, run: python <parser_name>/<parser_name>_parser.py <pdf_file> [-j]

For TypeScript/Node.js parsers, run: cd <parser_name> && npm run start -- <pdf_file> [-j]

The -j flag outputs the result in JSON format.

Examples: python pypdf/pypdf_parser.py sample.pdf -j cd pdfjs && npm run start -- ../sample.pdf -j

Ensure you have Python 3.7+ and Node.js 20.15+ installed.
For Python parsers:
- Activate the virtual environment: source .venv/bin/activate
- Install dependencies: pip install -r requirements.txt
For TypeScript/Node.js parsers: npm install in the respective directories

All parsers handle multipage PDFs and concatenate the text from all pages into a single output.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
pdf-parse-new		pdf-parse-new
pdfjs		pdfjs
pdfminer		pdfminer
pdfplumber		pdfplumber
pdfreader		pdfreader
pymupdf		pymupdf
pypdf		pypdf
pypdf2		pypdf2
unpdf		unpdf
.gitignore		.gitignore
README.md		README.md
comprehensive-pdf-parser-setup.sh		comprehensive-pdf-parser-setup.sh
requirements.txt		requirements.txt
run-all-pdf-parsers.sh		run-all-pdf-parsers.sh