This project compares different PDF parsing libraries for text extraction accuracy, including support for multipage PDFs.
- PyPDF (Python)
- PyMuPDF (Python)
- PDF.js (TypeScript/Node.js using pdfjs-dist)
- pdfplumber (Python)
- pdfreader (TypeScript/Node.js)
- PyPDF2 (Python)
- pdfminer.six (Python)
- pdf-parse-new (TypeScript/Node.js)
- unpdf (TypeScript/Node.js)
For Python parsers, run: python <parser_name>/<parser_name>_parser.py <pdf_file> [-j]
For TypeScript/Node.js parsers, run: cd <parser_name> && npm run start -- <pdf_file> [-j]
The -j flag outputs the result in JSON format.
Examples: python pypdf/pypdf_parser.py sample.pdf -j cd pdfjs && npm run start -- ../sample.pdf -j
- Ensure you have Python 3.7+ and Node.js 20.15+ installed.
- For Python parsers:
- Activate the virtual environment: source .venv/bin/activate
- Install dependencies: pip install -r requirements.txt
- For TypeScript/Node.js parsers: npm install in the respective directories
All parsers handle multipage PDFs and concatenate the text from all pages into a single output.