PDF Parser Comparison

This project compares different PDF parsing libraries for text extraction accuracy, including support for multipage PDFs.

Libraries included:

For Python parsers, run: python <parser_name>/<parser_name>_parser.py <pdf_file> [-j]

For TypeScript/Node.js parsers, run: cd <parser_name> && npm run start -- <pdf_file> [-j]

The -j flag outputs the result in JSON format.

Examples: python pypdf/pypdf_parser.py sample.pdf -j cd pdfjs && npm run start -- ../sample.pdf -j

Ensure you have Python 3.7+ and Node.js 20.15+ installed.
For Python parsers:
- Activate the virtual environment: source .venv/bin/activate
- Install dependencies: pip install -r requirements.txt
For TypeScript/Node.js parsers: npm install in the respective directories

All parsers handle multipage PDFs and concatenate the text from all pages into a single output.