Skip to content

Ly-Lynn/Mely-PDF-Miner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MinerU

Project Overview

MinerU is a powerful tool designed to convert PDF documents into machine-readable formats such as Markdown and JSON, enabling seamless extraction and transformation into various formats. Initially developed during the pre-training phase of InternLM, MinerU aims to tackle challenges related to symbol conversion in scientific literature. This project contributes to the advancement of large-scale models and supports technological growth in data processing.

The Mely version of MinerU is specially tailored for converting PDF content into Markdown in Vietnamese, paving the way for future implementations of Retrieval-Augmented Generation (RAG).

Workflow Overview

Through extensive testing with various types of Vietnamese PDFs, we found that MinerU performs effectively with text-based PDFs. However, its performance in OCR tasks was found lacking. As a result, we focused on enhancing PaddleOCR to better handle Vietnamese text.

The diagram below outlines the process flow for modifying and converting Vietnamese PDF OCR content within MinerU:

Installation Guide

To begin using MinerU, please refer to the detailed installation guide provided in the original repository.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published