Releases · opendatalab/MinerU

27 Nov 10:33

myhloli

magic_pdf-0.10.2-released

8afff9a

magic_pdf-0.10.2-released Latest

Latest

What's Changed

fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block. by @myhloli in #1082
refactor(txt_spans_extract_v2): optimize span processing and OCR logic by @myhloli in #1086
feat(ocr): filter out low confidence ocr results by @myhloli in #1088
feat(pdf_parse): add OCR score to span data by @myhloli in #1089
fix: test_rag by @icecraft in #1105
perf(image_processing): reduce maximum image size for analysis by @myhloli in #1106
fix: test_tools unittest by @icecraft in #1104
refactor(libs): remove unused imports and functions by @myhloli in #1112
Feat/add s3 read write example by @icecraft in #1117

Full Changelog: magic_pdf-0.10.1-released...magic_pdf-0.10.2-released

Contributors

myhloli and icecraft

Assets 3

25 Nov 03:41

myhloli

magic_pdf-0.10.1-released

4dcf31b

magic_pdf-0.10.1-released

What's Changed

Fix/demo by @icecraft in #1071
feat(demo): add visualization bbox parameter and refactor parsing process by @myhloli in #1074
demo: batch process demo PDFs by @myhloli in #1075

Full Changelog: magic_pdf-0.10.0-released...magic_pdf-0.10.1-released

Contributors

myhloli and icecraft

Assets 3

22 Nov 09:54

myhloli

magic_pdf-0.10.0-released

158e556

magic_pdf-0.10.0-released

What's Changed

fix: 修复issue #715 by @LollipopsAndWine in #971
docs(README): update GPU hardware recommendations and table recognition options by @myhloli in #973
docs: improve GPU support list formatting in README_zh-CN.md by @myhloli in #974
docs: update feature description for table conversion by @myhloli in #975
docs: update readme by @myhloli in #977
update ci by @dt-yy in #986
test(unitest): Restore unit test cases by @myhloli in #998
refactor(tests): extract common test utilities into test_commons.py by @myhloli in #1001
feat(ocr): improve handling of angled text boxes by @myhloli in #1010
refactor(para): improve paragraph splitting logic by @myhloli in #1013
build(setup): add old_linux specific dependencies by @myhloli in #1016
refactor(para): adjust right margin threshold based on block width by @myhloli in #1018
fix: using new data api replace old rw api by @icecraft in #1006
delete unused pipeline file by @liugongjian in #1024
refactor: move some constants or enums defs to config folder by @icecraft in #1027
fix: remove test code by @icecraft in #1036
fix(tools): handle empty language string in common.py by @myhloli in #1045
refactor(ocr_dict_merge): add threshold parameter for line merging by @myhloli in #1046
fix(ocr_mkcontent): improve hyphen handling at line ends by @myhloli in #1047
fix(remove_overlaps_min_spans): optimize overlap detection in OCR span list modification by @myhloli in #1048
feat(ocr): improve text detection and OCR accuracy by @myhloli in #1049
refactor(txt_parse): improve text extraction accuracy with new algorithm by @myhloli in #1050
fix: use concrete class instead of abstract class by @icecraft in #1052
fix(pdf_parse): improve line stop flag detection accuracy by @myhloli in #1053
test: comment out assertions for metascan classify and meta scan tests by @myhloli in #1054
Add test cases to json compressor util by @liugongjian in #1056
refactor(para): improve line stop flag and remove unused debug mode by @myhloli in #1058
fix(table): add null check for OCR result in rapid table prediction by @myhloli in #1060
refactor(model): move page total time logging to custom model analysis by @myhloli in #1061
fix(table): add null check for OCR result in rapid table prediction by @myhloli in #1062
fix(pdf_parse): improve OCR result handling by @myhloli in #1064

New Contributors

@liugongjian made their first contribution in #1024

Full Changelog: magic_pdf-0.9.3-released...magic_pdf-0.10.0-released

Contributors

liugongjian, myhloli, and 3 other contributors

Assets 3

15 Nov 11:27

myhloli

magic_pdf-0.9.3-released

845a3ff

magic_pdf-0.9.3-released

What's Changed

feat(model): add xycut algorithm for block sorting by @myhloli in #898
refactor(pdf_parse): adjust line count threshold for layoutreader by @myhloli in #902
Feat/add en docs by @icecraft in #906
feat: using next_docs by @icecraft in #907
feat(table): integrate RapidTable model for table recognition by @myhloli in #910
fix(gradio-app): add missing file type in upload by @myhloli in #911
refactor(magic_pdf_parse_main): optimize model data handling and JSON output by @myhloli in #912
Modify the test directory by @DTwz in #913
test(table): improve ppTableModel test coverage by @myhloli in #914
feat(table): add RapidOCR support for RapidTable model by @myhloli in #915
新增DocLayout-YOLO超链接 by @qiangqiang199 in #889
fix: remove classes hierarchy diagram by @icecraft in #919
refactor(model download script) by @myhloli in #922
docs(readme): update table recognition configuration and documentation by @myhloli in #924
docs(README_ja-JP.md): update warning message and remove outdated content by @myhloli in #925
更新 para_split_v3.py by @hyastar in #923
Style/docs by @icecraft in #927
docs: rewrite zh_cn docs without translate by @icecraft in #928
fix: typo by @icecraft in #931
fix: 修复Dockerfile文件中download_models.py脚本路径问题 by @kimi360 in #938
build(Dockerfile): update model download script and dependencies by @myhloli in #941
fix(ocr_mkcontent): improve handling of single-character content #937 by @myhloli in #943
feat: tune docs by @icecraft in #948
fix(parse_pipeline): Resolve post-processing exceptions caused by partial PDFs due to file corruption or non-standard format by forcing a re-print. by @myhloli in #957
refactor(model): rename and restructure model modules by @myhloli in #964
docs：update docs for 0.9.3 by @myhloli in #965
docs(README): update project references and translations by @myhloli in #967

New Contributors

@DTwz made their first contribution in #913
@qiangqiang199 made their first contribution in #889
@hyastar made their first contribution in #923
@kimi360 made their first contribution in #938

Full Changelog: magic_pdf-0.9.2-released...magic_pdf-0.9.3-released

Contributors

kimi360, myhloli, and 4 other contributors

Assets 3

06 Nov 10:18

myhloli

magic_pdf-0.9.2-released

b25ff7a

magic_pdf-0.9.2-released

What's Changed

fix: add ci repository by @dt-yy in #869
fix(table_model_init): remove unused code by @myhloli in #882
docs(README): update version number and improve documentation formatting by @myhloli in #884

Full Changelog: magic_pdf-0.9.1-released...magic_pdf-0.9.2-released

Contributors

myhloli and dt-yy

Assets 3

06 Nov 04:07

myhloli

magic_pdf-0.9.1-released

069bcfe

magic_pdf-0.9.1-released

What's Changed

Feat/tune docs by @icecraft in #833
fix(ocr_mkcontent): improve content handling for different languages and equation types by @myhloli in #839
feat(list): improve list detection algorithm & fix(list): improve list identification accuracy by @myhloli in #843
docs(tutorial): update magic-pdf command with output directory by @myhloli in #844
feat(para_split_v3): improve list identification with block aspect ratio by @myhloli in #845
fix(dict2md): improve text concatenation logic by @myhloli in #847
Update pdf_extract_kit.py by @CiaranYoung in #853
feat(table): upgrade StructEqTable model and integrate into PDF Extract Kit by @myhloli in #854
feat(model): add HTML minification to StructTableModel by @myhloli in #855
chore: add .gitattributes to configure file linguist attributes by @myhloli in #856
fix(merge_text): add ligature replacement functionality #305 #241 by @myhloli in #857
chore: add CSS and SCSS files to linguist-vendored- Update .gitattributes to mark CSS and SCSS files as vendored by @myhloli in #858
docs(README): update Colab demo link by @myhloli in #860
fix(table): improve table image processing by @myhloli in #866
docs(faq): add troubleshooting for illegal instruction error on Linux servers by @myhloli in #867
feat: mineru_demo接口文档替换为链接 by @LollipopsAndWine in #871
test(table): improve HTML validation for table extraction by @myhloli in #874
docs: update arXiv paper link in README files by @myhloli in #875
docs(README): update changelog for v0.9.1 release by @myhloli in #877

New Contributors

@CiaranYoung made their first contribution in #853

Full Changelog: magic_pdf-0.9.0-released...magic_pdf-0.9.1-released

Contributors

myhloli, icecraft, and 2 other contributors

Assets 3

01 Nov 11:04

myhloli

magic_pdf-0.9.0-released

3a42ebb

magic_pdf-0.9.0-released

What's Changed

Update README_zh-CN.md (#404) by @drunkpig in #409
feat: add dockerfile by @Lincyaw in #189
fix(ocr_mkcontent): improve language detection and content formatting by @myhloli in #458
fix(self_modify): merge detection boxes for optimized text region detection by @myhloli in #448
fix(pdf-extract): adjust box threshold for OCR detection to fix issue about OCR mode lost some line by @myhloli in #447
feat: rename the file generated by command line tools by @icecraft in #401
fix(ocr_mkcontent): revise table caption output by @myhloli in #397
build(docker): update docker build step by @myhloli in #471
upload an introduction about chemical formula and update readme.md by @GDDGCZ518 in #489
fix: remove the default value of output option in tools/cli.py and to… by @icecraft in #494
feat: add test case by @dt-yy in #499
fixes #492 decrease span threshold for block filling by @myhloli in #500
fix(detect_all_bboxes): remove small overlapping blocks by merging by @myhloli in #501
feat(cli&analyze&pipeline): add start_page and end_page args for pagination by @myhloli in #507
Feat/support rag by @icecraft in #510
feat(gradio): add app by gradio by @myhloli in #512
fix: replace \u0002, \u0003 in common text by @drunkpig in #521
fix(end_page_id):Fix the issue where end_page_id is corrected to len-1 when its input is 0. by @myhloli in #518
fix(para): When an English line ends with a hyphen, do not add a space at the end. by @drunkpig in #523
Release: Release 0.7.1 verison, update dev by @dt-yy in #527
Hotfix readme 0.7.1 by @Focusshang in #529
fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 by @papayalove in #542
fix: typo error in markdown by @icecraft in #536
fix(gradio): remove unused imports and simplify pdf display by @myhloli in #534
Feat/support footnote in figure by @icecraft in #532
refactor(pdf_extract_kit): implement singleton pattern for atomic models by @myhloli in #533
feat: mineru_web by @LollipopsAndWine in #555
features@add mineru gpu&web_api by @yanqiangmiffy in #568
docs(models_download): update model download instructions to use python script by @myhloli in #560
fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 by @papayalove in #574
feat(ocr): supports minority languages by @myhloli in #577
refactor(pdf_extract_kit): update model config and weight paths for UniMERNet-0.2.0 by @myhloli in #584
feat(gradio_app): add web app with PDF processing as a project by @myhloli in #579
fix: web_api by @LollipopsAndWine in #580
Realese 0.8.0 by @drunkpig in #587
fix: 1. resolve uncorrect pair relation of figure and footnote, 2. re… by @icecraft in #603
fix: recovert the lang option in tools/cli.py by @icecraft in #604
fix: solve conflicts by @myhloli in #607
fix: remove useless files by @myhloli in #608
feat(gradio_app): add examples accordion to the PDF conversion interface by @myhloli in #597
feat(pipeline): pass language parameter for parsing and markdown conversion by @myhloli in #602
feat(ocr_mkcontent): support drop reason in none_with_reason mode by @myhloli in #630
feat(UNIPipe): change default drop_mode to NONE_WITH_REASON by @myhloli in #631
refactor(pdf_extract): use Image.crop directly with layout detection by @myhloli in #635
fix(pdf-extract): ensure model is set to evaluation mode before processing by @myhloli in #636
fix(pdf_extract_kit):change unimernet base -> small by @myhloli in #639
feat: add test case by @dt-yy in #645
feat: 集成前端界面，配置一键启动 by @LollipopsAndWine in #668
feat: 删除无用的文件,更新前端style by @LollipopsAndWine in #669
docs: update project lists in README files to include web_api by @myhloli in #670
feat：add layoutreader to sort blocks by @myhloli in #672
refactor(model): improve timing information and performance by @myhloli in #690
feat: add arXiv paper link to header and adjust PDF parsing logic by @myhloli in #693
perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity by @myhloli in #694
fix: caption or footnote match algorithm by @icecraft in #695
fix: caption|footnote match algorithm by @icecraft in #696
feat(layoutreader): support local model directory and improve model loading by @myhloli in #698
feat(docs): automate model download and configuration by @myhloli in #699
docs: add filename to wget command in model download scripts by @myhloli in #700
docs: update CUDA acceleration guides and README content by @myhloli in #701
Update README_Windows_CUDA_Acceleration_en_US.md by @myhloli in #706
feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support by @myhloli in #716
Update how_to_download_models_zh_cn.md by @myhloli in #717
fix: Solving the Grouping Anomaly Issue with Multiple Consecutive Non-Text Blocks by @myhloli in #718
feat: manager docs with sphinx by @icecraft in #737
feat(list&index block): detect and merge list and index blocks by @myhloli in #740
refactor(para_split_v3): merge list and index block detection by @myhloli in #743
fix(para_split_v3): refine list block detection in paragraph splitting by @myhloli in #744
update example files by @myhloli in #747
refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation. by @myhloli in #753
refactor(para): improve paragraph splitting algorithm by @myhloli in #765
docs:Update the driver requirements on the Ubuntu system. by @myhloli in #766
update：update config json by @myhloli in #769
feat(model): add support for DocLayout-YOLO model by @myhloli in #773
build(setup): add doclayout_yolo dependency by @myhloli in #774
build(docker): add doclayout-yolo dependency by @myhloli in #776
feat: add support for non-PDF file conversion to PDF by @myhloli in #777
Feat/data api by @icecraft in #782
Feat/new table caption match by @icecraft in #784
refactor(parse_core): improve image and table block handling by @myhloli in #785
refactor(ocr): adjust OCR processing parameters by @myhloli in #786
fix: add init to magic_pdf.config by @myhloli in #788
fix: add init to magic_pdf.utils by @myhloli in #789
feat(draw_bbox): update bounding box drawing for tables and images by @myhloli in #791
Add multi_gpu process project by @randydl in #79...

Contributors

myhloli, icecraft, and 9 other contributors

Assets 3

09 Oct 08:58

myhloli

magic_pdf-0.8.1-update-docs

62aa1cb

magic_pdf-0.8.1-update-docs

What's Changed

refactor(docs): update model download instructions and configuration process by @myhloli in #707

Full Changelog: magic_pdf-0.8.1-released...magic_pdf-0.8.1-update-docs

Contributors

myhloli

Assets 2

12 Sep 14:00

myhloli

magic_pdf-0.8.1-released

c95f381

magic_pdf-0.8.1-released

What's Changed

fix:

resolve uncorrect pair relation of figure and footnote
resolve uncorrect pair relation of table and caption #590 by @icecraft in #599

Full Changelog: magic_pdf-0.8.0-released...magic_pdf-0.8.1-released

Contributors

icecraft

Assets 3

10 Sep 12:20

myhloli

magic_pdf-0.8.0-released

9f352df

magic_pdf-0.8.0-released

What's Changed

feat：

Add RAG API
Integration of RAG into llama_index project
Update Dockerfile
Fine grained model singleton, reducing memory usage and accelerating initialization speed
CLI and API add parsing range parameters, allowing customization of start and end pages
Support image footnotes

bugfix：

When removing the smaller overlapping block, retain the boundary information of that block
Fill in the threshold of 0.6->0.3 for the span block
The problem of losing low score lines in OCR DET stage
Merge multiple spans of a single line in the OCR DET stage
Optimization of English Adhesive Word Segmentation Logic
Inaccurate layout box issue
The problem of merging words after being broken by line breaks
The final output result contains certain special characters

Full Changelog: magic_pdf-0.7.1-released...magic_pdf-0.8.0-released

Assets 3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

New Contributors

Contributors

What's Changed

Contributors

What's Changed

Contributors

What's Changed

fix:

Contributors

What's Changed

Releases: opendatalab/MinerU

magic_pdf-0.10.2-released

What's Changed

Contributors

magic_pdf-0.10.1-released

What's Changed

Contributors

magic_pdf-0.10.0-released

What's Changed

New Contributors

Contributors

magic_pdf-0.9.3-released

What's Changed

New Contributors

Contributors

magic_pdf-0.9.2-released

What's Changed

Contributors

magic_pdf-0.9.1-released

What's Changed

New Contributors

Contributors

magic_pdf-0.9.0-released

What's Changed

Contributors

magic_pdf-0.8.1-update-docs

What's Changed

Contributors

magic_pdf-0.8.1-released

What's Changed

fix:

Contributors

magic_pdf-0.8.0-released

What's Changed