Skip to content

Releases: opendatalab/MinerU

magic_pdf-0.10.2-released

27 Nov 10:33
8afff9a
Compare
Choose a tag to compare

What's Changed

  • fix(pdf_parse): Move the logic for filling text content into spans before the discarded_block recognition to fix the issue of empty text blocks in discarded_block. by @myhloli in #1082
  • refactor(txt_spans_extract_v2): optimize span processing and OCR logic by @myhloli in #1086
  • feat(ocr): filter out low confidence ocr results by @myhloli in #1088
  • feat(pdf_parse): add OCR score to span data by @myhloli in #1089
  • fix: test_rag by @icecraft in #1105
  • perf(image_processing): reduce maximum image size for analysis by @myhloli in #1106
  • fix: test_tools unittest by @icecraft in #1104
  • refactor(libs): remove unused imports and functions by @myhloli in #1112
  • Feat/add s3 read write example by @icecraft in #1117

Full Changelog: magic_pdf-0.10.1-released...magic_pdf-0.10.2-released

magic_pdf-0.10.1-released

25 Nov 03:41
4dcf31b
Compare
Choose a tag to compare

What's Changed

Full Changelog: magic_pdf-0.10.0-released...magic_pdf-0.10.1-released

magic_pdf-0.10.0-released

22 Nov 09:54
158e556
Compare
Choose a tag to compare

What's Changed

  • fix: 修复issue #715 by @LollipopsAndWine in #971
  • docs(README): update GPU hardware recommendations and table recognition options by @myhloli in #973
  • docs: improve GPU support list formatting in README_zh-CN.md by @myhloli in #974
  • docs: update feature description for table conversion by @myhloli in #975
  • docs: update readme by @myhloli in #977
  • update ci by @dt-yy in #986
  • test(unitest): Restore unit test cases by @myhloli in #998
  • refactor(tests): extract common test utilities into test_commons.py by @myhloli in #1001
  • feat(ocr): improve handling of angled text boxes by @myhloli in #1010
  • refactor(para): improve paragraph splitting logic by @myhloli in #1013
  • build(setup): add old_linux specific dependencies by @myhloli in #1016
  • refactor(para): adjust right margin threshold based on block width by @myhloli in #1018
  • fix: using new data api replace old rw api by @icecraft in #1006
  • delete unused pipeline file by @liugongjian in #1024
  • refactor: move some constants or enums defs to config folder by @icecraft in #1027
  • fix: remove test code by @icecraft in #1036
  • fix(tools): handle empty language string in common.py by @myhloli in #1045
  • refactor(ocr_dict_merge): add threshold parameter for line merging by @myhloli in #1046
  • fix(ocr_mkcontent): improve hyphen handling at line ends by @myhloli in #1047
  • fix(remove_overlaps_min_spans): optimize overlap detection in OCR span list modification by @myhloli in #1048
  • feat(ocr): improve text detection and OCR accuracy by @myhloli in #1049
  • refactor(txt_parse): improve text extraction accuracy with new algorithm by @myhloli in #1050
  • fix: use concrete class instead of abstract class by @icecraft in #1052
  • fix(pdf_parse): improve line stop flag detection accuracy by @myhloli in #1053
  • test: comment out assertions for metascan classify and meta scan tests by @myhloli in #1054
  • Add test cases to json compressor util by @liugongjian in #1056
  • refactor(para): improve line stop flag and remove unused debug mode by @myhloli in #1058
  • fix(table): add null check for OCR result in rapid table prediction by @myhloli in #1060
  • refactor(model): move page total time logging to custom model analysis by @myhloli in #1061
  • fix(table): add null check for OCR result in rapid table prediction by @myhloli in #1062
  • fix(pdf_parse): improve OCR result handling by @myhloli in #1064

New Contributors

Full Changelog: magic_pdf-0.9.3-released...magic_pdf-0.10.0-released

magic_pdf-0.9.3-released

15 Nov 11:27
845a3ff
Compare
Choose a tag to compare

What's Changed

  • feat(model): add xycut algorithm for block sorting by @myhloli in #898
  • refactor(pdf_parse): adjust line count threshold for layoutreader by @myhloli in #902
  • Feat/add en docs by @icecraft in #906
  • feat: using next_docs by @icecraft in #907
  • feat(table): integrate RapidTable model for table recognition by @myhloli in #910
  • fix(gradio-app): add missing file type in upload by @myhloli in #911
  • refactor(magic_pdf_parse_main): optimize model data handling and JSON output by @myhloli in #912
  • Modify the test directory by @DTwz in #913
  • test(table): improve ppTableModel test coverage by @myhloli in #914
  • feat(table): add RapidOCR support for RapidTable model by @myhloli in #915
  • 新增DocLayout-YOLO超链接 by @qiangqiang199 in #889
  • fix: remove classes hierarchy diagram by @icecraft in #919
  • refactor(model download script) by @myhloli in #922
  • docs(readme): update table recognition configuration and documentation by @myhloli in #924
  • docs(README_ja-JP.md): update warning message and remove outdated content by @myhloli in #925
  • 更新 para_split_v3.py by @hyastar in #923
  • Style/docs by @icecraft in #927
  • docs: rewrite zh_cn docs without translate by @icecraft in #928
  • fix: typo by @icecraft in #931
  • fix: 修复Dockerfile文件中download_models.py脚本路径问题 by @kimi360 in #938
  • build(Dockerfile): update model download script and dependencies by @myhloli in #941
  • fix(ocr_mkcontent): improve handling of single-character content #937 by @myhloli in #943
  • feat: tune docs by @icecraft in #948
  • fix(parse_pipeline): Resolve post-processing exceptions caused by partial PDFs due to file corruption or non-standard format by forcing a re-print. by @myhloli in #957
  • refactor(model): rename and restructure model modules by @myhloli in #964
  • docs:update docs for 0.9.3 by @myhloli in #965
  • docs(README): update project references and translations by @myhloli in #967

New Contributors

Full Changelog: magic_pdf-0.9.2-released...magic_pdf-0.9.3-released

magic_pdf-0.9.2-released

06 Nov 10:18
b25ff7a
Compare
Choose a tag to compare

What's Changed

  • fix: add ci repository by @dt-yy in #869
  • fix(table_model_init): remove unused code by @myhloli in #882
  • docs(README): update version number and improve documentation formatting by @myhloli in #884

Full Changelog: magic_pdf-0.9.1-released...magic_pdf-0.9.2-released

magic_pdf-0.9.1-released

06 Nov 04:07
069bcfe
Compare
Choose a tag to compare

What's Changed

  • Feat/tune docs by @icecraft in #833
  • fix(ocr_mkcontent): improve content handling for different languages and equation types by @myhloli in #839
  • feat(list): improve list detection algorithm & fix(list): improve list identification accuracy by @myhloli in #843
  • docs(tutorial): update magic-pdf command with output directory by @myhloli in #844
  • feat(para_split_v3): improve list identification with block aspect ratio by @myhloli in #845
  • fix(dict2md): improve text concatenation logic by @myhloli in #847
  • Update pdf_extract_kit.py by @CiaranYoung in #853
  • feat(table): upgrade StructEqTable model and integrate into PDF Extract Kit by @myhloli in #854
  • feat(model): add HTML minification to StructTableModel by @myhloli in #855
  • chore: add .gitattributes to configure file linguist attributes by @myhloli in #856
  • fix(merge_text): add ligature replacement functionality #305 #241 by @myhloli in #857
  • chore: add CSS and SCSS files to linguist-vendored- Update .gitattributes to mark CSS and SCSS files as vendored by @myhloli in #858
  • docs(README): update Colab demo link by @myhloli in #860
  • fix(table): improve table image processing by @myhloli in #866
  • docs(faq): add troubleshooting for illegal instruction error on Linux servers by @myhloli in #867
  • feat: mineru_demo接口文档替换为链接 by @LollipopsAndWine in #871
  • test(table): improve HTML validation for table extraction by @myhloli in #874
  • docs: update arXiv paper link in README files by @myhloli in #875
  • docs(README): update changelog for v0.9.1 release by @myhloli in #877

New Contributors

Full Changelog: magic_pdf-0.9.0-released...magic_pdf-0.9.1-released

magic_pdf-0.9.0-released

01 Nov 11:04
3a42ebb
Compare
Choose a tag to compare

What's Changed

  • Update README_zh-CN.md (#404) by @drunkpig in #409
  • feat: add dockerfile by @Lincyaw in #189
  • fix(ocr_mkcontent): improve language detection and content formatting by @myhloli in #458
  • fix(self_modify): merge detection boxes for optimized text region detection by @myhloli in #448
  • fix(pdf-extract): adjust box threshold for OCR detection to fix issue about OCR mode lost some line by @myhloli in #447
  • feat: rename the file generated by command line tools by @icecraft in #401
  • fix(ocr_mkcontent): revise table caption output by @myhloli in #397
  • build(docker): update docker build step by @myhloli in #471
  • upload an introduction about chemical formula and update readme.md by @GDDGCZ518 in #489
  • fix: remove the default value of output option in tools/cli.py and to… by @icecraft in #494
  • feat: add test case by @dt-yy in #499
  • fixes #492 decrease span threshold for block filling by @myhloli in #500
  • fix(detect_all_bboxes): remove small overlapping blocks by merging by @myhloli in #501
  • feat(cli&analyze&pipeline): add start_page and end_page args for pagination by @myhloli in #507
  • Feat/support rag by @icecraft in #510
  • feat(gradio): add app by gradio by @myhloli in #512
  • fix: replace \u0002, \u0003 in common text by @drunkpig in #521
  • fix(end_page_id):Fix the issue where end_page_id is corrected to len-1 when its input is 0. by @myhloli in #518
  • fix(para): When an English line ends with a hyphen, do not add a space at the end. by @drunkpig in #523
  • Release: Release 0.7.1 verison, update dev by @dt-yy in #527
  • Hotfix readme 0.7.1 by @Focusshang in #529
  • fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 by @papayalove in #542
  • fix: typo error in markdown by @icecraft in #536
  • fix(gradio): remove unused imports and simplify pdf display by @myhloli in #534
  • Feat/support footnote in figure by @icecraft in #532
  • refactor(pdf_extract_kit): implement singleton pattern for atomic models by @myhloli in #533
  • feat: mineru_web by @LollipopsAndWine in #555
  • features@add mineru gpu&web_api by @yanqiangmiffy in #568
  • docs(models_download): update model download instructions to use python script by @myhloli in #560
  • fix: resolve inaccuracy of drawing layout box caused by paragraphs combination #384 by @papayalove in #574
  • feat(ocr): supports minority languages by @myhloli in #577
  • refactor(pdf_extract_kit): update model config and weight paths for UniMERNet-0.2.0 by @myhloli in #584
  • feat(gradio_app): add web app with PDF processing as a project by @myhloli in #579
  • fix: web_api by @LollipopsAndWine in #580
  • Realese 0.8.0 by @drunkpig in #587
  • fix: 1. resolve uncorrect pair relation of figure and footnote, 2. re… by @icecraft in #603
  • fix: recovert the lang option in tools/cli.py by @icecraft in #604
  • fix: solve conflicts by @myhloli in #607
  • fix: remove useless files by @myhloli in #608
  • feat(gradio_app): add examples accordion to the PDF conversion interface by @myhloli in #597
  • feat(pipeline): pass language parameter for parsing and markdown conversion by @myhloli in #602
  • feat(ocr_mkcontent): support drop reason in none_with_reason mode by @myhloli in #630
  • feat(UNIPipe): change default drop_mode to NONE_WITH_REASON by @myhloli in #631
  • refactor(pdf_extract): use Image.crop directly with layout detection by @myhloli in #635
  • fix(pdf-extract): ensure model is set to evaluation mode before processing by @myhloli in #636
  • fix(pdf_extract_kit):change unimernet base -> small by @myhloli in #639
  • feat: add test case by @dt-yy in #645
  • feat: 集成前端界面,配置一键启动 by @LollipopsAndWine in #668
  • feat: 删除无用的文件,更新前端style by @LollipopsAndWine in #669
  • docs: update project lists in README files to include web_api by @myhloli in #670
  • feat:add layoutreader to sort blocks by @myhloli in #672
  • refactor(model): improve timing information and performance by @myhloli in #690
  • feat: add arXiv paper link to header and adjust PDF parsing logic by @myhloli in #693
  • perf(pdf_extract_kit): conditional memory cleanup based on GPU capacity by @myhloli in #694
  • fix: caption or footnote match algorithm by @icecraft in #695
  • fix: caption|footnote match algorithm by @icecraft in #696
  • feat(layoutreader): support local model directory and improve model loading by @myhloli in #698
  • feat(docs): automate model download and configuration by @myhloli in #699
  • docs: add filename to wget command in model download scripts by @myhloli in #700
  • docs: update CUDA acceleration guides and README content by @myhloli in #701
  • Update README_Windows_CUDA_Acceleration_en_US.md by @myhloli in #706
  • feat(pdf_parse_union_core_v2): reintegrate para_split_v3 and add page range support by @myhloli in #716
  • Update how_to_download_models_zh_cn.md by @myhloli in #717
  • fix: Solving the Grouping Anomaly Issue with Multiple Consecutive Non-Text Blocks by @myhloli in #718
  • feat: manager docs with sphinx by @icecraft in #737
  • feat(list&index block): detect and merge list and index blocks by @myhloli in #740
  • refactor(para_split_v3): merge list and index block detection by @myhloli in #743
  • fix(para_split_v3): refine list block detection in paragraph splitting by @myhloli in #744
  • update example files by @myhloli in #747
  • refactor(ocr):Increase the dilation factor in OCR to address the issue of word concatenation. by @myhloli in #753
  • refactor(para): improve paragraph splitting algorithm by @myhloli in #765
  • docs:Update the driver requirements on the Ubuntu system. by @myhloli in #766
  • update:update config json by @myhloli in #769
  • feat(model): add support for DocLayout-YOLO model by @myhloli in #773
  • build(setup): add doclayout_yolo dependency by @myhloli in #774
  • build(docker): add doclayout-yolo dependency by @myhloli in #776
  • feat: add support for non-PDF file conversion to PDF by @myhloli in #777
  • Feat/data api by @icecraft in #782
  • Feat/new table caption match by @icecraft in #784
  • refactor(parse_core): improve image and table block handling by @myhloli in #785
  • refactor(ocr): adjust OCR processing parameters by @myhloli in #786
  • fix: add init to magic_pdf.config by @myhloli in #788
  • fix: add init to magic_pdf.utils by @myhloli in #789
  • feat(draw_bbox): update bounding box drawing for tables and images by @myhloli in #791
  • Add multi_gpu process project by @randydl in #79...
Read more

magic_pdf-0.8.1-update-docs

09 Oct 08:58
62aa1cb
Compare
Choose a tag to compare

What's Changed

  • refactor(docs): update model download instructions and configuration process by @myhloli in #707

Full Changelog: magic_pdf-0.8.1-released...magic_pdf-0.8.1-update-docs

magic_pdf-0.8.1-released

12 Sep 14:00
c95f381
Compare
Choose a tag to compare

What's Changed

fix:

  • resolve uncorrect pair relation of figure and footnote
  • resolve uncorrect pair relation of table and caption #590 by @icecraft in #599

Full Changelog: magic_pdf-0.8.0-released...magic_pdf-0.8.1-released

magic_pdf-0.8.0-released

10 Sep 12:20
9f352df
Compare
Choose a tag to compare

What's Changed

feat:

  • Add RAG API
  • Integration of RAG into llama_index project
  • Update Dockerfile
  • Fine grained model singleton, reducing memory usage and accelerating initialization speed
  • CLI and API add parsing range parameters, allowing customization of start and end pages
  • Support image footnotes

bugfix:

  • When removing the smaller overlapping block, retain the boundary information of that block
  • Fill in the threshold of 0.6->0.3 for the span block
  • The problem of losing low score lines in OCR DET stage
  • Merge multiple spans of a single line in the OCR DET stage
  • Optimization of English Adhesive Word Segmentation Logic
  • Inaccurate layout box issue
  • The problem of merging words after being broken by line breaks
  • The final output result contains certain special characters

Full Changelog: magic_pdf-0.7.1-released...magic_pdf-0.8.0-released