Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add: custom formatter #104

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .env.example
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
LLAMA_CLOUD_API_KEY=llx-1234567890
OPENAI_API_KEY = sk-1234567890
OPENAI_API_KEY=sk-1234567890
MEGAPARSE_API_KEY=MyMegaParseKey
17 changes: 17 additions & 0 deletions .github/workflows/release-please.yml
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,20 @@ jobs:
run: rye build
- name: Rye Publish
run: rye publish --token ${{ secrets.PYPI_API_TOKEN }} --yes

deploy-sdk:
if: needs.release-please.outputs.release_created == 'true'
needs: release-please
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Rye
uses: eifinger/setup-rye@v2
with:
enable-cache: true
- name: Rye Sync
run: cd megaparse/sdk && rye sync --no-lock
- name: Rye Build
run: cd megaparse/sdk && rye build
- name: Rye Publish
run: cd megaparse/sdk && rye publish --token ${{ secrets.PYPI_API_TOKEN }} --yes
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,3 @@
*.md
/output
/input
.env
Expand Down
2 changes: 1 addition & 1 deletion .release-please-manifest.json
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
{
".": "0.0.33"
".": "0.0.42"
}
63 changes: 63 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,68 @@
# Changelog

## [0.0.42](https://github.com/QuivrHQ/MegaParse/compare/megaparse-v0.0.41...megaparse-v0.0.42) (2024-11-08)


### Features

* **sdk:** new version ([e377cd6](https://github.com/QuivrHQ/MegaParse/commit/e377cd6df98b3ea9265788a4d907b43bde796196))

## [0.0.41](https://github.com/QuivrHQ/MegaParse/compare/megaparse-v0.0.40...megaparse-v0.0.41) (2024-11-08)


### Bug Fixes

* add megaparse url env variable ([#118](https://github.com/QuivrHQ/MegaParse/issues/118)) ([132c2eb](https://github.com/QuivrHQ/MegaParse/commit/132c2ebd13177fd116c4e710a4b1c864a9fa04bb))

## [0.0.40](https://github.com/QuivrHQ/MegaParse/compare/megaparse-v0.0.39...megaparse-v0.0.40) (2024-11-08)


### Bug Fixes

* sdk version ([#116](https://github.com/QuivrHQ/MegaParse/issues/116)) ([8bfeb4a](https://github.com/QuivrHQ/MegaParse/commit/8bfeb4a52326a5f645d3ed20e113153dc19bf012))

## [0.0.39](https://github.com/QuivrHQ/MegaParse/compare/megaparse-v0.0.38...megaparse-v0.0.39) (2024-11-08)


### Bug Fixes

* add_logs ([#114](https://github.com/QuivrHQ/MegaParse/issues/114)) ([63c9236](https://github.com/QuivrHQ/MegaParse/commit/63c9236590016ee4c210174e746e96ff2b654480))

## [0.0.38](https://github.com/QuivrHQ/MegaParse/compare/megaparse-v0.0.37...megaparse-v0.0.38) (2024-11-07)


### Bug Fixes

* env roots, imports root ([#112](https://github.com/QuivrHQ/MegaParse/issues/112)) ([a04230d](https://github.com/QuivrHQ/MegaParse/commit/a04230dc2de9e0bb0bde39ab66b2208f80743922))

## [0.0.37](https://github.com/QuivrHQ/MegaParse/compare/megaparse-v0.0.36...megaparse-v0.0.37) (2024-11-07)


### Features

* bump megaparse-sdk version to 0.1.1 ([ed3fdfb](https://github.com/QuivrHQ/MegaParse/commit/ed3fdfb10498c95d4f9a510df3a2913e0dfc3c23))

## [0.0.36](https://github.com/QuivrHQ/MegaParse/compare/megaparse-v0.0.35...megaparse-v0.0.36) (2024-11-07)


### Features

* **readme:** update ([9d571b7](https://github.com/QuivrHQ/MegaParse/commit/9d571b7c71db610e7a0b08045ad98994ecf71baa))

## [0.0.35](https://github.com/QuivrHQ/MegaParse/compare/megaparse-v0.0.34...megaparse-v0.0.35) (2024-11-07)


### Bug Fixes

* unnecessary dep and readme ([#107](https://github.com/QuivrHQ/MegaParse/issues/107)) ([b80aaa3](https://github.com/QuivrHQ/MegaParse/commit/b80aaa3a894b2bd2c7d7f518919c41af5c99219f))

## [0.0.34](https://github.com/QuivrHQ/MegaParse/compare/megaparse-v0.0.33...megaparse-v0.0.34) (2024-11-07)


### Features

* megaparse-sdk-cherry ([#105](https://github.com/QuivrHQ/MegaParse/issues/105)) ([ad44aa3](https://github.com/QuivrHQ/MegaParse/commit/ad44aa34999596e156c78f91adab97bce7ceeb0e))

## [0.0.33](https://github.com/QuivrHQ/MegaParse/compare/megaparse-v0.0.32...megaparse-v0.0.33) (2024-11-01)


Expand Down
4 changes: 3 additions & 1 deletion Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -30,6 +30,8 @@ RUN apt-get clean && apt-get update && apt-get install -y \
rm -rf /var/lib/apt/lists/* && apt-get clean

COPY requirements.lock pyproject.toml README.md ./
COPY megaparse/sdk/pyproject.toml megaparse/sdk/README.md megaparse/sdk/


RUN PYTHONDONTWRITEBYTECODE=1 pip install --no-cache-dir -r requirements.lock

Expand All @@ -39,7 +41,7 @@ RUN playwright install --with-deps && \
python -c "from unstructured.partition.model_init import initialize; initialize()"


ENV PYTHONPATH=/app
ENV PYTHONPATH="/app:/app/megaparse/sdk"

COPY . .
EXPOSE 8000
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# MegaParse - Your Mega Parser for every type of documents
# MegaParse - Your Parser for every type of documents

<div align="center">
<img src="https://raw.githubusercontent.com/QuivrHQ/MegaParse/main/logo.png" alt="Quivr-logo" width="30%" style="border-radius: 50%; padding-bottom: 20px"/>
Expand Down
9 changes: 7 additions & 2 deletions megaparse/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
# from .Converter import MegaParse
"""My library with optional components."""

# __all__ = ["MegaParse"]
__version__ = "0.0.42"

# Import only SDK components by default
from megaparse import sdk

__all__ = ["sdk"]
50 changes: 29 additions & 21 deletions megaparse/api/app.py
Original file line number Diff line number Diff line change
@@ -1,18 +1,19 @@
import os
import tempfile
from fastapi import Depends, FastAPI, UploadFile, File, HTTPException

import httpx
import psutil
from fastapi import Depends, FastAPI, File, HTTPException, UploadFile
from langchain_anthropic import ChatAnthropic
from langchain_community.document_loaders import PlaywrightURLLoader
from langchain_openai import ChatOpenAI
from llama_parse.utils import Language

from megaparse.api.utils.type import HTTPModelNotSupported
from megaparse.core.megaparse import MegaParse
from megaparse.core.parser.builder import ParserBuilder
from megaparse.core.parser.type import ParserConfig, ParserType
from megaparse.core.parser.unstructured_parser import StrategyEnum, UnstructuredParser
import psutil
import os
from langchain_community.document_loaders import PlaywrightURLLoader

from langchain_openai import ChatOpenAI
from langchain_anthropic import ChatAnthropic
from llama_parse.utils import Language
import httpx

app = FastAPI()

Expand Down Expand Up @@ -79,15 +80,17 @@ async def parse_file(
language=language,
parsing_instruction=parsing_instruction,
)

parser = parser_builder.build(parser_config)
with tempfile.NamedTemporaryFile(
delete=False, suffix=f".{str(file.filename).split('.')[-1]}"
) as temp_file:
temp_file.write(file.file.read())
megaparse = MegaParse(parser=parser)
result = await megaparse.aload(file_path=temp_file.name)
return {"message": "File parsed successfully", "result": result}
try:
parser = parser_builder.build(parser_config)
with tempfile.NamedTemporaryFile(
delete=False, suffix=f".{str(file.filename).split('.')[-1]}"
) as temp_file:
temp_file.write(file.file.read())
megaparse = MegaParse(parser=parser)
result = await megaparse.aload(file_path=temp_file.name)
return {"message": "File parsed successfully", "result": result}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))


@app.post("/v1/url")
Expand All @@ -106,9 +109,14 @@ async def upload_url(

with tempfile.NamedTemporaryFile(delete=False, suffix="pdf") as temp_file:
temp_file.write(response.content)
megaparse = MegaParse(parser=UnstructuredParser(strategy=StrategyEnum.AUTO))
result = megaparse.load(temp_file.name)
return {"message": "File parsed successfully", "result": result}
try:
megaparse = MegaParse(
parser=UnstructuredParser(strategy=StrategyEnum.AUTO)
)
result = megaparse.load(temp_file.name)
return {"message": "File parsed successfully", "result": result}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
else:
data = await playwright_loader.aload()
# Now turn the data into a string
Expand Down
25 changes: 0 additions & 25 deletions megaparse/core/checker/format_checker.py

This file was deleted.

40 changes: 40 additions & 0 deletions megaparse/core/example/run_simple_parser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
from langchain_openai import ChatOpenAI
from megaparse.core.formatter.table_formatter.vision_md_formatter import (
VisionMDTableFormatter,
)
from megaparse.core.formatter.unstructured_formatter.markdown_formatter import (
MarkDownFormatter,
)
from megaparse.core.megaparse import MegaParse
from megaparse.core.parser.unstructured_parser import UnstructuredParser
import os


def main():
# This is a simple example of how to use the MegaParse class
# You can use this class to parse any file format supported by the parser
# and apply any formatter to the parsed document
# The parsed document can then be saved to a file

# Create an instance of UnstructuredParser
parser = UnstructuredParser()

# Add a table formatter to the parser
formatter_list = []
model = ChatOpenAI(model="gpt-4o", api_key=str(os.getenv("OPENAI_API_KEY"))) # type:ignore
formatter_list.append(VisionMDTableFormatter(model=model))

# Add a MD formatter to the parser
formatter_list.append(MarkDownFormatter())

# Create an instance of MegaParse
mega_parse = MegaParse(parser=parser, formatters=formatter_list)

# Load a file
parsed_document = mega_parse.load("tests/data/MegaFake_report.pdf")

print(parsed_document)


if __name__ == "__main__":
main()
40 changes: 40 additions & 0 deletions megaparse/core/formatter/formatter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
from typing import List, Union
from abc import ABC
from langchain_core.language_models.chat_models import BaseChatModel
from unstructured.documents.elements import Element


# TODO: Implement the Formatter class @Chloe
class Formatter(ABC):
"""
A class used to improve the layout of elements, particularly focusing on converting HTML tables to markdown tables.
Attributes
----------
model : BaseChatModel
An instance of a chat model used to process and improve the layout of elements.
Methods
-------
improve_layout(elements: List[Element]) -> List[Element]
Processes a list of elements, converting HTML tables to markdown tables and improving the overall layout.

"""

def __init__(self, model: BaseChatModel | None = None):
self.model = model

async def format(
self, elements: Union[List[Element], str], file_path: str | None = None
) -> Union[List[Element], str]:
if isinstance(elements, list):
return await self.format_elements(elements, file_path)
return await self.format_string(elements, file_path)

async def format_elements(
self, elements: List[Element], file_path: str | None = None
) -> Union[List[Element], str]:
raise NotImplementedError("Subclasses should implement this method")

async def format_string(
self, text: str, file_path: str | None = None
) -> Union[List[Element], str]:
raise NotImplementedError("Subclasses should implement this method")
10 changes: 10 additions & 0 deletions megaparse/core/formatter/table_formatter/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
from megaparse.core.formatter.formatter import Formatter
from typing import List
from unstructured.documents.elements import Element


class TableFormatter(Formatter):
async def format_elements(
self, elements: List[Element], file_path: str | None = None
) -> List[Element]:
raise NotImplementedError()
Loading