Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MODULE] - Paragraph contains regular/markdown table #346

Open
jhoetter opened this issue Sep 26, 2023 · 2 comments
Open

[MODULE] - Paragraph contains regular/markdown table #346

jhoetter opened this issue Sep 26, 2023 · 2 comments
Labels
cognition enhancement New feature or request

Comments

@jhoetter
Copy link
Member

Please describe the module you would like to add to bricks
In the context of RAG (Retrieval Augmented Generation):
If a paragraph contains a table, i want to easily filter for it; generally, it means the paragraph has a higher complexity.

Do you already have an implementation?
This is nowhere near perfect, it just is a first heuristic I used previously that detects if there likely is a table without any kind of markdown structure. E.g. in a pricing table, which contains some headers and then just is prices.

import re


def likely_contains_tabular_data(text):
    # Check for sequences of numbers and special symbols,
    # accounting for whitespace or tabs between elements
    number_pattern = re.compile(r"(\d+([.,]\d+)?)+")
    whitespace_pattern = re.compile(r"\s+")

    lines = text.split("\n")
    lines_with_patterns = [
        line for line in lines if len(number_pattern.findall(line)) >= 3
    ]

    # Heuristic 1: If we find many sequences of numbers separated by whitespace in the same line,
    # it might be tabular data. Let's assume we need at least 3 such sequences for a line to be considered.
    if any(len(number_pattern.findall(line)) >= 3 for line in lines):
        return True

    # Heuristic 2: Check if there are multiple lines with a similar structure of elements.
    # If the majority of these lines have a similar number of numerical elements,
    # it might be a sign of tabular data.
    if lines_with_patterns:
        elements_counts = [
            len(whitespace_pattern.split(line)) for line in lines_with_patterns
        ]
        avg_elements = sum(elements_counts) / len(elements_counts)

        # If most lines have a number of elements close to the average, consider it tabular
        if (
            sum(1 for count in elements_counts if abs(count - avg_elements) <= 1)
            >= len(elements_counts) * 0.7
        ):
            return True

    # No heuristic matched
    return False

Additional context
-

@jhoetter jhoetter added enhancement New feature or request cognition labels Sep 26, 2023
@jhoetter
Copy link
Member Author

For instance:

This could be your text:

| Fruit       | Color  | Taste     |
|-------------|--------|-----------|
| Apple       | Red    | Sweet     |
| Banana      | Yellow | Sweet     |
| Orange      | Orange | Tangy     |
| Strawberry  | Red    | Sweet     |
| Blueberry   | Blue   | Tart      |

The table above lists some common fruits along with their colors and tastes. For example, apples are red and have a sweet taste, while bananas are yellow and also taste sweet. Oranges are orange in color and have a tangy flavor, while strawberries are red and sweet. Finally, blueberries are blue and have a slightly tart taste.

I want to detect that there is a markdown table

@jhoetter
Copy link
Member Author

Ideally, it would also find some tables without markdown structure, for instance:

Name    Age    Score
Alice   28     92
Bob     24     87
Carol   32     95
David   29     88
Eve     35     90

This is what I initially did with the function I described above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cognition enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant