[MODULE] - Paragraph contains regular/markdown table #346

jhoetter · 2023-09-26T16:22:14Z

Please describe the module you would like to add to bricks
In the context of RAG (Retrieval Augmented Generation):
If a paragraph contains a table, i want to easily filter for it; generally, it means the paragraph has a higher complexity.

Do you already have an implementation?
This is nowhere near perfect, it just is a first heuristic I used previously that detects if there likely is a table without any kind of markdown structure. E.g. in a pricing table, which contains some headers and then just is prices.

import re


def likely_contains_tabular_data(text):
    # Check for sequences of numbers and special symbols,
    # accounting for whitespace or tabs between elements
    number_pattern = re.compile(r"(\d+([.,]\d+)?)+")
    whitespace_pattern = re.compile(r"\s+")

    lines = text.split("\n")
    lines_with_patterns = [
        line for line in lines if len(number_pattern.findall(line)) >= 3
    ]

    # Heuristic 1: If we find many sequences of numbers separated by whitespace in the same line,
    # it might be tabular data. Let's assume we need at least 3 such sequences for a line to be considered.
    if any(len(number_pattern.findall(line)) >= 3 for line in lines):
        return True

    # Heuristic 2: Check if there are multiple lines with a similar structure of elements.
    # If the majority of these lines have a similar number of numerical elements,
    # it might be a sign of tabular data.
    if lines_with_patterns:
        elements_counts = [
            len(whitespace_pattern.split(line)) for line in lines_with_patterns
        ]
        avg_elements = sum(elements_counts) / len(elements_counts)

        # If most lines have a number of elements close to the average, consider it tabular
        if (
            sum(1 for count in elements_counts if abs(count - avg_elements) <= 1)
            >= len(elements_counts) * 0.7
        ):
            return True

    # No heuristic matched
    return False

Additional context
-

jhoetter · 2023-09-27T12:00:53Z

For instance:

This could be your text:

| Fruit       | Color  | Taste     |
|-------------|--------|-----------|
| Apple       | Red    | Sweet     |
| Banana      | Yellow | Sweet     |
| Orange      | Orange | Tangy     |
| Strawberry  | Red    | Sweet     |
| Blueberry   | Blue   | Tart      |

The table above lists some common fruits along with their colors and tastes. For example, apples are red and have a sweet taste, while bananas are yellow and also taste sweet. Oranges are orange in color and have a tangy flavor, while strawberries are red and sweet. Finally, blueberries are blue and have a slightly tart taste.

I want to detect that there is a markdown table

jhoetter · 2023-09-27T12:02:06Z

Ideally, it would also find some tables without markdown structure, for instance:

Name    Age    Score
Alice   28     92
Bob     24     87
Carol   32     95
David   29     88
Eve     35     90

This is what I initially did with the function I described above.

jhoetter added enhancement New feature or request cognition labels Sep 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MODULE] - Paragraph contains regular/markdown table #346

[MODULE] - Paragraph contains regular/markdown table #346

jhoetter commented Sep 26, 2023

jhoetter commented Sep 27, 2023

jhoetter commented Sep 27, 2023

[MODULE] - Paragraph contains regular/markdown table #346

[MODULE] - Paragraph contains regular/markdown table #346

Comments

jhoetter commented Sep 26, 2023

jhoetter commented Sep 27, 2023

jhoetter commented Sep 27, 2023