You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please describe the module you would like to add to bricks
In the context of RAG (Retrieval Augmented Generation):
If a paragraph contains a table, i want to easily filter for it; generally, it means the paragraph has a higher complexity.
Do you already have an implementation?
This is nowhere near perfect, it just is a first heuristic I used previously that detects if there likely is a table without any kind of markdown structure. E.g. in a pricing table, which contains some headers and then just is prices.
importredeflikely_contains_tabular_data(text):
# Check for sequences of numbers and special symbols,# accounting for whitespace or tabs between elementsnumber_pattern=re.compile(r"(\d+([.,]\d+)?)+")
whitespace_pattern=re.compile(r"\s+")
lines=text.split("\n")
lines_with_patterns= [
lineforlineinlinesiflen(number_pattern.findall(line)) >=3
]
# Heuristic 1: If we find many sequences of numbers separated by whitespace in the same line,# it might be tabular data. Let's assume we need at least 3 such sequences for a line to be considered.ifany(len(number_pattern.findall(line)) >=3forlineinlines):
returnTrue# Heuristic 2: Check if there are multiple lines with a similar structure of elements.# If the majority of these lines have a similar number of numerical elements,# it might be a sign of tabular data.iflines_with_patterns:
elements_counts= [
len(whitespace_pattern.split(line)) forlineinlines_with_patterns
]
avg_elements=sum(elements_counts) /len(elements_counts)
# If most lines have a number of elements close to the average, consider it tabularif (
sum(1forcountinelements_countsifabs(count-avg_elements) <=1)
>=len(elements_counts) *0.7
):
returnTrue# No heuristic matchedreturnFalse
Additional context
-
The text was updated successfully, but these errors were encountered:
| Fruit | Color | Taste |
|-------------|--------|-----------|
| Apple | Red | Sweet |
| Banana | Yellow | Sweet |
| Orange | Orange | Tangy |
| Strawberry | Red | Sweet |
| Blueberry | Blue | Tart |
The table above lists some common fruits along with their colors and tastes. For example, apples are red and have a sweet taste, while bananas are yellow and also taste sweet. Oranges are orange in color and have a tangy flavor, while strawberries are red and sweet. Finally, blueberries are blue and have a slightly tart taste.
Please describe the module you would like to add to bricks
In the context of RAG (Retrieval Augmented Generation):
If a paragraph contains a table, i want to easily filter for it; generally, it means the paragraph has a higher complexity.
Do you already have an implementation?
This is nowhere near perfect, it just is a first heuristic I used previously that detects if there likely is a table without any kind of markdown structure. E.g. in a pricing table, which contains some headers and then just is prices.
Additional context
-
The text was updated successfully, but these errors were encountered: