You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please describe the module you would like to add to bricks
When exporting data from e.g. a PDF document, you can likely face paragraphs with odd characters. I want to detect them.
Do you already have an implementation?
If so, please share it here. For instance:
importunicodedatadefdetect_unusual_characters(text, allowed_ranges=None):
""" Detect unusual characters in a given text based on specified Unicode ranges. Parameters: - text (str): Input string. - allowed_ranges (list): List of allowed Unicode blocks as (start, end) tuples. Returns: - set: Set of unusual characters. """ifallowed_rangesisNone:
allowed_ranges= [
(0x0020, 0x007F), # Basic Latin
(0x00A0, 0x00FF), # Latin-1 Supplement
(0x0100, 0x017F), # Latin Extended-A
(0x0180, 0x024F), # Latin Extended-B
(0x2000, 0x206F), # General Punctuation
(0x20A0, 0x20CF), # Currency Symbols
]
# Allowed control charactersallowed_controls= {"\n", "\t", "\r"}
unusual_chars= {
charforcharintextifnotany(start<=ord(char) <=endforstart, endinallowed_ranges)
andunicodedata.category(char) !="Zs"andcharnotinallowed_controls
}
returnunusual_charsdeflikely_contains_unusual_characters(text, allowed_ranges=None):
""" Detect whether a given text contains unusual characters based on specified Unicode ranges. Parameters: - text (str): Input string. - allowed_ranges (list): List of allowed Unicode blocks as (start, end) tuples. Returns: - bool: True if text contains unusual characters, False otherwise. """unusual_chars=detect_unusual_characters(text, allowed_ranges)
returnlen(unusual_chars) >0
Additional context
If a paragraph contains special characters, it generally is a "lower quality" paragraph for RAG.
The text was updated successfully, but these errors were encountered:
Please describe the module you would like to add to bricks
When exporting data from e.g. a PDF document, you can likely face paragraphs with odd characters. I want to detect them.
Do you already have an implementation?
If so, please share it here. For instance:
Additional context
If a paragraph contains special characters, it generally is a "lower quality" paragraph for RAG.
The text was updated successfully, but these errors were encountered: