Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MODULE] - Classifier "contains special characters" #345

Open
jhoetter opened this issue Sep 26, 2023 · 0 comments
Open

[MODULE] - Classifier "contains special characters" #345

jhoetter opened this issue Sep 26, 2023 · 0 comments
Labels
cognition enhancement New feature or request

Comments

@jhoetter
Copy link
Member

Please describe the module you would like to add to bricks
When exporting data from e.g. a PDF document, you can likely face paragraphs with odd characters. I want to detect them.

Do you already have an implementation?
If so, please share it here. For instance:

import unicodedata


def detect_unusual_characters(text, allowed_ranges=None):
    """
    Detect unusual characters in a given text based on specified Unicode ranges.

    Parameters:
    - text (str): Input string.
    - allowed_ranges (list): List of allowed Unicode blocks as (start, end) tuples.

    Returns:
    - set: Set of unusual characters.
    """
    if allowed_ranges is None:
        allowed_ranges = [
            (0x0020, 0x007F),  # Basic Latin
            (0x00A0, 0x00FF),  # Latin-1 Supplement
            (0x0100, 0x017F),  # Latin Extended-A
            (0x0180, 0x024F),  # Latin Extended-B
            (0x2000, 0x206F),  # General Punctuation
            (0x20A0, 0x20CF),  # Currency Symbols
        ]

    # Allowed control characters
    allowed_controls = {"\n", "\t", "\r"}

    unusual_chars = {
        char
        for char in text
        if not any(start <= ord(char) <= end for start, end in allowed_ranges)
        and unicodedata.category(char) != "Zs"
        and char not in allowed_controls
    }

    return unusual_chars


def likely_contains_unusual_characters(text, allowed_ranges=None):
    """
    Detect whether a given text contains unusual characters based on specified Unicode ranges.

    Parameters:
    - text (str): Input string.
    - allowed_ranges (list): List of allowed Unicode blocks as (start, end) tuples.

    Returns:
    - bool: True if text contains unusual characters, False otherwise.
    """
    unusual_chars = detect_unusual_characters(text, allowed_ranges)
    return len(unusual_chars) > 0

Additional context
If a paragraph contains special characters, it generally is a "lower quality" paragraph for RAG.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cognition enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant