diff --git a/_static/segmentation.png b/_static/segmentation.png new file mode 100755 index 0000000..c51a2b0 Binary files /dev/null and b/_static/segmentation.png differ diff --git a/index.html b/index.html index 93220d2..166e184 100755 --- a/index.html +++ b/index.html @@ -357,11 +357,18 @@ Why Choose Melusine ? + + +
In the following example, an email is divided into two distinct messages +separated by a transition pattern. +Each message is then tagged line by line. +This email segmentation can later be leveraged to enhance the performance of machine learning models.
+Dear Kim + HELLO +
+Please find the details in the forwarded email. + BODY +
+Best Regards + GREETINGS +
+Jo Kahn + SIGNATURE +
+Forwarded by jo@maif.fr on Monday december 12th + TRANSITION +
+From: alex@gmail.com + TRANSITION +
+To: jo@maif.fr + TRANSITION +
+Subject: New address + TRANSITION +
+Dear Jo + HELLO +
+A new version of Melusine is about to be released. + BODY +
+Feel free to test it and send us feedbacks! + BODY +
+Thank you for your help. + THANKS +
+Cheers + GREETINGS +
+Alex Leblanc + SIGNATURE +
+55 Rue du Faubourg Saint-Honoré + SIGNATURE +
+75008 Paris + SIGNATURE +
+Sent from my iPhone + FOOTER +
+Get started with melusine following our (tested!) tutorials:
Discover Melusine, a comprehensive email processing library designed to optimize your email workflow. Leverage Melusine's advanced features to achieve:
Melusine facilitates the integration of deep learning frameworks (HuggingFace, Pytorch, Tensorflow, etc), deterministic rules (regex, keywords, heuristics) into a full email qualification workflow.
"},{"location":"#why-choose-melusine","title":"Why Choose Melusine ?","text":"Melusine stands out with its combination of features and advantages:
Get started with melusine following our (tested!) tutorials:
Getting Started
MelusinePipeline
MelusineTransformers
MelusineRegex
ML models
MelusineDetector
Configurations
Basic Classification
With Melusine, you're well-equipped to transform your email handling, streamlining processes, maximizing efficiency, and enhancing overall productivity.
"},{"location":"advanced/ContentTagger/","title":"Use custom message tags","text":""},{"location":"advanced/CustomDetector/","title":"Use a custom MelusineDetector template","text":""},{"location":"advanced/CustomDetector/#specify-abstract-methods","title":"Specify abstract methods","text":""},{"location":"advanced/CustomDetector/#row-transformations-vs-dataframe-transformations","title":"Row transformations vs dataframe transformations","text":""},{"location":"advanced/ExchangeConnector/","title":"Connect melusine to a Microsoft Exchange Mailbox","text":""},{"location":"advanced/PreTrainedModelsHF/","title":"Use pre-trained models from HuggingFace","text":""},{"location":"contribute/how_to_contribute/","title":"How to contribute","text":"The melusine library is open to contributions from the community. This page describes the process to follow to contribute to the project:
master
branch of the main repository.master
branch of the main repository.Melusine, an open-source email processing library, was born at MAIF, a French mutual insurance company founded in 1934 and headquartered in Niort, France. MAIF is committed to social and environmental responsibility, operating as a Soci\u00e9t\u00e9 \u00e0 mission (a company with a purpose).
"},{"location":"history/history/#project-motivation","title":"Project Motivation","text":"MAIF handles a vast volume of emails daily, necessitating an efficient processing solution to ensure timely and accurate handling while maximizing customer satisfaction. Automated email processing plays a crucial role in this endeavor, enabling tasks such as:
After successfully implementing and testing the email processing code in production, MAIF made the strategic decision to open-source Melusine. This decision stems from multiple compelling reasons:
The initial release of Melusine in 2019 followed its successful deployment at MAIF. Since then, Melusine has evolved to meet the growing demands of MAIF's business needs. In 2024, based on extensive feedback gained from Melusine's production usage, a complete refactoring effort resulted in the release of Melusine V3. This latest version boasts enhanced modularity, customization options, seamless integration, robust monitoring capabilities, and simplified maintenance.
"},{"location":"philosophy/philosophy/","title":"Code philosophy","text":""},{"location":"philosophy/philosophy/#what-is-a-code-philosophy-and-why-do-i-need-it","title":"What is a code philosophy and why do I need it ?","text":""},{"location":"philosophy/philosophy/#design-patterns","title":"Design patterns","text":""},{"location":"tutorials/00_GettingStarted/","title":"Getting started with Melusine","text":"Let's run emergency detection with melusine :
Email datasets typically contain information about:
The present tutorial only makes use of the body and header data.
body header 0 This is an \u00ebm\u00e8rg\u00e9n\u00e7y Help 1 How is life ? Hey ! 2 Urgent update about Mr. Annoying Latest news 3 Please call me now URGENT"},{"location":"tutorials/00_GettingStarted/#code","title":"Code","text":"A typical code for a melusine-based application looks like this :
from melusine.data import load_email_data\nfrom melusine.pipeline import MelusinePipeline\n# Load an email dataset\ndf = load_email_data()\n# Load a pipeline\npipeline = MelusinePipeline.from_config(\"demo_pipeline\") # (1)!\n# Run the pipeline\ndf = pipeline.transform(df)\n
demo_pipeline
. Melusine users will typically define their own pipeline configuration. See more in the Configurations tutorialThe pipeline created extra columns in the dataset. Some columns are temporary variables required by detectors (ex: normalized_body
) and some are detection results with direct business value (ex: emergency_result
).
Illustration of the pipeline used in the present tutorial :
---\ntitle: Demonstration pipeline\n---\nflowchart LR\n Input[[Email]] --> A(Cleaner)\n A(Cleaner) --> C(Normalizer)\n C --> F(Emergency\\nDetector)\n F --> Output[[Qualified Email]]
Cleaner
: Cleaning transformations such as uniformization of line breaks (\\r\\n
-> \\n
)Normalizer
: Text normalisation to delete/replace non utf8 characters (\u00e9\u00f6\u00e0
-> eoa
)EmergencyDetector
: Detection of urgent emailsInfo
This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors. For example, pipelines may contain:
End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info.
from melusine.data import load_email_data\nfrom melusine.pipeline import MelusinePipeline\n# Load an email dataset\ndf = load_email_data()\n# Activate debug mode\ndf.debug = True\n# Load the default pipeline\npipeline = MelusinePipeline.from_config(\"demo_pipeline\")\n# Run the pipeline\ndf = pipeline.transform(df)\n
A new column debug_emergency
is created.
Inspecting the debug data gives a lot of info:
text
: Effective text considered for detection.EmergencyRegex
: melusine used an EmergencyRegex
object to run detection.match_result
: The EmergencyRegex
did not match the textpositive_match_data
: The EmergencyRegex
matched positively the text pattern \"Urgent\" (Required condition)negative_match_data
: The EmergencyRegex
matched negatively the text pattern \"Mr. Annoying\" (Forbidden condition)BLACKLIST
: Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified.# print(df.iloc[2][\"debug_emergency\"])\n{\n'text': 'Latest news\\nUrgent update about Mr. Annoying'},\n'EmergencyRegex': {\n'match_result': False,\n'negative_match_data': {\n'BLACKLIST': [\n{'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44}\n]},\n'neutral_match_data': {},\n'positive_match_data': {\n'DEFAULT': [\n{'match_text': 'Urgent', 'start': 12, 'stop': 18}\n]\n}\n}\n
"},{"location":"tutorials/01_MelusinePipeline/","title":"MelusinePipeline","text":"The MelusinePipeline
class is at the core of melusine. It inherits from the sklearn.Pipeline
class and adds extra functionalities such as :
The MelusineDetector
component aims at standardizing how detection is performed in a MelusinePipeline
.
Tip
Project running over several years (such as email automation) may accumulate technical debt over time. Standardizing code practices can limit the technical debt and ease the onboarding of new developers.
The MelusineDetector
class splits detection into three steps:
pre_detect
: Select/combine the inputs needed for detection. Ex: Select the text parts tagged as BODY
and combine them with the text in the email header.detect
: Use regular expressions, ML models or heuristics to run detection on the input text.post_detect
: Run detection rules such as thresholding or combine results from multiple models.The method transform
is defined by the BaseClass MelusineDetector
and will call the pre_detect/detect/post_detect methods in turn (Template pattern).
# Instantiate Detector\ndetector = MyDetector()\n# Run pre_detect, detect and post_detect on input data\ndata_with_detection = detector.transform(data)\n
Here is the full code of a MelusineDetector to detect emails related to viruses. The next sections break down the different parts of the code.
class MyCustomDetector(BaseMelusineDetector):\n@property\ndef transform_methods(self) -> List[Callable]:\nreturn [self.prepare, self.run]\ndef prepare(self, row, debug_mode=False):\nreturn row\ndef run(self, row, debug_mode=False):\nrow[self.output_columns[0]] = \"12345\"\nreturn row\n
The detector is run on a simple dataframe:
df = pd.DataFrame(\n[\n{\"input_col\": \"test1\"},\n{\"input_col\": \"test2\"},\n]\n)\ndetector = MyCustomDetector(input_columns=[\"input_col\"], output_columns=[\"output_col\"], name=\"custom\")\ndf = detector.transform(df)\n
The output is a dataframe with a new virus_result
column.
Tip
Columns that are not declared in the output_columns
are dropped automatically.
In the init method, you should call the superclass init and provide:
Tip
If the init method of the super class is enough (parameters name
, input_columns
and output_columns
) you may skip the init method entirely when defining your MelusineDetector
.
The pre_detect
method simply combines the header text and the body text (separated by a line break).
def pre_detect(self, df, debug_mode=False):\n# Assemble the text columns into a single column\ndf[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + \"\\n\" + df[self.body_column]\nreturn df\n
"},{"location":"tutorials/05a_MelusineDetectors/#detector-detect","title":"Detector detect","text":"The detect
applies two regexes on the selected text: - A positive regex to catch mentions to viruses - A negative regex to avoid false positive detections
def detect(self, df, debug_mode=False):\ntext_column = df[self.TMP_DETECTION_INPUT_COLUMN]\npositive_regex = r\"(virus)\"\nnegative_regex = r\"(corona[ _]virus)\"\n# Pandas str.extract method on columns\ndf[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)\ndf[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)\nreturn df\n
"},{"location":"tutorials/05a_MelusineDetectors/#detector-post_detect","title":"Detector post_detect","text":"The post_detect
combines the regex detection result to determine the final result.
def post_detect(self, df, debug_mode=False):\n# Boolean operation on pandas column\ndf[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] & ~df[self.TMP_NEGATIVE_REGEX_MATCH]\nreturn df\n
"},{"location":"tutorials/05a_MelusineDetectors/#are-melusinedetectors-mandatory-for-melusine","title":"Are MelusineDetectors mandatory for melusine?","text":"No.
You can use any scikit-learn compatible component in your MelusinePipeline
. However, we recommend using the MelusineDetector
(and MelusineTransformer
) classes to benefit from:
Check-out the next tutorial to discover advanced features of the MelusineDetector
class.
This tutorial presents the advanced features of the MelusineDetector
class:
MelusineDetector
are designed to be easily debugged. For that purpose, the pre-detect/detect/post-detect methods all have a debug_mode
argument. The debug mode is activated by setting the debug attribute of a dataframe to True.
import pandas as pd\ndf = pd.DataFrame({\"bla\": [1, 2, 3]})\ndf.debug = True\n
Warning
Debug mode activation is backend dependent. With a DictBackend, tou should use my_dict[\"debug\"] = True
When debug mode is activated, a column named \"DETECTOR_NAME_debug\" containing an empty dictionary is automatically created. Populating this debug dict with debug info is then left to the user's responsibility.
Exemple of a detector with debug data
class MyVirusDetector(MelusineDetector):\nOUTPUT_RESULT_COLUMN = \"virus_result\"\nTMP_DETECTION_INPUT_COLUMN = \"detection_input\"\nTMP_POSITIVE_REGEX_MATCH = \"positive_regex_match\"\nTMP_NEGATIVE_REGEX_MATCH = \"negative_regex_match\"\ndef __init__(self, body_column: str, header_column: str):\nself.body_column = body_column\nself.header_column = header_column\nsuper().__init__(\ninput_columns=[self.body_column, self.header_column],\noutput_columns=[self.OUTPUT_RESULT_COLUMN],\nname=\"virus\",\n)\ndef pre_detect(self, row, debug_mode=False):\neffective_text = row[self.header_column] + \"\\n\" + row[self.body_column]\nrow[self.TMP_DETECTION_INPUT_COLUMN] = effective_text\nif debug_mode:\nrow[self.debug_dict_col] = {\"detection_input\": row[self.TMP_DETECTION_INPUT_COLUMN]}\nreturn row\ndef detect(self, row, debug_mode=False):\ntext = row[self.TMP_DETECTION_INPUT_COLUMN]\npositive_regex = r\"virus\"\nnegative_regex = r\"corona[ _]virus\"\npositive_match = re.search(positive_regex, text)\nnegative_match = re.search(negative_regex, text)\nrow[self.TMP_POSITIVE_REGEX_MATCH] = bool(positive_match)\nrow[self.TMP_NEGATIVE_REGEX_MATCH] = bool(negative_match)\nif debug_mode:\npositive_match_text = (\npositive_match.string[positive_match.start() : positive_match.end()] if positive_match else None\n)\nnegative_match_text = (\npositive_match.string[negative_match.start() : negative_match.end()] if negative_match else None\n)\nrow[self.debug_dict_col].update(\n{\n\"positive_match_data\": {\"result\": bool(positive_match), \"match_text\": positive_match_text},\n\"negative_match_data\": {\"result\": bool(negative_match), \"match_text\": negative_match_text},\n}\n)\nreturn row\ndef post_detect(self, row, debug_mode=False):\nif row[self.TMP_POSITIVE_REGEX_MATCH] and not row[self.TMP_NEGATIVE_REGEX_MATCH]:\nrow[self.OUTPUT_RESULT_COLUMN] = True\nelse:\nrow[self.OUTPUT_RESULT_COLUMN] = False\nreturn row\n
In the end, an extra column is created containing debug data:
virus_result debug_virus 0 True {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}} 1 False {'detection_input': '...', 'positive_match_data': {'result': False, 'match_text': None}, 'negative_match_data': {'result': False, 'match_text': None}} 2 True {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}} 3 False {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': True, 'match_text': 'corona virus'}}"},{"location":"tutorials/05b_MelusineDetectorsAdvanced/#row-methods-vs-dataframe-methods","title":"Row methods vs dataframe methods","text":"There are two ways to use the pre-detect/detect/post-detect methods:
Tip
Using row wise methods make your code backend independent. You may switch from a PandasBackend
to a DictBackend
at any time. The PandasBackend
also supports multiprocessing for row wise methods.
To use row wise methods, you just need to name the first parameter of \"row\". Otherwise, dataframe wise transformations are used.
Exemple of a Detector with dataframe wise method (works with a PandasBackend only).
class MyVirusDetector(MelusineDetector):\n\"\"\"\n Detect if the text expresses dissatisfaction.\n \"\"\"\n# Dataframe column names\nOUTPUT_RESULT_COLUMN = \"virus_result\"\nTMP_DETECTION_INPUT_COLUMN = \"detection_input\"\nTMP_POSITIVE_REGEX_MATCH = \"positive_regex_match\"\nTMP_NEGATIVE_REGEX_MATCH = \"negative_regex_match\"\ndef __init__(self, body_column: str, header_column: str):\nself.body_column = body_column\nself.header_column = header_column\nsuper().__init__(\ninput_columns=[self.body_column, self.header_column],\noutput_columns=[self.OUTPUT_RESULT_COLUMN],\nname=\"virus\",\n)\ndef pre_detect(self, df, debug_mode=False):\n# Assemble the text columns into a single column\ndf[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + \"\\n\" + df[self.body_column]\nreturn df\ndef detect(self, df, debug_mode=False):\ntext_column = df[self.TMP_DETECTION_INPUT_COLUMN]\npositive_regex = r\"(virus)\"\nnegative_regex = r\"(corona[ _]virus)\"\n# Pandas str.extract method on columns\ndf[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)\ndf[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)\nreturn df\ndef post_detect(self, df, debug_mode=False):\n# Boolean operation on pandas column\ndf[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] & ~df[self.TMP_NEGATIVE_REGEX_MATCH]\nreturn df\n
"},{"location":"tutorials/05b_MelusineDetectorsAdvanced/#custom-transform-methods","title":"Custom transform methods","text":"If you are not happy with the pre_detect
/detect
/post_detect
transform methods, you:
MelusineDetector
class)In this exemple, the prepare
/run
custom transform methods are used instead of the default pre_detect
/detect
/post_detect
.
class MyCustomDetector(BaseMelusineDetector):\n@property\ndef transform_methods(self) -> List[Callable]:\nreturn [self.prepare, self.run]\ndef prepare(self, row, debug_mode=False):\nreturn row\ndef run(self, row, debug_mode=False):\nrow[self.output_columns[0]] = \"12345\"\nreturn row\n
To configure custom transform methods you need to:
transform_methods
propertyThe transform
method will now call prepare
and run
.
df = pd.DataFrame(\n[\n{\"input_col\": \"test1\"},\n{\"input_col\": \"test2\"},\n]\n)\ndetector = MyCustomDetector(input_columns=[\"input_col\"], output_columns=[\"output_col\"], name=\"custom\")\ndf = detector.transform(df)\n
We can check that the run
method was indeed called.
Melusine components can be instantiated using parameters defined in configurations. The from_config
method accepts a config_dict
argument
from melusine.processors import Normalizer\nnormalizer_conf = {\n\"input_columns\": [\"text\"],\n\"output_columns\": [\"normalized_text\"],\n\"form\": \"NFKD\",\n\"lowercase\": False,\n}\nnormalizer = Normalizer.from_config(config_dict=normalizer_conf)\n
or a config_key
argument.
from melusine.pipeline import MelusinePipeline\npipeline = MelusinePipeline.from_config(config_key=\"demo_pipeline\")\n
When demo_pipeline
is given as argument, parameters are read from the melusine.config
object at key demo_pipeline
. "},{"location":"tutorials/06_Configurations/#access-configurations","title":"Access configurations","text":"The melusine configurations can be accessed with the config
object.
from melusine import config\nprint(config[\"demo_pipeline\"])\n
The configuration of the demo_pipeline
can then be easily inspected.
{\n'steps': [\n{'class_name': 'Cleaner', 'config_key': 'body_cleaner', 'module': 'melusine.processors'},\n{'class_name': 'Cleaner', 'config_key': 'header_cleaner', 'module': 'melusine.processors'},\n{'class_name': 'Segmenter', 'config_key': 'segmenter', 'module': 'melusine.processors'},\n{'class_name': 'ContentTagger', 'config_key': 'content_tagger', 'module': 'melusine.processors'},\n{'class_name': 'TextExtractor', 'config_key': 'text_extractor', 'module': 'melusine.processors'},\n{'class_name': 'Normalizer', 'config_key': 'demo_normalizer', 'module': 'melusine.processors'},\n{'class_name': 'EmergencyDetector', 'config_key': 'emergency_detector', 'module': 'melusine.detectors'}\n]\n}\n
"},{"location":"tutorials/06_Configurations/#modify-configurations","title":"Modify configurations","text":"The simplest way to modify configurations is to create a new directory directly.
from melusine import config\n# Get a dict of the existing conf\nnew_conf = config.dict()\n# Add/Modify a config key\nnew_conf[\"my_conf_key\"] = \"my_conf_value\"\n# Reset Melusine configurations\nconfig.reset(new_conf)\n
To deliver code in a production environment, using configuration files should be preferred to modifying the configurations on the fly. Melusine lets you specify the path to a folder containing yaml files and loads them (the OmegaConf
package is used behind the scene).
from melusine import config\n# Specify the path to a conf folder\nconf_path = \"path/to/conf/folder\"\n# Reset Melusine configurations\nconfig.reset(config_path=conf_path)\n# >> Using config_path : path/to/conf/folder\n
When the MELUSINE_CONFIG_DIR
environment variable is set, Melusine loads directly the configurations files located at the path specified by the environment variable.
import os\nfrom melusine import config\n# Specify the MELUSINE_CONFIG_DIR environment variable\nos.environ[\"MELUSINE_CONFIG_DIR\"] = \"path/to/conf/folder\"\n# Reset Melusine configurations\nconfig.reset()\n# >> Using config_path from env variable MELUSINE_CONFIG_DIR\n# >> Using config_path : path/to/conf/folder\n
Tip
If the MELUSINE_CONFIG_DIR
is set before melusine is imported (e.g., before starting the program), you don't need to call config.reset()
.
Creating your configuration folder from scratch would be cumbersome. It is advised to export the default configurations and then modify just the files you need.
from melusine import config\n# Specify the path a folder (created if it doesn't exist)\nconf_path = \"path/to/conf/folder\"\n# Export default configurations to the folder\nfiles_created = config.export_default_config(path=conf_path)\n
Tip
The export_default_config
returns a list of path to all the files created.
Machine Learning is commonly used to classify data into pre-defined categories.
---\ntitle: Email classification\n---\nflowchart LR\n Input[[Email]] --> X(((Classifier)))\n X --> A(Car)\n X --> B(Boat)\n X --> C(Housing)\n X --> D(Health)
Typically, to reach high classification performance, models need to be trained on context specific labeled data. Zero-shot classification is a type of classification that uses a pre-trained model and does not require further training on context specific data.
"},{"location":"tutorials/07_BasicClassification/#tutorial-intro","title":"Tutorial intro","text":"In this tutorial we want to detect insatisfaction in an email dataset. Let's create a basic dataset:
import pandas as pd\nfrom transformers import pipeline\nfrom melusine.base import MelusineDetector\ndef create_dataset():\ndf = pd.DataFrame(\n[\n{\n\"header\": \"Dossier 123456\",\n\"body\": \"Merci beaucoup pour votre gentillesse et votre \u00e9coute !\",\n},\n{\n\"header\": \"R\u00e9clamation (Dossier 987654)\",\n\"body\": (\"Bonjour, je ne suis pas satisfait de cette situation, \" \"r\u00e9pondez-moi rapidement svp!\"),\n},\n]\n)\nreturn df\n
header body 0 Dossier 123456 Merci beaucoup pour votre gentillesse et votre \u00e9coute ! 1 R\u00e9clamation (Dossier 987654) Bonjour, je ne suis pas satisfait de cette situation, r\u00e9pondez-moi rapidement svp!"},{"location":"tutorials/07_BasicClassification/#classify-with-zero-shot-classification","title":"Classify with Zero-Shot-Classification","text":"The transformers
library makes it really simple to use pre-trained models for zero shot classification.
model_name_or_path = \"cmarkea/distilcamembert-base-nli\"\nsentences = [\n\"Quelle belle journ\u00e9e aujourd'hui\",\n\"La mar\u00e9e est haute\",\n\"Ce film est une catastrophe, je suis en col\u00e8re\",\n]\nclassifier = pipeline(task=\"zero-shot-classification\", model=model_name_or_path, tokenizer=model_name_or_path)\nresult = classifier(\nsequences=sentences, candidate_labels=\", \".join([\"positif\", \"n\u00e9gatif\"]), hypothesis_template=\"Ce texte est {}.\"\n)\n
The classifier returns a score for the \"positif\" and \"n\u00e9gatif\" label for each input text:
[\n{\n'sequence': \"Quelle belle journ\u00e9e aujourd'hui\",\n'labels': ['positif', 'n\u00e9gatif'],\n'scores': [0.95, 0.05]\n},\n{\n'sequence': 'La mar\u00e9e est haute',\n'labels': ['positif', 'n\u00e9gatif'],\n'scores': [0.76, 0.24]\n},\n{'sequence': 'Ce film est une catastrophe, je suis en col\u00e8re',\n'labels': ['n\u00e9gatif', 'positif'],\n'scores': [0.97, 0.03]\n}\n]\n
"},{"location":"tutorials/07_BasicClassification/#implement-a-dissatisfaction-detector","title":"Implement a Dissatisfaction detector","text":"A full email processing pipeline could contain multiple models. Melusine uses the MelusineDetector template class to standardise how models are integrated into a pipeline.
class DissatisfactionDetector(MelusineDetector):\n\"\"\"\n Detect if the text expresses dissatisfaction.\n \"\"\"\n# Dataframe column names\nOUTPUT_RESULT_COLUMN = \"dissatisfaction_result\"\nTMP_DETECTION_INPUT_COLUMN = \"detection_input\"\nTMP_DETECTION_OUTPUT_COLUMN = \"detection_output\"\n# Model inference parameters\nPOSITIVE_LABEL = \"positif\"\nNEGATIVE_LABEL = \"n\u00e9gatif\"\nHYPOTHESIS_TEMPLATE = \"Ce texte est {}.\"\ndef __init__(self, model_name_or_path: str, text_columns: List[str], threshold: float):\nself.text_columns = text_columns\nself.threshold = threshold\nself.classifier = pipeline(\ntask=\"zero-shot-classification\", model=model_name_or_path, tokenizer=model_name_or_path\n)\nsuper().__init__(input_columns=text_columns, output_columns=[self.OUTPUT_RESULT_COLUMN], name=\"dissatisfaction\")\n
The pre_detect
method assembles the text that we want to use for classification.
def pre_detect(self, row, debug_mode=False):\n# Assemble the text columns into a single text\neffective_text = \"\"\nfor col in self.text_columns:\neffective_text += \"\\n\" + row[col]\nrow[self.TMP_DETECTION_INPUT_COLUMN] = effective_text\n# Store the effective detection text in the debug data\nif debug_mode:\nrow[self.debug_dict_col] = {\"detection_input\": row[self.TMP_DETECTION_INPUT_COLUMN]}\nreturn row\n
The detect
method runs the classification model on the text.
def detect(self, row, debug_mode=False):\n# Run the classifier on the text\npipeline_result = self.classifier(\nsequences=row[self.TMP_DETECTION_INPUT_COLUMN],\ncandidate_labels=\", \".join([self.POSITIVE_LABEL, self.NEGATIVE_LABEL]),\nhypothesis_template=self.HYPOTHESIS_TEMPLATE,\n)\n# Format classification result\nresult_dict = dict(zip(pipeline_result[\"labels\"], pipeline_result[\"scores\"]))\nrow[self.TMP_DETECTION_OUTPUT_COLUMN] = result_dict\n# Store ML results in the debug data\nif debug_mode:\nrow[self.debug_dict_col].update(result_dict)\nreturn row\n
The post_detect
method applies a threshold on the prediction score to determine the detection result.
def post_detect(self, row, debug_mode=False):\n# Compare classification score to the detection threshold\nif row[self.TMP_DETECTION_OUTPUT_COLUMN][self.NEGATIVE_LABEL] > self.threshold:\nrow[self.OUTPUT_RESULT_COLUMN] = True\nelse:\nrow[self.OUTPUT_RESULT_COLUMN] = False\nreturn row\n
On top of that, the detector takes care of building debug data to make the result explicable.
"},{"location":"tutorials/07_BasicClassification/#run-detection","title":"Run detection","text":"Putting it all together, we run the detector on the input dataset.
df = create_dataset()\ndetector = DissatisfactionDetector(\nmodel_name_or_path=\"cmarkea/distilcamembert-base-nli\",\ntext_columns=[\"header\", \"body\"],\nthreshold=0.7,\n)\ndf = detector.transform(df)\n
As a result, we get a new column dissatisfaction_result
with the detection result. We could have detection details by running the detector in debug mode.
Discover Melusine, a comprehensive email processing library designed to optimize your email workflow. Leverage Melusine's advanced features to achieve:
Melusine facilitates the integration of deep learning frameworks (HuggingFace, Pytorch, Tensorflow, etc), deterministic rules (regex, keywords, heuristics) into a full email qualification workflow.
"},{"location":"#why-choose-melusine","title":"Why Choose Melusine ?","text":"Melusine stands out with its combination of features and advantages:
In the following example, an email is divided into two distinct messages separated by a transition pattern. Each message is then tagged line by line. This email segmentation can later be leveraged to enhance the performance of machine learning models.
Message 1Dear Kim HELLO
Please find the details in the forwarded email. BODY
Best Regards GREETINGS
Jo Kahn SIGNATURE
Transition patternForwarded by jo@maif.fr on Monday december 12th TRANSITION
From: alex@gmail.com TRANSITION
To: jo@maif.fr TRANSITION
Subject: New address TRANSITION
Message 2Dear Jo HELLO
A new version of Melusine is about to be released. BODY
Feel free to test it and send us feedbacks! BODY
Thank you for your help. THANKS
Cheers GREETINGS
Alex Leblanc SIGNATURE
55 Rue du Faubourg Saint-Honor\u00e9 SIGNATURE
75008 Paris SIGNATURE
Sent from my iPhone FOOTER
"},{"location":"#getting-started","title":"Getting Started","text":"Get started with melusine following our (tested!) tutorials:
Getting Started
MelusinePipeline
MelusineTransformers
MelusineRegex
ML models
MelusineDetector
Configurations
Basic Classification
With Melusine, you're well-equipped to transform your email handling, streamlining processes, maximizing efficiency, and enhancing overall productivity.
"},{"location":"advanced/ContentTagger/","title":"Use custom message tags","text":""},{"location":"advanced/CustomDetector/","title":"Use a custom MelusineDetector template","text":""},{"location":"advanced/CustomDetector/#specify-abstract-methods","title":"Specify abstract methods","text":""},{"location":"advanced/CustomDetector/#row-transformations-vs-dataframe-transformations","title":"Row transformations vs dataframe transformations","text":""},{"location":"advanced/ExchangeConnector/","title":"Connect melusine to a Microsoft Exchange Mailbox","text":""},{"location":"advanced/PreTrainedModelsHF/","title":"Use pre-trained models from HuggingFace","text":""},{"location":"contribute/how_to_contribute/","title":"How to contribute","text":"The melusine library is open to contributions from the community. This page describes the process to follow to contribute to the project:
master
branch of the main repository.master
branch of the main repository.Melusine, an open-source email processing library, was born at MAIF, a French mutual insurance company founded in 1934 and headquartered in Niort, France. MAIF is committed to social and environmental responsibility, operating as a Soci\u00e9t\u00e9 \u00e0 mission (a company with a purpose).
"},{"location":"history/history/#project-motivation","title":"Project Motivation","text":"MAIF handles a vast volume of emails daily, necessitating an efficient processing solution to ensure timely and accurate handling while maximizing customer satisfaction. Automated email processing plays a crucial role in this endeavor, enabling tasks such as:
After successfully implementing and testing the email processing code in production, MAIF made the strategic decision to open-source Melusine. This decision stems from multiple compelling reasons:
The initial release of Melusine in 2019 followed its successful deployment at MAIF. Since then, Melusine has evolved to meet the growing demands of MAIF's business needs. In 2024, based on extensive feedback gained from Melusine's production usage, a complete refactoring effort resulted in the release of Melusine V3. This latest version boasts enhanced modularity, customization options, seamless integration, robust monitoring capabilities, and simplified maintenance.
"},{"location":"philosophy/philosophy/","title":"Code philosophy","text":""},{"location":"philosophy/philosophy/#what-is-a-code-philosophy-and-why-do-i-need-it","title":"What is a code philosophy and why do I need it ?","text":""},{"location":"philosophy/philosophy/#design-patterns","title":"Design patterns","text":""},{"location":"tutorials/00_GettingStarted/","title":"Getting started with Melusine","text":"Let's run emergency detection with melusine :
Email datasets typically contain information about:
The present tutorial only makes use of the body and header data.
body header 0 This is an \u00ebm\u00e8rg\u00e9n\u00e7y Help 1 How is life ? Hey ! 2 Urgent update about Mr. Annoying Latest news 3 Please call me now URGENT"},{"location":"tutorials/00_GettingStarted/#code","title":"Code","text":"A typical code for a melusine-based application looks like this :
from melusine.data import load_email_data\nfrom melusine.pipeline import MelusinePipeline\n# Load an email dataset\ndf = load_email_data()\n# Load a pipeline\npipeline = MelusinePipeline.from_config(\"demo_pipeline\") # (1)!\n# Run the pipeline\ndf = pipeline.transform(df)\n
demo_pipeline
. Melusine users will typically define their own pipeline configuration. See more in the Configurations tutorialThe pipeline created extra columns in the dataset. Some columns are temporary variables required by detectors (ex: normalized_body
) and some are detection results with direct business value (ex: emergency_result
).
Illustration of the pipeline used in the present tutorial :
---\ntitle: Demonstration pipeline\n---\nflowchart LR\n Input[[Email]] --> A(Cleaner)\n A(Cleaner) --> C(Normalizer)\n C --> F(Emergency\\nDetector)\n F --> Output[[Qualified Email]]
Cleaner
: Cleaning transformations such as uniformization of line breaks (\\r\\n
-> \\n
)Normalizer
: Text normalisation to delete/replace non utf8 characters (\u00e9\u00f6\u00e0
-> eoa
)EmergencyDetector
: Detection of urgent emailsInfo
This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors. For example, pipelines may contain:
End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info.
from melusine.data import load_email_data\nfrom melusine.pipeline import MelusinePipeline\n# Load an email dataset\ndf = load_email_data()\n# Activate debug mode\ndf.debug = True\n# Load the default pipeline\npipeline = MelusinePipeline.from_config(\"demo_pipeline\")\n# Run the pipeline\ndf = pipeline.transform(df)\n
A new column debug_emergency
is created.
Inspecting the debug data gives a lot of info:
text
: Effective text considered for detection.EmergencyRegex
: melusine used an EmergencyRegex
object to run detection.match_result
: The EmergencyRegex
did not match the textpositive_match_data
: The EmergencyRegex
matched positively the text pattern \"Urgent\" (Required condition)negative_match_data
: The EmergencyRegex
matched negatively the text pattern \"Mr. Annoying\" (Forbidden condition)BLACKLIST
: Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified.# print(df.iloc[2][\"debug_emergency\"])\n{\n'text': 'Latest news\\nUrgent update about Mr. Annoying'},\n'EmergencyRegex': {\n'match_result': False,\n'negative_match_data': {\n'BLACKLIST': [\n{'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44}\n]},\n'neutral_match_data': {},\n'positive_match_data': {\n'DEFAULT': [\n{'match_text': 'Urgent', 'start': 12, 'stop': 18}\n]\n}\n}\n
"},{"location":"tutorials/01_MelusinePipeline/","title":"MelusinePipeline","text":"The MelusinePipeline
class is at the core of melusine. It inherits from the sklearn.Pipeline
class and adds extra functionalities such as :
The MelusineDetector
component aims at standardizing how detection is performed in a MelusinePipeline
.
Tip
Project running over several years (such as email automation) may accumulate technical debt over time. Standardizing code practices can limit the technical debt and ease the onboarding of new developers.
The MelusineDetector
class splits detection into three steps:
pre_detect
: Select/combine the inputs needed for detection. Ex: Select the text parts tagged as BODY
and combine them with the text in the email header.detect
: Use regular expressions, ML models or heuristics to run detection on the input text.post_detect
: Run detection rules such as thresholding or combine results from multiple models.The method transform
is defined by the BaseClass MelusineDetector
and will call the pre_detect/detect/post_detect methods in turn (Template pattern).
# Instantiate Detector\ndetector = MyDetector()\n# Run pre_detect, detect and post_detect on input data\ndata_with_detection = detector.transform(data)\n
Here is the full code of a MelusineDetector to detect emails related to viruses. The next sections break down the different parts of the code.
class MyCustomDetector(BaseMelusineDetector):\n@property\ndef transform_methods(self) -> List[Callable]:\nreturn [self.prepare, self.run]\ndef prepare(self, row, debug_mode=False):\nreturn row\ndef run(self, row, debug_mode=False):\nrow[self.output_columns[0]] = \"12345\"\nreturn row\n
The detector is run on a simple dataframe:
df = pd.DataFrame(\n[\n{\"input_col\": \"test1\"},\n{\"input_col\": \"test2\"},\n]\n)\ndetector = MyCustomDetector(input_columns=[\"input_col\"], output_columns=[\"output_col\"], name=\"custom\")\ndf = detector.transform(df)\n
The output is a dataframe with a new virus_result
column.
Tip
Columns that are not declared in the output_columns
are dropped automatically.
In the init method, you should call the superclass init and provide:
Tip
If the init method of the super class is enough (parameters name
, input_columns
and output_columns
) you may skip the init method entirely when defining your MelusineDetector
.
The pre_detect
method simply combines the header text and the body text (separated by a line break).
def pre_detect(self, df, debug_mode=False):\n# Assemble the text columns into a single column\ndf[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + \"\\n\" + df[self.body_column]\nreturn df\n
"},{"location":"tutorials/05a_MelusineDetectors/#detector-detect","title":"Detector detect","text":"The detect
applies two regexes on the selected text: - A positive regex to catch mentions to viruses - A negative regex to avoid false positive detections
def detect(self, df, debug_mode=False):\ntext_column = df[self.TMP_DETECTION_INPUT_COLUMN]\npositive_regex = r\"(virus)\"\nnegative_regex = r\"(corona[ _]virus)\"\n# Pandas str.extract method on columns\ndf[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)\ndf[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)\nreturn df\n
"},{"location":"tutorials/05a_MelusineDetectors/#detector-post_detect","title":"Detector post_detect","text":"The post_detect
combines the regex detection result to determine the final result.
def post_detect(self, df, debug_mode=False):\n# Boolean operation on pandas column\ndf[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] & ~df[self.TMP_NEGATIVE_REGEX_MATCH]\nreturn df\n
"},{"location":"tutorials/05a_MelusineDetectors/#are-melusinedetectors-mandatory-for-melusine","title":"Are MelusineDetectors mandatory for melusine?","text":"No.
You can use any scikit-learn compatible component in your MelusinePipeline
. However, we recommend using the MelusineDetector
(and MelusineTransformer
) classes to benefit from:
Check-out the next tutorial to discover advanced features of the MelusineDetector
class.
This tutorial presents the advanced features of the MelusineDetector
class:
MelusineDetector
are designed to be easily debugged. For that purpose, the pre-detect/detect/post-detect methods all have a debug_mode
argument. The debug mode is activated by setting the debug attribute of a dataframe to True.
import pandas as pd\ndf = pd.DataFrame({\"bla\": [1, 2, 3]})\ndf.debug = True\n
Warning
Debug mode activation is backend dependent. With a DictBackend, tou should use my_dict[\"debug\"] = True
When debug mode is activated, a column named \"DETECTOR_NAME_debug\" containing an empty dictionary is automatically created. Populating this debug dict with debug info is then left to the user's responsibility.
Exemple of a detector with debug data
class MyVirusDetector(MelusineDetector):\nOUTPUT_RESULT_COLUMN = \"virus_result\"\nTMP_DETECTION_INPUT_COLUMN = \"detection_input\"\nTMP_POSITIVE_REGEX_MATCH = \"positive_regex_match\"\nTMP_NEGATIVE_REGEX_MATCH = \"negative_regex_match\"\ndef __init__(self, body_column: str, header_column: str):\nself.body_column = body_column\nself.header_column = header_column\nsuper().__init__(\ninput_columns=[self.body_column, self.header_column],\noutput_columns=[self.OUTPUT_RESULT_COLUMN],\nname=\"virus\",\n)\ndef pre_detect(self, row, debug_mode=False):\neffective_text = row[self.header_column] + \"\\n\" + row[self.body_column]\nrow[self.TMP_DETECTION_INPUT_COLUMN] = effective_text\nif debug_mode:\nrow[self.debug_dict_col] = {\"detection_input\": row[self.TMP_DETECTION_INPUT_COLUMN]}\nreturn row\ndef detect(self, row, debug_mode=False):\ntext = row[self.TMP_DETECTION_INPUT_COLUMN]\npositive_regex = r\"virus\"\nnegative_regex = r\"corona[ _]virus\"\npositive_match = re.search(positive_regex, text)\nnegative_match = re.search(negative_regex, text)\nrow[self.TMP_POSITIVE_REGEX_MATCH] = bool(positive_match)\nrow[self.TMP_NEGATIVE_REGEX_MATCH] = bool(negative_match)\nif debug_mode:\npositive_match_text = (\npositive_match.string[positive_match.start() : positive_match.end()] if positive_match else None\n)\nnegative_match_text = (\npositive_match.string[negative_match.start() : negative_match.end()] if negative_match else None\n)\nrow[self.debug_dict_col].update(\n{\n\"positive_match_data\": {\"result\": bool(positive_match), \"match_text\": positive_match_text},\n\"negative_match_data\": {\"result\": bool(negative_match), \"match_text\": negative_match_text},\n}\n)\nreturn row\ndef post_detect(self, row, debug_mode=False):\nif row[self.TMP_POSITIVE_REGEX_MATCH] and not row[self.TMP_NEGATIVE_REGEX_MATCH]:\nrow[self.OUTPUT_RESULT_COLUMN] = True\nelse:\nrow[self.OUTPUT_RESULT_COLUMN] = False\nreturn row\n
In the end, an extra column is created containing debug data:
virus_result debug_virus 0 True {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}} 1 False {'detection_input': '...', 'positive_match_data': {'result': False, 'match_text': None}, 'negative_match_data': {'result': False, 'match_text': None}} 2 True {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}} 3 False {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': True, 'match_text': 'corona virus'}}"},{"location":"tutorials/05b_MelusineDetectorsAdvanced/#row-methods-vs-dataframe-methods","title":"Row methods vs dataframe methods","text":"There are two ways to use the pre-detect/detect/post-detect methods:
Tip
Using row wise methods make your code backend independent. You may switch from a PandasBackend
to a DictBackend
at any time. The PandasBackend
also supports multiprocessing for row wise methods.
To use row wise methods, you just need to name the first parameter of \"row\". Otherwise, dataframe wise transformations are used.
Exemple of a Detector with dataframe wise method (works with a PandasBackend only).
class MyVirusDetector(MelusineDetector):\n\"\"\"\n Detect if the text expresses dissatisfaction.\n \"\"\"\n# Dataframe column names\nOUTPUT_RESULT_COLUMN = \"virus_result\"\nTMP_DETECTION_INPUT_COLUMN = \"detection_input\"\nTMP_POSITIVE_REGEX_MATCH = \"positive_regex_match\"\nTMP_NEGATIVE_REGEX_MATCH = \"negative_regex_match\"\ndef __init__(self, body_column: str, header_column: str):\nself.body_column = body_column\nself.header_column = header_column\nsuper().__init__(\ninput_columns=[self.body_column, self.header_column],\noutput_columns=[self.OUTPUT_RESULT_COLUMN],\nname=\"virus\",\n)\ndef pre_detect(self, df, debug_mode=False):\n# Assemble the text columns into a single column\ndf[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + \"\\n\" + df[self.body_column]\nreturn df\ndef detect(self, df, debug_mode=False):\ntext_column = df[self.TMP_DETECTION_INPUT_COLUMN]\npositive_regex = r\"(virus)\"\nnegative_regex = r\"(corona[ _]virus)\"\n# Pandas str.extract method on columns\ndf[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)\ndf[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)\nreturn df\ndef post_detect(self, df, debug_mode=False):\n# Boolean operation on pandas column\ndf[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] & ~df[self.TMP_NEGATIVE_REGEX_MATCH]\nreturn df\n
"},{"location":"tutorials/05b_MelusineDetectorsAdvanced/#custom-transform-methods","title":"Custom transform methods","text":"If you are not happy with the pre_detect
/detect
/post_detect
transform methods, you:
MelusineDetector
class)In this exemple, the prepare
/run
custom transform methods are used instead of the default pre_detect
/detect
/post_detect
.
class MyCustomDetector(BaseMelusineDetector):\n@property\ndef transform_methods(self) -> List[Callable]:\nreturn [self.prepare, self.run]\ndef prepare(self, row, debug_mode=False):\nreturn row\ndef run(self, row, debug_mode=False):\nrow[self.output_columns[0]] = \"12345\"\nreturn row\n
To configure custom transform methods you need to:
transform_methods
propertyThe transform
method will now call prepare
and run
.
df = pd.DataFrame(\n[\n{\"input_col\": \"test1\"},\n{\"input_col\": \"test2\"},\n]\n)\ndetector = MyCustomDetector(input_columns=[\"input_col\"], output_columns=[\"output_col\"], name=\"custom\")\ndf = detector.transform(df)\n
We can check that the run
method was indeed called.
Melusine components can be instantiated using parameters defined in configurations. The from_config
method accepts a config_dict
argument
from melusine.processors import Normalizer\nnormalizer_conf = {\n\"input_columns\": [\"text\"],\n\"output_columns\": [\"normalized_text\"],\n\"form\": \"NFKD\",\n\"lowercase\": False,\n}\nnormalizer = Normalizer.from_config(config_dict=normalizer_conf)\n
or a config_key
argument.
from melusine.pipeline import MelusinePipeline\npipeline = MelusinePipeline.from_config(config_key=\"demo_pipeline\")\n
When demo_pipeline
is given as argument, parameters are read from the melusine.config
object at key demo_pipeline
. "},{"location":"tutorials/06_Configurations/#access-configurations","title":"Access configurations","text":"The melusine configurations can be accessed with the config
object.
from melusine import config\nprint(config[\"demo_pipeline\"])\n
The configuration of the demo_pipeline
can then be easily inspected.
{\n'steps': [\n{'class_name': 'Cleaner', 'config_key': 'body_cleaner', 'module': 'melusine.processors'},\n{'class_name': 'Cleaner', 'config_key': 'header_cleaner', 'module': 'melusine.processors'},\n{'class_name': 'Segmenter', 'config_key': 'segmenter', 'module': 'melusine.processors'},\n{'class_name': 'ContentTagger', 'config_key': 'content_tagger', 'module': 'melusine.processors'},\n{'class_name': 'TextExtractor', 'config_key': 'text_extractor', 'module': 'melusine.processors'},\n{'class_name': 'Normalizer', 'config_key': 'demo_normalizer', 'module': 'melusine.processors'},\n{'class_name': 'EmergencyDetector', 'config_key': 'emergency_detector', 'module': 'melusine.detectors'}\n]\n}\n
"},{"location":"tutorials/06_Configurations/#modify-configurations","title":"Modify configurations","text":"The simplest way to modify configurations is to create a new directory directly.
from melusine import config\n# Get a dict of the existing conf\nnew_conf = config.dict()\n# Add/Modify a config key\nnew_conf[\"my_conf_key\"] = \"my_conf_value\"\n# Reset Melusine configurations\nconfig.reset(new_conf)\n
To deliver code in a production environment, using configuration files should be preferred to modifying the configurations on the fly. Melusine lets you specify the path to a folder containing yaml files and loads them (the OmegaConf
package is used behind the scene).
from melusine import config\n# Specify the path to a conf folder\nconf_path = \"path/to/conf/folder\"\n# Reset Melusine configurations\nconfig.reset(config_path=conf_path)\n# >> Using config_path : path/to/conf/folder\n
When the MELUSINE_CONFIG_DIR
environment variable is set, Melusine loads directly the configurations files located at the path specified by the environment variable.
import os\nfrom melusine import config\n# Specify the MELUSINE_CONFIG_DIR environment variable\nos.environ[\"MELUSINE_CONFIG_DIR\"] = \"path/to/conf/folder\"\n# Reset Melusine configurations\nconfig.reset()\n# >> Using config_path from env variable MELUSINE_CONFIG_DIR\n# >> Using config_path : path/to/conf/folder\n
Tip
If the MELUSINE_CONFIG_DIR
is set before melusine is imported (e.g., before starting the program), you don't need to call config.reset()
.
Creating your configuration folder from scratch would be cumbersome. It is advised to export the default configurations and then modify just the files you need.
from melusine import config\n# Specify the path a folder (created if it doesn't exist)\nconf_path = \"path/to/conf/folder\"\n# Export default configurations to the folder\nfiles_created = config.export_default_config(path=conf_path)\n
Tip
The export_default_config
returns a list of path to all the files created.
Machine Learning is commonly used to classify data into pre-defined categories.
---\ntitle: Email classification\n---\nflowchart LR\n Input[[Email]] --> X(((Classifier)))\n X --> A(Car)\n X --> B(Boat)\n X --> C(Housing)\n X --> D(Health)
Typically, to reach high classification performance, models need to be trained on context specific labeled data. Zero-shot classification is a type of classification that uses a pre-trained model and does not require further training on context specific data.
"},{"location":"tutorials/07_BasicClassification/#tutorial-intro","title":"Tutorial intro","text":"In this tutorial we want to detect insatisfaction in an email dataset. Let's create a basic dataset:
import pandas as pd\nfrom transformers import pipeline\nfrom melusine.base import MelusineDetector\ndef create_dataset():\ndf = pd.DataFrame(\n[\n{\n\"header\": \"Dossier 123456\",\n\"body\": \"Merci beaucoup pour votre gentillesse et votre \u00e9coute !\",\n},\n{\n\"header\": \"R\u00e9clamation (Dossier 987654)\",\n\"body\": (\"Bonjour, je ne suis pas satisfait de cette situation, \" \"r\u00e9pondez-moi rapidement svp!\"),\n},\n]\n)\nreturn df\n
header body 0 Dossier 123456 Merci beaucoup pour votre gentillesse et votre \u00e9coute ! 1 R\u00e9clamation (Dossier 987654) Bonjour, je ne suis pas satisfait de cette situation, r\u00e9pondez-moi rapidement svp!"},{"location":"tutorials/07_BasicClassification/#classify-with-zero-shot-classification","title":"Classify with Zero-Shot-Classification","text":"The transformers
library makes it really simple to use pre-trained models for zero shot classification.
model_name_or_path = \"cmarkea/distilcamembert-base-nli\"\nsentences = [\n\"Quelle belle journ\u00e9e aujourd'hui\",\n\"La mar\u00e9e est haute\",\n\"Ce film est une catastrophe, je suis en col\u00e8re\",\n]\nclassifier = pipeline(task=\"zero-shot-classification\", model=model_name_or_path, tokenizer=model_name_or_path)\nresult = classifier(\nsequences=sentences, candidate_labels=\", \".join([\"positif\", \"n\u00e9gatif\"]), hypothesis_template=\"Ce texte est {}.\"\n)\n
The classifier returns a score for the \"positif\" and \"n\u00e9gatif\" label for each input text:
[\n{\n'sequence': \"Quelle belle journ\u00e9e aujourd'hui\",\n'labels': ['positif', 'n\u00e9gatif'],\n'scores': [0.95, 0.05]\n},\n{\n'sequence': 'La mar\u00e9e est haute',\n'labels': ['positif', 'n\u00e9gatif'],\n'scores': [0.76, 0.24]\n},\n{'sequence': 'Ce film est une catastrophe, je suis en col\u00e8re',\n'labels': ['n\u00e9gatif', 'positif'],\n'scores': [0.97, 0.03]\n}\n]\n
"},{"location":"tutorials/07_BasicClassification/#implement-a-dissatisfaction-detector","title":"Implement a Dissatisfaction detector","text":"A full email processing pipeline could contain multiple models. Melusine uses the MelusineDetector template class to standardise how models are integrated into a pipeline.
class DissatisfactionDetector(MelusineDetector):\n\"\"\"\n Detect if the text expresses dissatisfaction.\n \"\"\"\n# Dataframe column names\nOUTPUT_RESULT_COLUMN = \"dissatisfaction_result\"\nTMP_DETECTION_INPUT_COLUMN = \"detection_input\"\nTMP_DETECTION_OUTPUT_COLUMN = \"detection_output\"\n# Model inference parameters\nPOSITIVE_LABEL = \"positif\"\nNEGATIVE_LABEL = \"n\u00e9gatif\"\nHYPOTHESIS_TEMPLATE = \"Ce texte est {}.\"\ndef __init__(self, model_name_or_path: str, text_columns: List[str], threshold: float):\nself.text_columns = text_columns\nself.threshold = threshold\nself.classifier = pipeline(\ntask=\"zero-shot-classification\", model=model_name_or_path, tokenizer=model_name_or_path\n)\nsuper().__init__(input_columns=text_columns, output_columns=[self.OUTPUT_RESULT_COLUMN], name=\"dissatisfaction\")\n
The pre_detect
method assembles the text that we want to use for classification.
def pre_detect(self, row, debug_mode=False):\n# Assemble the text columns into a single text\neffective_text = \"\"\nfor col in self.text_columns:\neffective_text += \"\\n\" + row[col]\nrow[self.TMP_DETECTION_INPUT_COLUMN] = effective_text\n# Store the effective detection text in the debug data\nif debug_mode:\nrow[self.debug_dict_col] = {\"detection_input\": row[self.TMP_DETECTION_INPUT_COLUMN]}\nreturn row\n
The detect
method runs the classification model on the text.
def detect(self, row, debug_mode=False):\n# Run the classifier on the text\npipeline_result = self.classifier(\nsequences=row[self.TMP_DETECTION_INPUT_COLUMN],\ncandidate_labels=\", \".join([self.POSITIVE_LABEL, self.NEGATIVE_LABEL]),\nhypothesis_template=self.HYPOTHESIS_TEMPLATE,\n)\n# Format classification result\nresult_dict = dict(zip(pipeline_result[\"labels\"], pipeline_result[\"scores\"]))\nrow[self.TMP_DETECTION_OUTPUT_COLUMN] = result_dict\n# Store ML results in the debug data\nif debug_mode:\nrow[self.debug_dict_col].update(result_dict)\nreturn row\n
The post_detect
method applies a threshold on the prediction score to determine the detection result.
def post_detect(self, row, debug_mode=False):\n# Compare classification score to the detection threshold\nif row[self.TMP_DETECTION_OUTPUT_COLUMN][self.NEGATIVE_LABEL] > self.threshold:\nrow[self.OUTPUT_RESULT_COLUMN] = True\nelse:\nrow[self.OUTPUT_RESULT_COLUMN] = False\nreturn row\n
On top of that, the detector takes care of building debug data to make the result explicable.
"},{"location":"tutorials/07_BasicClassification/#run-detection","title":"Run detection","text":"Putting it all together, we run the detector on the input dataset.
df = create_dataset()\ndetector = DissatisfactionDetector(\nmodel_name_or_path=\"cmarkea/distilcamembert-base-nli\",\ntext_columns=[\"header\", \"body\"],\nthreshold=0.7,\n)\ndf = detector.transform(df)\n
As a result, we get a new column dissatisfaction_result
with the detection result. We could have detection details by running the detector in debug mode.