diff --git a/_static/segmentation.png b/_static/segmentation.png
new file mode 100755
index 0000000..c51a2b0
Binary files /dev/null and b/_static/segmentation.png differ
diff --git a/index.html b/index.html
index 93220d2..166e184 100755
--- a/index.html
+++ b/index.html
@@ -357,11 +357,18 @@
     Why Choose Melusine ?
   </a>
   
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#email-segmentation-exemple" class="md-nav__link">
+    Email Segmentation Exemple
+  </a>
+  
 </li>
       
         <li class="md-nav__item">
   <a href="#getting-started" class="md-nav__link">
-    Getting started
+    Getting Started
   </a>
   
 </li>
@@ -622,11 +629,18 @@
     Why Choose Melusine ?
   </a>
   
+</li>
+      
+        <li class="md-nav__item">
+  <a href="#email-segmentation-exemple" class="md-nav__link">
+    Email Segmentation Exemple
+  </a>
+  
 </li>
       
         <li class="md-nav__item">
   <a href="#getting-started" class="md-nav__link">
-    Getting started
+    Getting Started
   </a>
   
 </li>
@@ -681,7 +695,72 @@ <h2 id="why-choose-melusine">Why Choose Melusine ?</h2>
 <li><strong>Production ready</strong> : Proven in the MAIF production environment, 
 Melusine provides the robustness and stability you need.</li>
 </ul>
-<h2 id="getting-started">Getting started</h2>
+<h2 id="email-segmentation-exemple">Email Segmentation Exemple</h2>
+<p>In the following example, an email is divided into two distinct messages 
+separated by a transition pattern. 
+Each message is then tagged line by line. 
+This email segmentation can later be leveraged to enhance the performance of machine learning models.</p>
+<details class="note" open="open">
+<summary>Message 1</summary>
+<p><p style="text-align:left;"> Dear Kim
+<span style="float:right;background-color:#58D68D;"> HELLO</span>
+</p>
+<p style="text-align:left;"> Please find the details in the forwarded email.
+<span style="float:right;background-color:#F4D03F;"> BODY</span>
+</p>
+<p style="text-align:left;"> Best Regards
+<span style="float:right;background-color:#85C1E9;"> GREETINGS</span>
+</p>
+<p style="text-align:left;"> Jo Kahn
+<span style="float:right;background-color:#EB984E;"> SIGNATURE</span>
+</p></p>
+</details>
+<details class="note" open="open">
+<summary>Transition pattern</summary>
+<p><p>Forwarded by jo@maif.fr on Monday december 12th</span>
+<span style="float:right;background-color:#D5DBDB;"> TRANSITION</span>
+</p>
+<p>From: alex@gmail.com
+<span style="float:right;background-color:#D5DBDB;"> TRANSITION</span>
+</p>
+<p>To: jo@maif.fr
+<span style="float:right;background-color:#D5DBDB;"> TRANSITION</span>
+</p>
+<p>Subject: New address
+<span style="float:right;background-color:#D5DBDB;"> TRANSITION</span>
+</p></p>
+</details>
+<details class="note" open="open">
+<summary>Message 2</summary>
+<p><p style="text-align:left;"> Dear Jo
+<span style="float:right;background-color:#58D68D;"> HELLO</span>
+</p>
+<p style="text-align:left;"> A new version of Melusine is about to be released.
+<span style="float:right;background-color:#F4D03F;"> BODY</span>
+</p>
+<p style="text-align:left;"> Feel free to test it and send us feedbacks!
+<span style="float:right;background-color:#F4D03F;"> BODY</span>
+</p>
+<p style="text-align:left;"> Thank you for your help.
+<span style="float:right;background-color:#BB8FCE;"> THANKS</span>
+</p>
+<p style="text-align:left;"> Cheers
+<span style="float:right;background-color:#85C1E9;"> GREETINGS</span>
+</p>
+<p style="text-align:left;"> Alex Leblanc
+<span style="float:right;background-color:#EB984E;"> SIGNATURE</span>
+</p>
+<p style="text-align:left;"> 55 Rue du Faubourg Saint-Honoré
+<span style="float:right;background-color:#EB984E;"> SIGNATURE</span>
+</p>
+<p style="text-align:left;"> 75008 Paris
+<span style="float:right;background-color:#EB984E;"> SIGNATURE</span>
+</p>
+<p style="text-align:left;"> Sent from my iPhone
+<span style="float:right;background-color:#73C6B6;"> FOOTER</span>
+</p></p>
+</details>
+<h2 id="getting-started">Getting Started</h2>
 <p>Get started with melusine following our (tested!) tutorials:</p>
 <ul>
 <li>
diff --git a/search/search_index.json b/search/search_index.json
index ddc1487..762224e 100755
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to melusine","text":""},{"location":"#overview","title":"Overview","text":"<p>Discover Melusine, a comprehensive email processing library  designed to optimize your email workflow.  Leverage Melusine's advanced features to achieve:</p> <ul> <li>Effortless Email Routing: Ensure emails reach their intended destinations with high accuracy.</li> <li>Smart Prioritization: Prioritize urgent emails for timely handling and efficient task management.</li> <li>Snippet Summaries: Extract relevant information from lengthy emails, saving you precious time and effort.</li> <li>Precision Filtering: Eliminate unwanted emails from your inbox, maintaining focus and reducing clutter.</li> </ul> <p>Melusine facilitates the integration of deep learning frameworks (HuggingFace, Pytorch, Tensorflow, etc),  deterministic rules (regex, keywords, heuristics) into a full email qualification workflow.</p>"},{"location":"#why-choose-melusine","title":"Why Choose Melusine ?","text":"<p>Melusine stands out with its combination of features and advantages:  </p> <ul> <li>Out-of-the-box features : Melusine comes with features such as<ul> <li>Segmenting an email conversation into individual messages</li> <li>Tagging message parts (Email body, signatures, footers, etc)</li> <li>Transferred email handling</li> </ul> </li> <li>Streamlined Execution : Focus on the core email qualification logic  while Melusine handles the boilerplate code, providing debug mode, pipeline execution, code parallelization, and more.</li> <li>Flexible Integrations : Melusine's modular architecture enables seamless integration with various AI frameworks,  ensuring compatibility with your preferred tools.</li> <li>Production ready : Proven in the MAIF production environment,  Melusine provides the robustness and stability you need.</li> </ul>"},{"location":"#getting-started","title":"Getting started","text":"<p>Get started with melusine following our (tested!) tutorials:</p> <ul> <li> <p>Getting Started</p> </li> <li> <p>MelusinePipeline</p> </li> <li> <p>MelusineTransformers</p> </li> <li> <p>MelusineRegex</p> </li> <li> <p>ML models</p> </li> <li> <p>MelusineDetector</p> </li> <li> <p>Configurations</p> </li> <li> <p>Basic Classification</p> </li> </ul> <p>With Melusine, you're well-equipped to transform your email handling, streamlining processes, maximizing efficiency,  and enhancing overall productivity.</p>"},{"location":"advanced/ContentTagger/","title":"Use custom message tags","text":""},{"location":"advanced/CustomDetector/","title":"Use a custom MelusineDetector template","text":""},{"location":"advanced/CustomDetector/#specify-abstract-methods","title":"Specify abstract methods","text":""},{"location":"advanced/CustomDetector/#row-transformations-vs-dataframe-transformations","title":"Row transformations vs dataframe transformations","text":""},{"location":"advanced/ExchangeConnector/","title":"Connect melusine to a Microsoft Exchange Mailbox","text":""},{"location":"advanced/PreTrainedModelsHF/","title":"Use pre-trained models from HuggingFace","text":""},{"location":"contribute/how_to_contribute/","title":"How to contribute","text":"<p>The melusine library is open to contributions from the community. This page describes the process to follow to contribute to the project:</p> <ol> <li>Describe the feature you want to add or the bug you want to fix in an issue.</li> <li>Fork the repository.</li> <li>Create a branch from the <code>master</code> branch of the main repository.</li> <li>Create a virtual environment and install the development dependencies.</li> <li>Code your feature or fix the bug.</li> <li>Write tests for your feature or bug fix.</li> <li>Run the pre-commit hooks.</li> <li>Create a pull request to merge your branch into the <code>master</code> branch of the main repository.</li> <li>Ensure the continuous integration passes</li> <li>Wait for the review of your pull request.</li> </ol>"},{"location":"contribute/maif/","title":"MAIF","text":""},{"location":"history/history/","title":"Project History","text":""},{"location":"history/history/#maif-origins","title":"MAIF Origins","text":"<p>Melusine, an open-source email processing library, was born at MAIF,  a French mutual insurance company founded in 1934 and headquartered in Niort, France.  MAIF is committed to social and environmental responsibility, operating as a Soci\u00e9t\u00e9 \u00e0 mission (a company with a purpose).</p>"},{"location":"history/history/#project-motivation","title":"Project Motivation","text":"<p>MAIF handles a vast volume of emails daily,  necessitating an efficient processing solution to ensure timely and accurate handling while  maximizing customer satisfaction. Automated email processing plays a crucial role in this endeavor, enabling tasks such as:</p> <ul> <li>Email routing: Redirecting emails to the most appropriate service for processing.</li> <li>Email prioritization: Prioritizing urgent emails for prompt attention.</li> <li>Email summarization: Extracting key information from emails for quick comprehension.</li> </ul>"},{"location":"history/history/#open-source-journey","title":"Open-Source Journey","text":"<p>After successfully implementing and testing the email processing code in production,  MAIF made the strategic decision to open-source Melusine.  This decision stems from multiple compelling reasons:</p> <ul> <li>Transparency: Open-sourcing the code fosters transparency with MAIF's customers, enabling them to understand how their emails are handled.</li> <li>Code quality: Adopting open-source quality standards promotes a culture of excellence and upskilling among MAIF developers.</li> <li>Benefiting from External Contributions: Open-source initiatives often attract contributions from external developers, expanding the pool of expertise and enhancing Melusine's capabilities.</li> <li>Building a Community: Engaging with Melusine users fosters a community of shared knowledge, best practices, and use-cases.</li> <li>Demonstrating Expertise: Open-sourcing Melusine showcases the MAIF DataFactory's technical expertise and commitment to innovation.</li> </ul>"},{"location":"history/history/#refactoring-and-advancement","title":"Refactoring and Advancement","text":"<p>The initial release of Melusine in 2019 followed its successful deployment at MAIF.  Since then, Melusine has evolved to meet the growing demands of MAIF's business needs.  In 2024, based on extensive feedback gained from Melusine's production usage,  a complete refactoring effort resulted in the release of Melusine V3.  This latest version boasts enhanced modularity, customization options, seamless integration,  robust monitoring capabilities, and simplified maintenance.</p>"},{"location":"philosophy/philosophy/","title":"Code philosophy","text":""},{"location":"philosophy/philosophy/#what-is-a-code-philosophy-and-why-do-i-need-it","title":"What is a code philosophy and why do I need it ?","text":""},{"location":"philosophy/philosophy/#design-patterns","title":"Design patterns","text":""},{"location":"tutorials/00_GettingStarted/","title":"Getting started with Melusine","text":"<p>Let's run emergency detection with melusine :</p> <ul> <li>Load a fake email dataset</li> <li>Load a demonstration pipeline</li> <li>Run the pipeline  <ul> <li>Apply email cleaning transformations  </li> <li>Apply emergency detection</li> </ul> </li> </ul>"},{"location":"tutorials/00_GettingStarted/#input-data","title":"Input data","text":"<p>Email datasets typically contain information about:</p> <ul> <li>Email sender</li> <li>Email recipients</li> <li>Email subject/header</li> <li>Email body</li> <li>Attachments data</li> </ul> <p>The present tutorial only makes use of the body and header data.</p> body header 0 This is an \u00ebm\u00e8rg\u00e9n\u00e7y Help 1 How is life ? Hey ! 2 Urgent update about Mr. Annoying Latest news 3 Please call me now URGENT"},{"location":"tutorials/00_GettingStarted/#code","title":"Code","text":"<p>A typical code for a melusine-based application looks like this :</p> <pre><code>from melusine.data import load_email_data\nfrom melusine.pipeline import MelusinePipeline\n# Load an email dataset\ndf = load_email_data()\n# Load a pipeline\npipeline = MelusinePipeline.from_config(\"demo_pipeline\")  # (1)!\n# Run the pipeline\ndf = pipeline.transform(df)\n</code></pre> <ol> <li>This tutorial uses one of the default pipeline configuration <code>demo_pipeline</code>. Melusine users will typically define their own pipeline configuration.    See more in the Configurations tutorial</li> </ol>"},{"location":"tutorials/00_GettingStarted/#output-data","title":"Output data","text":"<p>The pipeline created extra columns in the dataset. Some columns are temporary variables required by detectors (ex: <code>normalized_body</code>) and some are detection results with direct business value (ex: <code>emergency_result</code>).</p> body header normalized_body emergency_result 0 This is an \u00ebm\u00e8rg\u00e9n\u00e7y Help This is an emergency True 1 How is life ? Hey ! How is life ? False 2 Urgent update about Mr. Annoying Latest news Urgent update about Mr. Annoying False 3 Please call me now URGENT Please call me now True"},{"location":"tutorials/00_GettingStarted/#pipeline-steps","title":"Pipeline steps","text":"<p>Illustration of the pipeline used in the present tutorial :</p> <pre><code>---\ntitle: Demonstration pipeline\n---\nflowchart LR\n    Input[[Email]] --&gt; A(Cleaner)\n    A(Cleaner) --&gt; C(Normalizer)\n    C --&gt; F(Emergency\\nDetector)\n    F --&gt; Output[[Qualified Email]]</code></pre> <ul> <li><code>Cleaner</code> : Cleaning transformations such as uniformization of line breaks (<code>\\r\\n</code> -&gt; <code>\\n</code>)</li> <li><code>Normalizer</code> : Text normalisation to delete/replace non utf8 characters (<code>\u00e9\u00f6\u00e0</code> -&gt; <code>eoa</code>)</li> <li><code>EmergencyDetector</code> : Detection of urgent emails</li> </ul> <p>Info</p> <p>This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors. For example, pipelines may contain:</p> <ul> <li>Email Segmentation : Split email conversation into unitary messages</li> <li>ContentTagging : Associate tags (SIGNATURE, FOOTER, BODY) to parts of messages</li> <li>Appointment detection : For exemple, detect \"construction work will take place on 01/01/2024\" as an appointment email.</li> <li>More on preprocessing in the MelusineTransformers tutorial</li> <li>More on detectors in the MelusineDetector tutorial</li> </ul>"},{"location":"tutorials/00_GettingStarted/#debug-mode","title":"Debug mode","text":"<p>End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info.</p> <pre><code>from melusine.data import load_email_data\nfrom melusine.pipeline import MelusinePipeline\n# Load an email dataset\ndf = load_email_data()\n# Activate debug mode\ndf.debug = True\n# Load the default pipeline\npipeline = MelusinePipeline.from_config(\"demo_pipeline\")\n# Run the pipeline\ndf = pipeline.transform(df)\n</code></pre> <p>A new column <code>debug_emergency</code> is created.</p> ... emergency_result debug_emergency 0 ... True [details_below] 1 ... False [details_below] 2 ... False [details_below] 3 ... True [details_below] <p>Inspecting the debug data gives a lot of info:</p> <ul> <li><code>text</code> : Effective text considered for detection.</li> <li><code>EmergencyRegex</code> : melusine used an <code>EmergencyRegex</code> object to run detection.</li> <li><code>match_result</code> : The <code>EmergencyRegex</code> did not match the text</li> <li><code>positive_match_data</code> : The <code>EmergencyRegex</code> matched positively the text pattern \"Urgent\" (Required condition)</li> <li><code>negative_match_data</code> : The <code>EmergencyRegex</code> matched negatively the text pattern \"Mr. Annoying\" (Forbidden condition)</li> <li><code>BLACKLIST</code> : Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified.</li> </ul> <pre><code># print(df.iloc[2][\"debug_emergency\"])\n{\n'text': 'Latest news\\nUrgent update about Mr. Annoying'},\n'EmergencyRegex': {\n'match_result': False,\n'negative_match_data': {\n'BLACKLIST': [\n{'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44}\n]},\n'neutral_match_data': {},\n'positive_match_data': {\n'DEFAULT': [\n{'match_text': 'Urgent', 'start': 12, 'stop': 18}\n]\n}\n}\n</code></pre>"},{"location":"tutorials/01_MelusinePipeline/","title":"MelusinePipeline","text":"<p>The <code>MelusinePipeline</code> class is at the core of melusine. It inherits from the <code>sklearn.Pipeline</code> class and adds extra functionalities such as :</p> <ul> <li>Instantiation from configurations</li> <li>Input/output coherence check</li> <li>Debug mode</li> </ul>"},{"location":"tutorials/01_MelusinePipeline/#code","title":"Code","text":""},{"location":"tutorials/02_MelusineTransformers/","title":"MelusineTransformers","text":""},{"location":"tutorials/03_MelusineRegex/","title":"MelusineRegex","text":""},{"location":"tutorials/04_UsingModels/","title":"Using AI models","text":""},{"location":"tutorials/05a_MelusineDetectors/","title":"Melusine Detectors","text":"<p>The <code>MelusineDetector</code> component aims at standardizing how detection  is performed in a <code>MelusinePipeline</code>. </p> <p>Tip</p> <p>Project running over several years (such as email automation)  may accumulate technical debt over time. Standardizing code practices  can limit the technical debt and ease the onboarding of new developers.</p> <p>The <code>MelusineDetector</code> class splits detection into three steps:</p> <ul> <li><code>pre_detect</code>: Select/combine the inputs needed for detection. Ex: Select the text parts tagged as <code>BODY</code> and combine them with the text  in the email header.</li> <li><code>detect</code>: Use regular expressions, ML models or heuristics to run detection on the input text.</li> <li><code>post_detect</code>: Run detection rules such as thresholding or combine results from multiple models.</li> </ul> <p>The method <code>transform</code> is defined by the BaseClass <code>MelusineDetector</code> and will call  the pre_detect/detect/post_detect methods in turn (Template pattern).</p> <pre><code># Instantiate Detector\ndetector = MyDetector()\n# Run pre_detect, detect and post_detect on input data\ndata_with_detection = detector.transform(data)\n</code></pre> <p>Here is the full code of a MelusineDetector to detect emails related to viruses.  The next sections break down the different parts of the code.</p> <pre><code>class MyCustomDetector(BaseMelusineDetector):\n@property\ndef transform_methods(self) -&gt; List[Callable]:\nreturn [self.prepare, self.run]\ndef prepare(self, row, debug_mode=False):\nreturn row\ndef run(self, row, debug_mode=False):\nrow[self.output_columns[0]] = \"12345\"\nreturn row\n</code></pre> <p>The detector is run on a simple dataframe: <pre><code>df = pd.DataFrame(\n[\n{\"input_col\": \"test1\"},\n{\"input_col\": \"test2\"},\n]\n)\ndetector = MyCustomDetector(input_columns=[\"input_col\"], output_columns=[\"output_col\"], name=\"custom\")\ndf = detector.transform(df)\n</code></pre></p> <p>The output is a dataframe with a new <code>virus_result</code> column.</p> body header virus_result 0 This is a dangerous virus test True 1 test test False 2 test viruses are dangerous True 3 corona virus is annoying test False <p>Tip</p> <p>Columns that are not declared in the <code>output_columns</code> are dropped automatically.</p>"},{"location":"tutorials/05a_MelusineDetectors/#detector-init","title":"Detector init","text":"<p>In the init method, you should call the superclass init and provide:</p> <ul> <li>A name for the detector</li> <li>Inputs columns</li> <li>Output columns</li> </ul> <p>Tip</p> <p>If the init method of the super class is enough (parameters <code>name</code>, <code>input_columns</code> and <code>output_columns</code>) you may skip the init method entirely when defining your <code>MelusineDetector</code>.</p>"},{"location":"tutorials/05a_MelusineDetectors/#detector-pre_detect","title":"Detector pre_detect","text":"<p>The <code>pre_detect</code> method simply combines the header text and the body text (separated by a line break). <pre><code>def pre_detect(self, df, debug_mode=False):\n# Assemble the text columns into a single column\ndf[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + \"\\n\" + df[self.body_column]\nreturn df\n</code></pre></p>"},{"location":"tutorials/05a_MelusineDetectors/#detector-detect","title":"Detector detect","text":"<p>The <code>detect</code> applies two regexes on the selected text: - A positive regex to catch mentions to viruses - A negative regex to avoid false positive detections <pre><code>def detect(self, df, debug_mode=False):\ntext_column = df[self.TMP_DETECTION_INPUT_COLUMN]\npositive_regex = r\"(virus)\"\nnegative_regex = r\"(corona[ _]virus)\"\n# Pandas str.extract method on columns\ndf[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)\ndf[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)\nreturn df\n</code></pre></p>"},{"location":"tutorials/05a_MelusineDetectors/#detector-post_detect","title":"Detector post_detect","text":"<p>The <code>post_detect</code> combines the regex detection result to determine the final result. <pre><code>def post_detect(self, df, debug_mode=False):\n# Boolean operation on pandas column\ndf[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] &amp; ~df[self.TMP_NEGATIVE_REGEX_MATCH]\nreturn df\n</code></pre></p>"},{"location":"tutorials/05a_MelusineDetectors/#are-melusinedetectors-mandatory-for-melusine","title":"Are MelusineDetectors mandatory for melusine?","text":"<p>No.  </p> <p>You can use any scikit-learn compatible component in your <code>MelusinePipeline</code>.  However, we recommend using the <code>MelusineDetector</code> (and <code>MelusineTransformer</code>)  classes to benefit from:</p> <ul> <li>Code standardization</li> <li>Input columns validation</li> <li>Dataframe backend variabilization   Today dict and pandas backend are supported but more backends may be added (e.g. polars)</li> <li>Debug mode</li> <li>Multiprocessing</li> </ul> <p>Check-out the next tutorial  to discover advanced features of the <code>MelusineDetector</code> class.</p>"},{"location":"tutorials/05b_MelusineDetectorsAdvanced/","title":"Advanced Melusine Detectors","text":"<p>This tutorial presents the advanced features of the <code>MelusineDetector</code> class:</p> <ul> <li>Debug mode</li> <li>Row wise methods vs DataFrame wise methods</li> <li>Custom transform methods</li> </ul>"},{"location":"tutorials/05b_MelusineDetectorsAdvanced/#debug-mode","title":"Debug mode","text":"<p><code>MelusineDetector</code> are designed to be easily debugged. For that purpose, the  pre-detect/detect/post-detect methods all have a <code>debug_mode</code> argument.  The debug mode is activated by setting the debug attribute of a dataframe to True.</p> <pre><code>import pandas as pd\ndf = pd.DataFrame({\"bla\": [1, 2, 3]})\ndf.debug = True\n</code></pre> <p>Warning</p> <p>Debug mode activation is backend dependent. With a DictBackend, tou should use <code>my_dict[\"debug\"] = True</code></p> <p>When debug mode is activated, a column named \"DETECTOR_NAME_debug\" containing an empty  dictionary is automatically created. Populating this debug dict with debug info is then left to the user's responsibility. </p> <p>Exemple of a detector with debug data <pre><code>class MyVirusDetector(MelusineDetector):\nOUTPUT_RESULT_COLUMN = \"virus_result\"\nTMP_DETECTION_INPUT_COLUMN = \"detection_input\"\nTMP_POSITIVE_REGEX_MATCH = \"positive_regex_match\"\nTMP_NEGATIVE_REGEX_MATCH = \"negative_regex_match\"\ndef __init__(self, body_column: str, header_column: str):\nself.body_column = body_column\nself.header_column = header_column\nsuper().__init__(\ninput_columns=[self.body_column, self.header_column],\noutput_columns=[self.OUTPUT_RESULT_COLUMN],\nname=\"virus\",\n)\ndef pre_detect(self, row, debug_mode=False):\neffective_text = row[self.header_column] + \"\\n\" + row[self.body_column]\nrow[self.TMP_DETECTION_INPUT_COLUMN] = effective_text\nif debug_mode:\nrow[self.debug_dict_col] = {\"detection_input\": row[self.TMP_DETECTION_INPUT_COLUMN]}\nreturn row\ndef detect(self, row, debug_mode=False):\ntext = row[self.TMP_DETECTION_INPUT_COLUMN]\npositive_regex = r\"virus\"\nnegative_regex = r\"corona[ _]virus\"\npositive_match = re.search(positive_regex, text)\nnegative_match = re.search(negative_regex, text)\nrow[self.TMP_POSITIVE_REGEX_MATCH] = bool(positive_match)\nrow[self.TMP_NEGATIVE_REGEX_MATCH] = bool(negative_match)\nif debug_mode:\npositive_match_text = (\npositive_match.string[positive_match.start() : positive_match.end()] if positive_match else None\n)\nnegative_match_text = (\npositive_match.string[negative_match.start() : negative_match.end()] if negative_match else None\n)\nrow[self.debug_dict_col].update(\n{\n\"positive_match_data\": {\"result\": bool(positive_match), \"match_text\": positive_match_text},\n\"negative_match_data\": {\"result\": bool(negative_match), \"match_text\": negative_match_text},\n}\n)\nreturn row\ndef post_detect(self, row, debug_mode=False):\nif row[self.TMP_POSITIVE_REGEX_MATCH] and not row[self.TMP_NEGATIVE_REGEX_MATCH]:\nrow[self.OUTPUT_RESULT_COLUMN] = True\nelse:\nrow[self.OUTPUT_RESULT_COLUMN] = False\nreturn row\n</code></pre></p> <p>In the end, an extra column is created containing debug data:</p> virus_result debug_virus 0 True {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}} 1 False {'detection_input': '...', 'positive_match_data': {'result': False, 'match_text': None}, 'negative_match_data': {'result': False, 'match_text': None}} 2 True {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}} 3 False {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': True, 'match_text': 'corona virus'}}"},{"location":"tutorials/05b_MelusineDetectorsAdvanced/#row-methods-vs-dataframe-methods","title":"Row methods vs dataframe methods","text":"<p>There are two ways to use the pre-detect/detect/post-detect methods:</p> <ul> <li>Row wise: The method works on a single row of a DataFrame. In that case, a map-like method is used to apply it on an entire dataframe (typically pandas.DataFrame.apply is used with the PandasBackend)</li> <li>Dataframe wise: The method works directly on the entire DataFrame.</li> </ul> <p>Tip</p> <p>Using row wise methods make your code backend independent. You may  switch from a <code>PandasBackend</code> to a <code>DictBackend</code> at any time.  The <code>PandasBackend</code> also supports multiprocessing for row wise methods.</p> <p>To use row wise methods, you just need to name the first parameter of \"row\".  Otherwise, dataframe wise transformations are used.</p> <p>Exemple of a Detector with dataframe wise method (works with a PandasBackend only). <pre><code>class MyVirusDetector(MelusineDetector):\n\"\"\"\n    Detect if the text expresses dissatisfaction.\n    \"\"\"\n# Dataframe column names\nOUTPUT_RESULT_COLUMN = \"virus_result\"\nTMP_DETECTION_INPUT_COLUMN = \"detection_input\"\nTMP_POSITIVE_REGEX_MATCH = \"positive_regex_match\"\nTMP_NEGATIVE_REGEX_MATCH = \"negative_regex_match\"\ndef __init__(self, body_column: str, header_column: str):\nself.body_column = body_column\nself.header_column = header_column\nsuper().__init__(\ninput_columns=[self.body_column, self.header_column],\noutput_columns=[self.OUTPUT_RESULT_COLUMN],\nname=\"virus\",\n)\ndef pre_detect(self, df, debug_mode=False):\n# Assemble the text columns into a single column\ndf[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + \"\\n\" + df[self.body_column]\nreturn df\ndef detect(self, df, debug_mode=False):\ntext_column = df[self.TMP_DETECTION_INPUT_COLUMN]\npositive_regex = r\"(virus)\"\nnegative_regex = r\"(corona[ _]virus)\"\n# Pandas str.extract method on columns\ndf[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)\ndf[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)\nreturn df\ndef post_detect(self, df, debug_mode=False):\n# Boolean operation on pandas column\ndf[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] &amp; ~df[self.TMP_NEGATIVE_REGEX_MATCH]\nreturn df\n</code></pre></p>"},{"location":"tutorials/05b_MelusineDetectorsAdvanced/#custom-transform-methods","title":"Custom transform methods","text":"<p>If you are not happy with the <code>pre_detect</code>/<code>detect</code>/<code>post_detect</code> transform methods, you: </p> <ul> <li>Use custom template methods</li> <li>Use regular pipeline steps (not inheriting from the <code>MelusineDetector</code> class)</li> </ul> <p>In this exemple, the <code>prepare</code>/<code>run</code> custom transform methods are used instead of the default <code>pre_detect</code>/<code>detect</code>/<code>post_detect</code>.</p> <pre><code>class MyCustomDetector(BaseMelusineDetector):\n@property\ndef transform_methods(self) -&gt; List[Callable]:\nreturn [self.prepare, self.run]\ndef prepare(self, row, debug_mode=False):\nreturn row\ndef run(self, row, debug_mode=False):\nrow[self.output_columns[0]] = \"12345\"\nreturn row\n</code></pre> <p>To configure custom transform methods you need to: </p> <ul> <li>inherit from the melusine.base.BaseMelusineDetector class</li> <li>define the <code>transform_methods</code> property</li> </ul> <p>The <code>transform</code> method will now call <code>prepare</code> and <code>run</code>.</p> <pre><code>df = pd.DataFrame(\n[\n{\"input_col\": \"test1\"},\n{\"input_col\": \"test2\"},\n]\n)\ndetector = MyCustomDetector(input_columns=[\"input_col\"], output_columns=[\"output_col\"], name=\"custom\")\ndf = detector.transform(df)\n</code></pre> <p>We can check that the <code>run</code> method was indeed called.</p> input_col output_col 0 test1 12345 1 test2 12345"},{"location":"tutorials/06_Configurations/","title":"Configurations","text":"<p>Melusine components can be instantiated using parameters defined in configurations. The <code>from_config</code> method accepts a <code>config_dict</code> argument <pre><code>from melusine.processors import Normalizer\nnormalizer_conf = {\n\"input_columns\": [\"text\"],\n\"output_columns\": [\"normalized_text\"],\n\"form\": \"NFKD\",\n\"lowercase\": False,\n}\nnormalizer = Normalizer.from_config(config_dict=normalizer_conf)\n</code></pre></p> <p>or a <code>config_key</code> argument. <pre><code>from melusine.pipeline import MelusinePipeline\npipeline = MelusinePipeline.from_config(config_key=\"demo_pipeline\")\n</code></pre> When <code>demo_pipeline</code> is given as argument, parameters are read from the <code>melusine.config</code> object at key <code>demo_pipeline</code>. </p>"},{"location":"tutorials/06_Configurations/#access-configurations","title":"Access configurations","text":"<p>The melusine configurations can be accessed with the <code>config</code> object. <pre><code>from melusine import config\nprint(config[\"demo_pipeline\"])\n</code></pre></p> <p>The configuration of the <code>demo_pipeline</code> can then be easily inspected.</p> <pre><code>{\n'steps': [\n{'class_name': 'Cleaner', 'config_key': 'body_cleaner', 'module': 'melusine.processors'},\n{'class_name': 'Cleaner', 'config_key': 'header_cleaner', 'module': 'melusine.processors'},\n{'class_name': 'Segmenter', 'config_key': 'segmenter', 'module': 'melusine.processors'},\n{'class_name': 'ContentTagger', 'config_key': 'content_tagger', 'module': 'melusine.processors'},\n{'class_name': 'TextExtractor', 'config_key': 'text_extractor', 'module': 'melusine.processors'},\n{'class_name': 'Normalizer', 'config_key': 'demo_normalizer', 'module': 'melusine.processors'},\n{'class_name': 'EmergencyDetector', 'config_key': 'emergency_detector', 'module': 'melusine.detectors'}\n]\n}\n</code></pre>"},{"location":"tutorials/06_Configurations/#modify-configurations","title":"Modify configurations","text":"<p>The simplest way to modify configurations is to create a new directory directly. <pre><code>from melusine import config\n# Get a dict of the existing conf\nnew_conf = config.dict()\n# Add/Modify a config key\nnew_conf[\"my_conf_key\"] = \"my_conf_value\"\n# Reset Melusine configurations\nconfig.reset(new_conf)\n</code></pre></p> <p>To deliver code in a production environment, using configuration files should be preferred to modifying the configurations on the fly. Melusine lets you specify the path to a folder containing yaml files and loads them (the <code>OmegaConf</code> package is used behind the scene). <pre><code>from melusine import config\n# Specify the path to a conf folder\nconf_path = \"path/to/conf/folder\"\n# Reset Melusine configurations\nconfig.reset(config_path=conf_path)\n# &gt;&gt; Using config_path : path/to/conf/folder\n</code></pre></p> <p>When the <code>MELUSINE_CONFIG_DIR</code> environment variable is set, Melusine loads directly the configurations files located at the path specified by the environment variable. <pre><code>import os\nfrom melusine import config\n# Specify the MELUSINE_CONFIG_DIR environment variable\nos.environ[\"MELUSINE_CONFIG_DIR\"] = \"path/to/conf/folder\"\n# Reset Melusine configurations\nconfig.reset()\n# &gt;&gt; Using config_path from env variable MELUSINE_CONFIG_DIR\n# &gt;&gt; Using config_path : path/to/conf/folder\n</code></pre></p> <p>Tip</p> <p>If the <code>MELUSINE_CONFIG_DIR</code> is set before melusine is imported (e.g., before starting the program), you don't need to call <code>config.reset()</code>. </p>"},{"location":"tutorials/06_Configurations/#export-configurations","title":"Export configurations","text":"<p>Creating your configuration folder from scratch would be cumbersome. It is advised to export the default configurations and then modify just the files you need.</p> <pre><code>from melusine import config\n# Specify the path a folder (created if it doesn't exist)\nconf_path = \"path/to/conf/folder\"\n# Export default configurations to the folder\nfiles_created = config.export_default_config(path=conf_path)\n</code></pre> <p>Tip</p> <p>The <code>export_default_config</code> returns a list of path to all the files created. </p>"},{"location":"tutorials/07_BasicClassification/","title":"Zero Shot Classification","text":"<p>Machine Learning is commonly used to classify data into pre-defined categories. </p> <pre><code>---\ntitle: Email classification\n---\nflowchart LR\n    Input[[Email]] --&gt; X(((Classifier)))\n    X --&gt; A(Car)\n    X --&gt; B(Boat)\n    X --&gt; C(Housing)\n    X --&gt; D(Health)</code></pre> <p>Typically, to reach high classification performance,  models need to be trained on context specific labeled data.  Zero-shot classification is a type of classification that  uses a pre-trained model and does not require further training on context specific data.</p>"},{"location":"tutorials/07_BasicClassification/#tutorial-intro","title":"Tutorial intro","text":"<p>In this tutorial we want to detect insatisfaction in an email dataset.  Let's create a basic dataset: <pre><code>import pandas as pd\nfrom transformers import pipeline\nfrom melusine.base import MelusineDetector\ndef create_dataset():\ndf = pd.DataFrame(\n[\n{\n\"header\": \"Dossier 123456\",\n\"body\": \"Merci beaucoup pour votre gentillesse et votre \u00e9coute !\",\n},\n{\n\"header\": \"R\u00e9clamation (Dossier 987654)\",\n\"body\": (\"Bonjour, je ne suis pas satisfait de cette situation, \" \"r\u00e9pondez-moi rapidement svp!\"),\n},\n]\n)\nreturn df\n</code></pre></p> header body 0 Dossier 123456 Merci beaucoup pour votre gentillesse et votre \u00e9coute ! 1 R\u00e9clamation (Dossier 987654) Bonjour, je ne suis pas satisfait de cette situation, r\u00e9pondez-moi rapidement svp!"},{"location":"tutorials/07_BasicClassification/#classify-with-zero-shot-classification","title":"Classify with Zero-Shot-Classification","text":"<p>The <code>transformers</code> library makes it really simple to use pre-trained models for zero shot classification.</p> <pre><code>model_name_or_path = \"cmarkea/distilcamembert-base-nli\"\nsentences = [\n\"Quelle belle journ\u00e9e aujourd'hui\",\n\"La mar\u00e9e est haute\",\n\"Ce film est une catastrophe, je suis en col\u00e8re\",\n]\nclassifier = pipeline(task=\"zero-shot-classification\", model=model_name_or_path, tokenizer=model_name_or_path)\nresult = classifier(\nsequences=sentences, candidate_labels=\", \".join([\"positif\", \"n\u00e9gatif\"]), hypothesis_template=\"Ce texte est {}.\"\n)\n</code></pre> <p>The classifier returns a score for the \"positif\" and \"n\u00e9gatif\" label for each input text:</p> <pre><code>[\n{\n'sequence': \"Quelle belle journ\u00e9e aujourd'hui\",\n'labels': ['positif', 'n\u00e9gatif'],\n'scores': [0.95, 0.05]\n},\n{\n'sequence': 'La mar\u00e9e est haute',\n'labels': ['positif', 'n\u00e9gatif'],\n'scores': [0.76, 0.24]\n},\n{'sequence': 'Ce film est une catastrophe, je suis en col\u00e8re',\n'labels': ['n\u00e9gatif', 'positif'],\n'scores': [0.97, 0.03]\n}\n]\n</code></pre>"},{"location":"tutorials/07_BasicClassification/#implement-a-dissatisfaction-detector","title":"Implement a Dissatisfaction detector","text":"<p>A full email processing pipeline could contain multiple models.  Melusine uses the MelusineDetector template class to standardise how models are integrated into a pipeline.</p> <pre><code>class DissatisfactionDetector(MelusineDetector):\n\"\"\"\n    Detect if the text expresses dissatisfaction.\n    \"\"\"\n# Dataframe column names\nOUTPUT_RESULT_COLUMN = \"dissatisfaction_result\"\nTMP_DETECTION_INPUT_COLUMN = \"detection_input\"\nTMP_DETECTION_OUTPUT_COLUMN = \"detection_output\"\n# Model inference parameters\nPOSITIVE_LABEL = \"positif\"\nNEGATIVE_LABEL = \"n\u00e9gatif\"\nHYPOTHESIS_TEMPLATE = \"Ce texte est {}.\"\ndef __init__(self, model_name_or_path: str, text_columns: List[str], threshold: float):\nself.text_columns = text_columns\nself.threshold = threshold\nself.classifier = pipeline(\ntask=\"zero-shot-classification\", model=model_name_or_path, tokenizer=model_name_or_path\n)\nsuper().__init__(input_columns=text_columns, output_columns=[self.OUTPUT_RESULT_COLUMN], name=\"dissatisfaction\")\n</code></pre> <p>The <code>pre_detect</code> method assembles the text that we want to use for classification.</p> <pre><code>def pre_detect(self, row, debug_mode=False):\n# Assemble the text columns into a single text\neffective_text = \"\"\nfor col in self.text_columns:\neffective_text += \"\\n\" + row[col]\nrow[self.TMP_DETECTION_INPUT_COLUMN] = effective_text\n# Store the effective detection text in the debug data\nif debug_mode:\nrow[self.debug_dict_col] = {\"detection_input\": row[self.TMP_DETECTION_INPUT_COLUMN]}\nreturn row\n</code></pre> <p>The <code>detect</code> method runs the classification model on the text.</p> <pre><code>def detect(self, row, debug_mode=False):\n# Run the classifier on the text\npipeline_result = self.classifier(\nsequences=row[self.TMP_DETECTION_INPUT_COLUMN],\ncandidate_labels=\", \".join([self.POSITIVE_LABEL, self.NEGATIVE_LABEL]),\nhypothesis_template=self.HYPOTHESIS_TEMPLATE,\n)\n# Format classification result\nresult_dict = dict(zip(pipeline_result[\"labels\"], pipeline_result[\"scores\"]))\nrow[self.TMP_DETECTION_OUTPUT_COLUMN] = result_dict\n# Store ML results in the debug data\nif debug_mode:\nrow[self.debug_dict_col].update(result_dict)\nreturn row\n</code></pre> <p>The <code>post_detect</code> method applies a threshold on the prediction score to determine the detection result.</p> <pre><code>def post_detect(self, row, debug_mode=False):\n# Compare classification score to the detection threshold\nif row[self.TMP_DETECTION_OUTPUT_COLUMN][self.NEGATIVE_LABEL] &gt; self.threshold:\nrow[self.OUTPUT_RESULT_COLUMN] = True\nelse:\nrow[self.OUTPUT_RESULT_COLUMN] = False\nreturn row\n</code></pre> <p>On top of that, the detector takes care of building debug data to make the result explicable.</p>"},{"location":"tutorials/07_BasicClassification/#run-detection","title":"Run detection","text":"<p>Putting it all together, we run the detector on the input dataset.</p> <pre><code>df = create_dataset()\ndetector = DissatisfactionDetector(\nmodel_name_or_path=\"cmarkea/distilcamembert-base-nli\",\ntext_columns=[\"header\", \"body\"],\nthreshold=0.7,\n)\ndf = detector.transform(df)\n</code></pre> <p>As a result, we get a new column <code>dissatisfaction_result</code> with the detection result.  We could have detection details by running the detector in debug mode.</p> header body dissatisfaction_result 0 Dossier 123456 Merci beaucoup pour votre gentillesse et votre \u00e9coute ! False 1 R\u00e9clamation (Dossier 987654) Bonjour, je ne suis pas satisfait de cette situation, r\u00e9pondez-moi rapidement svp! True"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to melusine","text":""},{"location":"#overview","title":"Overview","text":"<p>Discover Melusine, a comprehensive email processing library  designed to optimize your email workflow.  Leverage Melusine's advanced features to achieve:</p> <ul> <li>Effortless Email Routing: Ensure emails reach their intended destinations with high accuracy.</li> <li>Smart Prioritization: Prioritize urgent emails for timely handling and efficient task management.</li> <li>Snippet Summaries: Extract relevant information from lengthy emails, saving you precious time and effort.</li> <li>Precision Filtering: Eliminate unwanted emails from your inbox, maintaining focus and reducing clutter.</li> </ul> <p>Melusine facilitates the integration of deep learning frameworks (HuggingFace, Pytorch, Tensorflow, etc),  deterministic rules (regex, keywords, heuristics) into a full email qualification workflow.</p>"},{"location":"#why-choose-melusine","title":"Why Choose Melusine ?","text":"<p>Melusine stands out with its combination of features and advantages:  </p> <ul> <li>Out-of-the-box features : Melusine comes with features such as<ul> <li>Segmenting an email conversation into individual messages</li> <li>Tagging message parts (Email body, signatures, footers, etc)</li> <li>Transferred email handling</li> </ul> </li> <li>Streamlined Execution : Focus on the core email qualification logic  while Melusine handles the boilerplate code, providing debug mode, pipeline execution, code parallelization, and more.</li> <li>Flexible Integrations : Melusine's modular architecture enables seamless integration with various AI frameworks,  ensuring compatibility with your preferred tools.</li> <li>Production ready : Proven in the MAIF production environment,  Melusine provides the robustness and stability you need.</li> </ul>"},{"location":"#email-segmentation-exemple","title":"Email Segmentation Exemple","text":"<p>In the following example, an email is divided into two distinct messages  separated by a transition pattern.  Each message is then tagged line by line.  This email segmentation can later be leveraged to enhance the performance of machine learning models.</p> Message 1 <p><p> Dear Kim  HELLO </p> <p> Please find the details in the forwarded email.  BODY </p> <p> Best Regards  GREETINGS </p> <p> Jo Kahn  SIGNATURE </p></p> Transition pattern <p><p>Forwarded by jo@maif.fr on Monday december 12th  TRANSITION </p> <p>From: alex@gmail.com  TRANSITION </p> <p>To: jo@maif.fr  TRANSITION </p> <p>Subject: New address  TRANSITION </p></p> Message 2 <p><p> Dear Jo  HELLO </p> <p> A new version of Melusine is about to be released.  BODY </p> <p> Feel free to test it and send us feedbacks!  BODY </p> <p> Thank you for your help.  THANKS </p> <p> Cheers  GREETINGS </p> <p> Alex Leblanc  SIGNATURE </p> <p> 55 Rue du Faubourg Saint-Honor\u00e9  SIGNATURE </p> <p> 75008 Paris  SIGNATURE </p> <p> Sent from my iPhone  FOOTER </p></p>"},{"location":"#getting-started","title":"Getting Started","text":"<p>Get started with melusine following our (tested!) tutorials:</p> <ul> <li> <p>Getting Started</p> </li> <li> <p>MelusinePipeline</p> </li> <li> <p>MelusineTransformers</p> </li> <li> <p>MelusineRegex</p> </li> <li> <p>ML models</p> </li> <li> <p>MelusineDetector</p> </li> <li> <p>Configurations</p> </li> <li> <p>Basic Classification</p> </li> </ul> <p>With Melusine, you're well-equipped to transform your email handling, streamlining processes, maximizing efficiency,  and enhancing overall productivity.</p>"},{"location":"advanced/ContentTagger/","title":"Use custom message tags","text":""},{"location":"advanced/CustomDetector/","title":"Use a custom MelusineDetector template","text":""},{"location":"advanced/CustomDetector/#specify-abstract-methods","title":"Specify abstract methods","text":""},{"location":"advanced/CustomDetector/#row-transformations-vs-dataframe-transformations","title":"Row transformations vs dataframe transformations","text":""},{"location":"advanced/ExchangeConnector/","title":"Connect melusine to a Microsoft Exchange Mailbox","text":""},{"location":"advanced/PreTrainedModelsHF/","title":"Use pre-trained models from HuggingFace","text":""},{"location":"contribute/how_to_contribute/","title":"How to contribute","text":"<p>The melusine library is open to contributions from the community. This page describes the process to follow to contribute to the project:</p> <ol> <li>Describe the feature you want to add or the bug you want to fix in an issue.</li> <li>Fork the repository.</li> <li>Create a branch from the <code>master</code> branch of the main repository.</li> <li>Create a virtual environment and install the development dependencies.</li> <li>Code your feature or fix the bug.</li> <li>Write tests for your feature or bug fix.</li> <li>Run the pre-commit hooks.</li> <li>Create a pull request to merge your branch into the <code>master</code> branch of the main repository.</li> <li>Ensure the continuous integration passes</li> <li>Wait for the review of your pull request.</li> </ol>"},{"location":"contribute/maif/","title":"MAIF","text":""},{"location":"history/history/","title":"Project History","text":""},{"location":"history/history/#maif-origins","title":"MAIF Origins","text":"<p>Melusine, an open-source email processing library, was born at MAIF,  a French mutual insurance company founded in 1934 and headquartered in Niort, France.  MAIF is committed to social and environmental responsibility, operating as a Soci\u00e9t\u00e9 \u00e0 mission (a company with a purpose).</p>"},{"location":"history/history/#project-motivation","title":"Project Motivation","text":"<p>MAIF handles a vast volume of emails daily,  necessitating an efficient processing solution to ensure timely and accurate handling while  maximizing customer satisfaction. Automated email processing plays a crucial role in this endeavor, enabling tasks such as:</p> <ul> <li>Email routing: Redirecting emails to the most appropriate service for processing.</li> <li>Email prioritization: Prioritizing urgent emails for prompt attention.</li> <li>Email summarization: Extracting key information from emails for quick comprehension.</li> </ul>"},{"location":"history/history/#open-source-journey","title":"Open-Source Journey","text":"<p>After successfully implementing and testing the email processing code in production,  MAIF made the strategic decision to open-source Melusine.  This decision stems from multiple compelling reasons:</p> <ul> <li>Transparency: Open-sourcing the code fosters transparency with MAIF's customers, enabling them to understand how their emails are handled.</li> <li>Code quality: Adopting open-source quality standards promotes a culture of excellence and upskilling among MAIF developers.</li> <li>Benefiting from External Contributions: Open-source initiatives often attract contributions from external developers, expanding the pool of expertise and enhancing Melusine's capabilities.</li> <li>Building a Community: Engaging with Melusine users fosters a community of shared knowledge, best practices, and use-cases.</li> <li>Demonstrating Expertise: Open-sourcing Melusine showcases the MAIF DataFactory's technical expertise and commitment to innovation.</li> </ul>"},{"location":"history/history/#refactoring-and-advancement","title":"Refactoring and Advancement","text":"<p>The initial release of Melusine in 2019 followed its successful deployment at MAIF.  Since then, Melusine has evolved to meet the growing demands of MAIF's business needs.  In 2024, based on extensive feedback gained from Melusine's production usage,  a complete refactoring effort resulted in the release of Melusine V3.  This latest version boasts enhanced modularity, customization options, seamless integration,  robust monitoring capabilities, and simplified maintenance.</p>"},{"location":"philosophy/philosophy/","title":"Code philosophy","text":""},{"location":"philosophy/philosophy/#what-is-a-code-philosophy-and-why-do-i-need-it","title":"What is a code philosophy and why do I need it ?","text":""},{"location":"philosophy/philosophy/#design-patterns","title":"Design patterns","text":""},{"location":"tutorials/00_GettingStarted/","title":"Getting started with Melusine","text":"<p>Let's run emergency detection with melusine :</p> <ul> <li>Load a fake email dataset</li> <li>Load a demonstration pipeline</li> <li>Run the pipeline  <ul> <li>Apply email cleaning transformations  </li> <li>Apply emergency detection</li> </ul> </li> </ul>"},{"location":"tutorials/00_GettingStarted/#input-data","title":"Input data","text":"<p>Email datasets typically contain information about:</p> <ul> <li>Email sender</li> <li>Email recipients</li> <li>Email subject/header</li> <li>Email body</li> <li>Attachments data</li> </ul> <p>The present tutorial only makes use of the body and header data.</p> body header 0 This is an \u00ebm\u00e8rg\u00e9n\u00e7y Help 1 How is life ? Hey ! 2 Urgent update about Mr. Annoying Latest news 3 Please call me now URGENT"},{"location":"tutorials/00_GettingStarted/#code","title":"Code","text":"<p>A typical code for a melusine-based application looks like this :</p> <pre><code>from melusine.data import load_email_data\nfrom melusine.pipeline import MelusinePipeline\n# Load an email dataset\ndf = load_email_data()\n# Load a pipeline\npipeline = MelusinePipeline.from_config(\"demo_pipeline\")  # (1)!\n# Run the pipeline\ndf = pipeline.transform(df)\n</code></pre> <ol> <li>This tutorial uses one of the default pipeline configuration <code>demo_pipeline</code>. Melusine users will typically define their own pipeline configuration.    See more in the Configurations tutorial</li> </ol>"},{"location":"tutorials/00_GettingStarted/#output-data","title":"Output data","text":"<p>The pipeline created extra columns in the dataset. Some columns are temporary variables required by detectors (ex: <code>normalized_body</code>) and some are detection results with direct business value (ex: <code>emergency_result</code>).</p> body header normalized_body emergency_result 0 This is an \u00ebm\u00e8rg\u00e9n\u00e7y Help This is an emergency True 1 How is life ? Hey ! How is life ? False 2 Urgent update about Mr. Annoying Latest news Urgent update about Mr. Annoying False 3 Please call me now URGENT Please call me now True"},{"location":"tutorials/00_GettingStarted/#pipeline-steps","title":"Pipeline steps","text":"<p>Illustration of the pipeline used in the present tutorial :</p> <pre><code>---\ntitle: Demonstration pipeline\n---\nflowchart LR\n    Input[[Email]] --&gt; A(Cleaner)\n    A(Cleaner) --&gt; C(Normalizer)\n    C --&gt; F(Emergency\\nDetector)\n    F --&gt; Output[[Qualified Email]]</code></pre> <ul> <li><code>Cleaner</code> : Cleaning transformations such as uniformization of line breaks (<code>\\r\\n</code> -&gt; <code>\\n</code>)</li> <li><code>Normalizer</code> : Text normalisation to delete/replace non utf8 characters (<code>\u00e9\u00f6\u00e0</code> -&gt; <code>eoa</code>)</li> <li><code>EmergencyDetector</code> : Detection of urgent emails</li> </ul> <p>Info</p> <p>This demonstration pipeline is kept minimal but typical pipelines include more complex preprocessing and a variety of detectors. For example, pipelines may contain:</p> <ul> <li>Email Segmentation : Split email conversation into unitary messages</li> <li>ContentTagging : Associate tags (SIGNATURE, FOOTER, BODY) to parts of messages</li> <li>Appointment detection : For exemple, detect \"construction work will take place on 01/01/2024\" as an appointment email.</li> <li>More on preprocessing in the MelusineTransformers tutorial</li> <li>More on detectors in the MelusineDetector tutorial</li> </ul>"},{"location":"tutorials/00_GettingStarted/#debug-mode","title":"Debug mode","text":"<p>End users typically want to know what lead melusine to a specific detection result. The debug mode generates additional explainability info.</p> <pre><code>from melusine.data import load_email_data\nfrom melusine.pipeline import MelusinePipeline\n# Load an email dataset\ndf = load_email_data()\n# Activate debug mode\ndf.debug = True\n# Load the default pipeline\npipeline = MelusinePipeline.from_config(\"demo_pipeline\")\n# Run the pipeline\ndf = pipeline.transform(df)\n</code></pre> <p>A new column <code>debug_emergency</code> is created.</p> ... emergency_result debug_emergency 0 ... True [details_below] 1 ... False [details_below] 2 ... False [details_below] 3 ... True [details_below] <p>Inspecting the debug data gives a lot of info:</p> <ul> <li><code>text</code> : Effective text considered for detection.</li> <li><code>EmergencyRegex</code> : melusine used an <code>EmergencyRegex</code> object to run detection.</li> <li><code>match_result</code> : The <code>EmergencyRegex</code> did not match the text</li> <li><code>positive_match_data</code> : The <code>EmergencyRegex</code> matched positively the text pattern \"Urgent\" (Required condition)</li> <li><code>negative_match_data</code> : The <code>EmergencyRegex</code> matched negatively the text pattern \"Mr. Annoying\" (Forbidden condition)</li> <li><code>BLACKLIST</code> : Detection groups can be defined to easily link a matching pattern to the corresponding regex. DEFAULT is used if no detection group is specified.</li> </ul> <pre><code># print(df.iloc[2][\"debug_emergency\"])\n{\n'text': 'Latest news\\nUrgent update about Mr. Annoying'},\n'EmergencyRegex': {\n'match_result': False,\n'negative_match_data': {\n'BLACKLIST': [\n{'match_text': 'Mr. Annoying', 'start': 32, 'stop': 44}\n]},\n'neutral_match_data': {},\n'positive_match_data': {\n'DEFAULT': [\n{'match_text': 'Urgent', 'start': 12, 'stop': 18}\n]\n}\n}\n</code></pre>"},{"location":"tutorials/01_MelusinePipeline/","title":"MelusinePipeline","text":"<p>The <code>MelusinePipeline</code> class is at the core of melusine. It inherits from the <code>sklearn.Pipeline</code> class and adds extra functionalities such as :</p> <ul> <li>Instantiation from configurations</li> <li>Input/output coherence check</li> <li>Debug mode</li> </ul>"},{"location":"tutorials/01_MelusinePipeline/#code","title":"Code","text":""},{"location":"tutorials/02_MelusineTransformers/","title":"MelusineTransformers","text":""},{"location":"tutorials/03_MelusineRegex/","title":"MelusineRegex","text":""},{"location":"tutorials/04_UsingModels/","title":"Using AI models","text":""},{"location":"tutorials/05a_MelusineDetectors/","title":"Melusine Detectors","text":"<p>The <code>MelusineDetector</code> component aims at standardizing how detection  is performed in a <code>MelusinePipeline</code>. </p> <p>Tip</p> <p>Project running over several years (such as email automation)  may accumulate technical debt over time. Standardizing code practices  can limit the technical debt and ease the onboarding of new developers.</p> <p>The <code>MelusineDetector</code> class splits detection into three steps:</p> <ul> <li><code>pre_detect</code>: Select/combine the inputs needed for detection. Ex: Select the text parts tagged as <code>BODY</code> and combine them with the text  in the email header.</li> <li><code>detect</code>: Use regular expressions, ML models or heuristics to run detection on the input text.</li> <li><code>post_detect</code>: Run detection rules such as thresholding or combine results from multiple models.</li> </ul> <p>The method <code>transform</code> is defined by the BaseClass <code>MelusineDetector</code> and will call  the pre_detect/detect/post_detect methods in turn (Template pattern).</p> <pre><code># Instantiate Detector\ndetector = MyDetector()\n# Run pre_detect, detect and post_detect on input data\ndata_with_detection = detector.transform(data)\n</code></pre> <p>Here is the full code of a MelusineDetector to detect emails related to viruses.  The next sections break down the different parts of the code.</p> <pre><code>class MyCustomDetector(BaseMelusineDetector):\n@property\ndef transform_methods(self) -&gt; List[Callable]:\nreturn [self.prepare, self.run]\ndef prepare(self, row, debug_mode=False):\nreturn row\ndef run(self, row, debug_mode=False):\nrow[self.output_columns[0]] = \"12345\"\nreturn row\n</code></pre> <p>The detector is run on a simple dataframe: <pre><code>df = pd.DataFrame(\n[\n{\"input_col\": \"test1\"},\n{\"input_col\": \"test2\"},\n]\n)\ndetector = MyCustomDetector(input_columns=[\"input_col\"], output_columns=[\"output_col\"], name=\"custom\")\ndf = detector.transform(df)\n</code></pre></p> <p>The output is a dataframe with a new <code>virus_result</code> column.</p> body header virus_result 0 This is a dangerous virus test True 1 test test False 2 test viruses are dangerous True 3 corona virus is annoying test False <p>Tip</p> <p>Columns that are not declared in the <code>output_columns</code> are dropped automatically.</p>"},{"location":"tutorials/05a_MelusineDetectors/#detector-init","title":"Detector init","text":"<p>In the init method, you should call the superclass init and provide:</p> <ul> <li>A name for the detector</li> <li>Inputs columns</li> <li>Output columns</li> </ul> <p>Tip</p> <p>If the init method of the super class is enough (parameters <code>name</code>, <code>input_columns</code> and <code>output_columns</code>) you may skip the init method entirely when defining your <code>MelusineDetector</code>.</p>"},{"location":"tutorials/05a_MelusineDetectors/#detector-pre_detect","title":"Detector pre_detect","text":"<p>The <code>pre_detect</code> method simply combines the header text and the body text (separated by a line break). <pre><code>def pre_detect(self, df, debug_mode=False):\n# Assemble the text columns into a single column\ndf[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + \"\\n\" + df[self.body_column]\nreturn df\n</code></pre></p>"},{"location":"tutorials/05a_MelusineDetectors/#detector-detect","title":"Detector detect","text":"<p>The <code>detect</code> applies two regexes on the selected text: - A positive regex to catch mentions to viruses - A negative regex to avoid false positive detections <pre><code>def detect(self, df, debug_mode=False):\ntext_column = df[self.TMP_DETECTION_INPUT_COLUMN]\npositive_regex = r\"(virus)\"\nnegative_regex = r\"(corona[ _]virus)\"\n# Pandas str.extract method on columns\ndf[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)\ndf[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)\nreturn df\n</code></pre></p>"},{"location":"tutorials/05a_MelusineDetectors/#detector-post_detect","title":"Detector post_detect","text":"<p>The <code>post_detect</code> combines the regex detection result to determine the final result. <pre><code>def post_detect(self, df, debug_mode=False):\n# Boolean operation on pandas column\ndf[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] &amp; ~df[self.TMP_NEGATIVE_REGEX_MATCH]\nreturn df\n</code></pre></p>"},{"location":"tutorials/05a_MelusineDetectors/#are-melusinedetectors-mandatory-for-melusine","title":"Are MelusineDetectors mandatory for melusine?","text":"<p>No.  </p> <p>You can use any scikit-learn compatible component in your <code>MelusinePipeline</code>.  However, we recommend using the <code>MelusineDetector</code> (and <code>MelusineTransformer</code>)  classes to benefit from:</p> <ul> <li>Code standardization</li> <li>Input columns validation</li> <li>Dataframe backend variabilization   Today dict and pandas backend are supported but more backends may be added (e.g. polars)</li> <li>Debug mode</li> <li>Multiprocessing</li> </ul> <p>Check-out the next tutorial  to discover advanced features of the <code>MelusineDetector</code> class.</p>"},{"location":"tutorials/05b_MelusineDetectorsAdvanced/","title":"Advanced Melusine Detectors","text":"<p>This tutorial presents the advanced features of the <code>MelusineDetector</code> class:</p> <ul> <li>Debug mode</li> <li>Row wise methods vs DataFrame wise methods</li> <li>Custom transform methods</li> </ul>"},{"location":"tutorials/05b_MelusineDetectorsAdvanced/#debug-mode","title":"Debug mode","text":"<p><code>MelusineDetector</code> are designed to be easily debugged. For that purpose, the  pre-detect/detect/post-detect methods all have a <code>debug_mode</code> argument.  The debug mode is activated by setting the debug attribute of a dataframe to True.</p> <pre><code>import pandas as pd\ndf = pd.DataFrame({\"bla\": [1, 2, 3]})\ndf.debug = True\n</code></pre> <p>Warning</p> <p>Debug mode activation is backend dependent. With a DictBackend, tou should use <code>my_dict[\"debug\"] = True</code></p> <p>When debug mode is activated, a column named \"DETECTOR_NAME_debug\" containing an empty  dictionary is automatically created. Populating this debug dict with debug info is then left to the user's responsibility. </p> <p>Exemple of a detector with debug data <pre><code>class MyVirusDetector(MelusineDetector):\nOUTPUT_RESULT_COLUMN = \"virus_result\"\nTMP_DETECTION_INPUT_COLUMN = \"detection_input\"\nTMP_POSITIVE_REGEX_MATCH = \"positive_regex_match\"\nTMP_NEGATIVE_REGEX_MATCH = \"negative_regex_match\"\ndef __init__(self, body_column: str, header_column: str):\nself.body_column = body_column\nself.header_column = header_column\nsuper().__init__(\ninput_columns=[self.body_column, self.header_column],\noutput_columns=[self.OUTPUT_RESULT_COLUMN],\nname=\"virus\",\n)\ndef pre_detect(self, row, debug_mode=False):\neffective_text = row[self.header_column] + \"\\n\" + row[self.body_column]\nrow[self.TMP_DETECTION_INPUT_COLUMN] = effective_text\nif debug_mode:\nrow[self.debug_dict_col] = {\"detection_input\": row[self.TMP_DETECTION_INPUT_COLUMN]}\nreturn row\ndef detect(self, row, debug_mode=False):\ntext = row[self.TMP_DETECTION_INPUT_COLUMN]\npositive_regex = r\"virus\"\nnegative_regex = r\"corona[ _]virus\"\npositive_match = re.search(positive_regex, text)\nnegative_match = re.search(negative_regex, text)\nrow[self.TMP_POSITIVE_REGEX_MATCH] = bool(positive_match)\nrow[self.TMP_NEGATIVE_REGEX_MATCH] = bool(negative_match)\nif debug_mode:\npositive_match_text = (\npositive_match.string[positive_match.start() : positive_match.end()] if positive_match else None\n)\nnegative_match_text = (\npositive_match.string[negative_match.start() : negative_match.end()] if negative_match else None\n)\nrow[self.debug_dict_col].update(\n{\n\"positive_match_data\": {\"result\": bool(positive_match), \"match_text\": positive_match_text},\n\"negative_match_data\": {\"result\": bool(negative_match), \"match_text\": negative_match_text},\n}\n)\nreturn row\ndef post_detect(self, row, debug_mode=False):\nif row[self.TMP_POSITIVE_REGEX_MATCH] and not row[self.TMP_NEGATIVE_REGEX_MATCH]:\nrow[self.OUTPUT_RESULT_COLUMN] = True\nelse:\nrow[self.OUTPUT_RESULT_COLUMN] = False\nreturn row\n</code></pre></p> <p>In the end, an extra column is created containing debug data:</p> virus_result debug_virus 0 True {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}} 1 False {'detection_input': '...', 'positive_match_data': {'result': False, 'match_text': None}, 'negative_match_data': {'result': False, 'match_text': None}} 2 True {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': False, 'match_text': None}} 3 False {'detection_input': '...', 'positive_match_data': {'result': True, 'match_text': 'virus'}, 'negative_match_data': {'result': True, 'match_text': 'corona virus'}}"},{"location":"tutorials/05b_MelusineDetectorsAdvanced/#row-methods-vs-dataframe-methods","title":"Row methods vs dataframe methods","text":"<p>There are two ways to use the pre-detect/detect/post-detect methods:</p> <ul> <li>Row wise: The method works on a single row of a DataFrame. In that case, a map-like method is used to apply it on an entire dataframe (typically pandas.DataFrame.apply is used with the PandasBackend)</li> <li>Dataframe wise: The method works directly on the entire DataFrame.</li> </ul> <p>Tip</p> <p>Using row wise methods make your code backend independent. You may  switch from a <code>PandasBackend</code> to a <code>DictBackend</code> at any time.  The <code>PandasBackend</code> also supports multiprocessing for row wise methods.</p> <p>To use row wise methods, you just need to name the first parameter of \"row\".  Otherwise, dataframe wise transformations are used.</p> <p>Exemple of a Detector with dataframe wise method (works with a PandasBackend only). <pre><code>class MyVirusDetector(MelusineDetector):\n\"\"\"\n    Detect if the text expresses dissatisfaction.\n    \"\"\"\n# Dataframe column names\nOUTPUT_RESULT_COLUMN = \"virus_result\"\nTMP_DETECTION_INPUT_COLUMN = \"detection_input\"\nTMP_POSITIVE_REGEX_MATCH = \"positive_regex_match\"\nTMP_NEGATIVE_REGEX_MATCH = \"negative_regex_match\"\ndef __init__(self, body_column: str, header_column: str):\nself.body_column = body_column\nself.header_column = header_column\nsuper().__init__(\ninput_columns=[self.body_column, self.header_column],\noutput_columns=[self.OUTPUT_RESULT_COLUMN],\nname=\"virus\",\n)\ndef pre_detect(self, df, debug_mode=False):\n# Assemble the text columns into a single column\ndf[self.TMP_DETECTION_INPUT_COLUMN] = df[self.header_column] + \"\\n\" + df[self.body_column]\nreturn df\ndef detect(self, df, debug_mode=False):\ntext_column = df[self.TMP_DETECTION_INPUT_COLUMN]\npositive_regex = r\"(virus)\"\nnegative_regex = r\"(corona[ _]virus)\"\n# Pandas str.extract method on columns\ndf[self.TMP_POSITIVE_REGEX_MATCH] = text_column.str.extract(positive_regex).apply(pd.notna)\ndf[self.TMP_NEGATIVE_REGEX_MATCH] = text_column.str.extract(negative_regex).apply(pd.notna)\nreturn df\ndef post_detect(self, df, debug_mode=False):\n# Boolean operation on pandas column\ndf[self.OUTPUT_RESULT_COLUMN] = df[self.TMP_POSITIVE_REGEX_MATCH] &amp; ~df[self.TMP_NEGATIVE_REGEX_MATCH]\nreturn df\n</code></pre></p>"},{"location":"tutorials/05b_MelusineDetectorsAdvanced/#custom-transform-methods","title":"Custom transform methods","text":"<p>If you are not happy with the <code>pre_detect</code>/<code>detect</code>/<code>post_detect</code> transform methods, you: </p> <ul> <li>Use custom template methods</li> <li>Use regular pipeline steps (not inheriting from the <code>MelusineDetector</code> class)</li> </ul> <p>In this exemple, the <code>prepare</code>/<code>run</code> custom transform methods are used instead of the default <code>pre_detect</code>/<code>detect</code>/<code>post_detect</code>.</p> <pre><code>class MyCustomDetector(BaseMelusineDetector):\n@property\ndef transform_methods(self) -&gt; List[Callable]:\nreturn [self.prepare, self.run]\ndef prepare(self, row, debug_mode=False):\nreturn row\ndef run(self, row, debug_mode=False):\nrow[self.output_columns[0]] = \"12345\"\nreturn row\n</code></pre> <p>To configure custom transform methods you need to: </p> <ul> <li>inherit from the melusine.base.BaseMelusineDetector class</li> <li>define the <code>transform_methods</code> property</li> </ul> <p>The <code>transform</code> method will now call <code>prepare</code> and <code>run</code>.</p> <pre><code>df = pd.DataFrame(\n[\n{\"input_col\": \"test1\"},\n{\"input_col\": \"test2\"},\n]\n)\ndetector = MyCustomDetector(input_columns=[\"input_col\"], output_columns=[\"output_col\"], name=\"custom\")\ndf = detector.transform(df)\n</code></pre> <p>We can check that the <code>run</code> method was indeed called.</p> input_col output_col 0 test1 12345 1 test2 12345"},{"location":"tutorials/06_Configurations/","title":"Configurations","text":"<p>Melusine components can be instantiated using parameters defined in configurations. The <code>from_config</code> method accepts a <code>config_dict</code> argument <pre><code>from melusine.processors import Normalizer\nnormalizer_conf = {\n\"input_columns\": [\"text\"],\n\"output_columns\": [\"normalized_text\"],\n\"form\": \"NFKD\",\n\"lowercase\": False,\n}\nnormalizer = Normalizer.from_config(config_dict=normalizer_conf)\n</code></pre></p> <p>or a <code>config_key</code> argument. <pre><code>from melusine.pipeline import MelusinePipeline\npipeline = MelusinePipeline.from_config(config_key=\"demo_pipeline\")\n</code></pre> When <code>demo_pipeline</code> is given as argument, parameters are read from the <code>melusine.config</code> object at key <code>demo_pipeline</code>. </p>"},{"location":"tutorials/06_Configurations/#access-configurations","title":"Access configurations","text":"<p>The melusine configurations can be accessed with the <code>config</code> object. <pre><code>from melusine import config\nprint(config[\"demo_pipeline\"])\n</code></pre></p> <p>The configuration of the <code>demo_pipeline</code> can then be easily inspected.</p> <pre><code>{\n'steps': [\n{'class_name': 'Cleaner', 'config_key': 'body_cleaner', 'module': 'melusine.processors'},\n{'class_name': 'Cleaner', 'config_key': 'header_cleaner', 'module': 'melusine.processors'},\n{'class_name': 'Segmenter', 'config_key': 'segmenter', 'module': 'melusine.processors'},\n{'class_name': 'ContentTagger', 'config_key': 'content_tagger', 'module': 'melusine.processors'},\n{'class_name': 'TextExtractor', 'config_key': 'text_extractor', 'module': 'melusine.processors'},\n{'class_name': 'Normalizer', 'config_key': 'demo_normalizer', 'module': 'melusine.processors'},\n{'class_name': 'EmergencyDetector', 'config_key': 'emergency_detector', 'module': 'melusine.detectors'}\n]\n}\n</code></pre>"},{"location":"tutorials/06_Configurations/#modify-configurations","title":"Modify configurations","text":"<p>The simplest way to modify configurations is to create a new directory directly. <pre><code>from melusine import config\n# Get a dict of the existing conf\nnew_conf = config.dict()\n# Add/Modify a config key\nnew_conf[\"my_conf_key\"] = \"my_conf_value\"\n# Reset Melusine configurations\nconfig.reset(new_conf)\n</code></pre></p> <p>To deliver code in a production environment, using configuration files should be preferred to modifying the configurations on the fly. Melusine lets you specify the path to a folder containing yaml files and loads them (the <code>OmegaConf</code> package is used behind the scene). <pre><code>from melusine import config\n# Specify the path to a conf folder\nconf_path = \"path/to/conf/folder\"\n# Reset Melusine configurations\nconfig.reset(config_path=conf_path)\n# &gt;&gt; Using config_path : path/to/conf/folder\n</code></pre></p> <p>When the <code>MELUSINE_CONFIG_DIR</code> environment variable is set, Melusine loads directly the configurations files located at the path specified by the environment variable. <pre><code>import os\nfrom melusine import config\n# Specify the MELUSINE_CONFIG_DIR environment variable\nos.environ[\"MELUSINE_CONFIG_DIR\"] = \"path/to/conf/folder\"\n# Reset Melusine configurations\nconfig.reset()\n# &gt;&gt; Using config_path from env variable MELUSINE_CONFIG_DIR\n# &gt;&gt; Using config_path : path/to/conf/folder\n</code></pre></p> <p>Tip</p> <p>If the <code>MELUSINE_CONFIG_DIR</code> is set before melusine is imported (e.g., before starting the program), you don't need to call <code>config.reset()</code>. </p>"},{"location":"tutorials/06_Configurations/#export-configurations","title":"Export configurations","text":"<p>Creating your configuration folder from scratch would be cumbersome. It is advised to export the default configurations and then modify just the files you need.</p> <pre><code>from melusine import config\n# Specify the path a folder (created if it doesn't exist)\nconf_path = \"path/to/conf/folder\"\n# Export default configurations to the folder\nfiles_created = config.export_default_config(path=conf_path)\n</code></pre> <p>Tip</p> <p>The <code>export_default_config</code> returns a list of path to all the files created. </p>"},{"location":"tutorials/07_BasicClassification/","title":"Zero Shot Classification","text":"<p>Machine Learning is commonly used to classify data into pre-defined categories. </p> <pre><code>---\ntitle: Email classification\n---\nflowchart LR\n    Input[[Email]] --&gt; X(((Classifier)))\n    X --&gt; A(Car)\n    X --&gt; B(Boat)\n    X --&gt; C(Housing)\n    X --&gt; D(Health)</code></pre> <p>Typically, to reach high classification performance,  models need to be trained on context specific labeled data.  Zero-shot classification is a type of classification that  uses a pre-trained model and does not require further training on context specific data.</p>"},{"location":"tutorials/07_BasicClassification/#tutorial-intro","title":"Tutorial intro","text":"<p>In this tutorial we want to detect insatisfaction in an email dataset.  Let's create a basic dataset: <pre><code>import pandas as pd\nfrom transformers import pipeline\nfrom melusine.base import MelusineDetector\ndef create_dataset():\ndf = pd.DataFrame(\n[\n{\n\"header\": \"Dossier 123456\",\n\"body\": \"Merci beaucoup pour votre gentillesse et votre \u00e9coute !\",\n},\n{\n\"header\": \"R\u00e9clamation (Dossier 987654)\",\n\"body\": (\"Bonjour, je ne suis pas satisfait de cette situation, \" \"r\u00e9pondez-moi rapidement svp!\"),\n},\n]\n)\nreturn df\n</code></pre></p> header body 0 Dossier 123456 Merci beaucoup pour votre gentillesse et votre \u00e9coute ! 1 R\u00e9clamation (Dossier 987654) Bonjour, je ne suis pas satisfait de cette situation, r\u00e9pondez-moi rapidement svp!"},{"location":"tutorials/07_BasicClassification/#classify-with-zero-shot-classification","title":"Classify with Zero-Shot-Classification","text":"<p>The <code>transformers</code> library makes it really simple to use pre-trained models for zero shot classification.</p> <pre><code>model_name_or_path = \"cmarkea/distilcamembert-base-nli\"\nsentences = [\n\"Quelle belle journ\u00e9e aujourd'hui\",\n\"La mar\u00e9e est haute\",\n\"Ce film est une catastrophe, je suis en col\u00e8re\",\n]\nclassifier = pipeline(task=\"zero-shot-classification\", model=model_name_or_path, tokenizer=model_name_or_path)\nresult = classifier(\nsequences=sentences, candidate_labels=\", \".join([\"positif\", \"n\u00e9gatif\"]), hypothesis_template=\"Ce texte est {}.\"\n)\n</code></pre> <p>The classifier returns a score for the \"positif\" and \"n\u00e9gatif\" label for each input text:</p> <pre><code>[\n{\n'sequence': \"Quelle belle journ\u00e9e aujourd'hui\",\n'labels': ['positif', 'n\u00e9gatif'],\n'scores': [0.95, 0.05]\n},\n{\n'sequence': 'La mar\u00e9e est haute',\n'labels': ['positif', 'n\u00e9gatif'],\n'scores': [0.76, 0.24]\n},\n{'sequence': 'Ce film est une catastrophe, je suis en col\u00e8re',\n'labels': ['n\u00e9gatif', 'positif'],\n'scores': [0.97, 0.03]\n}\n]\n</code></pre>"},{"location":"tutorials/07_BasicClassification/#implement-a-dissatisfaction-detector","title":"Implement a Dissatisfaction detector","text":"<p>A full email processing pipeline could contain multiple models.  Melusine uses the MelusineDetector template class to standardise how models are integrated into a pipeline.</p> <pre><code>class DissatisfactionDetector(MelusineDetector):\n\"\"\"\n    Detect if the text expresses dissatisfaction.\n    \"\"\"\n# Dataframe column names\nOUTPUT_RESULT_COLUMN = \"dissatisfaction_result\"\nTMP_DETECTION_INPUT_COLUMN = \"detection_input\"\nTMP_DETECTION_OUTPUT_COLUMN = \"detection_output\"\n# Model inference parameters\nPOSITIVE_LABEL = \"positif\"\nNEGATIVE_LABEL = \"n\u00e9gatif\"\nHYPOTHESIS_TEMPLATE = \"Ce texte est {}.\"\ndef __init__(self, model_name_or_path: str, text_columns: List[str], threshold: float):\nself.text_columns = text_columns\nself.threshold = threshold\nself.classifier = pipeline(\ntask=\"zero-shot-classification\", model=model_name_or_path, tokenizer=model_name_or_path\n)\nsuper().__init__(input_columns=text_columns, output_columns=[self.OUTPUT_RESULT_COLUMN], name=\"dissatisfaction\")\n</code></pre> <p>The <code>pre_detect</code> method assembles the text that we want to use for classification.</p> <pre><code>def pre_detect(self, row, debug_mode=False):\n# Assemble the text columns into a single text\neffective_text = \"\"\nfor col in self.text_columns:\neffective_text += \"\\n\" + row[col]\nrow[self.TMP_DETECTION_INPUT_COLUMN] = effective_text\n# Store the effective detection text in the debug data\nif debug_mode:\nrow[self.debug_dict_col] = {\"detection_input\": row[self.TMP_DETECTION_INPUT_COLUMN]}\nreturn row\n</code></pre> <p>The <code>detect</code> method runs the classification model on the text.</p> <pre><code>def detect(self, row, debug_mode=False):\n# Run the classifier on the text\npipeline_result = self.classifier(\nsequences=row[self.TMP_DETECTION_INPUT_COLUMN],\ncandidate_labels=\", \".join([self.POSITIVE_LABEL, self.NEGATIVE_LABEL]),\nhypothesis_template=self.HYPOTHESIS_TEMPLATE,\n)\n# Format classification result\nresult_dict = dict(zip(pipeline_result[\"labels\"], pipeline_result[\"scores\"]))\nrow[self.TMP_DETECTION_OUTPUT_COLUMN] = result_dict\n# Store ML results in the debug data\nif debug_mode:\nrow[self.debug_dict_col].update(result_dict)\nreturn row\n</code></pre> <p>The <code>post_detect</code> method applies a threshold on the prediction score to determine the detection result.</p> <pre><code>def post_detect(self, row, debug_mode=False):\n# Compare classification score to the detection threshold\nif row[self.TMP_DETECTION_OUTPUT_COLUMN][self.NEGATIVE_LABEL] &gt; self.threshold:\nrow[self.OUTPUT_RESULT_COLUMN] = True\nelse:\nrow[self.OUTPUT_RESULT_COLUMN] = False\nreturn row\n</code></pre> <p>On top of that, the detector takes care of building debug data to make the result explicable.</p>"},{"location":"tutorials/07_BasicClassification/#run-detection","title":"Run detection","text":"<p>Putting it all together, we run the detector on the input dataset.</p> <pre><code>df = create_dataset()\ndetector = DissatisfactionDetector(\nmodel_name_or_path=\"cmarkea/distilcamembert-base-nli\",\ntext_columns=[\"header\", \"body\"],\nthreshold=0.7,\n)\ndf = detector.transform(df)\n</code></pre> <p>As a result, we get a new column <code>dissatisfaction_result</code> with the detection result.  We could have detection details by running the detector in debug mode.</p> header body dissatisfaction_result 0 Dossier 123456 Merci beaucoup pour votre gentillesse et votre \u00e9coute ! False 1 R\u00e9clamation (Dossier 987654) Bonjour, je ne suis pas satisfait de cette situation, r\u00e9pondez-moi rapidement svp! True"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 6059f61..70b6b03 100755
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ