From 76e2a1ec0bbbdbcb3f0d92e4e8b9111176c4721b Mon Sep 17 00:00:00 2001 From: Hugo Perrier Date: Mon, 18 Dec 2023 16:34:22 +0100 Subject: [PATCH 1/7] :memo: Update Readme --- README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.md b/README.md index 11cd2ba..93c467f 100755 --- a/README.md +++ b/README.md @@ -1,6 +1,5 @@ [![pypi badge](https://img.shields.io/pypi/v/melusine.svg)](https://pypi.python.org/pypi/melusine) -[![Build Status](https://travis-ci.org/MAIF/melusine.svg?branch=master)](https://travis-ci.org/MAIF/melusine) -[![documentation badge](https://readthedocs.org/projects/melusine/badge/?version=latest)](https://readthedocs.org/projects/melusine/) +[![build status](https://github.com/MAIF/melusine/actions/workflows/main/badge.svg)] [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Generic badge](https://img.shields.io/badge/python-3.7|3.8-blue.svg)](https://shields.io/) From 6b12b49a415878ec6f8ab213d959440714342954 Mon Sep 17 00:00:00 2001 From: Hugo Perrier Date: Mon, 18 Dec 2023 16:36:35 +0100 Subject: [PATCH 2/7] :memo: Update Readme --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 93c467f..e45c519 100755 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@ [![pypi badge](https://img.shields.io/pypi/v/melusine.svg)](https://pypi.python.org/pypi/melusine) -[![build status](https://github.com/MAIF/melusine/actions/workflows/main/badge.svg)] +[![build status](https://github.com/MAIF/melusine/actions/workflows/main.yaml/badge.svg)] [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Generic badge](https://img.shields.io/badge/python-3.7|3.8-blue.svg)](https://shields.io/) From a3e33d9001524d9a063e07d9413fd7ad9f462023 Mon Sep 17 00:00:00 2001 From: Hugo Perrier Date: Mon, 18 Dec 2023 16:39:55 +0100 Subject: [PATCH 3/7] :memo: Update Readme --- README.md | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/README.md b/README.md index e45c519..2892d71 100755 --- a/README.md +++ b/README.md @@ -1,6 +1,5 @@ [![pypi badge](https://img.shields.io/pypi/v/melusine.svg)](https://pypi.python.org/pypi/melusine) -[![build status](https://github.com/MAIF/melusine/actions/workflows/main.yaml/badge.svg)] -[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) +[![Build & Test](https://github.com/MAIF/melusine/actions/workflows/main.yml/badge.svg?branch=master)](https://github.com/MAIF/melusine/actions/workflows/main.yml)[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Generic badge](https://img.shields.io/badge/python-3.7|3.8-blue.svg)](https://shields.io/) ๐ŸŽ‰ We just released **Melusine 2.3.6** including uncertainty estimations using TensorFlow Probability. From 0c03e291b1eacbb6e0b58f95912fc6b776f7ed35 Mon Sep 17 00:00:00 2001 From: Hugo Perrier Date: Mon, 18 Dec 2023 16:56:37 +0100 Subject: [PATCH 4/7] :memo: Readme for melusine 3.0 --- README.md | 336 ++++++------------------------------------------------ 1 file changed, 32 insertions(+), 304 deletions(-) diff --git a/README.md b/README.md index 2892d71..a2ed982 100755 --- a/README.md +++ b/README.md @@ -1,323 +1,51 @@ [![pypi badge](https://img.shields.io/pypi/v/melusine.svg)](https://pypi.python.org/pypi/melusine) [![Build & Test](https://github.com/MAIF/melusine/actions/workflows/main.yml/badge.svg?branch=master)](https://github.com/MAIF/melusine/actions/workflows/main.yml)[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) -[![Generic badge](https://img.shields.io/badge/python-3.7|3.8-blue.svg)](https://shields.io/) +[![Generic badge](https://img.shields.io/badge/python-3.8+-blue.svg)](https://shields.io/) -๐ŸŽ‰ We just released **Melusine 2.3.6** including uncertainty estimations using TensorFlow Probability. -Checkout this [tutorial](https://github.com/MAIF/melusine/blob/master/tutorial/tutorial15_probabilistic_models.ipynb) -to learn more. We are grateful to those who contribute and make this library alive! -All new features can be found in the **full pipeline [tutorial](https://github.com/MAIF/melusine/blob/master/tutorial/tutorial08_full_pipeline_detailed.ipynb)**. ๐ŸŽ‰ - -# Melusine +๐ŸŽ‰ BREAKING : New major version **Melusine 3.0.0** is available ๐ŸŽ‰ +Checkout the [documentation](https://maif.github.io/melusine/) and [tutorials](https://maif.github.io/melusine/tutorials/00_GettingStarted/) to get started. ![](docs/_static/melusine.png) - Free software: Apache Software License 2.0 -- Documentation: [https://melusine.readthedocs.io](https://melusine.readthedocs.io). - -# Overview - -**Melusine** is a high-level Python library for email classification and feature extraction, -written in Python and capable of running on top of Scikit-Learn, Tensorflow 2 and Keras. -Integrated models runs with Tensorflow 2.2. -It is developed with a focus on emails written in French. - -Use **Melusine** if you need a library which: - * Supports transformers, CNN and RNN models. - * Runs seamlessly on CPU and GPU. - -**Melusine** is compatible with `Python 3.6` (<=2.3.2), `Python 3.7` and `Python 3.8`. - -## The Melusine package - -This package is designed for the preprocessing, classification and automatic summarization of emails written in french. - - -![](docs/_static/schema_1.png) - -**3 main subpackages are offered :** - -* ``prepare_email`` : to preprocess and clean emails. -* ``summarizer`` : to extract keywords from an email. -* ``models`` : to classify e-mails according to categories pre-defined by the user or compute sentiment score based on sentiment described by the user with seed words. - -**2 other subpackages are offered as building blocks :** - -* ``nlp_tools`` : to provide classic NLP tools such as tokenizer, phraser and embeddings. -* ``utils`` : to provide a *TransformerScheduler* class to build your own transformer and integrate into a scikit-learn Pipeline. - -**An other subpackage is also provided** to manage, modify or add parameters such as : regular expressions, keywords, stopwords, etc. - -* ``config`` : This modules loads a configuration dict which is essential to the Melusine package. By customizing the configurations, users may adapt the text preprocessing to their needs. - -**2 other subpackages are offered to provide a dashboard app and ethics guidelines for AI project :** - -* ``data`` : contains a classic data loader and provide a *StreamLit application* with exploratory dashboards on input data and models. - -* ``ethics_guidelines`` : to provide an Ethics Guide to evaluate AI project, with guidelines and questionnaire. The questionnaire is based on criteria derived in particular from the work of the European Commission and grouped by categories. - -## Getting started: 30 seconds to Melusine - -### Installation - -``` -pip install melusine -``` - -To use Melusine in a project - -```python -import melusine -``` - -### Input data : Email DataFrame - -The basic requirement to use Melusine is to have an input e-mail DataFrame with the following columns: - -- *body* : Body of an email (single message or conversation history) -- *header* : Header/Subject of an email -- *date* : Reception date of an email -- *from* : Email address of the sender -- *to* : Email address of the recipient -- *label* (optional): Label of the email for a classification task (examples: Business, Spam, Finance or Family) - -| body | header | date | from | to | label | -|:---------------------------|:--------------:|:------------------------------:|:----------------------------:|:-------------------------------------:|:-------:| -| Thank you.\\nBye,\\nJohn | Re: Your order | jeudi 24 mai 2018 11 h 49 CEST | anonymous.sender@unknown.com | anonymous.recipient@unknown.fr | label_1ย | - -To import the test dataset: - -```python -from melusine.data.data_loader import load_email_data - -df_email = load_email_data() -``` - - -### Pre-processing pipeline - -A working pre-processing pipeline is given below: - -```python -from sklearn.pipeline import Pipeline -from melusine.utils.transformer_scheduler import TransformerScheduler -from melusine.prepare_email.manage_transfer_reply import check_mail_begin_by_transfer, update_info_for_transfer_mail, add_boolean_transfer, add_boolean_answer -from melusine.prepare_email.build_historic import build_historic -from melusine.prepare_email.mail_segmenting import structure_email -from melusine.prepare_email.body_header_extraction import extract_last_body -from melusine.prepare_email.cleaning import clean_body - -ManageTransferReply = TransformerScheduler( -functions_scheduler=[ - (check_mail_begin_by_transfer, None, ['is_begin_by_transfer']), - (update_info_for_transfer_mail, None, None), - (add_boolean_answer, None, ['is_answer']), - (add_boolean_transfer, None, ['is_transfer']) -]) - -EmailSegmenting = TransformerScheduler( -functions_scheduler=[ - (build_historic, None, ['structured_historic']), - (structure_email, None, ['structured_body']) -]) - -Cleaning = TransformerScheduler( -functions_scheduler=[ - (extract_last_body, None, ['last_body']), - (clean_body, None, ['clean_body']) -]) - -prepare_data_pipeline = Pipeline([ - ('ManageTransferReply', ManageTransferReply), - ('EmailSegmenting', EmailSegmenting), - ('Cleaning', Cleaning), -]) - -df_email = prepare_data_pipeline.fit_transform(df_email) -``` - -In this example, the pre-processing functions applied are: - -- ``check_mail_begin_by_transfer`` : Email is a direct transfer (True/False) -- ``update_info_for_transfer_mail`` : Update body, header, from, to, date if direct transfer -- ``add_boolean_answer`` : Email is an answer (True/False) -- ``add_boolean_transfer`` : Email is transferred (True/False) -- ``build_historic`` : When email is a conversation, reconstructs the individual message history -- ``structure_email`` : Splits each messages into parts and tags them (tags: Hello, Body, Greetings, etc) - -### Phraser and Tokenizer pipeline - -A pipeline to train and apply the phraser end tokenizer is given below: - -```python -from melusine.nlp_tools.phraser import Phraser -from melusine.nlp_tools.tokenizer import Tokenizer - -tokenizer = Tokenizer (input_column='clean_body', output_column="tokens") -df_email = tokenizer.fit_transform(df_email) - -phraser = Phraser( - input_column='tokens', - output_column='phrased_tokens', - threshold=5, - min_count=2 -) -_ = phraser.fit(df_email) -df_email = phraser.transform(df_email) -``` - -### Embeddings training - -An example of embedding training is given below: - -```python -from melusine.nlp_tools.embedding import Embedding - -embedding = Embedding( - tokens_column='tokens', - size=300, - workers=4, - min_count=3 -) -embedding.train(df_email) -``` - -### Metadata pipeline - -A pipeline to prepare the metadata is given below: - -```python -from melusine.prepare_email.metadata_engineering import MetaExtension, MetaDate, Dummifier - -metadata_pipeline = Pipeline([ - ('MetaExtension', MetaExtension()), - ('MetaDate', MetaDate()), - ('Dummifier', Dummifier()) -]) - -df_meta = metadata_pipeline.fit_transform(df_email) -``` - -### Keywords extraction - -An example of keywords extraction is given below: - -```python -from melusine.summarizer.keywords_generator import KeywordsGenerator - -keywords_generator = KeywordsGenerator() -df_email = keywords_generator.fit_transform(df_email) -``` - -### Classification - -The package includes multiple neural network architectures including CNN, RNN, Attentive and pre-trained BERT Networks. -An example of classification is given below: -```python -from sklearn.preprocessing import LabelEncoder -from melusine.nlp_tools.embedding import Embedding -from melusine.models.neural_architectures import cnn_model -from melusine.models.train import NeuralModel - -X = df_email.drop(['label'], axis=1) -y = df_email.label - -le = LabelEncoder() -y = le.fit_transform(y) - -pretrained_embedding = embedding - -nn_model = NeuralModel(architecture_function=cnn_model, - pretrained_embedding=pretrained_embedding, - text_input_column='clean_body') -nn_model.fit(X, y, tensorboard_log_dir="./data") -y_res = nn_model.predict(X) -``` - -Training with tensorflow 2 can be monitored using tensorboard : -![](docs/_static/tensorboard.png) - -## Glossary - -### Pandas dataframes columns - -Because Melusine manipulates pandas dataframes, the naming of the columns is imposed. -Here is a basic glossary to provide an understanding of each columns manipulated. -Initial columns of the dataframe: - -* **body :** the body of the email. It can be composed of a unique message, a history of messages, a transfer of messages or a combination of history and transfers. -* **header :** the subject of the email. -* **date :** the date the email has been sent. It corresponds to the date of the last email message. -* **from :** the email address of the author of the last email message. -* **to :** the email address of the recipient of the last email message. - -Columns added by Melusine: - -* **is_begin_by_transfer :** boolean, indicates if the email is a direct transfer. In that case it is recommended to update the value of the initial columns with the information of the message transferred. -* **is_answer :** boolean, indicates if the email contains a history of messages -* **is_transfer :** boolean, indicates if the email is a transfer (in that case it does not have to be a direct transfer). -* **structured_historic :** list of dictionaries, each dictionary corresponds to a message of the email. The first dictionary corresponds to the last message (the one that has been written) while the last dictionary corresponds to the first message of the history. Each dictionary has two keys : - - - *meta :* to access the metadata of the message as a string. - - *text :* to access the message itself as a string. - -* **structured_body :** list of dictionaries, each dictionary corresponds to a message of the email. The first dictionary corresponds to the last message (the one that has been written) while the last dictionary corresponds to the first message of the history. Each dictionary has two keys : - - - *meta :* to access the metadata of the message as a dictionary. The dictionary has three keys: - + *date :* the date of the message. - + *from :* the email address of the author of the message. - + *to :* the email address of the recipient of the message. - - - *text :* to access the message itself as a dictionary. The dictionary has two keys: - + *header :* the subject of the message. - + *structured_text :* the different parts of the message segmented and tagged as a list of dictionaries. Each dictionary has two keys: - - *part :* to access the part of the message as a string. - - *tags :* to access the tag of the part of the message. - - -* **last_body :** string, corresponds to the part of the last email message that has been tagged as `BODY`. -* **clean_body :** string, corresponds a cleaned last_body. -* **clean_header :** string, corresponds to a cleaned header. -* **clean_text :** string, concatenation of clean_header and clean_body. -* **tokens :** list of strings, corresponds to a tokenized column, by default clean_text. -* **keywords :** list of strings, corresponds to the keywords of extracted from the tokens column. -* **stemmed_tokens :** list of strings, corresponds to a stemmed column, by default stemmed_tokens. -* **lemma_spacy_sm :** string, corresponds to a lemmatized column. -* **lemma_lefff :** string, corresponds to a lemmatized column. - -### Tags - -Each messages of an email are segmented in the **structured_body** columns and each part is assigned a tag: +- Documentation: [documentation](https://maif.github.io/melusine/). -* `RE/TR` : any metadata such as date, from, to, etc. -* `DISCLAIMER` : any disclaimer such as `L'รฉmetteur dรฉcline toute responsabilitรฉ...`. -* `GREETINGS` : any greetings such as `Salutations`. -* `PJ` : any indication of an attached document such as `See attached file...`. -* `FOOTER` : any footer such as `Provenance : Courrier pour Windows`. -* `HELLO` : any salutations such as `Bonjour,`. -* `THANKS` : any thanks such as `Avec mes remerciements` -* `BODY` : the core of the the message which contains the valuable information. +## Overview -### Dashboard App -Melusine also offered an easy and nice dashboard app with StreamLit. -The App contains exploratory dashboard on the email dataset and more specific study on discrimination between the dataset -and a neural model classification. +Melusine is a high-level library for emails processing that can be used to : -To run the app, run the following command in your terminal in the melusine/data directory : +- Categorize emails using AI, regex patterns or both +- Prioritize urgent emails +- Extract information +- And much more ! -```bash -streamlit run dashboard_app.py -``` +## Why melusine ? -![](docs/_static/demo_dashboard.gif) +The added value of melusine mainly resides in the following aspects: -### Ethics Guidelines +- **Off-the-shelf features** : melusine comes with a number of features that can be used straightaway + - Segmenting messages in an email conversation + - Tagging message parts (Email body, signatures, footers, etc) + - Transferred email handling +- **Execution framework** : users can focus on the email qualification code and save time on the boilerplate code + - debug mode + - pipeline execution + - code parallelization + - etc +- **Integrations** : the modular nature of melusine makes it easy to integrate with a variety of AI frameworks + (HuggingFace, Pytorch, Tensorflow, etc) +- **Production ready** : melusine builds-up on the feedback from several years of running automatic email processing +in production at MAIF. -Melusine also contains Ethics Guidelines to evaluate AI project. -The document and criteria are derived in particular from the work of the European Commission. +## Use cases +- Email routing: Make sure emails are sent to the most appropriate destination. +- Prioritization: Ensure urgent emails are treated first. +- Summarization: Save time reading summaries instead of long emails. +- Filtering: Remove undesired emails. -The pdf file is located in the melusine/ethics_guidelines directory : +## Getting started -![](docs/_static/demo_ethics_guide.gif) +Try one of our (tested!) [tutorials](https://maif.github.io/melusine/tutorials/00_GettingStarted/) to get started. From da2825884c65c0223d25ab70ba95f1077cbca37b Mon Sep 17 00:00:00 2001 From: Hugo Perrier Date: Mon, 18 Dec 2023 17:10:29 +0100 Subject: [PATCH 5/7] :memo: Readme for melusine 3.0 --- README.md | 47 ++++++++++++++++++++++++++++++++++------------- 1 file changed, 34 insertions(+), 13 deletions(-) diff --git a/README.md b/README.md index a2ed982..995b65f 100755 --- a/README.md +++ b/README.md @@ -9,17 +9,21 @@ Checkout the [documentation](https://maif.github.io/melusine/) and [tutorials](h ![](docs/_static/melusine.png) - Free software: Apache Software License 2.0 -- Documentation: [documentation](https://maif.github.io/melusine/). +- Documentation: [click here](https://maif.github.io/melusine/). +- Installation: `pip install melusine`. ## Overview -Melusine is a high-level library for emails processing that can be used to : +Melusine is a high-level library for emails processing that can be used to do: -- Categorize emails using AI, regex patterns or both -- Prioritize urgent emails -- Extract information -- And much more ! +- Email routing: Make sure emails are sent to the most appropriate destination. +- Prioritization: Ensure urgent emails are treated first. +- Summarization: Save time reading summaries instead of long emails. +- Filtering: Remove undesired emails. + +Melusine facilitates the integration of deep learning frameworks (HuggingFace, Pytorch, Tensorflow, etc) +deterministic rules (regex, keywords, heuristics) into a full email qualification workflow. ## Why melusine ? @@ -39,13 +43,30 @@ The added value of melusine mainly resides in the following aspects: - **Production ready** : melusine builds-up on the feedback from several years of running automatic email processing in production at MAIF. -## Use cases - -- Email routing: Make sure emails are sent to the most appropriate destination. -- Prioritization: Ensure urgent emails are treated first. -- Summarization: Save time reading summaries instead of long emails. -- Filtering: Remove undesired emails. - ## Getting started Try one of our (tested!) [tutorials](https://maif.github.io/melusine/tutorials/00_GettingStarted/) to get started. + +## Minimal example + +- Load a fake email dataset +- Instantiate a built-in `MelusinePipeline` +- Run the qualification pipeline on the emails dataset + +``` Python + from melusine.data import load_email_data + from melusine.pipeline import MelusinePipeline + + # Load an email dataset + df = load_email_data() + + # Load a pipeline + pipeline = MelusinePipeline.from_config("demo_pipeline") + + # Run the pipeline + df = pipeline.transform(df) +``` + +The output is a qualified email dataset with columns such as: +- `messages`: List of individual messages present in each email. +- `emergency_result`: Flag to identify urgent emails. \ No newline at end of file From 250bb098b0119b2c36e0cb3fcd732acfffbf23a6 Mon Sep 17 00:00:00 2001 From: Hugo Perrier Date: Mon, 18 Dec 2023 17:41:46 +0100 Subject: [PATCH 6/7] :memo: Readme for melusine 3.0 --- README.md | 48 +++++++++++++++++++++++++++++++----------------- 1 file changed, 31 insertions(+), 17 deletions(-) diff --git a/README.md b/README.md index 995b65f..2c0fcdb 100755 --- a/README.md +++ b/README.md @@ -1,26 +1,40 @@ -[![pypi badge](https://img.shields.io/pypi/v/melusine.svg)](https://pypi.python.org/pypi/melusine) -[![Build & Test](https://github.com/MAIF/melusine/actions/workflows/main.yml/badge.svg?branch=master)](https://github.com/MAIF/melusine/actions/workflows/main.yml)[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) -[![Generic badge](https://img.shields.io/badge/python-3.8+-blue.svg)](https://shields.io/) - -๐ŸŽ‰ BREAKING : New major version **Melusine 3.0.0** is available ๐ŸŽ‰ -Checkout the [documentation](https://maif.github.io/melusine/) and [tutorials](https://maif.github.io/melusine/tutorials/00_GettingStarted/) to get started. - - -![](docs/_static/melusine.png) - -- Free software: Apache Software License 2.0 -- Documentation: [click here](https://maif.github.io/melusine/). -- Installation: `pip install melusine`. +

+ +Build & Test + + +pypi + + +Test + + +pypi + +

+ +

๐ŸŽ‰ **BREAKING** : New major version **Melusine 3.0.0** is available ๐ŸŽ‰

+ +

+ + + +

+ +- **Free software**: Apache Software License 2.0 +- **Documentation**: [maif.github.io/melusine](https://maif.github.io/melusine/) +- **Installation**: `pip install melusine` +- **Tutorials**: [Discover melusine](https://maif.github.io/melusine/tutorials/00_GettingStarted/) ## Overview Melusine is a high-level library for emails processing that can be used to do: -- Email routing: Make sure emails are sent to the most appropriate destination. -- Prioritization: Ensure urgent emails are treated first. -- Summarization: Save time reading summaries instead of long emails. -- Filtering: Remove undesired emails. +- **Email routing**: Make sure emails are sent to the most appropriate destination. +- **Prioritization**: Ensure urgent emails are treated first. +- **Summarization**: Save time reading summaries instead of long emails. +- **Filtering**: Remove undesired emails. Melusine facilitates the integration of deep learning frameworks (HuggingFace, Pytorch, Tensorflow, etc) deterministic rules (regex, keywords, heuristics) into a full email qualification workflow. From bed5d4cdee719e538935ce0acdba6fff16d9b420 Mon Sep 17 00:00:00 2001 From: Hugo Perrier Date: Wed, 20 Dec 2023 09:40:37 +0100 Subject: [PATCH 7/7] :memo: README for melusine v3 --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index 2c0fcdb..e5b6f99 100755 --- a/README.md +++ b/README.md @@ -13,7 +13,7 @@

-

๐ŸŽ‰ **BREAKING** : New major version **Melusine 3.0.0** is available ๐ŸŽ‰

+

๐ŸŽ‰ **BREAKING** : New major version Melusine 3.0 is available ๐ŸŽ‰