docs: 📝 Batch Job - Spellcheck documentation (#1408)

* docs: 📝 Google Batch Job in Robotoff doc * fix: 🎨 Typos * docs: 📝 Spellcheck documentation in Robotoff (WIP) * docs: 📝 Ingredients Spellcheck Robotoff * docs: 📝 Spellcheck Robotoff * docs: 📝 Spellcheck Robotoff * docs: ✨ Spellcheck Robotoff
openfoodfacts · Sep 18, 2024 · 2748324 · 2748324
1 parent 9a09729
commit 2748324
Show file tree

Hide file tree

Showing 4 changed files with 230 additions and 0 deletions.
diff --git a/doc/assets/argilla.png b/doc/assets/argilla.png
diff --git a/doc/assets/batch_job_robotoff.svg b/doc/assets/batch_job_robotoff.svg
diff --git a/doc/references/batch-job.md b/doc/references/batch-job.md
@@ -0,0 +1,121 @@
+# Google Batch Job
+
+Robotoff primarily provides models to Open Food Facts with real-time inference using Nvidia Triton Inference on CPUs. 
+
+However, this approach presents two major gaps:
+
+* Challenges in processing large volumes of products during extended job runs
+* Limited access to larger computing resources, such as GPUs or multi-CPU setups
+
+To fill these gaps, we integrated a batch job feature into Robotoff, leveraging the capabilities of Google Cloud Platform.
+
+## Architecture
+
+![Robotoff Architecture](../assets/batch_job_robotoff.svg)
+
+
+The batch job pipeline is structured as follow:
+
+### 1. Launch job
+
+The role of this command is to prepare and launch the job in the cloud. The launch depends on the type of job to perform, such as `ingredients-spellcheck`. Therefore, it takes as parameter `job_type`. 
+
+Depending on the job type, the command will be responsible of:
+
+* Generate the google credentials from the production environment variables, 
+* Extracting, preparing and storing the data to process,
+* Query the config file relative to the job and validate it using [Pydantic](https://docs.pydantic.dev/latest/),
+* Launch the google batch job.
+
+The command can be found as a Command `launch_batch_job` in the CLI directory[^launch_batch_job_cli].
+
+### 2. Config files
+
+The configuration file define the resources and setup allocated to the google batch job. Each batch job requires an unique configuration, stored as a YAML file[^config_files]. It contains:
+
+* The resources location,
+* The type and number of resources allocated,
+* The maximal run duration,
+* The number of tasks in parallele,
+* The number of retries, ...
+
+When initiating a job, the configuration is validated using the [Pydantic](https://docs.pydantic.dev/latest/) library. This process serves two purposes:
+
+* Prevents errors that could potentially cause the pipeline to fail,
+* Safeguards against the allocation of unnecessarily expensive resources.
+
+For more information about Google Batch Job configuration, check the [official documentation](https://cloud.google.com/batch/docs/reference/rest/v1/projects.locations.jobs).
+
+### 3. The container registry
+
+The container registry represents the core of the batch job. It contains the required dependencies and algorithms.
+
+Docker images are maintained independently of the Robotoff micro-service [^dockerfile]. Each directory contains the files with their `Dockerfile`. Once built, the Docker image is pushed manually using the Make command written in the Makefile, such as `deploy-spellcheck`.
+
+The container needs to be accessible from the batch job once launched. Can be used as registry:
+
+* Google Artifact Registry within the project `Robotoff`,
+* Docker hub,
+* Public GitHub repository, such as [Robotoff](https://github.com/openfoodfacts/robotoff/tree/main).
+
+### 4. Batch job
+
+Once launched, the batch job goes throught different stages: SCHEDULED, QUEUED, RUNNING, SUCCEEDED, or FAILED. Each batch job is identified as the job type name associated with the launch `datetime`.
+
+During the run, all logs are stored in the Batch Job logs file.
+
+The list of batch jobs are located in the [Robotoff Google Batch Job](https://console.cloud.google.com/batch/jobs?referrer=search&project=robotoff).
+
+### 5. Storage
+
+If the batch job requires to import or export data, we use a storage feature such as Google Storage as an interface between Robotoff and the job running in the cloud.
+
+If Google Storage is used, Google credentials are necessary on the Robotoff side. On the other side, since the Batch Job utilizes the default [service account](https://cloud.google.com/iam/docs/service-account-overview) associated with the project `Robotoff`, no additional setup is required.
+
+### 6. Import processed data
+
+Once the job is successfully finished, the Robotoff API endpoint is queried from the job run through with an HTTP request.
+
+The role of this endpoint is to load the data processed by the batch job and import the new *predictions* to the Robotoff database. Check this [page](../explanations/predictions.md) to understand the process Robotoff used to transform raw *Predictions* into *Insights*.
+
+Since this endpoint has only the vocation of importing the batch job results, it is secured with a `BATCH_JOB_KEY` from external requests. This secured endpoint follows a [Bearer Authentication](https://swagger.io/docs/specification/authentication/bearer-authentication/). The key is set as an environment variable in Robotoff and is defined as batch environment variable during a job launch.
+
+Each batch job has its own method to import, or not, the results of the batch job.
+
+## Roles
+
+To launch a batch job and import its results, the following roles needs to be set up:
+
+* **Artifact Registry Editor**: to push Docker Image to the project image registry
+* **Batch Job Editor**
+* **Service Account User**
+* **Storage Admin**
+
+For production, it is preferable to create a custom *Service account* with these roles.
+
+## Additional notes
+
+### Links
+
+Check the official Google Batch Job documentation:
+
+* [Batch Job](https://cloud.google.com/batch/docs/get-started),
+* [Google Batch Job Python API](https://cloud.google.com/python/docs/reference/batch/latest),
+* [Batch job with Python examples](https://github.com/GoogleCloudPlatform/python-docs-samples/tree/main/batch),
+
+
+### Trials and errors notes
+
+* Netherland (europe-west4) has GPUs (A100, L4)
+* Add custom storage capacity to host the heavy docker image (~24GB) by adding `BootDisk`
+* 1000 products processed: 1:30min (g2-instance-with 8) (overall batch job: 3:25min):
+* L4: g2-instance-8 hourly cost: $0.896306 ==> ~ 0.05$ to process batch of 1000
+* A100: a2-highgpu-1g: $3.748064
+* A100/Cuda doesn't support FP8
+* A100 has less availability than L4: need to wait for batch job (can be long) or switch to us-east location
+* Don't forget to enable **Batch & Storage API** if used without gcloud ([link](https://cloud.google.com/batch/docs/get-started#project-prerequisites))
+
+
+[^launch_batch_job_cli]: see `./robotoff/cli/main.py`
+[^config_files]: see `./robotoff/batch/configs/job_configs`
+[^dockerfile]: see `./batch/`
diff --git a/doc/references/ingredients-spellcheck.md b/doc/references/ingredients-spellcheck.md
@@ -0,0 +1,105 @@
+# Ingredients Spellcheck
+
+A key element of the Open Food Facts database is the parsing of the product ingredients. These lists of ingredients either come from contributors' annotations or from OCR-extracted text from packaging pictures.
+
+However, text typos or wrong OCR-extraction lead to ingredients not recognized by the Product Opener service. Check about this process in the [wiki](https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis).
+
+For this reason, the Ingredients Spellcheck was developed to be implemented to solve this issue and improve the ingredient parsing quality. 
+
+## TL;DR
+
+Mistral-7B-Base was [fine-tuned](#training-pipeline) on lists of ingredients extracted from the Open Food Facts database. This [dataset](https://huggingface.co/datasets/openfoodfacts/spellcheck-dataset) was synthetically generated using closed-source LLMs (GPT-3.5-Turbo) and manually reviewed with Argilla, an open-source annotation tool.
+
+The current model (v1) shows the best performances over the closed-source LLMs on our [benchmark](https://huggingface.co/datasets/openfoodfacts/spellcheck-benchmark). A custom [evaluation algorithm](#evaluation-algorithm) was created to correctly estimate the Spellcheck performances.
+
+
+| Model | Correction Precision | Correction Recall | Correction F1
+|----------|----------|----------|----------|
+| GPT-3.5-Turbo | 0.557 | 0.727 | 0.631 |
+| GPT-4o | 0.311 | 0.702 | 0.431 |
+| Gemini-1.5-flash | 0.544 | 0.596 | 0.569 |
+| Claude3-Sonnet-3.5 | 0.178 | **0.810** | 0.292 |
+| **Our model** | **0.664** | 0.630 | **0.647** |
+
+The model is integrated into Robotoff in [Batch Inference](batch-job) using Google Batch Job.
+
+## Evaluation algorithm
+
+Our solution is very specific: correct errors in list of ingredients to enable the [Ingredients Parser](https://wiki.openfoodfacts.org/Ingredients_Extraction_and_Analysis) to accurately identify the composition of each product. 
+
+However, since the corrections are later added to the database, we need to ensure the model doesn't correct an ingredient by mistake. In other words, we minimize the number of False Positives while maximizing the overall Recall.
+
+Traditional evaluation metrics, such as [ROUGE](https://en.wikipedia.org/wiki/ROUGE_(metric)), [BLEU](https://en.wikipedia.org/wiki/BLEU), or [METEOR](https://en.wikipedia.org/wiki/METEOR) fall short in assessing the quality of the spellcheck process. They don't provide a detailed analysis about how many words were correctly rectified versus those that weren't...
+
+Therefore, we developed an algorithm that takes 3 inputs: the original, the reference, and the prediction of a  list of ingredients.
+
+Example:
+``` 
+Original:       "Th cat si on the fride,"
+Reference:      "The cat is on the fridge."
+Prediction:     "Th big cat is in the fridge."
+```
+
+We transform each text into a sequence of tokens and perform a [sequence alignment method](https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm) to align identical tokens between respectively original-reference, and prediction-reference. We assign 1 or 0 whether the tokens is modified.
+
+By comparing these 2 pairs of sequences, we calculate the number of True Positives (TP), False Positives (FP), and True Negatives (TN). Therefore, the overall Precision and Recall.
+
+```
+Orig-Ref:          1    0    0    1    0    1    1    1    1
+Orig-Pred:         0    1    0    1    1    1    1    1    1
+Signification:     FN   FP   TN   TP   FP   TP   TP   TP   TP
+```
+
+Coupled with a benchmark carefully prepared using the [Spellcheck Guidelines](#guidelines), the algorithm is capable of evaluating any solution, from Regular Expression techniques to LLMs.
+
+You'll find more details about the evaluation algorithm[^evaluation-algo] in the project [README](https://github.com/openfoodfacts/openfoodfacts-ai/tree/develop/spellcheck).
+
+
+## Guidelines
+
+The [Guidelines](https://github.com/openfoodfacts/openfoodfacts-ai/tree/develop/spellcheck#-guidelines) is a set of rules defined to guide and restrict the correction made by the Spellcheck.
+
+It was also used to create the [benchmark](https://huggingface.co/datasets/openfoodfacts/spellcheck-benchmark), and also to generate the [training dataset](https://huggingface.co/datasets/openfoodfacts/spellcheck-dataset) using proprietary LLMs (GPT-3.5-Turbo) for the synthetic data generation.
+
+## Model
+
+The model is accessible on [Hugging Face](https://huggingface.co/openfoodfacts/spellcheck-mistral-7b), along its [demo](https://huggingface.co/spaces/jeremyarancio/ingredients-spellcheck).
+
+A text *instruction* is provided to the model during the training and inference, which you can find in the same model repository.
+
+## Training pipeline
+
+The model training consists in a succession of steps, each one requiring different resources allocations, such as cloud GPUs, data validation and logging. For this reason, we decided to orchestrate the training using [Metaflow](https://metaflow.org/), an orchestrator designed for Data science and Machine Learning projects. 
+
+The training pipeline[^dags] is composed as follow:
+
+* Configurations and hyperparameters are imported to the pipeline from config yaml files[^configs].
+* The training job is launched in the cloud using [AWS Sagemaker](https://aws.amazon.com/sagemaker/). The `spellcheck/src/` package, containing the different modules, is imported as well as the training script[^training-script]. Once the job done, the model artifact is stored in AWS S3 bucket (private). All training details are tracked in the [Experiment Tracker Comet ML](https://www.comet.com/jeremyarancio/spellcheck/view/WzBvzCs36VdE6MIbytKEI2ePH/experiments).
+* The fine-tuned model is then evaluated on the benchmark using the [custom evaluation algorithm](#evaluation-algorithm). [vLLM](https://github.com/vllm-project/vllm) is used to accelerate the evaluation. *Currently, this process is handled manually, but further work is needed to fully integrate it into the pipeline.*
+* The predictions against the benchmark, also stored in AWS S3, are sent to Argilla for human-evaluation[^argilla-modules] under an unique ID: the *experiment key*. 
+
+![Human-evaluation with Argilla](../assets/argilla.png)
+*Human-evaluation with Argilla*
+
+The model and dataset versions are handled by Hugging Face repository as branch (v1, v2) and commits (v1.1, v1.2). You can easily access any version using the *Dataset* library from Hugging Face.
+
+```python
+from datasets import load_dataset
+dataset  = load_dataset(
+    path="openfoodfacts/spellcheck-dataset",
+    revision="v8",
+    split="train+test"
+)
+```
+
+## Integration with Batch Job
+
+Once the model is selected, the inference script with its dependencies are containerized in a Docker Image[^spellcheck-inference] before being pushed to the Image Registry[^makefile] (currently Google Artifact Registry). The image is then used within the [batch job pipeline](../references/batch-job.md), defined by the batch job type `ingredients-spellcheck`.
+
+[^evaluation-algo]: see `spellcheck/src/spellcheck/evaluation/evaluator`
+[^dags]: see `scripts/dags`
+[^configs]: see `spellcheck/config/training`
+[^training-script]: see `spellcheck/scripts/training`
+[^argilla-modules]: see `spellcheck/src/spellcheck/argilla`
+[^spellcheck-inference]: see `robotoff/batch/spellcheck`
+[^makefile]: see `robotoff/makefile`