Skip to content
K.B.Dharun Krishna edited this page Mar 12, 2024 · 1 revision

This page documents the official and community datasets featuring tldr-pages.

Official Datasets

We provide and generate datasets in formats like CSV, XML, JSON and TMX (Translation Memory eXchange) using https://github.com/tldr-pages/tldr-translation-pairs-gen tool. And can be found under its latest release. These artifacts are also available with the below sources:

  • OPUS tldr-pages Dataset (TMX format)

    • OPUS is a public dataset of translated resources on the web. All translations are derived from freely available and openly licensed sources, so the translations themselves are safe to use with minimal restrictions.
    • These datasets are helpful for a variety of applications such as research and machine learning.
    • A notable project that uses the OPUS corpora is LibreTranslate (which is powered by argos-translate).
  • Kaggle Translation Pairs Dataset (CSV format)

    • Kaggle is a data science competition platform and online community of data scientists and machine learning practitioners under Google LLC.
    • It is popular among Students and Data Scientists.
    • This multilingual text dataset contains paired strings mapping various localized tldr-pages.

Community Datasets

  1. https://www.kaggle.com/datasets/bppuneethpai/tldr-summary-for-man-pages