Skip to content

shamikbose/bigLAM

Repository files navigation

bigLAM

This repository contains dataloader scripts used for the BigLAM initiative. The supported dataloders are:

  • Atypical Animacy: Atypical animacy detection dataset, based on nineteenth-century sentences in English extracted from an open dataset of nineteenth-century books digitized by the British Library.
  • Old Bailey Proceedings: 2,163 transcriptions of the Proceedings and 475 Ordinary's Accounts marked up in TEI-XML, and contains some documentation covering the data structure and variables. Each Proceedings file represents one session of the court (1674-1913), and each Ordinary's Account file represents a single pamphlet (1676-1772).
  • Corpus of Late Modern English Texts v3.1: CLMET3.1 is a principled collection of public domain texts drawn from various online archiving projects. In total, the corpus contains some 34 million words of running text.
  • Lampeter Corpus: The Lampeter Corpus of Early Modern English Tracts is a collection of texts on various subject matter published between 1640 and 1740. Each text is associated with a year and one of the following topics: Law, Economy, Religion, Poitics, Science, Miscellaneous
  • Lancaster Newsbooks: This corpus consists of two collections of seventeenth-century English "newsbooks". The FIRST collection (1654_newsbooks) consists of every newsbook published in London and still surviving in the Thomason Tracts from the first half of 1654. The SECOND collection (mercurius_fumigosus) consists of every surviving issue published of the highly idiosyncratic newsbook "Mercurius Fumigosus", written by John Crouch between summer 1654 and early autumn 1655.
  • Hansard Speech: This corpus consists of all the speeches made in the House of Commons from May 1979 - July 2020. Each text is associated with a speaker, their political party and the date, in addition to other metadata
  • Contentious Contexts Corpus: This dataset contains extracts from historical Dutch newspapers containing keywords of potentially contentious words (according to present-day sensibilities). The dataset contains multiple annotations per instance, given the option to quantify agreement scores for annotations. This dataset can be used to track how words and their meanings have changed over time

About

Dataset loaders for the BigLAM hackathon

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages