A repo for experiments in "math concept identification" using the TAC corpus and the nLab corpus. (and first thoughts about NLI for mathematics.)
The TAC corpus can be found at https://github.com/ToposInstitute/tac-corpus.
A selection of 436 sentences of the TAC corpus (some are empty), selected by size (not too big, not too small) and lack of LaTeX is in https://github.com/ToposInstitute/tac-corpus/blob/main/golden-attempt/examples.txt and is repeated here for convenience both as Experiment2.txt in the folder Experiment436 and as the file 436sentences.txt
The nLab corpus (from around 2020) is at https://github.com/ToposInstitute/nlab-corpus.
Short guidelines for mathematician annotation already agreed:
-
Try to treat math concepts as black boxes, as much as possible.
-
Use the singular, instead of the plural, for concepts. Use no Capitals for concepts, as much as possible.
-
If one has a long span that is a concept, e.g. “enriched accessible categories”, we should also list the sensible subspans like “accessible category”.
A subset of the sentences have no mathematical concepts at all, e.g. "Further applications are given."