Includes many different Augmentation packages for Swedish.
!git clone https://github.com/mosh98/swe_aug.git
This is built on top of a swedish word2vec. Make sure you download that first.
!wget https://www.ida.liu.se/divisions/hcs/nlplab/swectors/swectors-300dim.txt.bz2
!bzip2 -dk /content/swectors-300dim.txt.bz2
!pip install -r reqs.txt
word_vec_path = '/content/swectors-300dim.txt' #path to txt vector file
#you can even set path to your own pretrain word2vec (make sure its a txt file)
A way to augment data in a way that is easy to understand and use. There are 4 mains components
- Random Synomym Replacement
- Random Word Replacement
- Random Word Deletion
- Random Word Insertion
from swe_aug import EDA
aug = EDA.Enkel_Data_Augmentation(word_vec_path)
txt = "enter ur desired text. It can be a sentence or a paragraph"
augmented_sentences = aug.enkel_augmentation(txt, alpha_sr=0.1,
alpha_ri=0.3, alpha_rs=0.2,
alpha_rd=0.1, num_aug=4)
#returns a list of augmented sentences
from swe_aug.Other_Techniques import Text_Cropping
frag = Text_Cropping.cropper(percent = 0.25)
list_of_fragmented_sentence = frag.text_fragmeter(txt)
# chops sentence into 4 halfs.
Idea is to replace word that are similar in an embeddings space that has the same POS token. [4]
# "NOUN", "VERB", "ADJ", "ADV", "PROPN","CONJ"
#These are the tokens you can perturb! [CASE SENSITIVE!]
from swe_aug.Other_Techniques import Type_SR
aug = Type_SR.type_DA(word_vec_path)
list_of_augs = aug.type_synonym_sr(txt, token_type = "NOUN", n = 2)
[1] Swedish word2vec: https://www.ida.liu.se/divisions/hcs/nlplab/swectors/
[2] EDA: https://aclanthology.org/D19-1670/
[3] Text Fragmenter: That was me
[4] Type Specific: That was me too
@software{Mahamud2022,
author = {Mahamud,Mosleh},
title = {Swedish Augmentation Packages},
year = {2022},
publisher = {GitHub},
journal = {Not Decided yet},
howpublished = {\url{https://github.com/mosh98/swe_aug}},
}