📚 Content generator given a table of content using Machine Learning and search similarity.
We do love technology, but when it comes to writing essays, reports or compositions we tend to be a little lazy. Our writing skills can be really outrageous. So we decided to create something that allowed us to do all that boring stuff automatically. No more boring reports for students!
It creates content from an index, that is to say, it generates text. Yeah, we know that it sounds really crazy but that's actually what it does. The user writes a tiny index with a small description of each section. Then we look into a 50-dimensional feature (previously reduced) index with 1M Wikipedia paragraphs indexed. We take the nearest one and with that output, we generate the rest of the text using a Deep Learning model. The user can choose the level of randomness that is going to be used in the text generation. Moreover, images can be added automatically to the text.
Finally, the user has the option to generate a LaTeX file with the content in a pdf.
All the project is built in Python with several libraries that makes this project possible.
As a phase one of the processing pipeline, the application receives a bunch of titles and descriptions that allows the user to search into a nmslib index. This is index was built thanks to the famous library called Word2Vec where we get all the features from a dataset from Wikipedia. That dataset was cleaned and formatted in order to get the best results as we can. We initially get a dataset of 5.2M of paragraphs but we finally decided to reduce it to 1M in order to train it faster.
As a phase two, we used the recent Unsupervised Learning Model developed by OpenAi GPT2. That model, among other things, is able to generate text by calculating the word that has the highest probability to go next by giving a context. We took the pre-trained model and trained with Wikipedia data in Google Colab, using tools like Pandas to see if we could make it generate better results. Then we took the output given by the index of paragraphs and generated text. After that, we used TextBlob to analyze the output: that information was useful to generate an image of the text. The search of the image is done using Google Custom Search API and the output latex file is done using Pandoc.
We've hosted all this stack in a Google Engine thanks to Google Cloud Platform.
We are truly lovers of new technology and some of us are really interested in deep learning or natural language processing. However, we are not, by far, experts. Actually we are newbies in those topics, so we had to deal with a lot of stuff to finally accomplish what we wanted.
Specially we had problems with the size of datasets we wanted to deal with (we started with the idea of using ALL WIKIPEDIA DATASET) we did not have computing power for that. Also, we wanted to train the GPT-2 model, developed by OpenAI, taking advantage of the pretrained models, to see if we saw an improvement in the word predictions.
We are really proud of the final result. Even though it seemed impossible to get kinda good results and merge everything we finally could make something and add all the work that was done separately.
We've learned stuff about writing multithreading code, how to use Google Colab, about doing good Flask APIs, Unsupervised Learning Methods and organizing stuff correctly (which is really important guys :D).
Trying to create a bigger index of paragraphs and spend more time with GPT-2 in order to get better results and understand completely those nasty and mysterious hyperparameters.
- Python 3.6+
Usage of virtualenv is recommended for package library / runtime isolation.
To run the server, please execute the following from the root directory:
-
Setup virtual environment
virtualenv -p /usr/bin/python3.5 env source env/bin/activate
-
Install dependencies
pip3 install -r requirements.txt
-
Run Flask server as a python module
python3 -m src
MIT © Donework