Researchers Relevance Estimation to Scientific Disciplines for Recommendation System using Natural Language Processing Techniques (APELLA+)
This repository contains the source code and test datasets of the Diploma Thesis "Researchers Relevance Estimation to Scientific Disciplines for Recommendation System using Natural Language Processing Techniques", developed by Vasileios Moschopoulos and Konstantinos Nikiforidis, undergraduate students of the Department of Electrical and Computer Engineering at Aristotle University of Thessaloniki.
The project consists of two main parts. The first is an implementation for automated web scraping of researchers scientific publications data (title, abstract, year, etc) from one of the most popular scientific search engines (Google Scholar, Semantic Scholar, ResearchGate), while in the second part we implemented a pipeline of relevance ranking list extraction for university professors (from a register pool) with an open academic position, based on text embedding similarity comparisons. The pretrained models used for the sentence embeddings extraction are SciBERT and SPECTER, based on the BERT architecture trained on scientific text corpora, while also further fine tuning was performed on these models using the SimCSE framework, showing superior results on test datasets.
The best performing fine-tuned models SimCSE_smallD and SimCSE_largeD (batch_size:40, max_sequence_length:300) based on contrastive learning, can be found here (with PDF report), in a typical Hugging Face model format. The datasets used for models training/fine tuning lie also on the same folder.
The present data on /csv_files folder about professors personal info (name, rank, APELLA id, email, etc) are already publicly available as raw pdf/xlsx files at the School of Informatics AUTh official website (https://www.csd.auth.gr/).