Automatic Refactoring Candidate Identification Leveraging Effective Code Representation

The use of machine learning to automate the detection of refactoring candidates is a rapidly evolving research area. The majority of work in this direction uses source code metrics and commit messages to predict refactoring candidates and do not exploit the rich semantics of source code. This paper proposes a new method to predict extract method refactoring candidates. We employ a self-supervised autoencoder to acquire a compact representation of source code generated by a pre-trained large language model. Subsequently, we train a classifier to solve the extract method identification task. We evaluate the performance of the proposed approach against a state-of-the-art baseline model; our proposed approach outperforms the state-of-the-art model by 30% in terms of the F1 score. The proposed work has implications for researchers and practitioners. Software developers may use the proposed automated approach to predict refactoring candidates better. This study will facilitate the development of improved refactoring candidate identification methods that the researchers in the field will use and extend.

Approach Overview

Computational requirements

Software Requirements

Windows/Linux
Python 3.10+
- joblib
- PyDriller
- python-dotenv
- transformers
- tensorflow
- torch
- scikit-learn

Memory and Runtime Requirements

Any CUDA based GPU with a minimum memory of 4GB.

Steps to reproduce

(Note: Provided steps are for linux based system)

Dependencies and Environment Set-Up

Clone this repository to the local folder 'cd' into the folder.

(optional) Setup virtual environment

python -m venv <venv_name>
source <venv_name>/bin/activate

Install all the dependencies
```
pip install -r requirements.txt
```
The code is optimized for GPU (cuda) so running it on a GPU with decent memory should work fine. If there is memory issue, please lower the batch size.*

Extract Method Dataset Creation

The file input.csv contains all the repository names from the baseline for which the dataset is to be created.
To split the file in 5 parts, execute the following script:
```
source csv-splitter.sh
```
Note: The split size can be configured from the script.
We then took 5% of the repositories, from first 2 splits to extract the positive and negative samples.
To generate the positive and negative samples, execute the following python script with needed arguments.
```
python data_creator.py <inpute_file_path> <output_file_path>
```
This will create the .jsonl file with the identified samples.
Pre-generated dataset is present in https://doi.org/10.5281/zenodo.8122619

Deep Learning

First navigate to the deep learning folder using -
```
cd deep-learning
```
Splitting the dataset (.jsonl file) to train and test split:
To split the dataset execute the following python script with needed arguments-
```
python data_creator.py <input_file_path> <output_file_name>
```
It will split the dataset and store it as '.npy' files as 'data/np_arrays/output_file_name.npy'
To train and test the autoencoder model, execute the following script-
```
python autoencoder_pn.py <train_data_file_path> <test_data_file_path>
```
This will store the validated model in './trained_models/AE/models'
To train and test the random forest classifier, execute the following python script with 4 arguments-
```
python classify_rf.py <train_data_file_path> <train_label_file_path> <test_data_file_path> <test_label_file_path>
```
It will generate the encoded embeddings on the fly and the final evaluation metrics will be displayed on the terminal.

To plot the t-SNE plots, execute execute the following script from the cloned root directory-

cd deep-learning/metrics/
python plot.py tsne <embedded_representation_path> <label_array_path>

License

Apache-2.0

Citation

@software{Palit_Extract_method_identification,
author = {Palit, Indranil and Shetty, Gautam and Arif, Hera and Sharma, Tushar},
license = {Apache-2.0},
title = {{Extract method identification}},
url = {https://github.com/SMART-Dal/extract-method-identification}
}

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
deep-learning		deep-learning
refactoring_identifier		refactoring_identifier
utils		utils
.gitignore		.gitignore
ArchDiagram.png		ArchDiagram.png
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
csv-splitter.sh		csv-splitter.sh
data-collector-st.sh		data-collector-st.sh
data_collector.py		data_collector.py
input.csv		input.csv
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Automatic Refactoring Candidate Identification Leveraging Effective Code Representation

Computational requirements

Software Requirements

Memory and Runtime Requirements

Steps to reproduce

Dependencies and Environment Set-Up

Extract Method Dataset Creation

Deep Learning

License

Citation

About

Releases

Packages

Contributors 2

Languages

License

SMART-Dal/extract-method-identification

Folders and files

Latest commit

History

Repository files navigation

Automatic Refactoring Candidate Identification Leveraging Effective Code Representation

Computational requirements

Software Requirements

Memory and Runtime Requirements

Steps to reproduce

Dependencies and Environment Set-Up

Extract Method Dataset Creation

Deep Learning

License

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages