MSR 2024 Artifact

About

This repository contains the source code and dataset examples for our MSR'24 paper PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software.

Repository Structure

PeaTMOSS Dataset ($4, 6)
- README.md
- PeaTMOSS.sql
- Examples/
LLM-Pipeline/ ($5)
- README.md
- Accurate_pipeline
- Cheap_pipeline
License-Analysis/ ($7)
- README.md
- analyzeIncompatibilities/
- data/
- figures/
- ghRepos
- projectPairings/
- ptmProjects/

PeaTMOSS Dataset

PeaTMOSS Demos

About

This repository contains a zipped sample of the PeaTMOSS dataset, as well as a script that demonstrates possible interactions with the SQLite database used to store the metadata dataset. The complete PeaTMOSS dataset contains snapshots of Pre-Trained machine learning Model (PTM) repositories and the downstream Open-Source GitHub repositories that reuse the PTMs, metadata about the PTMs, the pull requests and issues of the GitHub Repositories, and links between the downstream GitHub repositories and the PTM models. The schema of the SQLite database is specified by PeaTMOSS.py and PeatMOSS.sql. The sample of the database is PeaTMOSS_sample.db. The full database, as well as all captured repository snapshots are available here

- Note: When unzipping .tar.gz snapshots, include the flag

--strip-components=4

in the tar statement, like so

tar --strip-components=4 -xvzf {name}.tar.gz

If you do not do this, you will have 4 extraneous parent directories that encase the repository.

The script that we used to generate figures in $6 is available in script_plot.py

Globus

Globus Share

All zipped repos and the full metadata dataset are available through Globus Share

If you do not have an account, follow the Globus docs on how to sign up. You may create an account through a partnered organization if you are a part of that organization, or through Google or ORCID accounts.

Globus Connect Personal

To access the metadata dataset using the globus.py script provided in the repository:

Download Globus Connect Personal
Create your own private Globus collection on Mac, Windows, or Linux
Once this is created, make sure your Globus Personal Connect is running before executing globus.py

NOTE: In some cases, you may run into permission issues on Globus when running the script. If this is the case, you will need to change local_endpoint.endpoint_id, located on line 29, to your private collection's UUID:

local_endpoint_id = local_endpoint.endpoint_id

To locate your private collecion's UUID, click on the Globus icon on your taskbar and select "Web: Collection Details". On this page, scroll down to the bottom where the UUID field for your collection should be visible, and replace the variable with your collection's UUID expressed as a string. Then, use the activities tab to terminate the existing transfer and rerun globus.py.

Metadata Description

The following model hubs are captured in our database:

The content for each specific model hub is listed in the table below:

Model hub	#PTMs	#Snapshotted Repos	#Discussions (PRs, issues)	#Links	Size of Zipped Snapshots
Hugging Face	281,638	14,296	59,011	30,514	44TB
PyTorch Hub	362	361	52,161	13,823	1.3GB

We also offer two different formats of our datasets. An overview of these two formats can be found in the table below:

Formats	Description	Size
Metadata	It contains only the metadata of the PTM packagesr and a subset of the GitHub project metadata.	8.32GB (7.12GB + 1.20 GB enhanced metadata)
Full	It contains all metadata, adding the PTM package contents in each published version, and git history of the main branhes of the GitHub projects.	48.2TB

Dependencies

The scripts in the project depend upon the following software:

Package dependencies are given in environment.yml and handled by anaconda

How To Install

To run the scripts in this project, you must install python 3.11 and SQLAlchemy v2.0 or greater.

These package can be installed using the anaconda environment manager

Install the latest version of anaconda from here
run conda env create -f environment.yml to create the anaconda environment PeaTMOSS
Activate the environment using conda activate PeaTMOSS

Alternatively, you can navigate to each packages respective pages and install them.

How to Run

After installing the anaconda environment, each demo script can be run using python3 script_name.py

Tutorial

This section will explain how to use SQL and SQLAlchemy to interact with the database to answer the research questions outlined in some of the future work we propose.

Using SQL to query the database

One option users have to interact with the metadata dataset is to use plain SQL. The metadata dataset is stored in a SQLite database file called PeaTMOSS.db, which can be found in the [Anonymous Link]. This file can be queried through standard SQL queries, and this can be done from a terminal using sqlite3: https://sqlite.org/cli.html. Single queries can be executed like

$ sqlite3 PeaTMOSS.db '{query statement}'

Alternatively, you can start an SQLite instance by simply executing

$ sqlite3 PeaTMOSS.db

which can be terminated by CTRL + D or .quit. To output queries to files, the .output command can be used

sqlite> .output {filename}.txt

Research Question Example (SQL)

The following example has to do with research question GH2: "What do developers on GitHub discuss related to PTM use, e.g., in issues, and pull requests? What are developers’ sentiments regarding PTM use? Do the people do pull requests of PTMs have the right expertise?"

If someone wants to observe what developers on GitHub are currently discussing related to PTM usage, they can look at discussions in GitHub issues and pull requests. The following SQLite example shows queries that would help accomplish this task.

First, we will create an sqlite3 instance:

$ sqlite3 PeaTMOSS.db

Then, we will create an output file for our issues query, then execute that query:

sqlite> .output issues.txt
sqlite> SELECT id, title FROM github_issue WHERE state = 'OPEN' ORDER BY updated_at DESC LIMIT 100;

Output:

The above query selects the ID and Title fields from the github_issue table, and chooses the 100 most recent issues that are still open.

Next, we will create an output file for our pull requests query, then execute that query:

sqlite> .output pull_requests.txt
sqlite> SELECT id, title FROM github_pull_request WHERE state = 'OPEN' OR state = 'MERGED' ORDER BY updated_at DESC LIMIT 100;

Output:

Notice that the query is very similar to the issues query, as we are looking for similar information. The above query selects the ID and Title fields from the github_pull_request table, and chooses the 100 most recent pull requests that are either open or merged.

Querying this data can assist when beginning to observe current/recent discussions in GitHub about PTMs. From here, you may adjust these queries to include more/less entries by changing the LIMIT value, or you may adjust which fields the queries return. For example, if you want more detailed information you could select the "body" field in either table.

Using ORMs to query the database

This section will include more details about the demo provided in the repository, PeaTMOSS_demo.py. Once again, this method requires the PeaTMOSS.db file, which can be found in the Globus Share. Prior to running this demo, ensure that the conda environment has been created and activated, or you may run into errors.

The purpose of the demo, as described at by the comment at the top of its file, is to demonstrate how one may use SQLAlchemy to address one of the research questions. The question being addressed in the demo is I1: "It can be difficult to interpret model popularity numbers by download rates. To what extent does a PTM’s download rates correlate with the number of GitHub projects that rely on it, or the popularity of the GitHub projects?". The demo accomplishes this by looking at two main fields: the number of times a model is downloaded from its model hub, and the number of times a model is reused in a GitHub repository. The demo finds the 100 most downloaded models, and finds how many times each of those models are reused. Users can take this information and attempt to find a correlation.

Research Question Example (ORM)

PeaTMOSS_demo.py utilizes PeaTMOSS.py, which is used to describe the structure of the database so that we may interact with it using SQLAlchemy. To begin, you must create and SQLAlchemy engine using the database file

import sqlalchemy
engine = sqlalchemy.create_engine(f"sqlite:///{path}")

where path is a string that describes the filepath to the database file. Both relative and absolute file paths can be used.

To find the 100 most downloaded models, we will query the model table

import sqlalchemy
from sqlalchemy.orm import Session
from PeaTMOSS import *

query_name_downloads = sqlalchemy.select(Model.id, Model.context_id, Model.downloads).limit(100).order_by(sqlalchemy.desc(Model.downloads))

and execute the query

models = session.execute(query_name_downloads).all()

For each of these models, we want to know how many times they are being reused. The model_to_reuse_repository contains fields for model IDs and reuse repository IDs, effectively linking them together. If a model is reused in multiple repository its ID will show up multiple times in the model_to_reuse_repository table. Therefore, we want to see if these highly downloaded models are also highly reused. We can do this querying the model_to_reuse_repository table and only select entries where the model_id field is equivalent to the current model's ID:

for model in models:
    #...
    query_num_reuses = sqlalchemy.select(PeaTMOSS.model_to_reuse_repository.columns.model_id)\
                                  .where(PeaTMOSS.model_to_reuse_repository.columns.model_id == model.id)

This query will select all the instances of the current model's ID appears in the model_to_reuse_repository table. If we execute this query and count the number of elements in the result, we have the number of times that model has been reused:

num_reuses = len(session.execute(query_num_reuses).all())

In each iteration of the loop we can store this information in dictionaries, where the keys can be the names of the models:

for model in models:
    highly_downloaded[model.context_id] = model.downloads
    #...
    #...
    reused_rates[model.context_id] = num_reuses

And then at the end, we can simply print the results. From there, users may observe a level of correlation using a method they see fit.

Download Results:

Reuse Results:

Citing PeaTMOSS

@inproceedings{PeaTMOSS,
  title={PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software},
  author={Wenxin Jiang, Jerin Yasmin, Jason Jones, Nicholas Synovic, Jiashen Kuo, Yuan Tian, George K. Thiruvathukal, and James C. Davis},
  booktitle={Proceedings of the 21th Annual Conference on Mining
Software Repositories (MSR'24)},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Data-Cleaning		Data-Cleaning
Data-Collection		Data-Collection
Examples		Examples
LLM-Pipeline		LLM-Pipeline
License-Analysis		License-Analysis
Plots		Plots
.DS_Store		.DS_Store
LICENSE		LICENSE
PeaTMOSS.py		PeaTMOSS.py
PeaTMOSS.sql		PeaTMOSS.sql
PeaTMOSS_SAMPLE.db.zip		PeaTMOSS_SAMPLE.db.zip
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MSR 2024 Artifact

About

Repository Structure

PeaTMOSS Dataset

About

- Note: When unzipping .tar.gz snapshots, include the flag

Globus

Globus Share

Globus Connect Personal

Metadata Description

Dependencies

How To Install

How to Run

Tutorial

Using SQL to query the database

Research Question Example (SQL)

Using ORMs to query the database

Research Question Example (ORM)

Citing PeaTMOSS

About

Releases

Packages

Contributors 3

Languages

License

PurdueDualityLab/PeaTMOSS-Artifact

Folders and files

Latest commit

History

Repository files navigation

MSR 2024 Artifact

About

Repository Structure

PeaTMOSS Dataset

About

- Note: When unzipping .tar.gz snapshots, include the flag

Globus

Globus Share

Globus Connect Personal

Metadata Description

Dependencies

How To Install

How to Run

Tutorial

Using SQL to query the database

Research Question Example (SQL)

Using ORMs to query the database

Research Question Example (ORM)

Citing PeaTMOSS

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages