This repository contains the source code and dataset examples for our MSR'24 paper
PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software
.
-
PeaTMOSS Dataset ($4, 6)
README.md
PeaTMOSS.sql
Examples/
-
LLM-Pipeline/ ($5)
README.md
Accurate_pipeline
Cheap_pipeline
-
License-Analysis/ ($7)
README.md
analyzeIncompatibilities/
data/
figures/
ghRepos
projectPairings/
ptmProjects/
This repository contains a zipped sample of the PeaTMOSS dataset, as well as a script that demonstrates possible interactions with the SQLite database used to store the metadata dataset. The complete PeaTMOSS dataset contains snapshots of Pre-Trained machine learning Model (PTM) repositories and the downstream Open-Source GitHub repositories that reuse the PTMs, metadata about the PTMs, the pull requests and issues of the GitHub Repositories, and links between the downstream GitHub repositories and the PTM models. The schema of the SQLite database is specified by PeaTMOSS.py and PeatMOSS.sql. The sample of the database is PeaTMOSS_sample.db. The full database, as well as all captured repository snapshots are available here
--strip-components=4
in the tar statement, like so
tar --strip-components=4 -xvzf {name}.tar.gz
If you do not do this, you will have 4 extraneous parent directories that encase the repository.
The script that we used to generate figures in $6 is available in script_plot.py
All zipped repos and the full metadata dataset are available through Globus Share
If you do not have an account, follow the Globus docs on how to sign up. You may create an account through a partnered organization if you are a part of that organization, or through Google or ORCID accounts.
To access the metadata dataset using the globus.py
script provided in the repository:
- Download Globus Connect Personal
- Create your own private Globus collection on Mac, Windows, or Linux
- Once this is created, make sure your Globus Personal Connect is running before executing
globus.py
NOTE: In some cases, you may run into permission issues on Globus when running the script.
If this is the case, you will need to change local_endpoint.endpoint_id
, located on line 29, to your private collection's UUID:
local_endpoint_id = local_endpoint.endpoint_id
To locate your private collecion's UUID, click on the Globus icon on your taskbar and select "Web: Collection Details". On this page, scroll down to the bottom where the UUID field for your collection should be visible, and replace the variable with your collection's UUID expressed as a string. Then, use the activities tab to terminate the existing transfer and rerun globus.py.
The following model hubs are captured in our database:
The content for each specific model hub is listed in the table below:
Model hub | #PTMs | #Snapshotted Repos | #Discussions (PRs, issues) | #Links | Size of Zipped Snapshots |
---|---|---|---|---|---|
Hugging Face | 281,638 | 14,296 | 59,011 | 30,514 | 44TB |
PyTorch Hub | 362 | 361 | 52,161 | 13,823 | 1.3GB |
We also offer two different formats of our datasets. An overview of these two formats can be found in the table below:
Formats | Description | Size |
---|---|---|
Metadata | It contains only the metadata of the PTM packagesr and a subset of the GitHub project metadata. | 8.32GB (7.12GB + 1.20 GB enhanced metadata) |
Full | It contains all metadata, adding the PTM package contents in each published version, and git history of the main branhes of the GitHub projects. | 48.2TB |
The scripts in the project depend upon the following software:
Package dependencies are given in
environment.yml
and handled byanaconda
To run the scripts in this project, you must install python 3.11 and SQLAlchemy v2.0 or greater.
These package can be installed using the anaconda
environment manager
- Install the latest version of anaconda from here
- run
conda env create -f environment.yml
to create the anaconda environmentPeaTMOSS
- Activate the environment using
conda activate PeaTMOSS
Alternatively, you can navigate to each packages respective pages and install them.
After installing the anaconda environment, each demo script can be run using python3 script_name.py
This section will explain how to use SQL and SQLAlchemy to interact with the database to answer the research questions outlined in some of the future work we propose.
One option users have to interact with the metadata dataset is to use plain SQL. The metadata dataset is stored in a SQLite database file called PeaTMOSS.db, which can be found in the [Anonymous Link]. This file can be queried through standard SQL queries, and this can be done from a terminal using sqlite3: https://sqlite.org/cli.html. Single queries can be executed like
$ sqlite3 PeaTMOSS.db '{query statement}'
Alternatively, you can start an SQLite instance by simply executing
$ sqlite3 PeaTMOSS.db
which can be terminated by CTRL + D
or .quit
. To output queries to files, the .output command can be used
sqlite> .output {filename}.txt
The following example has to do with research question GH2: "What do developers on GitHub discuss related to PTM use, e.g., in issues, and pull requests? What are developers’ sentiments regarding PTM use? Do the people do pull requests of PTMs have the right expertise?"
If someone wants to observe what developers on GitHub are currently discussing related to PTM usage, they can look at discussions in GitHub issues and pull requests. The following SQLite example shows queries that would help accomplish this task.
- First, we will create an sqlite3 instance:
$ sqlite3 PeaTMOSS.db
- Then, we will create an output file for our issues query, then execute that query:
sqlite> .output issues.txt
sqlite> SELECT id, title FROM github_issue WHERE state = 'OPEN' ORDER BY updated_at DESC LIMIT 100;
Output:
The above query selects the ID and Title fields from the github_issue table, and chooses the 100 most recent issues that are still open.
- Next, we will create an output file for our pull requests query, then execute that query:
sqlite> .output pull_requests.txt
sqlite> SELECT id, title FROM github_pull_request WHERE state = 'OPEN' OR state = 'MERGED' ORDER BY updated_at DESC LIMIT 100;
Output:
Notice that the query is very similar to the issues query, as we are looking for similar information. The above query selects the ID and Title fields from the github_pull_request table, and chooses the 100 most recent pull requests that are either open or merged.
Querying this data can assist when beginning to observe current/recent discussions in GitHub about PTMs. From here, you may adjust these queries to include more/less entries by changing the LIMIT value, or you may adjust which fields the queries return. For example, if you want more detailed information you could select the "body" field in either table.
This section will include more details about the demo provided in the repository, PeaTMOSS_demo.py. Once again, this method requires the PeaTMOSS.db file, which can be found in the Globus Share. Prior to running this demo, ensure that the conda environment has been created and activated, or you may run into errors.
The purpose of the demo, as described at by the comment at the top of its file, is to demonstrate how one may use SQLAlchemy to address one of the research questions. The question being addressed in the demo is I1: "It can be difficult to interpret model popularity numbers by download rates. To what extent does a PTM’s download rates correlate with the number of GitHub projects that rely on it, or the popularity of the GitHub projects?". The demo accomplishes this by looking at two main fields: the number of times a model is downloaded from its model hub, and the number of times a model is reused in a GitHub repository. The demo finds the 100 most downloaded models, and finds how many times each of those models are reused. Users can take this information and attempt to find a correlation.
PeaTMOSS_demo.py
utilizes PeaTMOSS.py
, which is used to describe the structure of the database so that we may interact with it using SQLAlchemy. To begin, you must create and SQLAlchemy engine using the database file
import sqlalchemy
engine = sqlalchemy.create_engine(f"sqlite:///{path}")
where path
is a string that describes the filepath to the database file.
Both relative and absolute file paths can be used.
To find the 100 most downloaded models, we will query the model table
import sqlalchemy
from sqlalchemy.orm import Session
from PeaTMOSS import *
query_name_downloads = sqlalchemy.select(Model.id, Model.context_id, Model.downloads).limit(100).order_by(sqlalchemy.desc(Model.downloads))
and execute the query
models = session.execute(query_name_downloads).all()
For each of these models, we want to know how many times they are being reused. The model_to_reuse_repository contains fields for model IDs and reuse repository IDs, effectively linking them together. If a model is reused in multiple repository its ID will show up multiple times in the model_to_reuse_repository table. Therefore, we want to see if these highly downloaded models are also highly reused. We can do this querying the model_to_reuse_repository table and only select entries where the model_id field is equivalent to the current model's ID:
for model in models:
#...
query_num_reuses = sqlalchemy.select(PeaTMOSS.model_to_reuse_repository.columns.model_id)\
.where(PeaTMOSS.model_to_reuse_repository.columns.model_id == model.id)
This query will select all the instances of the current model's ID appears in the model_to_reuse_repository table. If we execute this query and count the number of elements in the result, we have the number of times that model has been reused:
num_reuses = len(session.execute(query_num_reuses).all())
In each iteration of the loop we can store this information in dictionaries, where the keys can be the names of the models:
for model in models:
highly_downloaded[model.context_id] = model.downloads
#...
#...
reused_rates[model.context_id] = num_reuses
And then at the end, we can simply print the results. From there, users may observe a level of correlation using a method they see fit.
Download Results:
Reuse Results:
@inproceedings{PeaTMOSS,
title={PeaTMOSS: A Dataset and Initial Analysis of Pre-Trained Models in Open-Source Software},
author={Wenxin Jiang, Jerin Yasmin, Jason Jones, Nicholas Synovic, Jiashen Kuo, Yuan Tian, George K. Thiruvathukal, and James C. Davis},
booktitle={Proceedings of the 21th Annual Conference on Mining
Software Repositories (MSR'24)},
year={2024}
}