Stupidly simple python library to track machine learning experiments as well as features in an excel file
VevestaX is an open source Python package for ML Engineers and Data Scientists. It includes modules for tracking features sourced from data, feature engineering and variables. The output is an excel file which has tabs namely, data sourcing, feature engineering, modelling, performance plots for tracking performance of variables(accuracy etc) over multiple experiments and lastly, EDA plots. The library can be used with Jupyter notebook, IDEs like spyder, Colab, Kaggle notebook or while running the python script through command line. VevestaX is framework agnostic. You can use it with any machine learning or deep learning framework.
- How to Install VevestaX
- How to import VevestaX and create the experiment object
- How to extract features present in input pandas dataframe
- How to extract engineered features
- How to track variables used
- How to track all variables in the code while writing less code
- How to write the features and modelling variables in an given excel file
- How to commit file, features and parameters to Vevesta
- Snapshots of output excel file
- How to speed up the code
pip install vevestaX
#import the vevesta Library
from vevestaX import vevesta as v
V=v.Experiment()
#read the dataset
import pandas as pd
df=pd.read_csv("salaries.csv")
df.head(2)
#Extract the columns names for features
V.ds=df
# you can also use:
# V.dataSourcing = df
Code snippet
#Extract features engineered
V.fe=df
# you can also use:
V.featureEngineering = df
V.start() and V.end() form a code block and can be called multiple times in the code to track variables used within the code block. Any technique such as XGBoost, decision tree, etc can be used within this code block. All computed variables will be tracked between V.start() and V.end(). If V.start() and V.end() is not used, all the variables used in the code will be tracked.
Code snippet:
#Track variables which have been used for modelling
V.start()
# you can also use: V.startModelling()
# All the variables mentioned here will be tracked
epochs=100
seed=3
accuracy = computeAccuracy() #this will be computed variable
recall = computeRecall() #This will be computed variable
loss='rmse'
#end tracking of variables
V.end()
# or, you can also use : V.endModelling()
You can absolutely eliminate using V.start() and V.end() function calls. All the primitive data type variables used in the code are tracked and written to the excel file by default. Note: while on colab or kaggle, V.start() and V.end() feature hasn't been rolled out. Instead all the variables used in the code are tracked by default.
# Dump the datasourcing, features engineered and the variables tracked in a xlsx file
V.dump(techniqueUsed='XGBoost',filename="vevestaDump1.xlsx",message="XGboost with data augmentation was used",version=1)
Alternatively, write the experiment into the default file, vevesta.xlsx Code snippet:
V.dump(techniqueUsed='XGBoost')
Vevesta is next generation knowledge repository/GitHub for data science project. The tool is free to use. Please create a login on vevesta . Then go to Setting section, download the access token. Place this token in the same folder as the jupyter notebook or python script. If my chance you face difficulties, please do mail [email protected].
You can commit the file(code),features and parameters to Vevesta by using the following command. You will find the project id for your project on the home page.
Code Snippet:
V.commit(techniqueUsed = "XGBoost", message="increased accuracy", version=1, projectId=1, attachmentFlag=True)
A sample output excel file has been uploaded on google sheets. Its url is here
After running calling the dump or commit function for each run of the code. The features used, features engineered and the variables used in the experiments get logged into the excel file. In the below experiment, the commit/dump function is called 6 times and each time an experiment/code run is written into the excel sheet.
For example, code snippet used to track code runs/experiments are as below:
#import the vevesta Library
from vevestaX import vevesta as v
V=v.Experiment()
df = pd.read_csv("wine.csv")
V.ds = df
df["salary_Ratio1"] = df["alchol_content"]/5
V.fe = df
epoch = 1000
accuracy = 90 #this will be a computed variable, may be an output of XGBoost algorithm
recall = 89 #this will be a computed variable, may be an output of XGBoost algorithm
For the above code snippet, each row in the excel sheet corresponds to an experiment/code run. The excel sheet will have the following:
- Data Sourcing tab: Marks which Features (or columns) in wine.csv were read from the input file. Presence of the feature is marked as 1 and absence as 0.
- Feature Engineering tab: Features engineered such as salary_Ratio1 exist as columns in the excel. Value 1 means that feature was engineered in that particular experiment and 0 means it was absent.
- Modelling tab: This tab tracks all the variables used in the code. Say variable precision was computed in the experiment, then for the experiment ID i, precision will be a column whose value is computed precision variable. Note: V.start() and V.end() are code blocks that you might define. In that case, the code can have multiple code blocks. The variables in all these code blocks are tracked together. Let us define 3 code blocks in the code, first one with precision, 2nd one with recall and accuracy and 3rd one with epoch, seed and no of trees. Then for experiment Id , all the variables, namely precision, recall, accuracy, epoch, seed and no. of trees will be tracked as one experiment and dumped in a single row with experiment id . Note, if code blocks are not defined then it that case all the variables are logged in the excel file.
- Messages tab: Data Scientists like to create new files when they change technique or approach to the problem. So everytime you run the code, it tracks the experiment ID with the name of the file which had the variables, features and features engineered.
- EDA-correlation: correlation is calculated on the input data automatically. EDA computation can be skipped by passing true during the creation of the object v.Experiment(True). The following is the code snippet:
#import the vevesta Library
from vevestaX import vevesta as v
V=v.Experiment(true)
The library does EDA automatically on the data. In order to accelerate compute and skip EDA, set the flag speedUp=True as shown in the code snippet.
#import the vevesta Library
from vevestaX import vevesta as v
V = v.Experiment(True)
#or u can also use
#V=v.Experiment(speedUp = True)
If you liked the library, please give us a github star and retweet .
For additional features, explore our tool at Vevesta . For comments, suggestions and early access to the tool, reach out at [email protected]
Looking for beta users for the library. Register here
We at Vevesta Labs are maintaining this library and we welcome feature requests. Find detailed blog on the vevestaX on Medium