Skip to content
forked from Vevesta/VevestaX

2 Lines of code to track features + machine learning experiments + EDA in a spreadsheet

License

Notifications You must be signed in to change notification settings

rjarun8/VevestaX

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VevestaX

image

Downloads Downloads Downloads License Twitter URL

Stupidly simple python library to track machine learning experiments as well as features in an excel file

VevestaX is an open source Python package for ML Engineers and Data Scientists. It includes modules for tracking features sourced from data, feature engineering and variables. The output is an excel file which has tabs namely, data sourcing, feature engineering, modelling, performance plots for tracking performance of variables(accuracy etc) over multiple experiments and lastly, EDA plots. The library can be used with Jupyter notebook, IDEs like spyder, Colab, Kaggle notebook or while running the python script through command line. VevestaX is framework agnostic. You can use it with any machine learning or deep learning framework.

Table of Contents

  1. How to Install VevestaX
  2. How to import VevestaX and create the experiment object
  3. How to extract features present in input pandas dataframe
  4. How to extract engineered features
  5. How to track variables used
  6. How to track all variables in the code while writing less code
  7. How to write the features and modelling variables in an given excel file
  8. How to commit file, features and parameters to Vevesta
  9. Snapshots of output excel file
  10. How to speed up the code

How to install VevestaX

pip install vevestaX

How to import VevestaX and create the experiment object

#import the vevesta Library
from vevestaX import vevesta as v
V=v.Experiment()

How to extract features present in input pandas dataframe

image Code snippet:

#read the dataset
import pandas as pd
df=pd.read_csv("salaries.csv")
df.head(2)

#Extract the columns names for features
V.ds=df
# you can also use:
#   V.dataSourcing = df

How to extract engineered features

image

Code snippet

#Extract features engineered
V.fe=df  
# you can also use:
V.featureEngineering = df

How to track variables used

V.start() and V.end() form a code block and can be called multiple times in the code to track variables used within the code block. Any technique such as XGBoost, decision tree, etc can be used within this code block. All computed variables will be tracked between V.start() and V.end(). If V.start() and V.end() is not used, all the variables used in the code will be tracked.

Code snippet:

#Track variables which have been used for modelling
V.start()
# you can also use: V.startModelling()


# All the variables mentioned here will be tracked
epochs=100
seed=3
accuracy = computeAccuracy() #this will be computed variable
recall = computeRecall() #This will be computed variable
loss='rmse'


#end tracking of variables
V.end()
# or, you can also use : V.endModelling()

How to track all variables in the code while writing less code

You can absolutely eliminate using V.start() and V.end() function calls. All the primitive data type variables used in the code are tracked and written to the excel file by default. Note: while on colab or kaggle, V.start() and V.end() feature hasn't been rolled out. Instead all the variables used in the code are tracked by default.

How to write the features and modelling variables in an given excel file

image Code snippet:

# Dump the datasourcing, features engineered and the variables tracked in a xlsx file
V.dump(techniqueUsed='XGBoost',filename="vevestaDump1.xlsx",message="XGboost with data augmentation was used",version=1)

Alternatively, write the experiment into the default file, vevesta.xlsx image Code snippet:

V.dump(techniqueUsed='XGBoost')

How to commit file, features and parameters to Vevesta

Vevesta is next generation knowledge repository/GitHub for data science project. The tool is free to use. Please create a login on vevesta . Then go to Setting section, download the access token. Place this token in the same folder as the jupyter notebook or python script. If my chance you face difficulties, please do mail [email protected].

You can commit the file(code),features and parameters to Vevesta by using the following command. You will find the project id for your project on the home page.

image

Code Snippet:

V.commit(techniqueUsed = "XGBoost", message="increased accuracy", version=1, projectId=1, attachmentFlag=True)

A sample output excel file has been uploaded on google sheets. Its url is here

Snapshots of output excel file

After running calling the dump or commit function for each run of the code. The features used, features engineered and the variables used in the experiments get logged into the excel file. In the below experiment, the commit/dump function is called 6 times and each time an experiment/code run is written into the excel sheet.

For example, code snippet used to track code runs/experiments are as below:

#import the vevesta Library
from vevestaX import vevesta as v
V=v.Experiment()
df = pd.read_csv("wine.csv") 
V.ds = df
df["salary_Ratio1"] = df["alchol_content"]/5
V.fe = df
epoch = 1000
accuracy = 90 #this will be a computed variable, may be an output of XGBoost algorithm
recall = 89  #this will be a computed variable, may be an output of XGBoost algorithm

For the above code snippet, each row in the excel sheet corresponds to an experiment/code run. The excel sheet will have the following:

  1. Data Sourcing tab: Marks which Features (or columns) in wine.csv were read from the input file. Presence of the feature is marked as 1 and absence as 0.
  2. Feature Engineering tab: Features engineered such as salary_Ratio1 exist as columns in the excel. Value 1 means that feature was engineered in that particular experiment and 0 means it was absent.
  3. Modelling tab: This tab tracks all the variables used in the code. Say variable precision was computed in the experiment, then for the experiment ID i, precision will be a column whose value is computed precision variable. Note: V.start() and V.end() are code blocks that you might define. In that case, the code can have multiple code blocks. The variables in all these code blocks are tracked together. Let us define 3 code blocks in the code, first one with precision, 2nd one with recall and accuracy and 3rd one with epoch, seed and no of trees. Then for experiment Id , all the variables, namely precision, recall, accuracy, epoch, seed and no. of trees will be tracked as one experiment and dumped in a single row with experiment id . Note, if code blocks are not defined then it that case all the variables are logged in the excel file.
  4. Messages tab: Data Scientists like to create new files when they change technique or approach to the problem. So everytime you run the code, it tracks the experiment ID with the name of the file which had the variables, features and features engineered.
  5. EDA-correlation: correlation is calculated on the input data automatically. EDA computation can be skipped by passing true during the creation of the object v.Experiment(True). The following is the code snippet:
#import the vevesta Library
from vevestaX import vevesta as v
V=v.Experiment(true)

Sourced Data tab

image

Feature Engineering tab

image

Modelling tab

image

Messages tab

image

EDA-correlation tab

image

Experiments performance plots

image image

How to speed up the code

The library does EDA automatically on the data. In order to accelerate compute and skip EDA, set the flag speedUp=True as shown in the code snippet.

#import the vevesta Library
from vevestaX import vevesta as v
V = v.Experiment(True)
#or u can also use
#V=v.Experiment(speedUp = True)

If you liked the library, please give us a github star and retweet .

For additional features, explore our tool at Vevesta . For comments, suggestions and early access to the tool, reach out at [email protected]

Looking for beta users for the library. Register here

We at Vevesta Labs are maintaining this library and we welcome feature requests. Find detailed blog on the vevestaX on Medium

About

2 Lines of code to track features + machine learning experiments + EDA in a spreadsheet

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 63.5%
  • Python 36.5%