This post is in continuation of the previous post : How to Version Control your Machine Learning - I. If you have not read the previous post, I would recommend to have a look at the previous post to understand the terminologies better.
For the Ninjas out there in Version Control and Machine Learning, you can go ahead.
By now, we already know the importance of Version Control, let’s go ahead and implement it to see the real use.
Before going ahead, make sure that we have DVC installed in the system. We can check that based on the operating system we are using. Since, DVC can only be installed using pip command, it is necessary to make sure the Pip is installed in the system. It can be checked using the command:
$ pip -V
Once we are sure that the pip is installed, we can go ahead and install DVC by using the following command:
$ pip install dvc
Once we have DVC installed in the system, let’s go ahead and take a real life case and see how it works.The Chatbot Conference
Example with Steps
For this I am taking a code from Numerai, which allows any data scientist to build machine learning models on their data, and submit predictions to control the capital in their Hedge Fund.
Numerai abstracts its financial data, data scientists do not know what the data represents and human biases and overfitting are overcome.
They also have an unique way of providing benefits with their own crypto currency which they call Numeraire.
In February, Numerai announced Numeraire a cryptographic token to incentivize data scientists around the world to contribute artificial intelligence to our hedge fund (see Forbes, Wired, Smith+Crown). Earlier today, the Numeraire smart contract was deployed to Ethereum, and over 1.2 million tokens were sent to 19,000 data scientists around the world. — Source
I will not talk more about Numerai here, but I will definitely mention it again in my next post in details about Cryptocurrency and Blockchain technology.
Once signed in, Numerai data can be downloaded from the website of Numerai, where they update the data after every 4 days. So, it might be possible that by the time you are reading this post, new data set is available by now.
Irrespective of what dataset you download, the following steps will almost be same with a few modifications:
1.First initialize a git repo and put the downloaded code there.
$ mkdir numerai_code #create a repo
$ cd numerai_code
$ git init # initialize git
$ git add numerai_code_downloaded # Add downloaded data to git repo
$ git commit -m 'Numerai code added'
$ git push
2. Install and Initialize DVC repository
$ pip install dvc
$ dvc init
3.‘numerai_training_data.csv’ and ‘numerai_tournament_data.csv’ will be present in the dataset downloaded which can be used to train and predict the results respectively.
4.Now, the time is to create a prediction model that predicts the data based on the dataset available and then put that file in the same repository (numerai_code) as above. Let’s call it ‘prediction.py’
For the prediction model, I used an LSTM (Long Short Term Memory) Recurrent Neural Networks in Python with Keras. To know more about the RNN and LSTM architecture and other deep learning terms, please have a look at this post:
5. Run the python code within dvc with the following command:
$ dvc run python prediction.py
6. The model saves the checkpoints in a CSV file with the name assigned and the saved CSV can be submitted on the Numerai submission page.
NOTE: DVC automatically derives the dependencies between the steps and builds the dependency graph (DAG) transparently to the user. Also, if there is a change in the files that changes everything, all the files will be reproduced.
Not only can DVC streamline your work into a single, reproducible environment, it also makes it easy to share this environment by Git including the dependencies (DAG) — an exciting collaboration feature which gives the ability to reproduce the research results in different computers. Moreover, you can share your data files through cloud storage services like AWS S3 or GCP Storage since DVC does not push data files to Git repositories.
Check this blog post to know more about Data Version Control:
In the next post, I will write more about how the environment and the dependencies can be shared between different scientists working on the same project and the sharing of files over cloud.Stay tuned!