MLOps and Data: Managing Large ML Datasets with DVC and S3
A quick start guide to version control for machine learning data
As part of a larger effort to test and evaluate different MLOps frameworks, the data science team at Anno.Ai recently tested out DVC to improve integration between our model repos on GitHub and our data and model storage on Amazon S3. In this article, we provide a quick guide to getting set up with DVC and some tips we learned along the way.
What is DVC?
DVC (Data Version Control) is an open-source application for machine learning project version control — think Git for data. In fact, the DVC syntax and workflow patterns are very similar to Git, making it intuitive to incorporate into existing repos.
While you could include small datasets in your existing repos on GitHub, Gitlab, or Bitbucket, this is not an effective approach for what are often large datasets of images, videos, and text documents that are used for training machine learning models.
This is where DVC comes in, enabling sharing and version control of large datasets and model files used by data science teams. DVC also supports pipelines to version control the steps in a typical ML workflow (e.g., data transformation, feature engineering, augmentation, and training).
Our use case involves configuring DVC to manage data between Git and AWS S3, but DVC also supports other cloud-based storage systems including Google Drive, Google Cloud Storage, and Azure Blob Storage.
As prerequisites, you should have Python and Git installed, and have created and activated a virtual environment for the repo you’ll be integrating with DVC. To enable pushing data to S3, also make sure the AWS CLI is installed and properly configured with your credentials.
Setting up DVC
After setting up your environment you can install DVC. In this example, since we’re using DVC to manage data between Git and S3, we’ll use pip to install DVC as well as the S3 dependencies. You may also want to add this to your requirements.txt file.
(env) $ pip install dvc[s3]
After installing DVC, run:
(env) $ dvc init
This command creates a .dvc folder and tracking files. By default, DVC sends anonymized data back to the server for usage and troubleshooting. If you would like to disable sending data back to their server, run:
(env) $ dvc config core.analytics false
Finally, git add and commit your changes.
(env) $ git add .dvc/config(env) $ git commit -m “initialize dvc”
Setting up the S3 Remote
First, set up your bucket (and sub folders if desired) in S3. Then configure DVC to point to that remote, and commit your configuration changes:
(env) $ dvc remote add -d storage s3://your-bucket/storage(env) $ git commit .dvc/config -m “set up s3 remote storage”
Pushing Data to S3
Tip: The data that you want to push to S3 via DVC must be local to start. When we first started using DVC, all of our data were already stored in S3 but not completely mirrored locally. First make sure that your data is synced locally and then push your data and models back to S3 via DVC so they can be versioned and tracked. At the time of writing, the DVC team is working on native support for adding files that are already in S3 (or other external storage) directly to the remote.
In this example, all of the data are in a directory called data. Add and push your data directory to the S3 remote you set up with DVC:
(env) $ dvc add data(env) $ dvc push
When you add files, DVC creates a .dvc file in YAML format with the MD5 hash corresponding to the files or directories being tracked.
├── your-repo│ ├── .dvc│ ├── data│ ├── data.dvc...
If you open the .dvc file it will look something like this:
outs:- md5: 821a6198470a18e8f801d382398bb0ac.dir path: data
Finally, git add your .dvc file.
(env) $ git add data.dvc(env) $ git commit -m “update tracked data”(env) $ git push
Now your repo is set up with data integration if another team member wants to use it, or if you want to clone the project to another workstation.
Pulling Data from S3
After cloning the project, simply run:
(env) $ dvc pull
Voila! The data will be pulled down to your local project.
While DVC fills a niche for versioning and managing ML data, it also offers other features that address additional aspects of the MLOps workflow, including pipelines and experiment tracking and visualization. The DVC team also recently launched a sister project, CML, for CI/CD of ML workflows. In our next post, we’ll step through and evaluate these features as well.
In addition, keep an eye out for our upcoming series where we evaluate DVC alongside other MLOps tools such as Kubeflow and MLFlow for hosting ML pipelines and processing jobs, and review end-to-end ML workflow best practices for different environments and use cases.
About the Author
Ashley Antonides is Anno.Ai’s Chief Artificial Intelligence Officer. Ashley has over 15 years of experience leading the research and development of machine learning and computer vision applications in the national security, public health, and environmental sectors. Ashley holds a B.S. in Symbolic Systems from Stanford University and a Ph.D. in Environmental Science, Policy, and Management from the University of California, Berkeley.