MLOps and Data: Managing Large ML Datasets with DVC and S3

Photo by Franki Chamaki on Unsplash

Background

What is DVC?

How do we manage all this training data? Photo by Rick Mason on Unsplash

Prerequisites

Setting up DVC

(env) $ pip install dvc[s3]
(env) $ dvc init
(env) $ dvc config core.analytics false
(env) $ git add .dvc/config(env) $ git commit -m “initialize dvc”

Setting up the S3 Remote

(env) $ dvc remote add -d storage s3://your-bucket/storage(env) $ git commit .dvc/config -m “set up s3 remote storage”
Photo by Jan Antonin Kolar on Unsplash

Pushing Data to S3

(env) $ dvc add data(env) $ dvc push
├── your-repo│   ├── .dvc│   ├── data│   ├── data.dvc...
outs:- md5: 821a6198470a18e8f801d382398bb0ac.dir  path: data
(env) $ git add data.dvc(env) $ git commit -m “update tracked data”(env) $ git push

Pulling Data from S3

(env) $ dvc pull

Learn More

Photo by John Schnobrich on Unsplash

What’s Next

About the Author

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store