MLOps and Data: Managing Large ML Datasets with DVC and S3

What is DVC?

Setting up DVC

(env) $ pip install dvc[s3]
(env) $ dvc init
(env) $ dvc config false
(env) $ git add .dvc/config(env) $ git commit -m “initialize dvc”

Setting up the S3 Remote

(env) $ dvc remote add -d storage s3://your-bucket/storage(env) $ git commit .dvc/config -m “set up s3 remote storage”
Pushing Data to S3

(env) $ dvc add data(env) $ dvc push
├── your-repo│   ├── .dvc│   ├── data│   ├── data.dvc...
outs:- md5: 821a6198470a18e8f801d382398bb0ac.dir  path: data
(env) $ git add data.dvc(env) $ git commit -m “update tracked data”(env) $ git push

Pulling Data from S3

(env) $ dvc pull

Learn More

What’s Next

