Datasets
A container for curated data
Datasets provide a container for curated data with lightweight versioning semantics. This page dives into the Gantry Dataset, its API, and how it can be used to make an ML pipeline reproducible.
Motivation
The success of machine learning projects requires a continual learning cycle, where ML teams continuously improve their models based on the latest data. In order to achieve this, teams need to keep updating their training and evaluation datasets, gathering not only fresh data, but the right data. Unlike the datasets found on Kaggle, the production datasets used in real-world applications are constantly changing. This makes it imperative to have a versioning tool that helps teams keep track of changes made to the dataset, and facilitates reproduction of the steps taken to build models. The Gantry Dataset is built to help users manage their dataset iterations, and make it straightforward to do ML the right way.
Concepts
Versioning
Versioning in Gantry datasets is at the file level, that is to say changes are recorded if any file in the dataset is changed.
Data Model
The data model to support this simple versioning is itself extremely simple. Right now branches are not supported, so no complex merge logic is required. It is simply a linear chain of versions, where all operations, i.e. write a new file, update an existing file, and delete a file, are recorded at the file level. You can think of it something like this:

Essentially each version has one, and only one, parent, and each process, i.e. a local editor or a remote training job, can pull any version from the dataset. To avoid overwriting your teammate's change by accident, we only allow creating new version on top of the latest version. Again, all push_*
operations produce a new version, and any concurrent writes result in the newest write failing as it tries to write to a version other than latest.
Dataset Operations
To work with data set, set the GANTRY_DATASET_WORKING_DIR
environment variable. This is the directory which local datasets will be pulled to.
Keep in mind throughout the tutorial your output will look slightly different.
Create Dataset
The Curators documentation works through an example of creating Datasets using Curators. The below example creates a Dataset directly using the Gantry SDK.
import os
import gantry
import gantry.dataset as gdataset
DATASET_NAME = "demo_dataset"
WORKING_DIRECTORY = "..." # Absolute path to your working directory
os.environ["GANTRY_API_KEY"] = "..." # Gantry API key
os.environ["GANTRY_DATASET_WORKING_DIR"] = WORKING_DIRECTORY
gantry.init()
dataset = gdataset.create_dataset(DATASET_NAME)
The create_dataset()
function takes an optional parameter app_name
which, when specified, will use the application scheme to set the dataset schema.
List Datasets
>> gdataset.list_datasets()
[{'name': 'demo_dataset_2',
'dataset_id': '03fb2465-d4b3-4bbd-940c-d1bf5b5880fb',
'created_at': 'Tue, 10 Jan 2023 16:46:21 GMT'},
{'name': 'demo_dataset_1',
'dataset_id': 'bfad188a-8a8e-4cc1-887e-9fee1d5dfb3f',
'created_at': 'Tue, 10 Jan 2023 16:42:21 GMT'},
{'name': 'demo_dataset',
'dataset_id': '60afc214-ea57-41de-a1e3-47649b5a5b28',
'created_at': 'Mon, 09 Jan 2023 16:47:48 GMT'}]
Pull dataset
Pulling a Dataset enables analyzing or modifying it. The pull
method automatically pulls the latest version of the Dataset. To pull an earlier version, optionally set the version_id
parameter. Pulled Datasets can be located in the {WORKING_DIRECTORY}/{DATASET_NAME}
.
Data can be added to Datasets without pulling them first. See the Append data
section below or check out the push
methods in the Gantry SDK documentation for more information.
# First get the dataset object using get_dataset API
>> dataset = gdataset.get_dataset(DATASET_NAME)
# Pull dataset to local working directory
>> dataset.pull()
{WORKING_DIRECTORY}/{DATASET_NAME}
will have the following structure:
-
README.md
: serves as a normal readme and can be used to document dataset information like the source of the dataset, how it should be used etc. -
dataset_config.yaml
: the configuration file used to define data schemas. If the Gantry application name is specified on Dataset creation, the schema will be automatically added to the config file. -
Two folders:
tabular_manifests
andartifacts
: Gantry will only index CSV files inside thetabular_manifests
folder. All non-CSV files (such a binary files like image, audio, video, etc) should be placed in theartifacts
folder. If this Dataset is updated from a Curator, these folders will contain one file per Selector per curation run.
Make and push local changes
To add data to a local Gantry Dataset repo, add a csv file to the {WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/
folder or copy any binary file into the {WORKING_DIRECTORY}/{DATASET_NAME}/artifacts/
folder. The existing Dataset files can also be modified or deleted directly.
import pandas as pd
df = pd.DataFrame(
[
{"name": "Novak", "rank": 1},
{"name": "Rafa", "rank": 2},
{"name": "Roger", "rank": 3}
]
)
df.to_csv(f"{WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/demo.csv", index=False)
The get_diff()
command returns the Dataset files that have been added, deleted, or modified.
>> dataset.get_diff()
{'new_files': ['tabular_manifests/demo.csv'],
'modified_files': [],
'deleted_files': []}
To push changes to the repo, call push_version
. Every time data is pushed to the repo, a new version will be created. The push version operation will take a snapshot of your local repo and upload it to the Gantry dataset server. Internally Gantry will optimize the storage by only pushing the changed files. After the version is created we can pull it from anywhere using the Gantry SDK. Pushing is only allowed if the newest version of the Dataset is being modified.
>> dataset.push_version("add demo dataframe")
list_versions()
displays the Dataset edit history:
>> dataset.list_versions()
[{'version_id': '09575ee7-0407-44b8-ae88-765a8270b17a',
'dataset': 'demo_dataset',
'message': 'add demo dataframe',
'created_at': 'Wed, 08 Feb 2023 22:17:55 GMT',
'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
'is_latest_version': True},
{'version_id': 'b787034a-798b-4bb3-a726-0e197ddb8aff',
'dataset': 'demo_dataset',
'message': 'initial dataset commit',
'created_at': 'Wed, 08 Feb 2023 22:00:26 GMT',
'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
'is_latest_version': False}]
Time travel
Pull older version
Gantry stores multiple versions of the Dataset to track the evolution of data over time. Each version is a snapshot. Any version of a Dataset can be pulled locally by specifying version_id
. The following is an example of checking out a version:
>> dataset.pull(version_id = 'b787034a-798b-4bb3-a726-0e197ddb8aff')
{'version_id': 'b787034a-798b-4bb3-a726-0e197ddb8aff',
'dataset': 'demo_dataset',
'message': 'initial dataset commit',
'created_at': 'Wed, 08 Feb 2023 22:00:26 GMT',
'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
'is_latest_version': False}
If we go to {WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/
we will see the newly added CSV file has gone. That's because it doesn't exist in the dataset version we just pulled.
Rollback
To roll back to an older version of a Dataset, call rollback
. This will fail unless we're on the latest version of a dataset. If you checkout an older version (or if someone pushed a new change), just run pull()
before calling rollback()
:
# Make sure your local dataset is up to date
>> dataset.pull()
# Rollback to a previous version, and pull the target version data to dataset folder
>> dataset.rollback(version_id="b787034a-798b-4bb3-a726-0e197ddb8aff")
>> dataset.list_versions()
[{'version_id': '23bc4d35-0df2-424c-9156-d5ca105eb4c1',
'dataset': 'demo_dataset',
'message': 'Rollback dataset to version: b787034a-798b-4bb3-a726-0e197ddb8aff',
'created_at': 'Thu, 09 Feb 2023 00:36:07 GMT',
'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
'is_latest_version': True},
{'version_id': '09575ee7-0407-44b8-ae88-765a8270b17a',
'dataset': 'demo_dataset',
'message': 'add demo dataframe',
'created_at': 'Wed, 08 Feb 2023 22:17:55 GMT',
'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
'is_latest_version': False},
{'version_id': 'b787034a-798b-4bb3-a726-0e197ddb8aff',
'dataset': 'demo_dataset',
'message': 'initial dataset commit',
'created_at': 'Wed, 08 Feb 2023 22:00:26 GMT',
'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
'is_latest_version': False}]
# Let's first pull an outdated dataset version to local repo
>> dataset.pull(version_id="b787034a-798b-4bb3-a726-0e197ddb8aff")
# Now make some local changes in your dataset folder
>> df.to_csv(f"{WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/demo_1.csv", index=False)
# Try to push a version and you will see an error
>> dataset.push_version("try to create a version on an old version")
DatasetHeadOutOfDateException Traceback
......
DatasetHeadOutOfDateException: Local HEAD not up to date! Your local version is behind the remote
# The operation will fail because you are modifying an older version of your dataset
# To fix it you can stash your local changes. Gantry will cache all the modified
# files in a tmp folder so you will not lost your local change.
>> dataset.stash()
# Now pull the latest dataset
>> dataset.pull()
# You can reapply the local change on top of the latest dataset
>> dataset.restore()
# When you do stash we will copy all the new files and modified files
# into a tmp folder, restore will copy them back. For deleted files,
# restore will redo the deletion.
# Now you can create a new version.
>> dataset.push_version("your version notes")
Append data
The following APIs can be used to append new data to the remote dataset repo without pulling the whole dataset locally. Note that these functions write directly to the remote and do not modify the local Dataset directory. Run dataset.pull()
to view these changes.
Add a pandas dataframe to the dataset
>> dataset.push_dataframe(df)
Add tabular file to the dataset
# add a new tabular data file to your dataset
>> dataset.push_tabular_file(open("{file_to_be_added}.csv"), "{dataset_file_name}.csv", "version info")
Use Dataset for training
Datasets have a built-in integration with HuggingFace Datasets. With the correct data schema configured in the dataset_config
file, data can be loaded directly into a HuggingFace to be used for training or evaluation.
Note
To use this feature, all the CSV files inside the
tabular_manifests
folder must have the same columns and the same column data types.
Configure data schema
These steps are not required if the Dataset was generated by a Curator. Curators will auto populate data schema.
An example dataset_config.yaml file without schema information:
artifacts:
type: folder
value: artifacts # do not modify
dataset_info: README.md
dataset_name: demo_dataset
tabular_files:
type: folder
value: tabular_manifests # do not modify
The following is an example of configuring the data schema.
Add two sections to the dataset_config.yaml file
:
features
which defines all the feature columns' typelabels
which defines all the ground truth columns' type.
Check out data schema to see all the data types Gantry can support.
Sample CSV and corresponding yaml:
cat ${WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/{file_name}.csv
name,rank
Novak,1
Rafa,2
Roger,3
We need to add the data schema for name
column and rank
column:
artifacts:
type: folder
value: artifacts # do not modify
dataset_info: README.md
dataset_name: demo_dataset
tabular_files:
type: folder
value: tabular_manifests # do not modify
features:
name: Text
labels:
rank: Integer
Push the Dataset:
>> dataset.push_version("Add data schema to dataset config")
Load Huggingface dataset from Gantry Dataset with configured schema
# load gantry data into huggingface dataset
>> training_df = dataset.get_huggingface_dataset()
>> training_df[0]
{'name': 'Novak',
'rank': 1}
# Load your dataset into pytorch
>> training_df.set_format(type="torch", columns=["name", "rank"])
>> training_df.format
{'type': 'torch',
'format_kwargs': {},
'columns': ['name', 'rank'],
'output_all_columns': False}
All data will be loaded into the training split. There are future plans for multiple split support.
Updated 18 days ago