Datasets

A container for curated data

Datasets provide a container for curated data with lightweight versioning semantics. This page dives into the Gantry Dataset, its API, and how it can be used to make an ML pipeline reproducible.

Motivation

The success of machine learning projects requires a continual learning cycle, where ML teams continuously improve their models based on the latest data. In order to achieve this, teams need to keep updating their training and evaluation datasets, gathering not only fresh data, but the right data. Unlike the datasets found on Kaggle, the production datasets used in real-world applications are constantly changing. This makes it imperative to have a versioning tool that helps teams keep track of changes made to the dataset, and facilitates reproduction of the steps taken to build models. The Gantry Dataset is built to help users manage their dataset iterations, and make it straightforward to do ML the right way.

Concepts

Versioning

Versioning in Gantry datasets is at the file level, that is to say changes are recorded if any file in the dataset is changed.

Data Model

The data model to support this simple versioning is itself extremely simple. Right now branches are not supported, so no complex merge logic is required. It is simply a linear chain of versions, where all operations, i.e. write a new file, update an existing file, and delete a file, are recorded at the file level. You can think of it something like this:

4202

Essentially each version has one, and only one, parent, and each process, i.e. a local editor or a remote training job, can pull any version from the dataset. To avoid overwriting your teammate's change by accident, we only allow creating new version on top of the latest version. Again, all push_* operations produce a new version, and any concurrent writes result in the newest write failing as it tries to write to a version other than latest.

Dataset Operations

To work with data set, set the GANTRY_DATASET_WORKING_DIR environment variable. This is the directory which local datasets will be pulled to.

Keep in mind throughout the tutorial your output will look slightly different.

Create Dataset

The Curators documentation works through an example of creating Datasets using Curators. The below example creates a Dataset directly using the Gantry SDK.

import os
import gantry
import gantry.dataset as gdataset


DATASET_NAME = "demo_dataset"
WORKING_DIRECTORY = "..." # Absolute path to your working directory

os.environ["GANTRY_API_KEY"] = "..." # Gantry API key
os.environ["GANTRY_DATASET_WORKING_DIR"] = WORKING_DIRECTORY

gantry.init()
dataset = gdataset.create_dataset(DATASET_NAME)

The create_dataset() function takes an optional parameter app_name which, when specified, will use the application scheme to set the dataset schema.

List Datasets

>> gdataset.list_datasets()

[{'name': 'demo_dataset_2',
  'dataset_id': '03fb2465-d4b3-4bbd-940c-d1bf5b5880fb',
  'created_at': 'Tue, 10 Jan 2023 16:46:21 GMT'},
 {'name': 'demo_dataset_1',
  'dataset_id': 'bfad188a-8a8e-4cc1-887e-9fee1d5dfb3f',
  'created_at': 'Tue, 10 Jan 2023 16:42:21 GMT'},
 {'name': 'demo_dataset',
  'dataset_id': '60afc214-ea57-41de-a1e3-47649b5a5b28',
  'created_at': 'Mon, 09 Jan 2023 16:47:48 GMT'}]

Pull dataset

Pulling a Dataset enables analyzing or modifying it. The pull method automatically pulls the latest version of the Dataset. To pull an earlier version, optionally set the version_id parameter. Pulled Datasets can be located in the {WORKING_DIRECTORY}/{DATASET_NAME}.

Data can be added to Datasets without pulling them first. See the Append data section below or check out the push methods in the Gantry SDK documentation for more information.

# First get the dataset object using get_dataset API
>> dataset = gdataset.get_dataset(DATASET_NAME)

# Pull dataset to local working directory
>> dataset.pull()

{WORKING_DIRECTORY}/{DATASET_NAME} will have the following structure:

  • README.md: serves as a normal readme and can be used to document dataset information like the source of the dataset, how it should be used etc.

  • dataset_config.yaml: the configuration file used to define data schemas. If the Gantry application name is specified on Dataset creation, the schema will be automatically added to the config file.

  • Two folders: tabular_manifests and artifacts: Gantry will only index CSV files inside the tabular_manifests folder. All non-CSV files (such a binary files like image, audio, video, etc) should be placed in the artifacts folder. If this Dataset is updated from a Curator, these folders will contain one file per Selector per curation run.

Make and push local changes

To add data to a local Gantry Dataset repo, add a csv file to the {WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/ folder or copy any binary file into the {WORKING_DIRECTORY}/{DATASET_NAME}/artifacts/ folder. The existing Dataset files can also be modified or deleted directly.

import pandas as pd

df = pd.DataFrame(
  [
    {"name": "Novak", "rank": 1}, 
    {"name": "Rafa", "rank": 2}, 
    {"name": "Roger", "rank": 3}
  ]
)

df.to_csv(f"{WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/demo.csv", index=False)

The get_diff() command returns the Dataset files that have been added, deleted, or modified.

>> dataset.get_diff()
{'new_files': ['tabular_manifests/demo.csv'],
 'modified_files': [],
 'deleted_files': []}

To push changes to the repo, call push_version. Every time data is pushed to the repo, a new version will be created. The push version operation will take a snapshot of your local repo and upload it to the Gantry dataset server. Internally Gantry will optimize the storage by only pushing the changed files. After the version is created we can pull it from anywhere using the Gantry SDK. Pushing is only allowed if the newest version of the Dataset is being modified.

>> dataset.push_version("add demo dataframe")

list_versions() displays the Dataset edit history:

>> dataset.list_versions()

[{'version_id': '09575ee7-0407-44b8-ae88-765a8270b17a',
  'dataset': 'demo_dataset',
  'message': 'add demo dataframe',
  'created_at': 'Wed, 08 Feb 2023 22:17:55 GMT',
  'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
  'is_latest_version': True},
 {'version_id': 'b787034a-798b-4bb3-a726-0e197ddb8aff',
  'dataset': 'demo_dataset',
  'message': 'initial dataset commit',
  'created_at': 'Wed, 08 Feb 2023 22:00:26 GMT',
  'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
  'is_latest_version': False}]

Time travel

Pull older version

Gantry stores multiple versions of the Dataset to track the evolution of data over time. Each version is a snapshot. Any version of a Dataset can be pulled locally by specifying version_id. The following is an example of checking out a version:

>> dataset.pull(version_id = 'b787034a-798b-4bb3-a726-0e197ddb8aff')

{'version_id': 'b787034a-798b-4bb3-a726-0e197ddb8aff',
 'dataset': 'demo_dataset',
 'message': 'initial dataset commit',
 'created_at': 'Wed, 08 Feb 2023 22:00:26 GMT',
 'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
 'is_latest_version': False}

If we go to {WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/ we will see the newly added CSV file has gone. That's because it doesn't exist in the dataset version we just pulled.

Rollback

To roll back to an older version of a Dataset, call rollback. This will fail unless we're on the latest version of a dataset. If you checkout an older version (or if someone pushed a new change), just run pull() before calling rollback():

# Make sure your local dataset is up to date
>> dataset.pull()

# Rollback to a previous version, and pull the target version data to dataset folder
>> dataset.rollback(version_id="b787034a-798b-4bb3-a726-0e197ddb8aff")
>> dataset.list_versions()

[{'version_id': '23bc4d35-0df2-424c-9156-d5ca105eb4c1',
  'dataset': 'demo_dataset',
  'message': 'Rollback dataset to version: b787034a-798b-4bb3-a726-0e197ddb8aff',
  'created_at': 'Thu, 09 Feb 2023 00:36:07 GMT',
  'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
  'is_latest_version': True},
 {'version_id': '09575ee7-0407-44b8-ae88-765a8270b17a',
  'dataset': 'demo_dataset',
  'message': 'add demo dataframe',
  'created_at': 'Wed, 08 Feb 2023 22:17:55 GMT',
  'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
  'is_latest_version': False},
 {'version_id': 'b787034a-798b-4bb3-a726-0e197ddb8aff',
  'dataset': 'demo_dataset',
  'message': 'initial dataset commit',
  'created_at': 'Wed, 08 Feb 2023 22:00:26 GMT',
  'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
  'is_latest_version': False}]
# Let's first pull an outdated dataset version to local repo
>> dataset.pull(version_id="b787034a-798b-4bb3-a726-0e197ddb8aff")

# Now make some local changes in your dataset folder 
>> df.to_csv(f"{WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/demo_1.csv", index=False)

# Try to push a version and you will see an error
>> dataset.push_version("try to create a version on an old version")

DatasetHeadOutOfDateException             Traceback
......
DatasetHeadOutOfDateException: Local HEAD not up to date! Your local version is behind the remote

# The operation will fail because you are modifying an older version of your dataset
# To fix it you can stash your local changes. Gantry will cache all the modified 
# files in a tmp folder so you will not lost your local change.
>> dataset.stash()

# Now pull the latest dataset
>> dataset.pull()

# You can reapply the local change on top of the latest dataset
>> dataset.restore()

# When you do stash we will copy all the new files and modified files 
# into a tmp folder, restore will copy them back. For deleted files, 
# restore will redo the deletion.


# Now you can create a new version.
>> dataset.push_version("your version notes")

Append data

The following APIs can be used to append new data to the remote dataset repo without pulling the whole dataset locally. Note that these functions write directly to the remote and do not modify the local Dataset directory. Run dataset.pull() to view these changes.

Add a pandas dataframe to the dataset

>> dataset.push_dataframe(df)

Add tabular file to the dataset

# add a new tabular data file to your dataset
>> dataset.push_tabular_file(open("{file_to_be_added}.csv"), "{dataset_file_name}.csv", "version info")

Use Dataset for training

Datasets have a built-in integration with HuggingFace Datasets. With the correct data schema configured in the dataset_config file, data can be loaded directly into a HuggingFace to be used for training or evaluation.

📘

Note

To use this feature, all the CSV files inside the tabular_manifests folder must have the same columns and the same column data types.

Configure data schema

These steps are not required if the Dataset was generated by a Curator. Curators will auto populate data schema.

An example dataset_config.yaml file without schema information:

artifacts:
  type: folder
  value: artifacts # do not modify
dataset_info: README.md
dataset_name: demo_dataset
tabular_files:
  type: folder
  value: tabular_manifests # do not modify

The following is an example of configuring the data schema.

Add two sections to the dataset_config.yaml file:

  1. features which defines all the feature columns' type
  2. labels which defines all the ground truth columns' type.
    Check out data schema to see all the data types Gantry can support.

Sample CSV and corresponding yaml:

cat ${WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/{file_name}.csv

name,rank
Novak,1
Rafa,2
Roger,3

We need to add the data schema for name column and rank column:

artifacts:
  type: folder
  value: artifacts # do not modify
dataset_info: README.md
dataset_name: demo_dataset
tabular_files:
  type: folder
  value: tabular_manifests # do not modify
features:
  name: Text
labels:
  rank: Integer

Push the Dataset:

>> dataset.push_version("Add data schema to dataset config")

Load Huggingface dataset from Gantry Dataset with configured schema

# load gantry data into huggingface dataset
>> training_df = dataset.get_huggingface_dataset()

>> training_df[0]
{'name': 'Novak',
 'rank': 1}

# Load your dataset into pytorch
>> training_df.set_format(type="torch", columns=["name", "rank"])
>> training_df.format

{'type': 'torch',
 'format_kwargs': {},
 'columns': ['name', 'rank'],
 'output_all_columns': False}

All data will be loaded into the training split. There are future plans for multiple split support.