Datasets

A versioned container for curated data

The success of machine learning projects requires a continual learning cycle, where ML teams continuously improve their models based on the latest data. In order to achieve this, teams need to keep updating their training and evaluation datasets, gathering not only fresh data, but the right data. Unlike the datasets found on Kaggle, the production datasets used in real-world applications are constantly changing. This makes it imperative to have a versioning tool that helps teams keep track of changes made to the dataset, and facilitates reproduction of the steps taken to build models. The Gantry Dataset is built to help users manage their dataset iterations, and make it straightforward to do ML the right way.

Use Datasets to:

  1. Understand when your data was collected so you know you're always retraining on the most up-to-date data (datasets are versioned!)
  2. Manage data versions manually or automatically (via Curators)

Data Model

Versioning in Gantry datasets is at the file level, that is to say changes are recorded if any file in the dataset is changed. The data model to support this simple versioning is itself extremely simple. Right now branches are not supported, so no complex merge logic is required. It is simply a linear chain of versions, where all operations, i.e. write a new file, update an existing file, and delete a file, are recorded at the file level. You can think of it something like this:

Essentially each version has one, and only one, parent, and each process, i.e. a local editor or a remote training job, can pull any version from the dataset. To avoid overwriting your teammate's change by accident, Gantry only allows creating new version on top of the latest version. Again, all push_* operations produce a new version, and any concurrent writes result in the newest write failing as it tries to write to a version other than latest.

Dataset Operations in the UI

Datasets can be created and managed visually, though Workspaces. From the data panel, you can add data to a new or existing dataset.

Datasets can also be created and managed through the Datasets tab (Completion apps only):

Dataset Operations in the SDK

To work with data set, set the GANTRY_DATASET_WORKING_DIR environment variable. This is the directory which local datasets will be pulled to.

Create Dataset

The Curators documentation works through an example of creating Datasets using Curators. The below example creates a Dataset directly using the Gantry SDK.

import os
import gantry

GANTRY_API_KEY = os.environ.get("GANTRY_API_KEY")
GANTRY_APP_NAME = "my-app"
gantry.init(api_key=GANTRY_API_KEY, send_in_background=False)
application = gantry.get_application(GANTRY_APP_NAME)
dataset = application.create_dataset(name="new-dataset")

Creating the dataset on the application allows the dataset to inherit the application schema.

Uploading a CSV as a Dataset

import os
import gantry

GANTRY_API_KEY = os.environ.get("GANTRY_API_KEY")
GANTRY_APP_NAME = "my-app"
gantry.init(api_key=GANTRY_API_KEY, send_in_background=False)
application = gantry.get_application(GANTRY_APP_NAME)
dataset = application.create_dataset(name="new-dataset")
dataset.push_tabular_file(open("example.csv"), "example.csv")

List Datasets

import gantry
application = gantry.get_application(GANTRY_APP_NAME)
datasets = application.list_datasets()
print(datasets)

[{'name': 'new-dataset', 'dataset_id': '8c83397c-e941-48f9-9667-e61b4fc7015c', 'created_at': '2023-05-24T17:48:27.556864Z'}, {'name': 'eval-data', 'dataset_id': 'e7802ba0-2330-4b24-9f3c-b39ea9783694', 'created_at': '2023-05-24T17:37:54.842528Z'}]

Pull datasets

Pulling a dataset enables analyzing or modifying it. When datasets are pulled locally, they appear in your file tree as a CSV file. The pull method automatically retrieves the latest version of the dataset. To pull an earlier version, optionally set the version_id parameter. Pulled datasets are stored in the {WORKING_DIRECTORY}/{DATASET_NAME}.

Data can be added to datasets without pulling them first. See the Append data section below or check out the push methods in the Gantry SDK documentation for more information.

# First get the dataset object using get_dataset API
>> dataset = gdataset.get_dataset(DATASET_NAME)

# Pull dataset to local working directory
>> dataset.pull()

{WORKING_DIRECTORY}/{DATASET_NAME} will have the following structure:

  • README.md: serves as a normal readme and can be used to document dataset information like the source of the dataset, how it should be used etc.

  • dataset_config.yaml: the configuration file used to define data schemas. If the Gantry application name is specified on Dataset creation, the schema will be automatically added to the config file.

  • Two folders: tabular_manifests and artifacts: Gantry will only index CSV files inside the tabular_manifests folder. All non-CSV files (such a binary files like image, audio, video, etc) should be placed in the artifacts folder. If this Dataset is updated from a Curator, these folders will contain one file per Selector per curation run.

Dataset pulled locally

Dataset pulled locally

Make and push local changes

To add data to a local Gantry dataset repo, add a CSV file to the {WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/ folder or copy any binary file into the {WORKING_DIRECTORY}/{DATASET_NAME}/artifacts/ folder. The existing dataset files can also be modified or deleted directly.

import pandas as pd

df = pd.DataFrame(
  [
    {"name": "Novak", "rank": 1}, 
    {"name": "Rafa", "rank": 2}, 
    {"name": "Roger", "rank": 3}
  ]
)

df.to_csv(f"{WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/demo.csv", index=False)

The get_diff() command returns the dataset files that have been added, deleted, or modified.

>> dataset.get_diff()
{'new_files': ['tabular_manifests/demo.csv'],
 'modified_files': [],
 'deleted_files': []}

To push changes to the repo, call push_version. Every time data is pushed to the repo, a new version will be created. The push version operation will take a snapshot of your local repo and upload it to the Gantry dataset server. Internally Gantry will optimize the storage by only pushing the changed files. After the version is created we can pull it from anywhere using the Gantry SDK. Pushing is only allowed if the newest version of the Dataset is being modified.

>> dataset.push_version("add demo dataframe")

list_versions() displays the dataset edit history:

>> dataset.list_versions()

[{'version_id': '09575ee7-0407-44b8-ae88-765a8270b17a',
  'dataset': 'demo_dataset',
  'message': 'add demo dataframe',
  'created_at': 'Wed, 08 Feb 2023 22:17:55 GMT',
  'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
  'is_latest_version': True},
 {'version_id': 'b787034a-798b-4bb3-a726-0e197ddb8aff',
  'dataset': 'demo_dataset',
  'message': 'initial dataset commit',
  'created_at': 'Wed, 08 Feb 2023 22:00:26 GMT',
  'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
  'is_latest_version': False}]

Time travel

Pull older version

Gantry stores multiple versions of the dataset to track the evolution of data over time. Each version is a snapshot. Any version of a Dataset can be pulled locally by specifying version_id. The following is an example of checking out a version:

>> dataset.pull(version_id = 'b787034a-798b-4bb3-a726-0e197ddb8aff')

{'version_id': 'b787034a-798b-4bb3-a726-0e197ddb8aff',
 'dataset': 'demo_dataset',
 'message': 'initial dataset commit',
 'created_at': 'Wed, 08 Feb 2023 22:00:26 GMT',
 'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
 'is_latest_version': False}

If you go to {WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/ you will see the newly added CSV file has gone. That's because it doesn't exist in the dataset version you just pulled.

Rollback

To roll back to an older version of a Dataset, call rollback. This will fail unless you're on the latest version of a dataset. If you checkout an older version (or if someone pushed a new change), just run pull() before calling rollback():

# Make sure your local dataset is up to date
>> dataset.pull()

# Rollback to a previous version, and pull the target version data to dataset folder
>> dataset.rollback(version_id="b787034a-798b-4bb3-a726-0e197ddb8aff")
>> dataset.list_versions()

[{'version_id': '23bc4d35-0df2-424c-9156-d5ca105eb4c1',
  'dataset': 'demo_dataset',
  'message': 'Rollback dataset to version: b787034a-798b-4bb3-a726-0e197ddb8aff',
  'created_at': 'Thu, 09 Feb 2023 00:36:07 GMT',
  'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
  'is_latest_version': True},
 {'version_id': '09575ee7-0407-44b8-ae88-765a8270b17a',
  'dataset': 'demo_dataset',
  'message': 'add demo dataframe',
  'created_at': 'Wed, 08 Feb 2023 22:17:55 GMT',
  'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
  'is_latest_version': False},
 {'version_id': 'b787034a-798b-4bb3-a726-0e197ddb8aff',
  'dataset': 'demo_dataset',
  'message': 'initial dataset commit',
  'created_at': 'Wed, 08 Feb 2023 22:00:26 GMT',
  'created_by': 'db459d6d-c83b-496d-b659-e48bca971156',
  'is_latest_version': False}]
# Let's first pull an outdated dataset version to local repo
>> dataset.pull(version_id="b787034a-798b-4bb3-a726-0e197ddb8aff")

# Now make some local changes in your dataset folder 
>> df.to_csv(f"{WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/demo_1.csv", index=False)

# Try to push a version and you will see an error
>> dataset.push_version("try to create a version on an old version")

DatasetHeadOutOfDateException             Traceback
......
DatasetHeadOutOfDateException: Local HEAD not up to date! Your local version is behind the remote

# The operation will fail because you are modifying an older version of your dataset
# To fix it you can stash your local changes. Gantry will cache all the modified 
# files in a tmp folder so you will not lost your local change.
>> dataset.stash()

# Now pull the latest dataset
>> dataset.pull()

# You can reapply the local change on top of the latest dataset
>> dataset.restore()

# When you do stash we will copy all the new files and modified files 
# into a tmp folder, restore will copy them back. For deleted files, 
# restore will redo the deletion.


# Now you can create a new version.
>> dataset.push_version("your version notes")

Append data

The following APIs can be used to append new data to the remote dataset repo without pulling the whole dataset locally. Note that these functions write directly to the remote and do not modify the local Dataset directory. Run dataset.pull() to view these changes.

Add a pandas dataframe to the dataset

>> dataset.push_dataframe(df)

Add tabular file to the dataset

# add a new tabular data file to your dataset
>> dataset.push_tabular_file(open("{file_to_be_added}.csv"), "{dataset_file_name}.csv", "version info")

Use Dataset for training

Datasets have a built-in integration with HuggingFace Datasets. With the correct data schema configured in the dataset_config file, data can be loaded directly into a HuggingFace to be used for training or evaluation.

📘

Note

To use this feature, all the CSV files inside the tabular_manifests folder must have the same columns and the same column data types.

Configure data schema

These steps are not required if the dataset was generated by a Curator. Curators will auto populate data schema.

An example dataset_config.yaml file without schema information:

artifacts:
  type: folder
  value: artifacts # do not modify
dataset_info: README.md
dataset_name: demo_dataset
tabular_files:
  type: folder
  value: tabular_manifests # do not modify

The following is an example of configuring the data schema.

Add two sections to the dataset_config.yaml file:

  1. features which defines all the feature columns' type
  2. labels which defines all the ground truth columns' type.
    Check out data schema to see all the data types Gantry can support.

Sample CSV and corresponding yaml:

cat ${WORKING_DIRECTORY}/{DATASET_NAME}/tabular_manifests/{file_name}.csv

name,rank
Novak,1
Rafa,2
Roger,3

We need to add the data schema for name column and rank column:

artifacts:
  type: folder
  value: artifacts # do not modify
dataset_info: README.md
dataset_name: demo_dataset
tabular_files:
  type: folder
  value: tabular_manifests # do not modify
features:
  name: Text
labels:
  rank: Integer

Push the Dataset:

>> dataset.push_version("Add data schema to dataset config")

Load Huggingface dataset from Gantry Dataset with configured schema

# load gantry data into huggingface dataset
>> training_df = dataset.get_huggingface_dataset()

>> training_df[0]
{'name': 'Novak',
 'rank': 1}

# Load your dataset into pytorch
>> training_df.set_format(type="torch", columns=["name", "rank"])
>> training_df.format

{'type': 'torch',
 'format_kwargs': {},
 'columns': ['name', 'rank'],
 'output_all_columns': False}

All data will be loaded into the training split. There are future plans for multiple split support.