Improving model performance

Using Datasets and Curation to improve model performance.

Curators and datasets help transform production data-streams into data artifacts that can be used in downstream workflows such as labeling and retraining.

To use Gantry for assessing a machine learning model's performance, Gantry needs data about the model's behavior in the form of inputs, predictions, and any other relevant context. Curation and datasets are designed to make the data that Gantry collects actionable. Curators encode insights about model performance in the form of jobs that gather data into datasets that can be used to create a better model.

Curators

A curator is a container for providing selection criteria, along with an interval, that Gantry will use to transform a production data stream into a dataset that can be labeled or used directly for retraining. For example, a curator might run with an interval of one day, and employ a uniform random selection of 100 datapoints per interval. The following Gantry SDK call would register such a curator with Gantry:

import gantry
from gantry.curators import UniformCurator

gantry.init(api_key=GANTRY_API_KEY)

uniform_curator = UniformCurator(
    name="uniform_sample_my_stream",
    application_name="my-app",
    limit=100,
)

uniform_curator.create()

Once a curator is created, and so long as it is turned on, Gantry will take care of selecting datapoints that meet its selection criteria. These datapoints will be associated with a version in a Gantry Dataset. The relationship between curators, curator runs, datasets, and dataset versions is illustrated below:

2537

Datasets

Datasets are the output format of curators. A dataset is a lightweight container for data that provides a simple versioning API to help make downstream processes reproducible. The datasets API centers on two operations:

  1. push: write operations that write directly to an S3 bucket and create version identified by a hash. In Git terms this roughly corresponds to add, commit, and push. The point is not to provide fine grained versioning, but instead to provide simple semantics for iterating on training data.
    2.pull: operations that pull data to local host or whichever host is consuming the dataset.

What’s Next

Explore curators and datasets in more detail: