Improving model performance

Set up a job that collects data for you at regular intervals. Understand how that data changes over time. Automatically run test cases against your model.

Gantry helps you improve model performance by:

  1. Collecting a snapshot of your most recent data at regular intervals for retraining and analysis
  2. Allowing you to fully customize what that snapshot and the interval look like
  3. Enabling you to determine if you model has improved by running an evaluation with two model versions on the same dataset. (Completion apps only)

To collect your data automatically, Gantry can be configured to run a job (a Curator) that collects data and puts it into datasets. Datasets are versioned so you can know what was collected when and how it has changed over time.

To automatically evaluate model versions, skip to the Model Evaluations section.

Curators

A Curator (a container for providing selection criteria) and a Trigger (the job schedule specification) define an Automation. An curator transforms a production data stream into a dataset. For example, a curator might run with an interval of one day, and employ a uniform random selection of 100 datapoints per interval. The following Gantry SDK call would register such an curator with Gantry:

import gantry
import datetime
from gantry.automations.curators import UniformCurator
from gantry.automations.triggers import IntervalTrigger
from gantry.automations.automations import Automation
 

gantry.init(api_key=GANTRY_API_KEY)
application = gantry.get_application(GANTRY_APP_NAME)

uniform_curator = UniformCurator(
    name="uniform_sample_my_stream_1",
    application_name="my-app",
    limit=100,
)
interval_trigger = IntervalTrigger(start_on = datetime.datetime(2022, 12, 31, 0, 0), interval = datetime.timedelta(days=1))
curator_automation = Automation(name="curator-automation", trigger=interval_trigger, action=uniform_curator)
application.add_automation(curator_automation)

Once a curator is created, and so long as it is turned on (adding it automatically turns it on), Gantry will take care of selecting datapoints that meet its selection criteria. These datapoints will be associated with a version in a Gantry Dataset. The relationship between curators, curator runs, datasets, and dataset versions is illustrated below:

Datasets

There are three ways that datasets are created in Gantry:

  1. As the output format as curators
  2. Directly by you, from the SDK
  3. Directly by you, from Workspaces

A dataset is a lightweight container for data that provides a simple versioning API to help make downstream processes reproducible. The datasets API centers on two operations:

  1. push: write operations that write directly to an S3 bucket and create version identified by a hash. In Git terms this roughly corresponds to add, commit, and push. The point is not to provide fine grained versioning, but instead to provide simple semantics for iterating on training data.
    2.pull: operations that pull data to locally. Datasets look like CSV files named by version.

What’s Next

Explore curators and datasets in more detail: