Curators

Auto-populate datasets with filtered data

The curation process leverages the following concepts:

  • selection, which determines what data is collected from a production stream
  • scheduling, which defines how often data is collected.

When data is assembled by a Curator job, it is saved in a Dataset with a version number associated from the job.

At a high level the Curator looks like this:

2227

Concepts

Selectors

Data selectors allow for encoding domain knowledge and insights about a problem to curate training sets.

Selectors take a list of filters, and combine the filters with method and limit criteria that describe how prioritize datapoints that do pass the selection conditions provided by the list of filters. Limits allow for encoding a labeling budget directly into job definitions.

AFilter is a boolean function that can be applied to a field of an application. Curation jobs will discard data points that do not evaluate to true by all filter. Filtering is an AND operation.

The following example defines filters designed to identify loan applicants that we suspect buy too much expensive salad. These filters are then passed to a Selector instructed to prioritize data points with the lowest job satisfaction scores and to select a maximum of 10 data points per execution:

from gantry.curators.selectors import BoundsFilter, EqualityFilter, OrderedSampler, Selector

buys_too_much_sweetgreen = [
    EqualityFilter(field="inputs.college_degree", equals=True)
    BoundsFilter(field="inputs.credit_score", upper=5_000),
    BoundsFilter(field="inputs.job.income", lower=7_500),
]

dissatisfied_sweetgreen_addicts = Selector(
    method=OrderedSampler(
        field="inputs.job.satisfaction",
        sort="ascending",
    ),
    filters=buys_too_much_sweetgreen,
    limit=10,
)

This Selector can be used to specify a Curator. If multiple Selectors are passed to a Curator, ANY datapoint that satisfies the Selector will be passed to the curation job. Selectors are an OR operation.

import datetime
from gantry.curators import Curator

custom_curator = Curator(
    name="most_dissatisfied_sweetgreen_addicts",
    application_name=application_name,
    selectors=[dissatisfied_sweetgreen_addicts],
    start_on=datetime.datetime(2022, 12, 31, 0, 0),
    curation_interval=datetime.timedelta(days=1),
)

Scheduling specifies how often a version should be created in the target dataset. Versions are updated by querying the production stream at the end of an interval and appending results to the existing version.

Curation interval

The curation_interval parameter, defined as a timedelta object, dictates which time window each dataset version corresponds to. As soon as an interval ends, a curator runs its query and writes the resulting records to a new version in the target dataset.

Start Date

Specifying a historical start_on with a datetime object will trigger Gantry to backfill every interval between the time specified and now. While the curator is enabled, it will then continue to create versions as intervals pass until it is disabled.

Examples

Bounded Ranges

Here we reuse an example from the Tutorial. In that example, we wanted to gather predictions served to users of newly created accounts, i.e. we wanted to "bound" the account age field. To do this we used the BoundedRangeCurator, a convenient subclass of the Curator. Gantry provides many such "stock curators" that encapsulate common selection criteria to help streamline the creation of curators. In the case of the BoundedRangeCurator, this looks like:

from gantry.curators import BoundedRangeCurator

new_accounts_curator = BoundedRangeCurator(
    name=f"{GantryConfig.GANTRY_APP_NAME}-account-age-curator",
    application_name=GantryConfig.GANTRY_APP_NAME,
    limit=1000,
    bound_field="inputs.account_age_days",
    lower_bound=0,
    upper_bound=5,
    start_on=DataStorageConfig.MIN_DATE,
    curation_interval=datetime.timedelta(days=1)
)

new_accounts_curator.create()

Let's review each parameter:

  • name: names the curator, we conveniently extend our application name
  • application_name: ties the curator to an application
  • limit: let's assume we can afford to label 1000 datapoints per day
  • bound_field: we want newer accounts so we specify the account_age_days field for bounding
  • lower_bound, upper_bound: in the tutorial we identified 0 to 5 days as the window for users experiencing degraded model performance
  • start_on: the data for the tutorial is from 2022, so we specified the start of the window for which we have data
  • curation_interval: we want to gather a maximum of 1000 data points per day

That's all that's required for Gantry to seamlessly transform your production data-streams into versioned datasets that can be used for labeling, retraining, evaluation, or any other downstream process.