Curators
Auto-populate datasets with filtered data
The curation process leverages the following concepts:
- selection, which determines what data is collected from a production stream
- scheduling, which defines how often data is collected.
When data is assembled by a Curator job, it is saved in a Dataset with a version number associated from the job.
At a high level the Curator looks like this:

Concepts
Selectors
Data selectors allow for encoding domain knowledge and insights about a problem to curate training sets.
Selectors take a list of filters
, and combine the filters with method
and limit
criteria that describe how prioritize datapoints that do pass the selection conditions provided by the list of filters. Limits allow for encoding a labeling budget directly into job definitions.
AFilter
is a boolean function that can be applied to a field of an application. Curation jobs will discard data points that do not evaluate to true by all filter. Filtering is an AND
operation.
The following example defines filters designed to identify loan applicants that we suspect buy too much expensive salad. These filters are then passed to a Selector instructed to prioritize data points with the lowest job satisfaction scores and to select a maximum of 10 data points per execution:
from gantry.curators.selectors import BoundsFilter, EqualityFilter, OrderedSampler, Selector
buys_too_much_sweetgreen = [
EqualityFilter(field="inputs.college_degree", equals=True)
BoundsFilter(field="inputs.credit_score", upper=5_000),
BoundsFilter(field="inputs.job.income", lower=7_500),
]
dissatisfied_sweetgreen_addicts = Selector(
method=OrderedSampler(
field="inputs.job.satisfaction",
sort="ascending",
),
filters=buys_too_much_sweetgreen,
limit=10,
)
This Selector can be used to specify a Curator. If multiple Selectors are passed to a Curator, ANY datapoint that satisfies the Selector will be passed to the curation job. Selectors are an OR
operation.
import datetime
from gantry.curators import Curator
custom_curator = Curator(
name="most_dissatisfied_sweetgreen_addicts",
application_name=application_name,
selectors=[dissatisfied_sweetgreen_addicts],
start_on=datetime.datetime(2022, 12, 31, 0, 0),
curation_interval=datetime.timedelta(days=1),
)
Scheduling specifies how often a version should be created in the target dataset. Versions are updated by querying the production stream at the end of an interval and appending results to the existing version.
Curation interval
The curation_interval
parameter, defined as a timedelta
object, dictates which time window each dataset version corresponds to. As soon as an interval ends, a curator runs its query and writes the resulting records to a new version in the target dataset.
Start Date
Specifying a historical start_on
with a datetime
object will trigger Gantry to backfill every interval between the time specified and now. While the curator is enabled, it will then continue to create versions as intervals pass until it is disabled.
Examples
Bounded Ranges
Here we reuse an example from the Tutorial. In that example, we wanted to gather predictions served to users of newly created accounts, i.e. we wanted to "bound" the account age field. To do this we used the BoundedRangeCurator
, a convenient subclass of the Curator
. Gantry provides many such "stock curators" that encapsulate common selection criteria to help streamline the creation of curators. In the case of the BoundedRangeCurator
, this looks like:
from gantry.curators import BoundedRangeCurator
new_accounts_curator = BoundedRangeCurator(
name=f"{GantryConfig.GANTRY_APP_NAME}-account-age-curator",
application_name=GantryConfig.GANTRY_APP_NAME,
limit=1000,
bound_field="inputs.account_age_days",
lower_bound=0,
upper_bound=5,
start_on=DataStorageConfig.MIN_DATE,
curation_interval=datetime.timedelta(days=1)
)
new_accounts_curator.create()
Let's review each parameter:
name
: names the curator, we conveniently extend our application nameapplication_name
: ties the curator to an applicationlimit
: let's assume we can afford to label 1000 datapoints per daybound_field
: we want newer accounts so we specify theaccount_age_days
field for boundinglower_bound
,upper_bound
: in the tutorial we identified 0 to 5 days as the window for users experiencing degraded model performancestart_on
: the data for the tutorial is from 2022, so we specified the start of the window for which we have datacuration_interval
: we want to gather a maximum of 1000 data points per day
That's all that's required for Gantry to seamlessly transform your production data-streams into versioned datasets that can be used for labeling, retraining, evaluation, or any other downstream process.
Updated 18 days ago