A end-to-end tutorial showing how Gantry helps improve ML powered products.

This tutorial walks through how to observe and monitor the predictions that power an ML power product. The product is The Gantry Grammar Error Corrector. It's a simple example application that uses an off-the-shelf HuggingFace model to provide grammar error corrections to snippets of text. The UI looks is simply a text box and provides corrections when the user clicks "Submit":


We built this app on top of Gradio, the code is incredibly simple which makes it perfect as a tutorial.

We use Gantry to gather data about the quality of the corrections, which are model predictions from this model. This done using Python hooks provided by the Gantry SDK. Architecturally, it looks something like this, with the Gantry instrumentation responsible for marshaling the elements of a prediction into a "row":


Every time a user provides input text to be corrected, the model provides suggested corrections. Both the input and the corrections, as well as whether the user accepted those corrections, are sent to Gantry. The user's acceptance, or non-acceptance, of the corrections is referred to as "feedback." In supervised learning this is referred to as "ground truth," but whether it's "truth" that corrects your models or user opinions, this data is critical to improving the value your model provides to users.

The balance of this page is devoted to getting up and running with this demo yourself so you can explore Gantry!


Before getting started you need an API key, navigate over to our tutorial for obtaining an API key or Getting Your API Key, or just navigate to the settings page and create one.

First, let's get setup with the sample code, which is just a matter of cloning a Git repository:

git clone [email protected]:gantry-ml/gantry-demos.git \
    && cd gantry-demos/grammar-error-corrector

We recommend creating a virtual environment to make sure this tutorial does not interfere with your native or default Python environment:

python -m venv venv
source venv/bin/activate

Now, install the Python dependencies:

pip install -r requirements.txt

The final step is to backfill the data so we have something to investigate:

python backfill.py --load-data --create-views

This will download a historical dataset from the a public S3 bucket, and load it into Gantry. This will give us a realistic example to work through as a way introduction to Gantry.

Now we are setup, let's jump into the product!

Monitoring and Observability

You can start exploring the data here. As mentioned above we back-populated some historical data to make the demo more interesting, so you'll notice you get dropped into the month of April 2022, where the action takes place.

You should see something like this:


What you can see above is the Gantry data table, containing a table of raw predictions, along with any context (metadata tags, such as username) that was provided with those predictions, and any feedback that was joined to those predictions after the fact.

We said we'd identify a performance issue with our model, but first we need a metric to define what performance even means. Earlier we pointed out that users provide "feedback" in the form of an "accepted" bit that indicates whether they accepted our corrections. We can use the percent_true aggregation function to transform that feedback into an "acceptance rate" for our model's predictions:


Now that we have a metric for measuring performance.

One capability Gantry provides is the ability to save "views," which are basically queries on your production data-stream. Gantry allows users to create them programmatically, and the setup script for this demo actually pre-populated views useful for the narrative of this demo. Let's jump into this-week view, which corresponds to the third week of April when our story will unfold!


You can see the query builder where the "view" is defined:


We can look at how our performance metric is doing on this view of our data, that is the aggregate will be computed on-the-fly just on the data returned by that query. A clear dip can be observed, after which the metric is quite volatile:


This definitely feels like something worth investigating. Let's head back to the raw data and drill in on just the records where the user has rejected our suggestions:


What really stands out here is that the records with non-accepted corrections seem heavily concentrated in "younger" accounts. Put simply, our new users are not loving our product. We discovered this by observing a the distribution of account ages and how it changed when toggled the accepted and not accepted categories for feedback:


So, a reasonable hypothesis is that our newer users are presenting data that is confusing our model. What if we wanted to gather a training set of data from these new users for retraining purposes? In order to keep things simple we will skip ahead to doing so, though in a real world scenario we might do a lot more data analysis to evaluate whether this hypothesis is reasonable.

Continual Learning

In the previous section we used the Gantry for monitoring and observability features to formulate a hypothesis for how to create a better model: gather data from newer users. We now turn to Gantry's continual learning features to show how to easily transform that hypothesis into a dataset we can use to train a candidate model.

It all starts with defining a "curator" that grabs the data you want.


Gantry curators are essentially a way to tell Gantry what data you want, and on what schedule to gather it. As a reminder our hypothesis is that a model that is trained on data from new users might resolve our newly discovered acceptance rate volatility issue. The goal of the curator is to provide the you with a way to describe what data you want from your production data-stream, and when you want it. Gantry acts as your "data butler," dutifully gathering the data into versioned datasets and taking care of all of the messy job management and execution.

So, in summary a "curator" is a job definition, and the output of that job is a dataset. We will get to datasets in the next section, but for now let's just show how datasets are created from the notebook in the tutorial repository:

from config import GantryConfig, DataStorageConfig
import gantry
import datetime
from gantry.curators import BoundedRangeCurator

# Initialize a connection with Gantry and check it's live
assert gantry.ping()

new_accounts_curator = BoundedRangeCurator(
    # This is because dataset names must be valid Python identifiers
    name=f"{GantryConfig.GANTRY_APP_NAME}-new-accounts-curator".replace("-", "_"),


Executing the relevant cells in the notebook should produce the following output:

<gantry.curators.stock_curators.BoundedRangeCurator at 0x13c5a9f70>

A few things to note here:

  • we use GantryConfig and DataStorageConfig to grab some configurations, like the earliest timestamp of the data
  • we provide the application name to tie this curator to the associated application

Now that this curator is created, we can validate that it exists and is writing "versions" to the target dataset. We will talk more about datasets in the next section, but they are a container that provides simple data versioning.

Run the following code to grab the dataset that our curator is busy backfilling data queried from historical segments to:

dataset = new_accounts_curator.get_curated_dataset()

For each historical interval, in this case each day, our curator will write a "version" to the downstream dataset, and we can list those versions:


Don't worry if your output looks a bit different, it'll take a few minutes to backfill all the segments, of which there are about 30 for time window that has data in this demo application:

[{'version_id': 'f4251481-2c9e-4bfc-b061-9a45d97d8799',
  'dataset': 'gec_demo_app_new_accounts_curator',
  'message': 'Added Gantry data from model gec-demo-app from start time 2022-04-04T00:00:00 to end time 2022-04-05T00:00:00 - 120 records added from interval of size 193. Commit automatically created by curator.',
  'created_at': 'Wed, 01 Feb 2023 05:13:40 GMT',
  'created_by': '54369935-4749-492e-961d-2fc596d2d51c',
  'is_latest_version': True},
 {'version_id': 'd94387eb-d6b5-4745-9912-ded44538e4f8',
  'dataset': 'gec_demo_app_new_accounts_curator',
  'message': 'Added Gantry data from model gec-demo-app from start time 2022-04-03T00:00:00 to end time 2022-04-04T00:00:00 - 110 records added from interval of size 174. Commit automatically created by curator.',
  'created_at': 'Wed, 01 Feb 2023 05:13:39 GMT',
  'created_by': '54369935-4749-492e-961d-2fc596d2d51c',
  'is_latest_version': False},
 {'version_id': '061aa10e-2f85-4173-b419-286d2ac5a44f',
  'dataset': 'gec_demo_app_new_accounts_curator',
  'message': 'Added Gantry data from model gec-demo-app from start time 2022-04-02T00:00:00 to end time 2022-04-03T00:00:00 - 76 records added from interval of size 120. Commit automatically created by curator.',
  'created_at': 'Wed, 01 Feb 2023 05:13:38 GMT',
  'created_by': '54369935-4749-492e-961d-2fc596d2d51c',
  'is_latest_version': False},
 {'version_id': '62842450-38a4-4c8e-b7b8-af38665c6af9',
  'dataset': 'gec_demo_app_new_accounts_curator',
  'message': 'Added Gantry data from model gec-demo-app from start time 2022-04-01T00:00:00 to end time 2022-04-02T00:00:00 - 51 records added from interval of size 96. Commit automatically created by curator.',
  'created_at': 'Wed, 01 Feb 2023 05:13:38 GMT',
  'created_by': '54369935-4749-492e-961d-2fc596d2d51c',
  'is_latest_version': False},
 {'version_id': '6af2a754-738f-4e1c-907d-180d6266c63b',
  'dataset': 'gec_demo_app_new_accounts_curator',
  'message': 'initial dataset commit',
  'created_at': 'Wed, 01 Feb 2023 05:13:06 GMT',
  'created_by': '54369935-4749-492e-961d-2fc596d2d51c',
  'is_latest_version': False}]

So, curators are a simple way to generate snapshots from your production data that can be queried in a reproducible way. You tell Gantry what you want, and how often you want it, and Gantry takes care of managing and executing that job.

Let's dive into datasets to examine the container we wrote our data to.


So far we have only vaguely described Gantry Datasets, the "lightweight container" that our curator wrote to. This "lightweight container" consists of a very simple API for versioning. Versioning is at the file level, and centers on two operations: push and pull. Curators push data to Gantry Datasets. Users pull data from them to analyze, label, or train on. Users can also make local modifications to datasets, and push them back. All push operations create a new version, and each version is written to underlying S3 storage.

Let's pull the dataset our curator created and started to see what we can do with it:


Again your local output will be slightly different:

{'version_id': 'e4da77a1-de3e-43d9-95ff-4e1213c36be4',
 'dataset': 'gec_demo_app_new_accounts_curator',
 'message': 'Added Gantry data from model gec-demo-app from start time 2022-04-30T00:00:00 to end time 2022-05-01T00:00:00 - 11 records added from interval of size 1479. Commit automatically created by curator.',
 'created_at': 'Wed, 01 Feb 2023 05:13:59 GMT',
 'created_by': '54369935-4749-492e-961d-2fc596d2d51c',
 'is_latest_version': True}

One of the compelling features of Gantry datasets is that you can load them directly into HuggingFace Datasets, which can in turn be turned into Pandas DataFrame objects. All the type mappings are handled by the integration:

hfds = dataset.get_huggingface_dataset()
df = hfds["train"].to_pandas()

This will produce output looking something like this, depending on how many segments have backfilled:


We can now analyze, train on, or label this dataset, or just edit and push a new version back to Gantry if we happened to want to do some manual data cleaning.


In summary, after setting up a demo application with some dummy data, we:

  • used Gantry monitoring and observability capabilities to identify a hypothesis for why our model is exhibiting diminished performance
  • with a single line of code we defined selection criteria for a dataset that would evaluate this hypothesis, and then let Gantry take care of backfilling all the historical segments in our time window
  • we pulled that straight into popular open source machine learning data loader and data analysis libraries to use it as we please.