LLM Tutorial

Use Gantry to create, evaluate, deploy, and monitor the performance of your OpenAI LLM for Completion tasks


Gantry is currently invite-only. Contact us for access.

This tutorial uses a Gantry workflow specific to OpenAI Completion data. If you're using other data, follow the custom model quickstart. This workflow will support other LLM model providers and OpenAI chat soon.

This tutorial introduces Gantry support for OpenAI completion LLMs. By the time you complete this tutorial, you will understand how to:

  • Understand the performance of your model and how that compares to previous versions
  • Use Gantry create data visualizations, find underperforming cases, and add those cases to your test datasets
  • Send data to Gantry

Create an application

Before we can analyze data, we need to send data to Gantry. This all starts by creating an application, a model and its associated configuration, in which the data will live.

The dashboard lists all applications. There are two ways to create a new application in Gantry, the UI and the SDK.

Navigate to the dashboard and click create application. Choose the Completion application type and name it my-app.

Doing this takes us to the sandbox for my-app.

Creating an application version

Applications in Gantry are versioned, which allows us to track changes over time. A version for completions is a specific prompt and configuration. If the prompt and configuration gets updated, Gantry stores it as a new version. Versioning makes it easy to compare the performance of multiple prompts, instantly deploy and roll back changes, and more.

Versions are created and updated in code or in the sandbox. To create a version for our new application, let's use the sandbox.

Start by adding a prompt that says: "Correct the grammar of the user's input. User input: {{user_input}}". The double bracket here tells Gantry that user_input is a variable. We will see it appear on the lefthand side as a box where we can enter data.

Let's try this prompt on a few data points to make sure it's performing as we expect. As you can see in the gif below, we can run this on multiple test inputs at a time.

When we click the Run changes button, Gantry will run the prompt on the inputs and show us the results.

Manage application versions

You'll also notice at the top that this gets saved as a new prompt version. We can see all the prompt versions we've created from the versions tab:

Let's say we're happy with the prompt that we created. Let's deploy it to production. This is done by clicking the 3 dots on the righthand side and clicking "deploy to prod". There will be a confirmation modal showing us that we did this correctly:

By tagging this model as our production model, the Gantry SDK will pull it automatically and use it in our production environment. Once we have this set up correctly, this allows us to change the production model without changing code. Just update the model tagged prod in Gantry.

Send data to Gantry

Let's configure an application that uses this model and pre-load a bunch of completions. This will simulate time passing. First, clone the repo and navigate to this demo's folder:

git clone https://github.com/gantry-ml/gantry-demos.git \
    && cd gantry-demos/llm-completion-gec

We also recommend creating a virtual environment:

python -m venv venv
source venv/bin/activate

Next, install the requirements and configure the API key:

pip install -r requirements.txt

The folder llm-completion-gec contains two interesting files:

  1. A file that loads the prod version from our application my-app, auto fills it with user_input, sends all the information to OpenAI, and logs everything to Gantry:
import gantry
import os
from dotenv import load_dotenv
from gantry.applications.llm_utils import fill_prompt
import openai
from openai.util import convert_to_dict

openai.api_key = os.getenv("OPENAI_API_KEY")


my_llm_app = gantry.get_application("my-app-docs")

version = my_llm_app.get_version("prod")
config = version.config
prompt = config['prompt']

def generate(user_input_value):
    values = {
        "user_input": user_input_value
    filled_in_prompt = fill_prompt(prompt, values)
    request = {
        "model": "text-davinci-002",
        "prompt": filled_in_prompt,
    results = openai.Completion.create(**request)

        request_attributes={"prompt_values": values},

    return results
  1. A file that pretends time has gone by. This file loads some example inputs into the function above. Let's run it.
python log_examples.py

Analyze data with Workspaces

Navigating back to the application in Gantry, we should see that we've now ingested some data:

Let's explore it further in workspaces by clicking on Go to Workspaces.

One thing that immediately jumps out is that some of our users are using this prompt with an input language other than English. Let's dig into that a bit further. First, let's filter on the French language users by clicking on the bar chart.

Here we notice that sometimes the French inputs get translated to English instead of just correcting the French grammar in French. This looks like a good opportunity to improve our prompts. Before we do that, let's employ some test driven development practices and add these prompts to a dataset. This way, we can come back to this dataset later when we evaluate this prompt against our newer version. To do that, click the boxes corresponding to rows with language mismatch and click Add to dataset.

Since we don't have any datasets yet, we'll need to create a new one. Let's call it language-mismatch. If we head over to the datasets tab, we can see it listed.

Let's head back over to the sandbox to create a new prompt to solve this problem: Correct the grammar of the user's input. Make sure the output language of the grammar correction matches the input. User input: {{user_input}}. Once we click save, we can see that it gets saved as Version 2.

Compare model performance with Evaluations

Now, let's create an evaluation to understand if our Version 2 prompt is outperforming our Version 1 prompt. To do this, navigate to the evaluations tab and click "Get started". We could run the evaluation against the dataset we just created, but for demonstration purposes, let's play around with creating a new dataset with generated data. First configure the comparison versions to be Version 2 and Version 1. Then click on the generate data button:

In this text box, we can describe any data set we'd like and Gantry will use an LLM to generate it. Let's say Grammatically incorrect sentences in a language other than English. and click Generate data. Let's save it as a new dataset called generated-non-english-sentences and use it in our evaluation. We also need to update our evaluation criteria to explain how the correctness of the output should be judged. Let's enter 2 criteria:

  1. Result should have correct grammar
  2. Result should match the language of the input

When everything is configured correctly, our evaluation modal should look like this:

Now, let's run the evaluation.

We know that the evaluation is completed when the status changes to green. This might take a few minutes.

Now, click on the completed evaluation to see which model performed better: