This feature is only available to Completion applications

Evaluate configuration versions with Gantry to:

  1. View a performance report on either the latest production data or data generated by an LLM that meets your specifications
  2. Understand success on a normalized scale of 1 to 5
Example evaluation report

Example evaluation report

An evaluation is a test suite run on your model against data. It allows you to understand model performance at a glance. On the evaluations page, you can either create a new evaluation or access previous ones.

Create an evaluation

To create an evaluation, navigate to the evaluations page and click on New evaluation

In the popup modal, pick model versions to compare, the dataset to compare against, and describe some evaluation criteria that you'd like the evaluation to focus in on.

Choosing versions

This version(s) chosen here will be the version(s) that the test cases are run against. If one version is chosen, you will see the raw results for that version in the evaluation report. If two versions are chosen, you'll be able to see the results of these versions side by side.

Choosing a dataset

When evaluations are run, they can are run against datasets. These datasets can either be autogenerated, uploaded, derived from existing data. A dataset is essentially a list of test cases that your model will be evaluated against.

Autogenerating a dataset

From the New Evaluation modal on the evaluation page, click on the blue stars on the righthand side:

In the Data generator modal, explain what type of data should be generated for each variable, and click generate data.


Writing good descriptions

Good descriptions for the Data generator describe, in as much detail as possible, what kind of data should appear in the input field. For example, say you have an input field called user_name:

  • A suboptimal description might be "The name of the user". This isn't specific enough - is it the first name, last name or both?
  • A good description could be "The first and last name of the user". The model should be able to produce examples like this.
  • A great description might be "The first and last name of the user, in the form LastName, FirstName. Names can be anything, but are most commonly associated with users in the United States." This gives the model a lot of detail to generate data that is specific to your particular use case.

If you like the generated dataset, you can click Create dataset. Otherwise, edit the instructions and regenerate the data.

Manually curating a dataset

Datasets can also be generated manually from points of interest. To generate a dataset manually, go to the Workspaces page and filter for the data you're interested in. In the data table, select the rows that you'd like to be in your dataset and click Add to dataset. You can either add these records to a new dataset or append them to an existing one.

Uploading a CSV as a dataset

To use an uploaded dataset for evaluations, Gantry has to know how to map the CSV columns to variables specified in a prompt. To do this, name your specified columns with the following format: inputs.prompt_values.<prompt_variable>. For example, if your version has a prompt variable user_input, then Gantry will look for the inputs.prompt_values.user_input column.

Determining evaluation criteria

When you run an evaluation, Gantry generates outputs for the chosen versions, and then evaluates them using specified evaluation criteria. These criteria could be either:

  1. Free form text. Gantry will run your descriptions of what the output should be through a LLM model to determine the score. In this case, since your evaluation criteria are being interpreted by an LLM, it helps to be as specific as possible about how you want them to work.
  2. A python function. This can be specified via a custom projection.

When evaluating the output, Gantry will rate it in one of two ways:

  1. A score between 1 and 5.
  2. True, False, or Unsure

Gantry provides some default evaluation functions that can be used to evaluate your model:

If none of the built in functions meet your needs, you can customize the criteria to your liking:

Understand the evaluation report

The evaluation report is a high level summary of how the model configurations(s) did against the dataset, based on the criteria. From the bottom of the evaluation report page, you can choose to deploy your configuration.

The first part of the evaluation report details the performance breakdown. At a glance, this is the most top-level summary Gantry provides regarding which model was better. The bar chart below shows how many examples from each model got which rating. The chart under that shows the examples individually. To understand why an example got a certain rating, hover over the result.