Evaluate Models

Evaluate configuration versions with Gantry to:

  1. View a performance report on either the latest production data or data that meets your specifications, generated by an LLM
  2. Understand success on a normalized scale of 1 to 5

An evaluation is a test suite run of your model against data. It allows you to understand model performance at a glance.

On the evaluations page, you can either create a new evaluation or access previous ones.

Create an evaluation

To create an evaluation, navigate to the evaluations page and click on New evaluation

In the popup modal, pick model versions to compare, the dataset to compare against, and describe some evaluation criteria that you'd like the evaluation to focus in on.

Choosing versions

This version chosen here will be the version that the test cases are run against. If one version is chosen, you will see the raw results for that version in the evaluation report. If two versions are chosen, you'll be able to see the results of these versions side by side.

Choosing a dataset

When evaluations are run, they can are run against datasets. These datasets can either be autogenerated or derived from existing data. A dataset is essentially a list of test cases that your model will be evaluated against.

Autogenerating a dataset

From the New Evaluation modal on the evaluation page, click on the blue stars on the righthand side:

In the Data generator modal, explain what type of data should be generated for each variable, and click generate data.


Writing good descriptions

Good descriptions for the Data generator describe, in as much detail as possible, what kind of data should appear in the input field. For example, say you have an input field called user_name:

  • A suboptimal description might be "The name of the user". This isn't specific enough - is it the first name, last name or both?
  • A good description could be "The first and last name of the user". The model should be able to produce examples like this.
  • A great description might be "The first and last name of the user, in the form LastName, FirstName. Names can be anything, but are most commonly associated with users in the United States." This gives the model a lot of detail to generate data that is specific to your particular use case.

If you like the generated dataset, you can click Create dataset. Otherwise, edit the instructions and regenerate the data.

Manually curating a dataset

Datasets can also be generated manually from points of interest. To generate a dataset manually, go to the Analytics page and filter for the data you're interested in. In the data table, select the rows that you'd like to be in your dataset and click Add to dataset. You can either add these records to a new dataset or append them to an existing one.

Determining evaluation criteria

When you run an evaluation, Gantry generates outputs for the chosen versions, and then evaluates them using an evaluation model. Under the hood, the evaluation model is itself a LLM. It looks at your prompt template, the input fields, the generated output, and your evaluation criteria to come up with a score between 1 and 5. Since your evaluation criteria are being interpreted by an LLM, it helps to be as specific as possible about how you want them to work.

Understand the evaluation report

The evaluation report is a high level summary of how the model(s) did against the dataset, based on the criteria. From the bottom of the evaluation report page, you can choose to Deploy your model.