This feature is only available to Completion applications
- View a performance report on either the latest production data or data generated by an LLM that meets your specifications
- Understand success on a normalized scale of 1 to 5
An evaluation is a test suite run on your model against data. It allows you to understand model performance at a glance. On the evaluations page, you can either create a new evaluation or access previous ones.
To create an evaluation, navigate to the evaluations page and click on
In the popup modal, pick model versions to compare, the dataset to compare against, and describe some evaluation criteria that you'd like the evaluation to focus in on.
This version(s) chosen here will be the version(s) that the test cases are run against. If one version is chosen, you will see the raw results for that version in the evaluation report. If two versions are chosen, you'll be able to see the results of these versions side by side.
When evaluations are run, they can are run against datasets. These datasets can either be autogenerated, uploaded, derived from existing data. A dataset is essentially a list of test cases that your model will be evaluated against.
New Evaluation modal on the evaluation page, click on the blue stars on the righthand side:
Data generator modal, explain what type of data should be generated for each variable, and click generate data.
Writing good descriptions
Good descriptions for the Data generator describe, in as much detail as possible, what kind of data should appear in the input field. For example, say you have an input field called
- A suboptimal description might be "The name of the user". This isn't specific enough - is it the first name, last name or both?
- A good description could be "The first and last name of the user". The model should be able to produce examples like this.
- A great description might be "The first and last name of the user, in the form LastName, FirstName. Names can be anything, but are most commonly associated with users in the United States." This gives the model a lot of detail to generate data that is specific to your particular use case.
If you like the generated dataset, you can click
Create dataset. Otherwise, edit the instructions and regenerate the data.
Datasets can also be generated manually from points of interest. To generate a dataset manually, go to the
Workspaces page and filter for the data you're interested in. In the data table, select the rows that you'd like to be in your dataset and click
Add to dataset. You can either add these records to a new dataset or append them to an existing one.
To use an uploaded dataset for evaluations, Gantry has to know how to map the CSV columns to variables specified in a prompt. To do this, name your specified columns with the following format:
inputs.prompt_values.<prompt_variable>. For example, if your version has a prompt variable
user_input, then Gantry will look for the
When you run an evaluation, Gantry generates outputs for the chosen versions, and then evaluates them using specified evaluation criteria. These criteria could be either:
- Free form text. Gantry will run your descriptions of what the output should be through a LLM model to determine the score. In this case, since your evaluation criteria are being interpreted by an LLM, it helps to be as specific as possible about how you want them to work.
- A python function. This can be specified via a custom projection.
When evaluating the output, Gantry will rate it in one of two ways:
- A score between 1 and 5.
- True, False, or Unsure
Gantry provides some default evaluation functions that can be used to evaluate your model:
If none of the built in functions meet your needs, you can customize the criteria to your liking:
The evaluation report is a high level summary of how the model configurations(s) did against the dataset, based on the criteria. From the bottom of the evaluation report page, you can choose to deploy your configuration.
The first part of the evaluation report details the performance breakdown. At a glance, this is the most top-level summary Gantry provides regarding which model was better. The bar chart below shows how many examples from each model got which rating. The chart under that shows the examples individually. To understand why an example got a certain rating, hover over the result.
Updated 2 months ago