Evaluate Application

Understand model performance

In the last step, we created an Application that uses a LLM to correct the grammar of user input. In this step, we'll evaluate how well it works.

Evaluating LLM-based applications can be tricky. Gantry will help us answer three main questions:

  • What data should we use to evaluate?
  • How should we quantify performance on that data?
  • What level of performance constitutes "good enough" to deploy the application to production?

Running an evaluation

Using the lefthand menu, let's navigate to the evaluations tab and click "Get started".

The evaluation creation modal needs some information:

  • Version(s) to evaluate. We can evaluate a single version, or compare performance on multiple versions
  • Dataset to use. Which data to use for determining the performance of that/those versions
  • Criteria. The parameters on which Gantry should judge success

We don't have a dataset yet, so we'll generate one using the "Generate new dataset" button.

Create model test cases

Gantry can use LLMs to create a dataset so that we don't have to create one from scratch. Clicking "Generate new dataset" takes us to the data generator. Here, we can write a description for each of the input fields in our prompt template, press "Generate data" and watch the magic happen.

πŸ“˜

Writing good descriptions

Good descriptions for the data generator describe, in as much detail as possible, what kind of data should appear in the input field. For example, say we have an input field called user_name:

  • A suboptimal description might be "The name of the user". This isn't specific enough - is it the first name, last name or both?
  • A good description could be "The first and last name of the user". The model should be able to produce examples like this.
  • A great description might be "The first and last name of the user, in the form LastName, FirstName. Names can be anything, but are most commonly associated with users in the United States." This gives the model a lot of detail to generate data that is specific to your particular use case.

Write a description like "grammatically incorrect sentences" and click "Generate data". Feel free to be more descriptive!

When the results are to our liking, we'll give the dataset a name like eval-data and click "Create dataset".

Creating evaluation criteria

Now that we have an initial evaluation set, we will configure our evaluation criteria. Evaluation criteria tell Gantry how to measure the way the model performed on the specified evaluation data.

πŸ“˜

Understanding evaluation criteria

When we run an evaluation, Gantry generates outputs for the chosen versions, and then evaluates them using an evaluation model. Under the hood, the evaluation model is itself a LLM. It looks at our prompt template, the input fields, the generated output, and our evaluation criteria to come up with a score between 1 and 5.

Since our evaluation criteria are being interpreted by an LLM, it helps to be as specific as possible about how we want them to work.

Click "Edit" to customize the criteria.

Edit the criteria to create a single one: "Results should be grammatically correct" and click "Save".

Running the evaluation

Back on the evaluation page, click "Run evaluation" to kick off our eval.

Reading the evaluation report

Back on the Evaluations page, give our new report a few seconds to run. When the status changes to Completed, click it to view the report.

Evaluation overview

The top section provides an overview of the report, including a description of the version(s) evaluated and the criteria used.

We can click "Show more" to view the full details about the version.

Evaluation results

Scroll down to view the evaluation results. We will see a breakdown of the scores for each of the examples in our evaluation set.

If we want to understand why the evaluation model gave an output a given score, we can click on the score to see the model's explanation.

Deployment

If we like what you see, we can deploy your version directly to production from the report. Deployment is discussed further in the next section.


What’s Next