Easily find failure cases, underperforming cohorts, and novel user behaviors. Add these interesting cases to an evaluation dataset to progressively validate your prompts on the latest production data.
Collaborate with your team on prompt development and version control your LLM-powered features.
Quantify the performance of your prompts. Test your prompt on evaluation data gathered from production or generated automatically using an LLM. Specify criteria and success metrics, then compare performance between versions in an evaluation report.
Keep your production models up to date without having to re-deploy. Tag your production and test environment prompts and set up your backend code to always pull the latest from Gantry.
Once your app is live, you can configure alerts on metrics you choose to detect performance degradations, data drift, and data quality issues. Send data from production (inputs, outputs, predictions, and implicit / explicit user feedback) back to Gantry to make sense of the full picture.
Gantry is currently invite-only. Contact us to get access.
Updated 3 months ago