LLM evaluation & red teaming

promptfoo

Open-source and commercial platform for testing prompts, agents, RAG systems, and AI security behavior.

Overview

Category: LLM evaluation & red teaming

Open-source and commercial platform for testing prompts, agents, RAG systems, and AI security behavior.

AI teams that need repeatable evaluations, regression tests, and red-team probes before shipping LLM apps.

Create an eval suite that compares model answers before and after changing a support chatbot prompt.

Pricing model: Open-source Community version plus paid/enterprise plans; red-team probes and hosted features may have usage limits.

Free plan / trial assessment: Free/open-source usage exists, but larger-scale red teaming, hosted features, and enterprise controls are limited.

Requires test-case design and engineering integration; eval results are only as good as the suite.

Better than ChatGPT/Claude for this task — it provides repeatable automated evals rather than one-off chat judgments.

LangSmith, Braintrust, OpenAI Evals, TruLens