← Back to overview
LLM evaluation & red teaming

promptfoo

Open-source and commercial platform for testing prompts, agents, RAG systems, and AI security behavior.

Open official link

Overview

Category: LLM evaluation & red teaming

Open-source and commercial platform for testing prompts, agents, RAG systems, and AI security behavior.

Best for

AI teams that need repeatable evaluations, regression tests, and red-team probes before shipping LLM apps.

Use cases

  • Test prompt changes
  • Evaluate RAG quality
  • Run AI red-team/security checks

Common example

Create an eval suite that compares model answers before and after changing a support chatbot prompt.

Pricing and free plan

Pricing model: Open-source Community version plus paid/enterprise plans; red-team probes and hosted features may have usage limits.

Free plan / trial assessment: Free/open-source usage exists, but larger-scale red teaming, hosted features, and enterprise controls are limited.

Limitations

Requires test-case design and engineering integration; eval results are only as good as the suite.

ChatGPT / Claude comparison

Better than ChatGPT/Claude for this task — it provides repeatable automated evals rather than one-off chat judgments.

Alternatives

LangSmith, Braintrust, OpenAI Evals, TruLens