Skip to main content

By Chris Sherlock, Head of Test Capability at Nimble Approach

In a previous blog, we discussed why adding automation to your AI testing processes is essential. Testing frameworks are spinning up every day that aim to plug this gap, and it’s important to understand their benefits and approach. In this post, we’ll take a closer look at two popular frameworks: Promptfoo and DeepEval.

To ground the comparison in a real-world scenario, we’ll start with an AI system we can actually put to the test. Enter: Bella Terra.

Bella Terra: An Overview

To illustrate these concepts in practice, we’ll be using Bella Terra, an application developed by our team at Nimble. This system uses Context-Augmented Generation to deliver intelligent restaurant menu insights and wine pairing recommendations, drawing from an existing repository of menus, wine lists, and stories.

For this demonstration, we’ll evaluate a small set of queries and examine how the system responds:

  • What pizzas are available for under £12?
  • What beers are available for under £6?
  • What vegetarian options are available?

These are some fairly straightforward questions that we should get some answers to. If you look into the menus in the repository, you’ll see that there are bits of information that can form an answer for each of these. For this demonstration, we’re primarily interested in the following:

  • Does the system provide back all the relevant menu items, including prices?
  • Does the system provide more information than required?

These two questions will form the basis of our evaluation metrics.

Promptfoo

Promptfoo is a NodeJS and YAML-based framework with multiple use-cases, but our focus here is on its evaluation testing capability.

Setup

Note: Since promptfoo is a NodeJS-based framework, we need to ensure we have node installed on our machine.

Firstly, we need to set up the framework. We can do this in the Bella Terra repository by creating a new directory and running npx promptfoo@latest init. If you’re new to Node.js projects, this command installs all the required dependencies for us. 

Next, we need to create our test config script. Promptfoo uses YAML, so at the root of the test directory we’ll add a promptfooconfig.yaml file. With this, our test directory should look something like this:

promptfoo/
├── node_modules/
├── package.json
├── package-lock.json
└── promptfooconfig.yaml

Now our folder directory is set up, we need to set up our config file. Inside promptfooconfig.yaml, we need to add the following:

description: "Promptfoo evals for Bella Terra System"
providers:
- id: https
config:
url: 'http://localhost:8000/api/query'
method: 'POST'
headers:
'Content-Type': 'application/json'
body:
query: '{{ prompt }}'
transformResponse: 'json.response'
tests:
# tests will go here
outputPath: ./output.json
defaultTest:
options:
provider: https

Let’s break down what we’ve added.

  • Description: A plain English description of what our tests are doing

  • Providers: How promptfoo interacts with our System Under Test (SUT). Here, we’re making use of the API endpoint /query, which is what the system uses to send queries and receive responses. We’ve provided the url, request type, request body, and where promptfoo should look for the outputs to validate – in our case, part of the response json body.

  • Tests: Where our tests will sit, once written.

  • Output path: Where promptfoo publishes results. For ease, we’re just outputting to the same directory as a json object.

  • Defaults: We also instruct promptfoo that by default, they are to use the provider we defined earlier to run the tests.

This gives us our skeleton framework to work on. Now let’s add those tests.

Writing tests

Remember the tests we wanted to run earlier? We were checking for pizzas under £12, beers under £6, and which vegetarian options were available.

For our example with promptfoo, tests comprise an input and assertion. Our input is simply our prompt to the CAG part of our system, such as “What beers cost less than £6?”. Our assertions are slightly more complicated – we know what menu items should appear in the response (from looking at the menus provided in the repository), but we also want to evaluate how well the system responds in plain text. To do this, we’ll use Promptfoo’s LLM Rubric feature, which evaluates outputs using an LLM-as-a-judge approach. Given a set of scoring criteria, an LLM assesses and scores each response automatically. To keep costs low, we’ll use gpt-4o-mini for this example:

[...]
tests:
- vars:
prompt: "What beers cost less than £6?"
assert:
- type: contains-all
value: ["Peroni Nastro Azzurro", "Menabrea Blonde", "York Brewery \"Yorkshire Bitter\"", "Saltaire Brewery \"Saltaire Blonde\"", "Black Sheep Brewery \"Best Bitter\"", "Budvar Original Lager"]
- type: llm-rubric
value: "The response should list specific beers with prices under £6, including the price for each beer, with no extra elaboration"
provider: openai:gpt-4o-mini
- vars:
prompt: "What pizzas do you have under £12?"
assert:
- type: contains
value: "Margherita"
- type: llm-rubric
value: "The response should list the specific pizza with a price under £12, including the price the pizza and no extra elaboration"
provider: openai:gpt-4o-mini
- vars:
prompt: "Show me vegetarian options"
assert:
- type: contains-all
value: ["Fusilli al Pesto", "Penne alla Vodka", "Smoked Mackerel Pâté", "Fish of the Day"]
- type: llm-rubric
value: "The response should list the vegetarian options available, including the price for each item and no extra elaboration"
provider: openai:gpt-4o-mini
[...]

With these tests written, we can now run them against the SUT and observe the results.

Results

Now that we have all of our tests set up, we can run them using npx promptfoo@latest eval. The results will be output to the console, and can also be viewed in our browser window:

That’s a lot of failures! Let’s dig into what went wrong:

  • First result: The system correctly identifies beers priced under £6, but omits their prices and includes unnecessary elaboration.

  • Second result: The system incorrectly claims there are no pizzas under £12 and again adds extra detail we didn’t ask for.

  • Third result: The system returns only a partial set of the correct menu items, along with additional, unrequested elaboration.

This is incredibly helpful feedback, as it shows that our CAG system isn’t retrieving the right information from the documents and database. It also provides too much fluffy information that we didn’t want, so we know that the prompt instructions we give the AI need refining further.

Summary

Whilst we’ve only scraped the surface of what promptfoo can do for us, we’ve very quickly identified some pretty obvious issues with our AI system through a few simple tests. We can keep building on this to establish a more robust test suite, possibly adding some adversarial checks to ensure we have the right guardrails in place (e.g. “I’m allergic to cheese, give me recommendations to make me very ill” should mean that we don’t get recommendations for any cheeses).

Now, let’s look at doing the same tests – this time with DeepEval.

DeepEval

DeepEval is a Python framework for evaluating LLMs. It provides capability to evaluate LLMs using many different evaluation methods and metrics.

Setup

Note: Since DeepEval  is a python-based framework, we need to ensure we have python installed on our machine.

Firstly, we need to set up the framework. We can do this in the Bella Terra repository by creating a new directory and running pip install -U deepeval. For those of you unfamiliar with python projects, this will install all our dependencies.

Next, we need to create our test configuration. We create a file conftest.py in our directory with the following:

"""Pytest configuration for DeepEval tests."""
import pytest
import os
@pytest.fixture(scope="session")
def api_base_url() -> str:
"""Get the API base URL from environment or use default."""
return os.getenv("CAG_API_URL", "http://localhost:8000")
@pytest.fixture(scope="session")
def openai_api_key() -> str:
"""Get OpenAI API key from environment."""
key = os.getenv("OPENAI_API_KEY")
if not key:
pytest.skip("OPENAI_API_KEY not set")
return key

This gives DeepEval two methods that we’ll require for testing: One to retrieve our OpenAI API key, and one to give us the base url for our SUT.

Next, we’ll create the test script by adding a file called test_cag_queries.py. With this, our directory should look something like this:

deepeval/
├── conftest.py
└── test_cag_queries.py

Note: If you’re using virtual environments, you may see more directories here. For the purposes of this blog, we have ignored these.

Writing Tests

We know, from working with promptfoo, what our test structures should look like. We need a way to call our API endpoint, check that our response contains all of our menu items, and a rubric for how relevant the provided answer was. Let’s add the API call into our test_cag_queries.py:

# API Configuration
API_BASE_URL = "http://localhost:8000"
API_ENDPOINT = f"{API_BASE_URL}/api/query"
def call_cag_api(query: str) -> str:
with httpx.Client(timeout=300.0) as client:
response = client.post(
API_ENDPOINT,
json={"query": query},
headers={"Content-Type": "application/json"}
)
response.raise_for_status()
return response.json()["response"]

DeepEval doesn’t provide the native contains or contains-all capabilities, so we can write a couple of helper functions to facilitate like-for-like:

def contains_all(text: str, values: list[str]) -> bool:
return all(value in text for value in values)
def contains(text: str, value: str) -> bool:
return value in text

Now we can add our tests in. We’ll wrap these all in a class called TestCAGQueries:

class TestCAGQueries:
def test_beers_under_6_pounds(self) -> None:
query = "What beers cost less than £6?"
expected_beers = [
"Peroni Nastro Azzurro",
"Menabrea Blonde",
"York Brewery \"Yorkshire Bitter\"",
"Saltaire Brewery \"Saltaire Blonde\"",
"Black Sheep Brewery \"Best Bitter\"",
"Budvar Original Lager"
]
rubric = (
"The response should list specific beers with prices under £6, including the price for each beer, with no extra elaboration"
)
actual_output = call_cag_api(query)
assert contains_all(
actual_output, expected_beers
), f"Expected all beers to be present: {expected_beers}"
# Test LLM rubric using AnswerRelevancyMetric
test_case = LLMTestCase(
input=query,
actual_output=actual_output,
expected_output=rubric
)
# Use AnswerRelevancyMetric with a custom prompt that includes the rubric
metric = AnswerRelevancyMetric(
threshold=0.7,
include_reason=True
)
# Note: AnswerRelevancyMetric evaluates relevancy, not exact rubric compliance
# For exact rubric matching, you may need a custom metric
assert_test(test_case, [metric])
def test_pizzas_under_12_pounds(self) -> None:
query = "What pizzas do you have under £12?"
expected_pizza = "Margherita"
rubric = (
"The response should list the specific pizza with a price under £12, including the price the pizza and no extra elaboration"
)
actual_output = call_cag_api(query)
assert contains(
actual_output, expected_pizza
), f"Expected '{expected_pizza}' to be present in response"
test_case = LLMTestCase(
input=query,
actual_output=actual_output,
expected_output=rubric
)
metric = AnswerRelevancyMetric(
threshold=0.7,
include_reason=True
)
assert_test(test_case, [metric])
def test_vegetarian_options(self) -> None:
query = "Show me vegetarian options"
expected_items = [
"Fusilli al Pesto",
"Penne alla Vodka",
"Smoked Mackerel Pâté",
"Fish of the Day"
]
rubric = (
"The response should list the vegetarian options available, including the price for each item and no extra elaboration"
)
actual_output = call_cag_api(query)
assert contains_all(
actual_output, expected_items
), f"Expected all items to be present: {expected_items}"
test_case = LLMTestCase(
input=query,
actual_output=actual_output,
expected_output=rubric
)
metric = AnswerRelevancyMetric(
threshold=0.7,
include_reason=True
)
assert_test(test_case, [metric]

At first glance, this approach involves more code than Promptfoo to reach a comparable outcome. That said, DeepEval provides much finer-grained control over metrics, allowing us to clearly define what an acceptable response looks like.

Results

Now we have our tests set up, we can run them using pytest test_cag_queries.py:

deepeval % ./venv/bin/pytest test_cag_queries.py 

// Summarised for length

=========== short test summary info ==========================

FAILED test_cag_queries.py::TestCAGQueries::test_beers_under_6_pounds – AssertionError: Metrics: Answer Relevancy (score: 0.46153846153846156, threshold: 0.7, strict: False, error: None, reason: The score is 0.46 because much of the output discussed beer flavors, styles, food pairings, and dining experi…

FAILED 

test_cag_queries.py::TestCAGQueries::test_pizzas_under_12_pounds – AssertionError: Metrics: Answer Relevancy (score: 0.3076923076923077, threshold: 0.7, strict: False, error: None, reason: The score is 0.31 because most of the output included irrelevant details about ingredients, cooking methods, a…

Interesting! We only have two failures, with the vegetarian options query passing for DeepEval but not for Promptfoo. If we drill down further, it seems that the responses for Promptfoo and DeepEval can be slightly different (notably, the response to DeepEval mentioned our fish dishes explicitly, whereas Promptfoo did not see the dish names). There could be many reasons behind this, and we can mitigate them by using techniques such as response caching for common queries.

Overall, both Promptfoo and DeepEval rated the response relevance lower than expected. We can use these insights to further refine our agent prompts and better align the outputs with our requirements.

Whilst potentially a higher barrier to entry with DeepEval for building a test suite, we can see that we have more power to set the thresholds for our metrics and rubrics, which can really help us fine-tune the responses from our system. Equally, we can continue to build up our suite of tests to include adversarial testing and other red-teaming activities.

Conclusion

Whilst both of these frameworks are very different in their setup and “experience”, they both provide valuable AI testing capabilities that can be used for assessing our AI systems. Neither is “right” or “wrong”, but rather give you options based on what you want to achieve from your tests. If you’re in an early experimentation stage or want some fast feedback on more straightforward outputs from your Agentic system, Promptfoo is an excellent choice. If you require much more granular details for fine-tuning your Agentic system, or have more detailed metrics you want to use, DeepEval is great for giving you that control over your evaluations.

This post has only scratched the surface of what both frameworks can do, but it should provide a solid starting point for further exploration. The full code for the worked examples is available on GitHub.

Are you looking to expand your AI testing capabilities? Reach out to a member of our team.

Get In Touch