Technical Deep Dive: Promptfoo vs DeepEval for Automated AI Evaluation

By Chris Sherlock, Head of Test Capability at Nimble Approach

In a previous blog, we discussed why adding automation to your AI testing processes is essential. Testing frameworks are spinning up every day that aim to plug this gap, and it’s important to understand their benefits and approach. In this post, we’ll take a closer look at two popular frameworks: Promptfoo and DeepEval.

To ground the comparison in a real-world scenario, we’ll start with an AI system we can actually put to the test. Enter: Bella Terra.

Bella Terra: An Overview

To illustrate these concepts in practice, we’ll be using Bella Terra, an application developed by our team at Nimble. This system uses Context-Augmented Generation to deliver intelligent restaurant menu insights and wine pairing recommendations, drawing from an existing repository of menus, wine lists, and stories.

For this demonstration, we’ll evaluate a small set of queries and examine how the system responds:

What pizzas are available for under £12?
What beers are available for under £6?
What vegetarian options are available?

These are some fairly straightforward questions that we should get some answers to. If you look into the menus in the repository, you’ll see that there are bits of information that can form an answer for each of these. For this demonstration, we’re primarily interested in the following:

Does the system provide back all the relevant menu items, including prices?
Does the system provide more information than required?

These two questions will form the basis of our evaluation metrics.

Promptfoo

Promptfoo is a NodeJS and YAML-based framework with multiple use-cases, but our focus here is on its evaluation testing capability.

Setup

Note: Since promptfoo is a NodeJS-based framework, we need to ensure we have node installed on our machine.

Firstly, we need to set up the framework. We can do this in the Bella Terra repository by creating a new directory and running npx promptfoo@latest init. If you’re new to Node.js projects, this command installs all the required dependencies for us.

Next, we need to create our test config script. Promptfoo uses YAML, so at the root of the test directory we’ll add a promptfooconfig.yaml file. With this, our test directory should look something like this:

			
promptfoo/
├── node_modules/
├── package.json
├── package-lock.json
└── promptfooconfig.yaml

		

Now our folder directory is set up, we need to set up our config file. Inside promptfooconfig.yaml, we need to add the following:

			
description: "Promptfoo evals for Bella Terra System"
providers:
 - id: https
   config:
     url: 'http://localhost:8000/api/query'
     method: 'POST'
     headers:
       'Content-Type': 'application/json'
     body:
       query: '{{ prompt }}'
     transformResponse: 'json.response'
tests:
# tests will go here
outputPath: ./output.json
defaultTest:
 options:
   provider: https

		

Let’s break down what we’ve added.

Description: A plain English description of what our tests are doing
Providers: How promptfoo interacts with our System Under Test (SUT). Here, we’re making use of the API endpoint /query, which is what the system uses to send queries and receive responses. We’ve provided the url, request type, request body, and where promptfoo should look for the outputs to validate – in our case, part of the response json body.
Tests: Where our tests will sit, once written.
Output path: Where promptfoo publishes results. For ease, we’re just outputting to the same directory as a json object.
Defaults: We also instruct promptfoo that by default, they are to use the provider we defined earlier to run the tests.

This gives us our skeleton framework to work on. Now let’s add those tests.

Writing tests

Remember the tests we wanted to run earlier? We were checking for pizzas under £12, beers under £6, and which vegetarian options were available.

For our example with promptfoo, tests comprise an input and assertion. Our input is simply our prompt to the CAG part of our system, such as “What beers cost less than £6?”. Our assertions are slightly more complicated – we know what menu items should appear in the response (from looking at the menus provided in the repository), but we also want to evaluate how well the system responds in plain text. To do this, we’ll use Promptfoo’s LLM Rubric feature, which evaluates outputs using an LLM-as-a-judge approach. Given a set of scoring criteria, an LLM assesses and scores each response automatically. To keep costs low, we’ll use gpt-4o-mini for this example:

			
[...]
tests:
- vars:
     prompt: "What beers cost less than £6?"
   assert:
     - type: contains-all
       value: ["Peroni Nastro Azzurro", "Menabrea Blonde", "York Brewery \"Yorkshire Bitter\"", "Saltaire Brewery \"Saltaire Blonde\"", "Black Sheep Brewery \"Best Bitter\"", "Budvar Original Lager"]
     - type: llm-rubric
       value: "The response should list specific beers with prices under £6, including the price for each beer, with no extra elaboration"
       provider: openai:gpt-4o-mini
 - vars:
     prompt: "What pizzas do you have under £12?"
   assert:
     - type: contains
       value: "Margherita"
     - type: llm-rubric
       value: "The response should list the specific pizza with a price under £12, including the price the pizza and no extra elaboration"
       provider: openai:gpt-4o-mini
 - vars:
     prompt: "Show me vegetarian options"
   assert:
     - type: contains-all
       value: ["Fusilli al Pesto", "Penne alla Vodka", "Smoked Mackerel Pâté", "Fish of the Day"]
     - type: llm-rubric
       value: "The response should list the vegetarian options available, including the price for each item and no extra elaboration"
       provider: openai:gpt-4o-mini
[...]

		

With these tests written, we can now run them against the SUT and observe the results.

Results

Now that we have all of our tests set up, we can run them using npx promptfoo@latest eval. The results will be output to the console, and can also be viewed in our browser window:

That’s a lot of failures! Let’s dig into what went wrong:

First result: The system correctly identifies beers priced under £6, but omits their prices and includes unnecessary elaboration.
Second result: The system incorrectly claims there are no pizzas under £12 and again adds extra detail we didn’t ask for.
Third result: The system returns only a partial set of the correct menu items, along with additional, unrequested elaboration.

This is incredibly helpful feedback, as it shows that our CAG system isn’t retrieving the right information from the documents and database. It also provides too much fluffy information that we didn’t want, so we know that the prompt instructions we give the AI need refining further.

Summary

Whilst we’ve only scraped the surface of what promptfoo can do for us, we’ve very quickly identified some pretty obvious issues with our AI system through a few simple tests. We can keep building on this to establish a more robust test suite, possibly adding some adversarial checks to ensure we have the right guardrails in place (e.g. “I’m allergic to cheese, give me recommendations to make me very ill” should mean that we don’t get recommendations for any cheeses).

Now, let’s look at doing the same tests – this time with DeepEval.

DeepEval

DeepEval is a Python framework for evaluating LLMs. It provides capability to evaluate LLMs using many different evaluation methods and metrics.

Setup

Note: Since DeepEval is a python-based framework, we need to ensure we have python installed on our machine.

Firstly, we need to set up the framework. We can do this in the Bella Terra repository by creating a new directory and running pip install -U deepeval. For those of you unfamiliar with python projects, this will install all our dependencies.

Next, we need to create our test configuration. We create a file conftest.py in our directory with the following:

			
"""Pytest configuration for DeepEval tests."""
import pytest
import os
@pytest.fixture(scope="session")
def api_base_url() -> str:
   """Get the API base URL from environment or use default."""
   return os.getenv("CAG_API_URL", "http://localhost:8000")
@pytest.fixture(scope="session")
def openai_api_key() -> str:
   """Get OpenAI API key from environment."""
   key = os.getenv("OPENAI_API_KEY")
   if not key:
       pytest.skip("OPENAI_API_KEY not set")
   return key

		

This gives DeepEval two methods that we’ll require for testing: One to retrieve our OpenAI API key, and one to give us the base url for our SUT.

Next, we’ll create the test script by adding a file called test_cag_queries.py. With this, our directory should look something like this:

			
deepeval/
├── conftest.py
└── test_cag_queries.py

Note: If you’re using virtual environments, you may see more directories here. For the purposes of this blog, we have ignored these.

Writing Tests

We know, from working with promptfoo, what our test structures should look like. We need a way to call our API endpoint, check that our response contains all of our menu items, and a rubric for how relevant the provided answer was. Let’s add the API call into our test_cag_queries.py:

			
# API Configuration
API_BASE_URL = "http://localhost:8000"
API_ENDPOINT = f"{API_BASE_URL}/api/query"
def call_cag_api(query: str) -> str:
   with httpx.Client(timeout=300.0) as client:
       response = client.post(
           API_ENDPOINT,
           json={"query": query},
           headers={"Content-Type": "application/json"}
       )
       response.raise_for_status()
       return response.json()["response"]

		

DeepEval doesn’t provide the native contains or contains-all capabilities, so we can write a couple of helper functions to facilitate like-for-like:

			
def contains_all(text: str, values: list[str]) -> bool:
   return all(value in text for value in values)
def contains(text: str, value: str) -> bool:
   return value in text

Now we can add our tests in. We’ll wrap these all in a class called TestCAGQueries:

			
class TestCAGQueries:    
    def test_beers_under_6_pounds(self) -> None:
        query = "What beers cost less than £6?"
        expected_beers = [
            "Peroni Nastro Azzurro",
            "Menabrea Blonde",
            "York Brewery \"Yorkshire Bitter\"",
            "Saltaire Brewery \"Saltaire Blonde\"",
            "Black Sheep Brewery \"Best Bitter\"",
            "Budvar Original Lager"
        ]
        rubric = (
            "The response should list specific beers with prices under £6, including the price for each beer, with no extra elaboration"
        )
        actual_output = call_cag_api(query)
        
        assert contains_all(
            actual_output, expected_beers
        ), f"Expected all beers to be present: {expected_beers}"
        
        # Test LLM rubric using AnswerRelevancyMetric
        test_case = LLMTestCase(
            input=query,
            actual_output=actual_output,
            expected_output=rubric
        )
        
        # Use AnswerRelevancyMetric with a custom prompt that includes the rubric
        metric = AnswerRelevancyMetric(
            threshold=0.7,
            include_reason=True
        )
        
        # Note: AnswerRelevancyMetric evaluates relevancy, not exact rubric compliance
        # For exact rubric matching, you may need a custom metric
        assert_test(test_case, [metric])
    
    def test_pizzas_under_12_pounds(self) -> None:
        query = "What pizzas do you have under £12?"
        expected_pizza = "Margherita"
        rubric = (
            "The response should list the specific pizza with a price under £12, including the price the pizza and no extra elaboration"
        )
        
        actual_output = call_cag_api(query)
        
        assert contains(
            actual_output, expected_pizza
        ), f"Expected '{expected_pizza}' to be present in response"
        test_case = LLMTestCase(
            input=query,
            actual_output=actual_output,
            expected_output=rubric
        )
        
        metric = AnswerRelevancyMetric(
            threshold=0.7,
            include_reason=True
        )
        
        assert_test(test_case, [metric])
    
    def test_vegetarian_options(self) -> None:
        query = "Show me vegetarian options"
        expected_items = [
            "Fusilli al Pesto",
            "Penne alla Vodka",
            "Smoked Mackerel Pâté",
            "Fish of the Day"
        ]
        rubric = (
            "The response should list the vegetarian options available, including the price for each item and no extra elaboration"
        )
        actual_output = call_cag_api(query)
        
        assert contains_all(
            actual_output, expected_items
        ), f"Expected all items to be present: {expected_items}"
        
        test_case = LLMTestCase(
            input=query,
            actual_output=actual_output,
            expected_output=rubric
        )
        
        metric = AnswerRelevancyMetric(
            threshold=0.7,
            include_reason=True
        )
        
        assert_test(test_case, [metric]

		

At first glance, this approach involves more code than Promptfoo to reach a comparable outcome. That said, DeepEval provides much finer-grained control over metrics, allowing us to clearly define what an acceptable response looks like.

Results

Now we have our tests set up, we can run them using pytest test_cag_queries.py:

deepeval % ./venv/bin/pytest test_cag_queries.py

// Summarised for length

=========== short test summary info ==========================

FAILED test_cag_queries.py::TestCAGQueries::test_beers_under_6_pounds – AssertionError: Metrics: Answer Relevancy (score: 0.46153846153846156, threshold: 0.7, strict: False, error: None, reason: The score is 0.46 because much of the output discussed beer flavors, styles, food pairings, and dining experi…

FAILED

test_cag_queries.py::TestCAGQueries::test_pizzas_under_12_pounds – AssertionError: Metrics: Answer Relevancy (score: 0.3076923076923077, threshold: 0.7, strict: False, error: None, reason: The score is 0.31 because most of the output included irrelevant details about ingredients, cooking methods, a…

Interesting! We only have two failures, with the vegetarian options query passing for DeepEval but not for Promptfoo. If we drill down further, it seems that the responses for Promptfoo and DeepEval can be slightly different (notably, the response to DeepEval mentioned our fish dishes explicitly, whereas Promptfoo did not see the dish names). There could be many reasons behind this, and we can mitigate them by using techniques such as response caching for common queries.

Overall, both Promptfoo and DeepEval rated the response relevance lower than expected. We can use these insights to further refine our agent prompts and better align the outputs with our requirements.

Whilst potentially a higher barrier to entry with DeepEval for building a test suite, we can see that we have more power to set the thresholds for our metrics and rubrics, which can really help us fine-tune the responses from our system. Equally, we can continue to build up our suite of tests to include adversarial testing and other red-teaming activities.

Conclusion

Whilst both of these frameworks are very different in their setup and “experience”, they both provide valuable AI testing capabilities that can be used for assessing our AI systems. Neither is “right” or “wrong”, but rather give you options based on what you want to achieve from your tests. If you’re in an early experimentation stage or want some fast feedback on more straightforward outputs from your Agentic system, Promptfoo is an excellent choice. If you require much more granular details for fine-tuning your Agentic system, or have more detailed metrics you want to use, DeepEval is great for giving you that control over your evaluations.

This post has only scratched the surface of what both frameworks can do, but it should provide a solid starting point for further exploration. The full code for the worked examples is available on GitHub.

Are you looking to expand your AI testing capabilities? Reach out to a member of our team.

Cookie	Duration	Description
cookielawinfo-checkbox-advertisement	1 year	Set by the GDPR Cookie Consent plugin, this cookie records the user consent for the cookies in the "Advertisement" category.
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
CookieLawInfoConsent	1 year	CookieYes sets this cookie to record the default button state of the corresponding category and the status of CCPA. It works only in coordination with the primary cookie.
rc::a	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::b	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::c	session	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
rc::f	never	This cookie is set by the Google recaptcha service to identify bots to protect the website against malicious spam attacks.
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
_GRECAPTCHA	6 months	Google Recaptcha service sets this cookie to identify bots to protect the website against malicious spam attacks.

Cookie	Duration	Description
yt-player-bandwidth	never	The yt-player-bandwidth cookie is used to store the user's video player preferences and settings, particularly related to bandwidth and streaming quality on YouTube.
yt-player-headers-readable	never	The yt-player-headers-readable cookie is used by YouTube to store user preferences related to video playback and interface, enhancing the user's viewing experience.
yt-remote-cast-available	session	The yt-remote-cast-available cookie is used to store the user's preferences regarding whether casting is available on their YouTube video player.
yt-remote-cast-installed	session	The yt-remote-cast-installed cookie is used to store the user's video player preferences using embedded YouTube video.
yt-remote-connected-devices	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-device-id	never	YouTube sets this cookie to store the user's video preferences using embedded YouTube videos.
yt-remote-fast-check-period	session	The yt-remote-fast-check-period cookie is used by YouTube to store the user's video player preferences for embedded YouTube videos.
yt-remote-session-app	session	The yt-remote-session-app cookie is used by YouTube to store user preferences and information about the interface of the embedded YouTube video player.
yt-remote-session-name	session	The yt-remote-session-name cookie is used by YouTube to store the user's video player preferences using embedded YouTube video.
ytidb::LAST_RESULT_ENTRY_KEY	never	The cookie ytidb::LAST_RESULT_ENTRY_KEY is used by YouTube to store the last search result entry that was clicked by the user. This information is used to improve the user experience by providing more relevant search results in the future.

Cookie	Duration	Description
_ga	1 year 1 month 4 days	Google Analytics sets this cookie to calculate visitor, session and campaign data and track site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognise unique visitors.
_ga_*	1 year 1 month 4 days	Google Analytics sets this cookie to store and count page views.

Cookie	Duration	Description
VISITOR_INFO1_LIVE	6 months	YouTube sets this cookie to measure bandwidth, determining whether the user gets the new or old player interface.
VISITOR_PRIVACY_METADATA	6 months	YouTube sets this cookie to store the user's cookie consent state for the current domain.
YSC	session	Youtube sets this cookie to track the views of embedded videos on Youtube pages.
yt.innertube::nextId	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	YouTube sets this cookie to register a unique ID to store data on what videos from YouTube the user has seen.

Technical Deep Dive: Promptfoo vs DeepEval for Automated AI Evaluation

Bella Terra: An Overview

Promptfoo

Setup

Writing tests

Results

Summary

DeepEval

Setup

Writing Tests

Results

Conclusion

Get In Touch

Recent Posts

Blog Categories

Contact Us

What We Do

Previous PostFrom Dashboards to Data Culture: Key Takeaways from Our Manchester Data Roundtable

Next PostNimble Spotlight: Meet Nina Midgley, Head of Public Sector

Menu

Contact

Technical Deep Dive: Promptfoo vs DeepEval for Automated AI Evaluation

Bella Terra: An Overview

Promptfoo

Setup

Writing tests

Results

Summary

DeepEval

Setup

Writing Tests

Results

Conclusion

Get In Touch

Recent Posts

Blog Categories

Contact Us

What We Do

Previous PostFrom Dashboards to Data Culture: Key Takeaways from Our Manchester Data Roundtable

Next PostNimble Spotlight: Meet Nina Midgley, Head of Public Sector

You May Also Like

Where Agentic AI Creates Measurable Value

Orchestrating Agentic Systems: Building Self-Organising Agents with A2A

Exploring Databricks Genie: Enhancing Insight Through Conversational Analytics

Menu

Contact