New Features
Evaluations are here! π
---
What are Evaluations? π π
We know that building an Agent is only part of the journey.
Understanding how that Agent responds to real-world queries is a key indicator of how it will perform in Production.
Running evaluations, or "evals", allows Agent developers to quickly identify "losses", or areas of opportunities for improving Agent design.
Evals can provide answers to questions like:
* What is the current performance baseline for my Agent?
* How is my Agent performing after the most recent changes?
* If I switch to a new LLM, how does that change my Agent's performance?
Evaluation Toolsets in SCRAPI π οΈπ
For this latest release, we have included 2 specific Eval setups for developers to use with Agent Builder and Dialogflow CX Agents.
1. [DataStore Evaluations](https://github.com/GoogleCloudPlatform/dfcx-scrapi/blob/main/examples/vertex_ai_conversation/evaluation_tool__autoeval__colab.ipynb)
2. [Multi-turn, Multi-Agent w/ Tool Calling Evaluations](https://github.com/GoogleCloudPlatform/dfcx-scrapi/blob/main/examples/vertex_ai_conversation/vertex_agents_evals.ipynb)
These are offered as two distinct evaluations toolsets because of a few reasons:
* They support different build architectures in DFCX vs. Agent Builder
* They support different metrics based on the task you are trying to evaluate
* They support different tool calling setups: Native DataStores vs. arbitrary custom tools
Metrics by Toolset. π
The following metrics are currently supported for each toolset.
Additional metrics will be added over time to support various other evaluation needs.
- DataStore Evaluations
- `Url Match`
- `Context Recall`
- `Faithfulness`
- `Answer Correctness`
- `RougeL`
- Multi-Turn, Multi-Agent w/ Tool Callling Evaluations
- `Semantic Similarity`
- `Exact Match Tool Quality`
Getting Started with Evaluations π
1. Start by choosing your Eval toolset based on the Agent architecture you are evaluating
- [DataStore Evaluations](https://github.com/GoogleCloudPlatform/dfcx-scrapi/blob/main/examples/vertex_ai_conversation/evaluation_tool__autoeval__colab.ipynb)
- [Multi-turn, Multi-Agent w/ Tool Calling Evaluations](https://github.com/GoogleCloudPlatform/dfcx-scrapi/blob/main/examples/vertex_ai_conversation/vertex_agents_evals.ipynb)
3. Build an Evaluation Dataset. You can find detailed information about the dataset formats in each of the toolset instructions
4. Run your evals!
Example Eval Setup for Multi-Turn, Mutli-Agent w/ Tools
py
import pandas as pd
from dfcx_scrapi.tools.evaluations import Evaluations
from dfcx_scrapi.tools.evaluations import DataLoader
data = DataLoader()
INPUT_SCHEMA_REQUIRED_COLUMNS = ['eval_id', 'action_id', 'action_type', 'action_input', 'action_input_parameters', 'tool_action', 'notes']
sample_df = pd.DataFrame(columns=INPUT_SCHEMA_REQUIRED_COLUMNS)
sample_df.loc[0] = ["travel-ai-001", 1, "User Utterance", "Paris", "", "", ""]
sample_df.loc[1] = ["travel-ai-001", 2, "Playbook Invocation", "Travel Inspiration", "", "", ""]
sample_df.loc[2] = ["travel-ai-001", 3, "Agent Response", "Paris is a beautiful city! Here are a few things you might enjoy doing there:\n\nVisit the Eiffel Tower\nTake a walk along the Champs-ΓlysΓ©es\nVisit the Louvre Museum\nSee the Arc de Triomphe\nTake a boat ride on the Seine River", "", "", ""]
sample_df = data.from_dataframe(sample_df)
agent_id = "projects/your-project/locations/us-central1/agents/11111-2222-33333-44444" Example Agent
evals = Evaluations(agent_id, metrics=["response_similarity", "tool_call_quality"])
eval_results = evals.run_query_and_eval(sample_df.head(10))
What's Changed
* Feat/evaluations by kmaphoenix in https://github.com/GoogleCloudPlatform/dfcx-scrapi/pull/217
* Feat/evals notebook by kmaphoenix in https://github.com/GoogleCloudPlatform/dfcx-scrapi/pull/218
**Full Changelog**: https://github.com/GoogleCloudPlatform/dfcx-scrapi/compare/1.11.2...1.12.0