Evaluation¶
class
versionhq.task.evaluate.
A Pydantic class to store conditions and results of the evaluation.
Variables¶
Variable |
Data Type | Default | Nullable | Description |
---|---|---|---|---|
items |
List[InstanceOf[EvaluationItem]] | list() | - | Stores evaluation items. |
eval_by |
Any | None | True | Stores an agent evaluated the output. |
Property¶
Property |
Returns | Description |
---|---|---|
aggregate_score |
float | Calucurates weighted average eval scores of the task output. |
suggestion_summary |
str | Returns summary of the suggestions. |
EvaluationItem¶
class
versionhq.task.evaluate.
Variables¶
Variable |
Data Type | Default | Nullable | Description |
---|---|---|---|---|
criteria |
str | None | False | Stores evaluation criteria given by the client. |
suggestion |
str | None | True | Stores suggestion on improvement from the evaluator agent. |
score |
float | None | True | Stores the score on a 0 to 1 scale. |
Usage¶
Evaluator agents will evaluate the task output based on the given criteria, and store the results in the TaskOutput
object.
import versionhq as vhq
from pydantic import BaseModel
class CustomOutput(BaseModel):
test1: str
test2: list[str]
task = vhq.Task(
description="Research a topic to teach a kid aged 6 about math.",
response_schema=CustomOutput,
should_evaluate=True, # triggers evaluation
eval_criteria=["uniquness", "audience fit",],
)
res = task.execute()
assert isinstance(res.evaluation, vhq.Evaluation)
assert [item for item in res.evaluation.items if item.criteria == "uniquness" or item.criteria == "audience fit"]
assert res.evaluation.aggregate_score is not None
assert res.evaluation.suggestion_summary is not None
An Evaluation
object provides scores for the given criteria.
For example, it might indicate a uniqueness
score of 0.56, an audience fit
score of 0.70, and an aggregate score
of 0.63.