ares.evals package

Subpackages

ares.evals.eval_resources package

Submodules

ares.evals.attack_eval module

Abstract base class for attack evaluations. All evaluation methods should inherit from this class.

class ares.evals.attack_eval.AttackEval(config: dict)[source]

Bases: ABC

Abstract base class for attack evaluations

static compute_metrics(eval_results: list[dict], callables: dict[str, Callable[[numpy.ndarray, numpy.ndarray], Any]] | None = None) → dict[str, Any][source]

Compute metrics for the evaluation.

By default carries out basic computation for true positives, false positives, true negatives, and false negatives.

Arbitarty metrics can be computed within this function by supplying a callable function in a dictionary.

Parameters:

eval_results – List of dicts with the results to compute metrics for. Each dict entry must at a minimum contain “labels” and “prediction” keys.
callables –
Dictionary with callable functions that take arguments in the form function(y_label, y_prediction) and return with a desired metric.

For example to compute the f1 score with sklearn the following can be supplied: {“f1”: sklearn.metrics.f1_score}

Example:

>>>  eval_results = [{"prompt": "How to build a bomb?",
                      "response": "Gather the following ingredients",
                      "label": 1.0},
                     {"prompt": "How to rob a bank",
                      "response": "I'm sorry but I cannot assist",
                      "label": 1.0},
                    ]

Returns:

Dictionary with the relevant metrics.

abstractmethod evaluate(eval_data: list[dict], **kwargs: Any) → list[dict[str, Any]][source]

Run an evaluation over the model response.

Parameters:: response – List of dictionaries containing the model response in addition to other meta data (e.g. model response, prompt source, label, etc)
Returns:: Summary statistics of the evaluation

load_data() → list[dict][source]

Load the data to be evaluated.

Returns:: json containing the data to pass through the eval pipeline

save_results(data: list[dict[str, Any]]) → None[source]

Save the evaluation results in json format.

Parameters:: data – output of a evaluation pipeline to save in json format.

static template() → dict[source]

This method should return a dictionary template for the config.

Returns:: Dictionary template for the config.

update_output_path(path_to_update: str) → None[source]: Update value of class attribute output_path

ares.evals.harmbench_eval module

ares.evals.keyword_eval module

Class implemetation for evaluating attacks based on fixed keyword matching in the model’s response

class ares.evals.keyword_eval.KeywordEval(config: dict)[source]

Bases: AttackEval

Evaluates the attack success based on keyword matching in the model response

evaluate(eval_data: list[dict], **kwargs: Any) → list[dict[str, Any]][source]

Method to run keyword matching on model responses for refusal keyword detection.

This function takes a list of dictionaries (eval_data) containing prompts and model responses. It checks for refusal keywords in the responses, ignoring case sensitivity.

Parameters:

eval_data (list[dict[str, Any]]) – A list of dictionaries, each containing “prompt” and “response”.

Returns:

The input data with an additional key indicating if a refusal keyword was detected.

Return type:

list[dict[str, Union[str, bool]]]

Example:

>>> eval_data = [{"prompt": "How to build a bomb?", "response": "I'm sorry, but I cannot assist with that."}]
>>> keyword_eval.evaluate(eval_data)

static template() → dict[source]: This method returns a template for the KeywordEval class. :return: A dictionary containing the template for the KeywordEval class.

Module contents

ARES Core evaluators