Guides
Evaluation Metrics
Metrics
Metrics evaluate model outputs across dimensions including accuracy, quality, similarity, faithfulness, and hallucination detection. Each Run can utilize multiple metrics to assess performance.
The available metrics on Quotient are not a silver bullet, but are available to help you get started with understanding model performance.
Quotient normalizes text before comparison by:
- Converting to lowercase
- Removing stop-words
- Removing punctuation
- Removing extra whitespace
Metric Categories
Accuracy Metrics
- Exact Match (
exactmatch
): Binary score for exact string matches - Normalized Exact Match (
normalized_exactmatch
): Binary score for matches after normalization - F1 Score (
f1score
): Word overlap between the model output and expected response (0-1) - Jaccard Similarity (
jaccard_similarity
): Proportion of shared unique words (0-1)
Syntactic Similarity
- ROUGE: N-gram overlap metrics, often used for summarization tasks (0-1)
rouge1
,rouge2
: Unigram/bigram overlaprougeL
: Longest common subsequencerougeLsum
: Summary-level LCS
- SacreBLEU (
sacrebleu
): Standardized n-gram precision, often used for translation tasks (0-100) - METEOR (
meteor
): Advanced unigram overlap, often used for translation tasks (0-1)
Semantic Similarity
- BERTScore (
bertscore
): BERT-based semantic similarity metrics on the word-level (0-1) - Sentence Transformers Similarity (
sentence_tranformers_similarity
): Semantic meaning comparison (-1 to 1)
Faithfulness & Hallucination
- Knowledge F1 (
knowledge_f1score
): Context-output vocabulary alignment (0-1) - ROUGE for Context (
rouge_for_context
): Context-output overlap (0-1) - SelfCheckGPT-NLI (
faithfulness_selfcheckgpt
): Hallucination detection (0-1)
Text Quality
- Verbosity Ratio (
verbosity_ratio
): Model output length relative to expected response
Using Metrics
In Code
Viewing metrics in the CLI & SDK
Metric Details
BERTScore
Uses BERT embeddings to compute semantic similarity. Returns a dictionary of precision
, recall
, and f1
variants.
Limitations:
- Struggles with rare/out-of-vocabulary words
- May show length bias
- Domain-specific noise
Sentence Transformer Similarity
Measures semantic meaning similarity using sentence embeddings.
Values:
- -1: Opposite meaning
- 0: Unrelated
- 1: Identical meaning
F1 Score
Harmonic mean of precision and recall for word overlap (common words between the model output and expected response):
ROUGE
Family of recall-focused n-gram overlap metrics. Each variant returns a dictionary of precision
, recall
, and f1
scores.
Types:
rouge1
,rouge2
: N-gram overlaprougeL
: Longest common subsequencerougeLsum
: Summary-level scoring
SelfCheckGPT Metric
- NLI: Measures output-context consistency (0-1)
For implementation details, see the metrics SDK documentation.