Metrics

Metrics evaluate model outputs across dimensions including accuracy, quality, similarity, faithfulness, and hallucination detection. Each Run can utilize multiple metrics to assess performance.

The available metrics on Quotient are not a silver bullet, but are available to help you get started with understanding model performance.

Quotient normalizes text before comparison by:

  • Converting to lowercase
  • Removing stop-words
  • Removing punctuation
  • Removing extra whitespace

Metric Categories

Accuracy Metrics

  • Exact Match (exactmatch): Binary score for exact string matches
  • Normalized Exact Match (normalized_exactmatch): Binary score for matches after normalization
  • F1 Score (f1score): Word overlap between the model output and expected response (0-1)
  • Jaccard Similarity (jaccard_similarity): Proportion of shared unique words (0-1)

Syntactic Similarity

  • ROUGE: N-gram overlap metrics, often used for summarization tasks (0-1)
    • rouge1, rouge2: Unigram/bigram overlap
    • rougeL: Longest common subsequence
    • rougeLsum: Summary-level LCS
  • SacreBLEU (sacrebleu): Standardized n-gram precision, often used for translation tasks (0-100)
  • METEOR (meteor): Advanced unigram overlap, often used for translation tasks (0-1)

Semantic Similarity

  • BERTScore (bertscore): BERT-based semantic similarity metrics on the word-level (0-1)
  • Sentence Transformers Similarity (sentence_tranformers_similarity): Semantic meaning comparison (-1 to 1)

Faithfulness & Hallucination

  • Knowledge F1 (knowledge_f1score): Context-output vocabulary alignment (0-1)
  • ROUGE for Context (rouge_for_context): Context-output overlap (0-1)
  • SelfCheckGPT-NLI (faithfulness_selfcheckgpt): Hallucination detection (0-1)

Text Quality

  • Verbosity Ratio (verbosity_ratio): Model output length relative to expected response

Using Metrics

In Code

run = quotient.evaluate(
    prompt=prompt,
    dataset=dataset,
    model=model,
    metrics=[
        'bertscore',
        'exact_match',
        'verbosity_ratio'
    ]
)

Viewing metrics in the CLI & SDK

# CLI
quotient list metrics
# sdk
quotient.metrics.list()

Metric Details

BERTScore

Uses BERT embeddings to compute semantic similarity. Returns a dictionary of precision, recall, and f1 variants.

Limitations:

  • Struggles with rare/out-of-vocabulary words
  • May show length bias
  • Domain-specific noise

Sentence Transformer Similarity

Measures semantic meaning similarity using sentence embeddings.

Values:

  • -1: Opposite meaning
  • 0: Unrelated
  • 1: Identical meaning

F1 Score

Harmonic mean of precision and recall for word overlap (common words between the model output and expected response):

precision = |intersection| / |output|
recall = |intersection| / |expected|
F1 = 2 * (precision * recall) / (precision + recall)

ROUGE

Family of recall-focused n-gram overlap metrics. Each variant returns a dictionary of precision, recall, and f1 scores.

Types:

  • rouge1, rouge2: N-gram overlap
  • rougeL: Longest common subsequence
  • rougeLsum: Summary-level scoring

SelfCheckGPT Metric

  • NLI: Measures output-context consistency (0-1)

For implementation details, see the metrics SDK documentation.