Skip to main content

Generative AI Evaluation Methods

· 3 min read
Hinny Tsang
Data Scientist @ Pollock Asset Management

The evaluation of generative AI models is a complex task that requires a combination of qualitative and quantitative methods. Here is the points that I summarized from the Machine Learning Operations with Vertex AI tutorial from Google.

In general, the evaluation methods can be categorized into several types:

  1. Binary Evaluation

    • e.g. positive or negative sentiment analysis, spam detection, and appropriate content detection.
  2. Category Evaluation

    • e.g. topic classification, sentiment analysis including neutral, and product rating.
  3. Ranking Evaluation

    • Rank the relative quality for different outputs.
    • e.g. ranking search results, ranking product recommendations, and ranking news articles.
  4. Numerical Evaluation

    • Assigns a quantitative score to model output.
    • e.g. BLEU, ROUGE, METEOR, perplexity, and F1 score.
  5. Text Evaluation

    • Evaluation of the text by human.
  6. Multi-task Evaluation

    • Mixture of the above methods.

The choice of evaluation method depends on the specific task and the goals of the evaluation. Below are some examples:

  1. Lexical Similarity

    • Measure the similarity between the model's output and reference text, based on word overlap, sequence of words, or semantic similarity.
    • e.g. BLEU focuses on n-gram overlap, ROUGE focuses on recall, and METEOR focuses on both precision and recall.
  2. Linguistic Quality

    • Evaluate the fluency, coherence, and grammatical correctness of the generated text.
    • e.g. Perplexity, BLEURT.
  3. Task-Specific Evaluation

    • Evaluate the performance of the model on specific tasks, such as summarization, translation, or question answering.
    • e.g. BLEU for translation, and ROUGE for summarization.
  4. Safety and fairness

    • Evaluate the model's output for safety and fairness, including bias detection and harmful content detection.
    • e.g. toxicity detection, hate speech detection, and bias detection or even human evaluation.
  5. Groundedness

    • Evaluate the factual accuracy of the model's output.
    • e.g. Fact-checking tools, knowledge-based integration and human evaluation.
  6. User-Centric Evaluation

    • Focus on the user experience and satisfaction with the model's output.
    • e.g. User survey.

Moreover, there are some common evaluation paradigms:

  1. Pointwise Evaluation:

    • Evaluating model behavior in production.
    • Absolute performance of a single model.
    • Identifying behaviors to prioritize for tuning.
    • Establishing a baseline for model performance.
  2. Pairwise Evaluation:

    • Comparison of two models.

Evaluation methods are not only computation based, but also be model based, using LLM as a judge to evaluate the model output (Google Auto Side by Side). In summary, the choice of evaluation method depends on the specific task and the goals of the evaluation. I may talk about more about the evaluation metrics in the future.