Evaluations - Open WebUI

Evaluations in Open WebUI provide a comprehensive feedback and model comparison system. Users can rate model responses, compare models side-by-side in arena mode, and administrators can view performance leaderboards with Elo ratings.

Overview

The evaluation system provides:

User feedback collection on model responses
Model comparison arena for head-to-head evaluations
Elo-based leaderboard ranking system
Topic-specific model performance analysis
Feedback export and management
Historical performance tracking

Evaluations help identify which models perform best for different use cases and provide valuable feedback for model improvement.

Feedback System

Creating Feedback

Users can submit feedback on model responses:

import requests

url = "http://localhost:8080/api/evaluations/feedback"
payload = {
    "type": "rating",
    "data": {
        "model_id": "gpt-4",
        "rating": "1",  # 1 for positive, -1 for negative
        "sibling_model_ids": ["claude-3", "llama-2"],  # For comparisons
        "tags": ["coding", "python"]  # Topic tags
    },
    "meta": {
        "chat_id": chat_id,
        "message_id": message_id
    }
}

response = requests.post(url, json=payload)
feedback = response.json()

Feedback Structure

Feedback entries contain:

type: Type of feedback (e.g., “rating”, “comment”)
data: Feedback details including:
- model_id: The evaluated model
- rating: “1” (positive) or “-1” (negative)
- sibling_model_ids: Other models in the comparison
- tags: Topic/category tags for context
meta: Additional metadata (chat, message, timestamp)

Viewing User Feedback

Retrieve feedback submitted by the current user:

import requests

response = requests.get("http://localhost:8080/api/evaluations/feedbacks/user")
feedbacks = response.json()

for feedback in feedbacks:
    print(f"Model: {feedback['data']['model_id']}")
    print(f"Rating: {feedback['data']['rating']}")
    print(f"Date: {feedback['created_at']}")

Managing Feedback

Get Specific Feedback

import requests

response = requests.get(
    f"http://localhost:8080/api/evaluations/feedback/{feedback_id}"
)

feedback = response.json()

Update Feedback

Users can update their own feedback:

import requests

url = f"http://localhost:8080/api/evaluations/feedback/{feedback_id}"
payload = {
    "type": "rating",
    "data": {
        "model_id": "gpt-4",
        "rating": "-1",  # Changed rating
        "tags": ["coding", "python", "debugging"]
    }
}

response = requests.post(url, json=payload)

Delete Feedback

import requests

response = requests.delete(
    f"http://localhost:8080/api/evaluations/feedback/{feedback_id}"
)

Evaluation Arena

The arena mode allows administrators to configure blind model comparisons:

Configure Arena Models

Set up models for arena evaluation:

import requests

url = "http://localhost:8080/api/evaluations/config"
payload = {
    "ENABLE_EVALUATION_ARENA_MODELS": True,
    "EVALUATION_ARENA_MODELS": [
        {
            "id": "model-a",
            "name": "Model A",
            "info": {"base_model": "gpt-4"}
        },
        {
            "id": "model-b",
            "name": "Model B",
            "info": {"base_model": "claude-3"}
        }
    ]
}

response = requests.post(url, json=payload)

Get Arena Configuration

import requests

response = requests.get("http://localhost:8080/api/evaluations/config")
config = response.json()

print(f"Arena enabled: {config['ENABLE_EVALUATION_ARENA_MODELS']}")
for model in config['EVALUATION_ARENA_MODELS']:
    print(f"Arena model: {model['name']}")

Arena mode presents models to users without revealing their identities, reducing bias in comparisons.

Leaderboard System

Administrators can view model performance rankings based on user feedback:

Get Model Leaderboard

Retrieve Elo-based model rankings:

import requests

response = requests.get("http://localhost:8080/api/evaluations/leaderboard")
leaderboard = response.json()

for entry in leaderboard["entries"]:
    print(f"{entry['model_id']}: {entry['rating']} Elo")
    print(f"  Wins: {entry['won']}, Losses: {entry['lost']}")
    print(f"  Top tags: {entry['top_tags']}")

Topic-Specific Rankings

Filter leaderboard by topic using semantic search:

import requests

response = requests.get(
    "http://localhost:8080/api/evaluations/leaderboard",
    params={"query": "coding"}  # Filter by topic
)

leaderboard = response.json()
# Returns models ranked for coding tasks based on tag similarity

How Elo Ratings Work

The leaderboard uses an Elo rating system:

Starting Rating: All models begin at 1000 Elo
Rating Updates: When users compare models:
- Winner gains points, loser loses points
- Amount depends on rating difference (upsets = bigger change)
Formula: new_rating = old_rating + K * (actual - expected)
- K=32 (controls rating volatility)
- Expected = probability of winning based on current ratings

Topic-Based Re-Ranking

When querying with a topic (e.g., “coding”):

System computes semantic similarity between query and feedback tags
Similarity scores weight the Elo calculation
Feedback about “coding” contributes more to rankings
Unrelated feedback contributes less

This provides topic-specific leaderboards without separate data.

Model Performance History

Track a model’s performance over time:

import requests

response = requests.get(
    f"http://localhost:8080/api/evaluations/leaderboard/{model_id}/history",
    params={"days": 30}
)

history = response.json()
for entry in history["history"]:
    print(f"Date: {entry['date']}")
    print(f"Wins: {entry['wins']}, Losses: {entry['losses']}")

Admin Feedback Management

Administrators have additional feedback management capabilities:

List All Feedback

Get paginated feedback with sorting:

import requests

response = requests.get(
    "http://localhost:8080/api/evaluations/feedbacks/list",
    params={
        "order_by": "created_at",
        "direction": "desc",
        "page": 1
    }
)

result = response.json()
feedbacks = result["items"]
total = result["total"]

Export All Feedback

Export feedback data for analysis:

import requests

response = requests.get(
    "http://localhost:8080/api/evaluations/feedbacks/all/export"
)

feedback_data = response.json()
# Full feedback data with all fields

Get Feedback IDs Only

Retrieve just the IDs for bulk operations:

import requests

response = requests.get(
    "http://localhost:8080/api/evaluations/feedbacks/all/ids"
)

feedback_ids = response.json()

Bulk Delete Feedback

Delete all feedback (admin only):

import requests

response = requests.delete(
    "http://localhost:8080/api/evaluations/feedbacks/all"
)

success = response.json()

Bulk delete operations are permanent and cannot be undone. Export feedback data before deletion.

Use Cases

A/B Model Testing

Compare two models for a specific use case:

# Configure arena with two models
payload = {
    "ENABLE_EVALUATION_ARENA_MODELS": True,
    "EVALUATION_ARENA_MODELS": [
        {"id": "model-a", "name": "Model A"},
        {"id": "model-b", "name": "Model B"}
    ]
}

# Collect user feedback in blind mode
# After sufficient data, check leaderboard:
response = requests.get(
    "http://localhost:8080/api/evaluations/leaderboard",
    params={"query": "customer-support"}  # Topic filter
)

# Analyze which model performs better for customer support

Topic Performance Analysis

Identify which models excel at different topics:

topics = ["coding", "writing", "analysis", "translation"]

for topic in topics:
    response = requests.get(
        "http://localhost:8080/api/evaluations/leaderboard",
        params={"query": topic}
    )
    
    leaderboard = response.json()
    top_model = leaderboard["entries"][0]
    
    print(f"Best for {topic}: {top_model['model_id']}")
    print(f"  Rating: {top_model['rating']}")
    print(f"  Record: {top_model['won']}-{top_model['lost']}")

Feedback Collection Workflow

Systematic feedback collection:

# 1. User completes a task with a model
chat_response = chat_with_model("Help me write a Python function")

# 2. Collect structured feedback
feedback = {
    "type": "rating",
    "data": {
        "model_id": "gpt-4",
        "rating": "1",  # Positive
        "tags": ["coding", "python", "helpful"],
        "comment": "Great explanation and working code"
    },
    "meta": {
        "chat_id": chat_id,
        "message_id": message_id,
        "task_type": "code_generation"
    }
}

requests.post("http://localhost:8080/api/evaluations/feedback", json=feedback)

# 3. Periodically review aggregate feedback
response = requests.get(
    "http://localhost:8080/api/evaluations/leaderboard",
    params={"query": "coding"}
)

Best Practices

Tag Consistently: Use standardized tags for better topic analysis
Collect Context: Include relevant metadata (task type, user role, etc.)
Regular Reviews: Check leaderboard trends periodically
Blind Comparisons: Use arena mode to reduce bias
Sufficient Data: Collect enough feedback before making decisions
Export Backups: Regularly export feedback for analysis and backup
Topic-Specific: Use query filtering to find best models per use case

Troubleshooting

Invalid Rating Value

Rating must be “1” (positive) or “-1” (negative):

# Correct
{"rating": "1"}  # String value

# Incorrect
{"rating": 1}    # Integer not supported
{"rating": "5"}  # Only 1 and -1 allowed

Empty Leaderboard

Leaderboard requires feedback with comparison data:

# Ensure feedback includes sibling_model_ids
{
    "data": {
        "model_id": "gpt-4",
        "rating": "1",
        "sibling_model_ids": ["claude-3"]  # Required for rankings
    }
}

Missing Topic Results

Topic-based filtering requires:

Embedding model configured (AUXILIARY_EMBEDDING_MODEL)
Feedback with topic tags
Sufficient tag diversity

If the embedding model fails to load, topic-based filtering falls back to unweighted Elo rankings.

Configuration

Environment Variables

# Set embedding model for topic analysis
AUXILIARY_EMBEDDING_MODEL=TaylorAI/bge-micro-v2

# The default model is lightweight and works well for tag similarity

Required Permissions

Users: Can create, view, update, and delete their own feedback
Admins: Full access to all feedback, leaderboard, and configuration

Next Steps

Learn about Model Management for adding models
Explore RBAC for permission management
Configure Observability for performance monitoring

​Overview

​Feedback System

​Creating Feedback

​Feedback Structure

​Viewing User Feedback

​Managing Feedback

​Get Specific Feedback

​Update Feedback

​Delete Feedback

​Evaluation Arena

​Configure Arena Models

​Get Arena Configuration

​Leaderboard System

​Get Model Leaderboard

​Topic-Specific Rankings

​How Elo Ratings Work

​Model Performance History

​Admin Feedback Management

​List All Feedback

​Export All Feedback

​Get Feedback IDs Only

​Bulk Delete Feedback

​Use Cases

​Best Practices

​Troubleshooting

​Invalid Rating Value

​Empty Leaderboard

​Missing Topic Results

​Configuration

​Environment Variables

​Required Permissions

​Next Steps

Overview

Feedback System

Creating Feedback

Feedback Structure

Viewing User Feedback

Managing Feedback

Get Specific Feedback

Update Feedback

Delete Feedback

Evaluation Arena

Configure Arena Models

Get Arena Configuration

Leaderboard System

Get Model Leaderboard

Topic-Specific Rankings

How Elo Ratings Work

Model Performance History

Admin Feedback Management

List All Feedback

Export All Feedback

Get Feedback IDs Only

Bulk Delete Feedback

Use Cases

Best Practices

Troubleshooting

Invalid Rating Value

Empty Leaderboard

Missing Topic Results

Configuration

Environment Variables

Required Permissions

Next Steps