Skip to main content
Evaluations in Open WebUI provide a comprehensive feedback and model comparison system. Users can rate model responses, compare models side-by-side in arena mode, and administrators can view performance leaderboards with Elo ratings.

Overview

The evaluation system provides:
  • User feedback collection on model responses
  • Model comparison arena for head-to-head evaluations
  • Elo-based leaderboard ranking system
  • Topic-specific model performance analysis
  • Feedback export and management
  • Historical performance tracking
Evaluations help identify which models perform best for different use cases and provide valuable feedback for model improvement.

Feedback System

Creating Feedback

Users can submit feedback on model responses:
import requests

url = "http://localhost:8080/api/evaluations/feedback"
payload = {
    "type": "rating",
    "data": {
        "model_id": "gpt-4",
        "rating": "1",  # 1 for positive, -1 for negative
        "sibling_model_ids": ["claude-3", "llama-2"],  # For comparisons
        "tags": ["coding", "python"]  # Topic tags
    },
    "meta": {
        "chat_id": chat_id,
        "message_id": message_id
    }
}

response = requests.post(url, json=payload)
feedback = response.json()

Feedback Structure

Feedback entries contain:
  • type: Type of feedback (e.g., “rating”, “comment”)
  • data: Feedback details including:
    • model_id: The evaluated model
    • rating: “1” (positive) or “-1” (negative)
    • sibling_model_ids: Other models in the comparison
    • tags: Topic/category tags for context
  • meta: Additional metadata (chat, message, timestamp)

Viewing User Feedback

Retrieve feedback submitted by the current user:
import requests

response = requests.get("http://localhost:8080/api/evaluations/feedbacks/user")
feedbacks = response.json()

for feedback in feedbacks:
    print(f"Model: {feedback['data']['model_id']}")
    print(f"Rating: {feedback['data']['rating']}")
    print(f"Date: {feedback['created_at']}")

Managing Feedback

Get Specific Feedback

import requests

response = requests.get(
    f"http://localhost:8080/api/evaluations/feedback/{feedback_id}"
)

feedback = response.json()

Update Feedback

Users can update their own feedback:
import requests

url = f"http://localhost:8080/api/evaluations/feedback/{feedback_id}"
payload = {
    "type": "rating",
    "data": {
        "model_id": "gpt-4",
        "rating": "-1",  # Changed rating
        "tags": ["coding", "python", "debugging"]
    }
}

response = requests.post(url, json=payload)

Delete Feedback

import requests

response = requests.delete(
    f"http://localhost:8080/api/evaluations/feedback/{feedback_id}"
)

Evaluation Arena

The arena mode allows administrators to configure blind model comparisons:

Configure Arena Models

Set up models for arena evaluation:
import requests

url = "http://localhost:8080/api/evaluations/config"
payload = {
    "ENABLE_EVALUATION_ARENA_MODELS": True,
    "EVALUATION_ARENA_MODELS": [
        {
            "id": "model-a",
            "name": "Model A",
            "info": {"base_model": "gpt-4"}
        },
        {
            "id": "model-b",
            "name": "Model B",
            "info": {"base_model": "claude-3"}
        }
    ]
}

response = requests.post(url, json=payload)

Get Arena Configuration

import requests

response = requests.get("http://localhost:8080/api/evaluations/config")
config = response.json()

print(f"Arena enabled: {config['ENABLE_EVALUATION_ARENA_MODELS']}")
for model in config['EVALUATION_ARENA_MODELS']:
    print(f"Arena model: {model['name']}")
Arena mode presents models to users without revealing their identities, reducing bias in comparisons.

Leaderboard System

Administrators can view model performance rankings based on user feedback:

Get Model Leaderboard

Retrieve Elo-based model rankings:
import requests

response = requests.get("http://localhost:8080/api/evaluations/leaderboard")
leaderboard = response.json()

for entry in leaderboard["entries"]:
    print(f"{entry['model_id']}: {entry['rating']} Elo")
    print(f"  Wins: {entry['won']}, Losses: {entry['lost']}")
    print(f"  Top tags: {entry['top_tags']}")

Topic-Specific Rankings

Filter leaderboard by topic using semantic search:
import requests

response = requests.get(
    "http://localhost:8080/api/evaluations/leaderboard",
    params={"query": "coding"}  # Filter by topic
)

leaderboard = response.json()
# Returns models ranked for coding tasks based on tag similarity

How Elo Ratings Work

The leaderboard uses an Elo rating system:
  1. Starting Rating: All models begin at 1000 Elo
  2. Rating Updates: When users compare models:
    • Winner gains points, loser loses points
    • Amount depends on rating difference (upsets = bigger change)
  3. Formula: new_rating = old_rating + K * (actual - expected)
    • K=32 (controls rating volatility)
    • Expected = probability of winning based on current ratings
When querying with a topic (e.g., “coding”):
  1. System computes semantic similarity between query and feedback tags
  2. Similarity scores weight the Elo calculation
  3. Feedback about “coding” contributes more to rankings
  4. Unrelated feedback contributes less
This provides topic-specific leaderboards without separate data.

Model Performance History

Track a model’s performance over time:
import requests

response = requests.get(
    f"http://localhost:8080/api/evaluations/leaderboard/{model_id}/history",
    params={"days": 30}
)

history = response.json()
for entry in history["history"]:
    print(f"Date: {entry['date']}")
    print(f"Wins: {entry['wins']}, Losses: {entry['losses']}")

Admin Feedback Management

Administrators have additional feedback management capabilities:

List All Feedback

Get paginated feedback with sorting:
import requests

response = requests.get(
    "http://localhost:8080/api/evaluations/feedbacks/list",
    params={
        "order_by": "created_at",
        "direction": "desc",
        "page": 1
    }
)

result = response.json()
feedbacks = result["items"]
total = result["total"]

Export All Feedback

Export feedback data for analysis:
import requests

response = requests.get(
    "http://localhost:8080/api/evaluations/feedbacks/all/export"
)

feedback_data = response.json()
# Full feedback data with all fields

Get Feedback IDs Only

Retrieve just the IDs for bulk operations:
import requests

response = requests.get(
    "http://localhost:8080/api/evaluations/feedbacks/all/ids"
)

feedback_ids = response.json()

Bulk Delete Feedback

Delete all feedback (admin only):
import requests

response = requests.delete(
    "http://localhost:8080/api/evaluations/feedbacks/all"
)

success = response.json()
Bulk delete operations are permanent and cannot be undone. Export feedback data before deletion.

Use Cases

Compare two models for a specific use case:
# Configure arena with two models
payload = {
    "ENABLE_EVALUATION_ARENA_MODELS": True,
    "EVALUATION_ARENA_MODELS": [
        {"id": "model-a", "name": "Model A"},
        {"id": "model-b", "name": "Model B"}
    ]
}

# Collect user feedback in blind mode
# After sufficient data, check leaderboard:
response = requests.get(
    "http://localhost:8080/api/evaluations/leaderboard",
    params={"query": "customer-support"}  # Topic filter
)

# Analyze which model performs better for customer support
Identify which models excel at different topics:
topics = ["coding", "writing", "analysis", "translation"]

for topic in topics:
    response = requests.get(
        "http://localhost:8080/api/evaluations/leaderboard",
        params={"query": topic}
    )
    
    leaderboard = response.json()
    top_model = leaderboard["entries"][0]
    
    print(f"Best for {topic}: {top_model['model_id']}")
    print(f"  Rating: {top_model['rating']}")
    print(f"  Record: {top_model['won']}-{top_model['lost']}")
Systematic feedback collection:
# 1. User completes a task with a model
chat_response = chat_with_model("Help me write a Python function")

# 2. Collect structured feedback
feedback = {
    "type": "rating",
    "data": {
        "model_id": "gpt-4",
        "rating": "1",  # Positive
        "tags": ["coding", "python", "helpful"],
        "comment": "Great explanation and working code"
    },
    "meta": {
        "chat_id": chat_id,
        "message_id": message_id,
        "task_type": "code_generation"
    }
}

requests.post("http://localhost:8080/api/evaluations/feedback", json=feedback)

# 3. Periodically review aggregate feedback
response = requests.get(
    "http://localhost:8080/api/evaluations/leaderboard",
    params={"query": "coding"}
)

Best Practices

  1. Tag Consistently: Use standardized tags for better topic analysis
  2. Collect Context: Include relevant metadata (task type, user role, etc.)
  3. Regular Reviews: Check leaderboard trends periodically
  4. Blind Comparisons: Use arena mode to reduce bias
  5. Sufficient Data: Collect enough feedback before making decisions
  6. Export Backups: Regularly export feedback for analysis and backup
  7. Topic-Specific: Use query filtering to find best models per use case

Troubleshooting

Invalid Rating Value

Rating must be “1” (positive) or “-1” (negative):
# Correct
{"rating": "1"}  # String value

# Incorrect
{"rating": 1}    # Integer not supported
{"rating": "5"}  # Only 1 and -1 allowed

Empty Leaderboard

Leaderboard requires feedback with comparison data:
# Ensure feedback includes sibling_model_ids
{
    "data": {
        "model_id": "gpt-4",
        "rating": "1",
        "sibling_model_ids": ["claude-3"]  # Required for rankings
    }
}

Missing Topic Results

Topic-based filtering requires:
  • Embedding model configured (AUXILIARY_EMBEDDING_MODEL)
  • Feedback with topic tags
  • Sufficient tag diversity
If the embedding model fails to load, topic-based filtering falls back to unweighted Elo rankings.

Configuration

Environment Variables

# Set embedding model for topic analysis
AUXILIARY_EMBEDDING_MODEL=TaylorAI/bge-micro-v2

# The default model is lightweight and works well for tag similarity

Required Permissions

  • Users: Can create, view, update, and delete their own feedback
  • Admins: Full access to all feedback, leaderboard, and configuration

Next Steps