Evaluations in Open WebUI provide a comprehensive feedback and model comparison system. Users can rate model responses, compare models side-by-side in arena mode, and administrators can view performance leaderboards with Elo ratings.
Overview
The evaluation system provides:
User feedback collection on model responses
Model comparison arena for head-to-head evaluations
Elo-based leaderboard ranking system
Topic-specific model performance analysis
Feedback export and management
Historical performance tracking
Evaluations help identify which models perform best for different use cases and provide valuable feedback for model improvement.
Feedback System
Creating Feedback
Users can submit feedback on model responses:
import requests
url = "http://localhost:8080/api/evaluations/feedback"
payload = {
"type" : "rating" ,
"data" : {
"model_id" : "gpt-4" ,
"rating" : "1" , # 1 for positive, -1 for negative
"sibling_model_ids" : [ "claude-3" , "llama-2" ], # For comparisons
"tags" : [ "coding" , "python" ] # Topic tags
},
"meta" : {
"chat_id" : chat_id,
"message_id" : message_id
}
}
response = requests.post(url, json = payload)
feedback = response.json()
Feedback Structure
Feedback entries contain:
type : Type of feedback (e.g., “rating”, “comment”)
data : Feedback details including:
model_id : The evaluated model
rating : “1” (positive) or “-1” (negative)
sibling_model_ids : Other models in the comparison
tags : Topic/category tags for context
meta : Additional metadata (chat, message, timestamp)
Viewing User Feedback
Retrieve feedback submitted by the current user:
import requests
response = requests.get( "http://localhost:8080/api/evaluations/feedbacks/user" )
feedbacks = response.json()
for feedback in feedbacks:
print ( f "Model: { feedback[ 'data' ][ 'model_id' ] } " )
print ( f "Rating: { feedback[ 'data' ][ 'rating' ] } " )
print ( f "Date: { feedback[ 'created_at' ] } " )
Managing Feedback
Get Specific Feedback
import requests
response = requests.get(
f "http://localhost:8080/api/evaluations/feedback/ { feedback_id } "
)
feedback = response.json()
Update Feedback
Users can update their own feedback:
import requests
url = f "http://localhost:8080/api/evaluations/feedback/ { feedback_id } "
payload = {
"type" : "rating" ,
"data" : {
"model_id" : "gpt-4" ,
"rating" : "-1" , # Changed rating
"tags" : [ "coding" , "python" , "debugging" ]
}
}
response = requests.post(url, json = payload)
Delete Feedback
import requests
response = requests.delete(
f "http://localhost:8080/api/evaluations/feedback/ { feedback_id } "
)
Evaluation Arena
The arena mode allows administrators to configure blind model comparisons:
Set up models for arena evaluation:
import requests
url = "http://localhost:8080/api/evaluations/config"
payload = {
"ENABLE_EVALUATION_ARENA_MODELS" : True ,
"EVALUATION_ARENA_MODELS" : [
{
"id" : "model-a" ,
"name" : "Model A" ,
"info" : { "base_model" : "gpt-4" }
},
{
"id" : "model-b" ,
"name" : "Model B" ,
"info" : { "base_model" : "claude-3" }
}
]
}
response = requests.post(url, json = payload)
Get Arena Configuration
import requests
response = requests.get( "http://localhost:8080/api/evaluations/config" )
config = response.json()
print ( f "Arena enabled: { config[ 'ENABLE_EVALUATION_ARENA_MODELS' ] } " )
for model in config[ 'EVALUATION_ARENA_MODELS' ]:
print ( f "Arena model: { model[ 'name' ] } " )
Arena mode presents models to users without revealing their identities, reducing bias in comparisons.
Leaderboard System
Administrators can view model performance rankings based on user feedback:
Get Model Leaderboard
Retrieve Elo-based model rankings:
import requests
response = requests.get( "http://localhost:8080/api/evaluations/leaderboard" )
leaderboard = response.json()
for entry in leaderboard[ "entries" ]:
print ( f " { entry[ 'model_id' ] } : { entry[ 'rating' ] } Elo" )
print ( f " Wins: { entry[ 'won' ] } , Losses: { entry[ 'lost' ] } " )
print ( f " Top tags: { entry[ 'top_tags' ] } " )
Topic-Specific Rankings
Filter leaderboard by topic using semantic search:
import requests
response = requests.get(
"http://localhost:8080/api/evaluations/leaderboard" ,
params = { "query" : "coding" } # Filter by topic
)
leaderboard = response.json()
# Returns models ranked for coding tasks based on tag similarity
How Elo Ratings Work
The leaderboard uses an Elo rating system:
Starting Rating : All models begin at 1000 Elo
Rating Updates : When users compare models:
Winner gains points, loser loses points
Amount depends on rating difference (upsets = bigger change)
Formula : new_rating = old_rating + K * (actual - expected)
K=32 (controls rating volatility)
Expected = probability of winning based on current ratings
When querying with a topic (e.g., “coding”):
System computes semantic similarity between query and feedback tags
Similarity scores weight the Elo calculation
Feedback about “coding” contributes more to rankings
Unrelated feedback contributes less
This provides topic-specific leaderboards without separate data.
Model Performance History
Track a model’s performance over time:
import requests
response = requests.get(
f "http://localhost:8080/api/evaluations/leaderboard/ { model_id } /history" ,
params = { "days" : 30 }
)
history = response.json()
for entry in history[ "history" ]:
print ( f "Date: { entry[ 'date' ] } " )
print ( f "Wins: { entry[ 'wins' ] } , Losses: { entry[ 'losses' ] } " )
Admin Feedback Management
Administrators have additional feedback management capabilities:
List All Feedback
Get paginated feedback with sorting:
import requests
response = requests.get(
"http://localhost:8080/api/evaluations/feedbacks/list" ,
params = {
"order_by" : "created_at" ,
"direction" : "desc" ,
"page" : 1
}
)
result = response.json()
feedbacks = result[ "items" ]
total = result[ "total" ]
Export All Feedback
Export feedback data for analysis:
import requests
response = requests.get(
"http://localhost:8080/api/evaluations/feedbacks/all/export"
)
feedback_data = response.json()
# Full feedback data with all fields
Get Feedback IDs Only
Retrieve just the IDs for bulk operations:
import requests
response = requests.get(
"http://localhost:8080/api/evaluations/feedbacks/all/ids"
)
feedback_ids = response.json()
Bulk Delete Feedback
Delete all feedback (admin only):
import requests
response = requests.delete(
"http://localhost:8080/api/evaluations/feedbacks/all"
)
success = response.json()
Bulk delete operations are permanent and cannot be undone. Export feedback data before deletion.
Use Cases
Compare two models for a specific use case: # Configure arena with two models
payload = {
"ENABLE_EVALUATION_ARENA_MODELS" : True ,
"EVALUATION_ARENA_MODELS" : [
{ "id" : "model-a" , "name" : "Model A" },
{ "id" : "model-b" , "name" : "Model B" }
]
}
# Collect user feedback in blind mode
# After sufficient data, check leaderboard:
response = requests.get(
"http://localhost:8080/api/evaluations/leaderboard" ,
params = { "query" : "customer-support" } # Topic filter
)
# Analyze which model performs better for customer support
Topic Performance Analysis
Feedback Collection Workflow
Systematic feedback collection: # 1. User completes a task with a model
chat_response = chat_with_model( "Help me write a Python function" )
# 2. Collect structured feedback
feedback = {
"type" : "rating" ,
"data" : {
"model_id" : "gpt-4" ,
"rating" : "1" , # Positive
"tags" : [ "coding" , "python" , "helpful" ],
"comment" : "Great explanation and working code"
},
"meta" : {
"chat_id" : chat_id,
"message_id" : message_id,
"task_type" : "code_generation"
}
}
requests.post( "http://localhost:8080/api/evaluations/feedback" , json = feedback)
# 3. Periodically review aggregate feedback
response = requests.get(
"http://localhost:8080/api/evaluations/leaderboard" ,
params = { "query" : "coding" }
)
Best Practices
Tag Consistently : Use standardized tags for better topic analysis
Collect Context : Include relevant metadata (task type, user role, etc.)
Regular Reviews : Check leaderboard trends periodically
Blind Comparisons : Use arena mode to reduce bias
Sufficient Data : Collect enough feedback before making decisions
Export Backups : Regularly export feedback for analysis and backup
Topic-Specific : Use query filtering to find best models per use case
Troubleshooting
Invalid Rating Value
Rating must be “1” (positive) or “-1” (negative):
# Correct
{ "rating" : "1" } # String value
# Incorrect
{ "rating" : 1 } # Integer not supported
{ "rating" : "5" } # Only 1 and -1 allowed
Empty Leaderboard
Leaderboard requires feedback with comparison data:
# Ensure feedback includes sibling_model_ids
{
"data" : {
"model_id" : "gpt-4" ,
"rating" : "1" ,
"sibling_model_ids" : [ "claude-3" ] # Required for rankings
}
}
Missing Topic Results
Topic-based filtering requires:
Embedding model configured (AUXILIARY_EMBEDDING_MODEL)
Feedback with topic tags
Sufficient tag diversity
If the embedding model fails to load, topic-based filtering falls back to unweighted Elo rankings.
Configuration
Environment Variables
# Set embedding model for topic analysis
AUXILIARY_EMBEDDING_MODEL = TaylorAI/bge-micro-v2
# The default model is lightweight and works well for tag similarity
Required Permissions
Users : Can create, view, update, and delete their own feedback
Admins : Full access to all feedback, leaderboard, and configuration
Next Steps