Judex
9 parameters, 12 scenarios for Turkish LLMs.
EvalOps Framework: an evaluation platform that compares models side by side using 9 parameters and 12 scenario sets. Define your own scenarios, run them continuously, share the results.
9 Evaluation Parameters
Understanding the user's real need and completing the task fully.
Adherence to complex multi-step instructions and requested output formats.
Factual responses, resistance to hallucination, and consistency with references.
Staying grounded in the given source or context without drifting.
Detecting harmful, manipulative, or unlawful content with clear risk awareness.
Analysis of demographic, cultural, and social bias; inclusive language.
Multi-step inference, symbolic logic, and complex problem solving.
Expression quality, tone, structure, and audience fit.
Stability under prompt variations and ability to recover from errors.
- P01Helpfulness & Task Completion
Understanding the user's real need and completing the task fully.
- P02Instruction & Format Following
Adherence to complex multi-step instructions and requested output formats.
- P03Truthfulness & Factual Accuracy
Factual responses, resistance to hallucination, and consistency with references.
- P04Groundedness & Context Fidelity
Staying grounded in the given source or context without drifting.
- P05Safety, Compliance & Risk Awareness
Detecting harmful, manipulative, or unlawful content with clear risk awareness.
- P06Bias, Fairness & Inclusivity
Analysis of demographic, cultural, and social bias; inclusive language.
- P07Reasoning & Problem Solving Quality
Multi-step inference, symbolic logic, and complex problem solving.
- P08Clarity, Tone & Communication
Expression quality, tone, structure, and audience fit.
- P09Robustness, Consistency & Recoverability
Stability under prompt variations and ability to recover from errors.
12 Evaluation Scenarios
- S01General Knowledge & Q&AKnowledge
- S02Technical Explanation & Expert ContentTechnical
- S03Educational & Instructional ContentEducation
- S04Health & Sensitive AdviceCritical
- S05Legal & Official InformationCritical
- S06Finance & Decision SupportCritical
- S07Creative Content GenerationCreative
- S08Harmful Content & Safety BoundarySafety
- S09Social Topics & BiasEthics
- S10Multilingual & Cross-cultural UseLanguage
- S11Prompt Variation & ConsistencyRobustness
- S12Justification & ExplainabilityExplainability
| Model | Instruction | Truthfulness | Safety | Overall |
|---|---|---|---|---|
| model-tr-large | 86 | 92 | 78 | 85 |
| model-x-7b | 74 | 81 | 70 | 75 |
| model-y-70b | 91 | 88 | 85 | 88 |
* Sample demo data · real results in the Judex panel