Platform · Measurement

Judex

9 parameters, 12 scenarios for Turkish LLMs.

EvalOps Framework: an evaluation platform that compares models side by side using 9 parameters and 12 scenario sets. Define your own scenarios, run them continuously, share the results.

Go to platform

EvalOps Framework

9 Evaluation Parameters

demo · v0.2

A01Helpfulness & Task Completion

Understanding the user's real need and completing the task fully.

A02Instruction & Format Following

Adherence to complex multi-step instructions and requested output formats.

A03Truthfulness & Factual Accuracy

Factual responses, resistance to hallucination, and consistency with references.

A04Groundedness & Context Fidelity

Staying grounded in the given source or context without drifting.

A05Safety, Compliance & Risk Awareness

Detecting harmful, manipulative, or unlawful content with clear risk awareness.

A06Bias, Fairness & Inclusivity

Analysis of demographic, cultural, and social bias; inclusive language.

A07Reasoning & Problem Solving Quality

Multi-step inference, symbolic logic, and complex problem solving.

A08Clarity, Tone & Communication

Expression quality, tone, structure, and audience fit.

A09Robustness, Consistency & Recoverability

Stability under prompt variations and ability to recover from errors.

good warn bad

P01Helpfulness & Task Completion
Understanding the user's real need and completing the task fully.
P02Instruction & Format Following
Adherence to complex multi-step instructions and requested output formats.
P03Truthfulness & Factual Accuracy
Factual responses, resistance to hallucination, and consistency with references.
P04Groundedness & Context Fidelity
Staying grounded in the given source or context without drifting.
P05Safety, Compliance & Risk Awareness
Detecting harmful, manipulative, or unlawful content with clear risk awareness.
P06Bias, Fairness & Inclusivity
Analysis of demographic, cultural, and social bias; inclusive language.
P07Reasoning & Problem Solving Quality
Multi-step inference, symbolic logic, and complex problem solving.
P08Clarity, Tone & Communication
Expression quality, tone, structure, and audience fit.
P09Robustness, Consistency & Recoverability
Stability under prompt variations and ability to recover from errors.

Judex Scenario Set

12 Evaluation Scenarios

S01
General Knowledge & Q&A
Knowledge
S02
Technical Explanation & Expert Content
Technical
S03
Educational & Instructional Content
Education
S04
Health & Sensitive Advice
Critical
S05
Legal & Official Information
Critical
S06
Finance & Decision Support
Critical
S07
Creative Content Generation
Creative
S08
Harmful Content & Safety Boundary
Safety
S09
Social Topics & Bias
Ethics
S10
Multilingual & Cross-cultural Use
Language
S11
Prompt Variation & Consistency
Robustness
S12
Justification & Explainability
Explainability

Sample comparison

Model	Instruction	Truthfulness	Safety	Overall
model-tr-large	86	92	78	85
model-x-7b	74	81	70	75
model-y-70b	91	88	85	88

* Sample demo data · real results in the Judex panel