LLM Turkey
Platform · Measurement

Judex

9 parameters, 12 scenarios for Turkish LLMs.

EvalOps Framework: an evaluation platform that compares models side by side using 9 parameters and 12 scenario sets. Define your own scenarios, run them continuously, share the results.

Go to platform
EvalOps Framework

9 Evaluation Parameters

demo · v0.2
A01Helpfulness & Task Completion
0

Understanding the user's real need and completing the task fully.

A02Instruction & Format Following
0

Adherence to complex multi-step instructions and requested output formats.

A03Truthfulness & Factual Accuracy
0

Factual responses, resistance to hallucination, and consistency with references.

A04Groundedness & Context Fidelity
0

Staying grounded in the given source or context without drifting.

A05Safety, Compliance & Risk Awareness
0

Detecting harmful, manipulative, or unlawful content with clear risk awareness.

A06Bias, Fairness & Inclusivity
0

Analysis of demographic, cultural, and social bias; inclusive language.

A07Reasoning & Problem Solving Quality
0

Multi-step inference, symbolic logic, and complex problem solving.

A08Clarity, Tone & Communication
0

Expression quality, tone, structure, and audience fit.

A09Robustness, Consistency & Recoverability
0

Stability under prompt variations and ability to recover from errors.

good warn bad
  • P01Helpfulness & Task Completion

    Understanding the user's real need and completing the task fully.

  • P02Instruction & Format Following

    Adherence to complex multi-step instructions and requested output formats.

  • P03Truthfulness & Factual Accuracy

    Factual responses, resistance to hallucination, and consistency with references.

  • P04Groundedness & Context Fidelity

    Staying grounded in the given source or context without drifting.

  • P05Safety, Compliance & Risk Awareness

    Detecting harmful, manipulative, or unlawful content with clear risk awareness.

  • P06Bias, Fairness & Inclusivity

    Analysis of demographic, cultural, and social bias; inclusive language.

  • P07Reasoning & Problem Solving Quality

    Multi-step inference, symbolic logic, and complex problem solving.

  • P08Clarity, Tone & Communication

    Expression quality, tone, structure, and audience fit.

  • P09Robustness, Consistency & Recoverability

    Stability under prompt variations and ability to recover from errors.

Judex Scenario Set

12 Evaluation Scenarios

  1. S01
    General Knowledge & Q&A
    Knowledge
  2. S02
    Technical Explanation & Expert Content
    Technical
  3. S03
    Educational & Instructional Content
    Education
  4. S04
    Health & Sensitive Advice
    Critical
  5. S05
    Legal & Official Information
    Critical
  6. S06
    Finance & Decision Support
    Critical
  7. S07
    Creative Content Generation
    Creative
  8. S08
    Harmful Content & Safety Boundary
    Safety
  9. S09
    Social Topics & Bias
    Ethics
  10. S10
    Multilingual & Cross-cultural Use
    Language
  11. S11
    Prompt Variation & Consistency
    Robustness
  12. S12
    Justification & Explainability
    Explainability
Sample comparison
ModelInstructionTruthfulnessSafetyOverall
model-tr-large86927885
model-x-7b74817075
model-y-70b91888588

* Sample demo data · real results in the Judex panel