Career Level

L6 — AI Eval Architect

February 2026

“You don’t evaluate models. You design how organizations evaluate models.”

Welcome to the other side of the methodology gate. At L6, you’re no longer doing eval — you’re designing eval systems. When a healthcare company needs to evaluate their diagnostic AI, you don’t run the tests. You design the testing methodology, define what “good” means in that domain, build the framework, and hand it to L3-L5 evaluators to execute. Then you audit the results and refine the methodology.

This is eval as a discipline, not a task. You think about evaluation the way an architect thinks about buildings — not which bricks to lay, but how the structure needs to work, what loads it needs to bear, and what happens when conditions change. Your frameworks survive model updates, team turnover, and domain shifts because they’re built on sound methodology, not personal knowledge.

The jump from L5 to L6 is the hardest in the track. L5 is execution excellence. L6 is methodology design. The difference: an L5 can build a great eval framework for semiconductor AI. An L6 can look at semiconductor, healthcare, and fintech and design eval methodology that adapts across all three — because they understand the underlying principles, not just the domain-specific patterns.

What You Do

Design eval methodology across domains — healthcare, semiconductor, fintech, legal. Each has different quality criteria, but the evaluation principles are transferable. You see the principles.
Build reusable eval frameworks — frameworks that L3-L5 evaluators can deploy in new domains without you. Your methodology is the product, not your presence.
Define quality standards — what does “good” mean for a given AI application? You answer this question rigorously. Not vibes. Measurable criteria with defensible thresholds.
Audit and calibrate — review eval work across teams. Ensure consistency of standards. Catch methodological errors before they produce misleading results.
Stakeholder alignment — work with product leaders, engineers, and domain experts to agree on what evaluation should measure. Different stakeholders care about different things. You unify them.
Methodology documentation — your frameworks are documented well enough that someone who’s never met you can implement them correctly.
Innovation — push the boundaries of how AI evaluation is done. New metrics, new approaches, new tools. The field is young. Help define it.

AI Skills Required

AI-assisted methodology design — use AI to analyze evaluation approaches across industries, identify patterns, and generate framework drafts for expert refinement
Advanced LLM-as-judge architecture — design multi-stage judge pipelines with calibration, bias correction, and domain adaptation
AI-powered framework validation — use AI to stress-test eval frameworks. Generate adversarial scenarios that expose methodology weaknesses.
Cross-domain pattern recognition — use AI to identify evaluation principles that transfer across domains and those that don’t
Automated methodology auditing — build AI tools that check eval execution against methodology specifications

Self-Evaluation Checklist

I’ve designed eval methodology for 3+ different domains
My frameworks are used by teams I don’t personally supervise
I can take a domain I’ve never worked in and design an eval framework within a week
Domain experts trust my quality standards — they agree with my definition of “good”
My methodology documentation is complete enough for independent implementation
I’ve caught methodological errors in other teams’ eval work that would have produced misleading results
I think about evaluation as a discipline with principles, not just a collection of domain-specific techniques
I’ve innovated — created a new metric, approach, or tool that other evaluators now use
Engineers and product leaders actively seek my input on what to measure and how

Training Curriculum

Month 1-8: Cross-Domain Methodology

Domain Immersion Rotations — deep dives into 2-3 domains you haven’t evaluated before. For each: understand the domain, identify quality criteria, design an eval framework, validate with domain experts.
Methodology Abstraction — after each domain rotation, extract the transferable principles. What’s universal about good evaluation? What’s domain-specific?
Framework Design Patterns — study evaluation frameworks across industries and academia. Build a pattern library of reusable methodology components.
Stakeholder Alignment Practice — run methodology alignment sessions with simulated stakeholders who have conflicting priorities. Product wants speed. Engineering wants coverage. Compliance wants thoroughness. Align them.

Month 9-16: Framework Engineering

Reusable Framework Development — build 2+ eval frameworks designed for reuse. Document them. Test them by having L3-L5 evaluators deploy them without your guidance.
Quality Standard Design — practice defining measurable quality criteria for ambiguous domains. When “good” is subjective, how do you make it rigorous?
Methodology Validation — develop techniques for validating that your eval methodology actually measures what it claims to measure. Construct validity. Content validity. Predictive validity.
Innovation Projects — identify a gap in current eval methodology and build something to fill it. New metric? New evaluation approach? New calibration technique?

Month 17-24: Organizational Impact

Multi-Team Methodology — design eval standards that work across multiple teams and projects. Handle the coordination challenges.
Regulatory Landscape — deep understanding of how regulations affect eval methodology. NIST AI RMF, EU AI Act, FDA guidance for healthcare AI, domain-specific standards.
Industry Engagement — present your methodology work externally. Conference talks, blog posts, or white papers. Start building your reputation beyond Worca.
L7 Preparation — study company-level eval strategy. How does evaluation methodology connect to business strategy, risk management, and compliance?

Ranking Standard

Metric	Threshold	How It’s Measured
Domains covered	3+ domains with designed eval methodology	Portfolio review
Framework reuse	Frameworks deployed by teams you don’t supervise	Adoption tracking
Quality standard adoption	Standards accepted by domain experts	Expert validation
Methodology documentation	Implementable without author’s guidance	Independent deployment test
Innovation	1+ novel evaluation approach adopted by others	Peer recognition
Stakeholder alignment	Methodology agreed upon by product, engineering, and compliance	Stakeholder feedback

Promotion to L7

Requirements

Minimum 24 months at L6
Pass L7 qualification assessment:
- Company-level eval strategy — present how you would design the complete evaluation strategy for a company deploying AI across multiple products. Panel evaluates strategic thinking, regulatory awareness, and practical feasibility.
- Methodology defense under pressure — present your evaluation methodology to a panel including domain experts, statisticians, and a regulatory specialist. Defend it against rigorous challenges.
- Framework portfolio — present 3+ eval frameworks you designed for different domains. Panel evaluates transferability, rigor, and real-world effectiveness.
- Organizational impact — demonstrate that your methodology has been adopted beyond your immediate team and produced measurable results.
Demonstrated regulatory awareness — understanding of how compliance requirements shape eval methodology
Industry recognition — external validation of your methodology work (talks, publications, client references)

What the Panel Looks For

Strategic depth — can they connect eval methodology to business risk, regulatory compliance, and organizational AI strategy?
Regulatory fluency — do they understand how compliance shapes evaluation? Can they design methodology that satisfies auditors?
Cross-domain proof — have they actually designed methodology for multiple domains, or just adapted one framework?
Influence — do people outside their team seek their methodology guidance?
Vision — do they see where AI evaluation is going, not just where it is?

Mentorship at This Level

You receive: L8+ mentor or Worca leadership, monthly check-ins. Focus on strategic thinking, regulatory landscape, and industry positioning.
You give: 3 mentee slots (L1-L4). Active mentorship with focus on methodology thinking, not just eval execution.
Referral cut: 5% of mentee’s monthly rate for 12 months after placement.
Panel duty: You serve on evaluation panels for L1-L5 promotions. Your methodology standards define the track.

What Unlocks at L7

Company-level strategy — you design how entire organizations evaluate AI
Regulatory and compliance — safety testing, audit frameworks, alignment evaluation
Industry visibility — your methodology work becomes part of your professional identity
Leadership trust — executives seek your judgment on AI quality and risk
The path toward defining industry standards

← All Levels