“You don’t just run the tests. You write the tests.”
You’ve proven you can follow guidelines accurately and work with LLM APIs. Now you start creating, not just executing. At L2, you write test cases, design simple evaluation prompts, and flag issues with enough evidence that an L3+ engineer can act on your findings without asking follow-up questions.
This is the discount billing tier. Clients know they’re getting someone in development, but you’re productive enough to deliver real value. Your test cases catch real problems. Your annotation work is reliable enough that it doesn’t need constant review. You’re halfway between trainee and independent contributor.
The shift from L1 to L2 is about initiative. At L1, you do what you’re told. At L2, you start seeing things nobody asked you to look for — and you document them well enough that people listen.
What You Do
- Write test cases — given a model capability, design test inputs that probe edge cases, failure modes, and boundary conditions.
- Run eval pipelines — execute full evaluation workflows independently. Troubleshoot when scripts fail. Interpret results without hand-holding.
- Flag issues with evidence — your bug reports include the prompt, the output, the expected behavior, why it matters, and how to reproduce it.
- Basic prompt evaluation — assess whether prompts are well-structured, test them against models, and report on quality.
- Annotation at scale — handle larger annotation projects with minimal oversight. Your labels are trusted.
- Data quality checks — review training data and eval datasets for errors, biases, and coverage gaps.
- Write simple eval scripts — modify existing scripts and write small new ones for specific test scenarios.
AI Skills Required
- LLM-as-judge basics — understand how to use one model to evaluate another model’s output. Know the limitations.
- Test case design with AI — use AI to brainstorm edge cases and failure scenarios, then refine with human judgment
- Eval script modification — extend existing Python eval scripts with new metrics, new test cases, new output formats
- Prompt analysis — evaluate prompt quality. Is it clear? Does it constrain the model appropriately? Does it leak the answer?
- AI-assisted data analysis — use AI to help analyze eval results, identify patterns in failures, and generate visualizations
Self-Evaluation Checklist
- I’ve written 50+ test cases that found real issues in production models
- My bug reports are actionable without follow-up questions — prompt, output, expected, actual, reproduction steps
- I can run an eval pipeline end-to-end and troubleshoot failures independently
- I’ve modified existing eval scripts to add new test cases or metrics
- My annotation work requires minimal review — error rate under 5%
- I’ve identified at least one data quality issue that wasn’t part of my assigned task
- I can explain what common eval metrics mean and when to use each one
- I write Python scripts that other people can read and run
Training Curriculum
Month 1-3: Test Case Craft
- Test Case Design — structured exercises in writing test cases for different model capabilities. Fact accuracy, instruction following, safety, format compliance, reasoning.
- Failure Mode Taxonomy — formalize your understanding of how models fail. Categorize failures. Learn which types are most common in which domains.
- Eval Script Development — go from modifying scripts to writing small ones. Python testing frameworks, data loading, metric computation, result formatting.
- Evidence-Based Reporting — practice writing issue reports that meet the bar: specific, reproducible, prioritized, actionable.
Month 4-6: Independence
- Solo Eval Runs — own an eval pipeline for a real client project. Run it, interpret results, write the summary report. Reviewed by L3+ but not hand-held.
- Prompt Evaluation — evaluate prompt quality systematically. Build a checklist. Test prompts against multiple models. Report on robustness.
- Data Quality Auditing — review training datasets and eval suites for issues. Coverage gaps, label errors, distribution biases.
- Cross-Domain Exposure — run eval tasks across different domains (healthcare, semiconductor, fintech) to understand how evaluation criteria change with context.
Ranking Standard
| Metric | Threshold | How It’s Measured |
|---|---|---|
| Test cases written | 50+ that found real issues | Test case log |
| Bug report quality | Actionable without follow-up | Review by L3+ |
| Eval pipeline operation | Can run and troubleshoot independently | Mentor observation |
| Annotation error rate | Under 5% | Spot-check audits |
| Script contributions | 3+ scripts written or significantly modified | Code review |
| Proactive issue discovery | 2+ issues found outside assigned scope | Issue log |
Promotion to L3
Requirements
- Minimum 6 months at L2
- Pass L3 qualification assessment:
- Test case design challenge — given a model and a domain, design a comprehensive test suite. Panel evaluates coverage, creativity, and rigor.
- Eval pipeline exercise — build a small eval pipeline from scratch. Load data, run inference, compute metrics, generate a report.
- Root cause analysis — given a set of model failures, categorize them and propose what changes might fix each category.
- Statistical literacy — explain p-values, confidence intervals, sample size requirements, and inter-rater reliability to a non-technical audience.
- Client feedback from 1+ placement
- Demonstrated initiative — evidence of finding issues nobody asked you to find
What the Panel Looks For
- Test design instinct — do they think about edge cases naturally? Do their test cases catch things that matter?
- Independence — can they own an eval workflow without supervision?
- Statistical foundation — do they understand when results are meaningful and when they’re noise?
- Communication — can they explain findings clearly to engineers and stakeholders?
Mentorship at This Level
- You receive: L3+ mentor, bi-weekly check-ins. Focus on test case design, statistical literacy, and developing independent evaluation judgment.
- You give: Help L1 trainees with annotation questions and script basics. Informal, but it builds your teaching skills.
- Exposure: Participate in eval framework design discussions. Start understanding how L3+ engineers think about evaluation architecture.
What Unlocks at L3
- Standard billing rate — full productivity, full rate
- Eval framework design — you don’t just run frameworks, you build them
- Benchmark creation — design evaluation suites from scratch
- Red-teaming — systematically probe models for failure modes
- The default Worca placement level — this is where clients expect you to be