Career Level

L2 — AI QA Engineer, Intern

February 2026

“You don’t just run the tests. You write the tests.”

You’ve proven you can follow guidelines accurately and work with LLM APIs. Now you start creating, not just executing. At L2, you write test cases, design simple evaluation prompts, and flag issues with enough evidence that an L3+ engineer can act on your findings without asking follow-up questions.

This is the discount billing tier. Clients know they’re getting someone in development, but you’re productive enough to deliver real value. Your test cases catch real problems. Your annotation work is reliable enough that it doesn’t need constant review. You’re halfway between trainee and independent contributor.

The shift from L1 to L2 is about initiative. At L1, you do what you’re told. At L2, you start seeing things nobody asked you to look for — and you document them well enough that people listen.

What You Do

Write test cases — given a model capability, design test inputs that probe edge cases, failure modes, and boundary conditions.
Run eval pipelines — execute full evaluation workflows independently. Troubleshoot when scripts fail. Interpret results without hand-holding.
Flag issues with evidence — your bug reports include the prompt, the output, the expected behavior, why it matters, and how to reproduce it.
Basic prompt evaluation — assess whether prompts are well-structured, test them against models, and report on quality.
Annotation at scale — handle larger annotation projects with minimal oversight. Your labels are trusted.
Data quality checks — review training data and eval datasets for errors, biases, and coverage gaps.
Write simple eval scripts — modify existing scripts and write small new ones for specific test scenarios.

AI Skills Required

LLM-as-judge basics — understand how to use one model to evaluate another model’s output. Know the limitations.
Test case design with AI — use AI to brainstorm edge cases and failure scenarios, then refine with human judgment
Eval script modification — extend existing Python eval scripts with new metrics, new test cases, new output formats
Prompt analysis — evaluate prompt quality. Is it clear? Does it constrain the model appropriately? Does it leak the answer?
AI-assisted data analysis — use AI to help analyze eval results, identify patterns in failures, and generate visualizations

Self-Evaluation Checklist

I’ve written 50+ test cases that found real issues in production models
My bug reports are actionable without follow-up questions — prompt, output, expected, actual, reproduction steps
I can run an eval pipeline end-to-end and troubleshoot failures independently
I’ve modified existing eval scripts to add new test cases or metrics
My annotation work requires minimal review — error rate under 5%
I’ve identified at least one data quality issue that wasn’t part of my assigned task
I can explain what common eval metrics mean and when to use each one
I write Python scripts that other people can read and run

Training Curriculum

Month 1-3: Test Case Craft

Test Case Design — structured exercises in writing test cases for different model capabilities. Fact accuracy, instruction following, safety, format compliance, reasoning.
Failure Mode Taxonomy — formalize your understanding of how models fail. Categorize failures. Learn which types are most common in which domains.
Eval Script Development — go from modifying scripts to writing small ones. Python testing frameworks, data loading, metric computation, result formatting.
Evidence-Based Reporting — practice writing issue reports that meet the bar: specific, reproducible, prioritized, actionable.

Month 4-6: Independence

Solo Eval Runs — own an eval pipeline for a real client project. Run it, interpret results, write the summary report. Reviewed by L3+ but not hand-held.
Prompt Evaluation — evaluate prompt quality systematically. Build a checklist. Test prompts against multiple models. Report on robustness.
Data Quality Auditing — review training datasets and eval suites for issues. Coverage gaps, label errors, distribution biases.
Cross-Domain Exposure — run eval tasks across different domains (healthcare, semiconductor, fintech) to understand how evaluation criteria change with context.

Ranking Standard

Metric	Threshold	How It’s Measured
Test cases written	50+ that found real issues	Test case log
Bug report quality	Actionable without follow-up	Review by L3+
Eval pipeline operation	Can run and troubleshoot independently	Mentor observation
Annotation error rate	Under 5%	Spot-check audits
Script contributions	3+ scripts written or significantly modified	Code review
Proactive issue discovery	2+ issues found outside assigned scope	Issue log

Promotion to L3

Requirements

Minimum 6 months at L2
Pass L3 qualification assessment:
- Test case design challenge — given a model and a domain, design a comprehensive test suite. Panel evaluates coverage, creativity, and rigor.
- Eval pipeline exercise — build a small eval pipeline from scratch. Load data, run inference, compute metrics, generate a report.
- Root cause analysis — given a set of model failures, categorize them and propose what changes might fix each category.
- Statistical literacy — explain p-values, confidence intervals, sample size requirements, and inter-rater reliability to a non-technical audience.
Client feedback from 1+ placement
Demonstrated initiative — evidence of finding issues nobody asked you to find

What the Panel Looks For

Test design instinct — do they think about edge cases naturally? Do their test cases catch things that matter?
Independence — can they own an eval workflow without supervision?
Statistical foundation — do they understand when results are meaningful and when they’re noise?
Communication — can they explain findings clearly to engineers and stakeholders?

Mentorship at This Level

You receive: L3+ mentor, bi-weekly check-ins. Focus on test case design, statistical literacy, and developing independent evaluation judgment.
You give: Help L1 trainees with annotation questions and script basics. Informal, but it builds your teaching skills.
Exposure: Participate in eval framework design discussions. Start understanding how L3+ engineers think about evaluation architecture.

What Unlocks at L3

Standard billing rate — full productivity, full rate
Eval framework design — you don’t just run frameworks, you build them
Benchmark creation — design evaluation suites from scratch
Red-teaming — systematically probe models for failure modes
The default Worca placement level — this is where clients expect you to be

← All Levels