Blog Services For Talent Manifesto About Contact
Contract or Full-Time

AI Eval Engineer

Remote or Hybrid US Hours Required

February 2026

PythonLLM APIsClaude CodeStatistical AnalysisPrompt EngineeringCI/CD

Why This Role Exists

AI systems are probabilistic, not deterministic. Traditional QA asks “does this pass or fail?” AI evaluation asks “why did the model give this answer — and can we trust it?” As companies move AI from demos to production, someone needs to build the systems that answer that question rigorously and at scale.

This is a new discipline. There’s no established playbook. The startups we work with need engineers who can design evaluation frameworks from scratch — benchmarks, red-teaming pipelines, safety testing, regression suites — and iterate as models and requirements change.

What You’d Work On

  • LLM evaluation pipelines — accuracy, hallucination detection, safety, and trustworthiness scoring
  • Benchmark design — create and maintain domain-specific evaluation suites that catch real failures
  • Red-teaming and adversarial testing — systematically probe models for failure modes and edge cases
  • Automated regression testing — catch quality regressions when models are updated or fine-tuned
  • Human-in-the-loop evaluation workflows — design rubrics, manage annotation, measure inter-rater reliability

Who We’re Looking For

  • Strong Python — testing frameworks, data analysis, scripting. You build evaluation systems, not just run them.
  • Thinks in probabilities — you understand that AI outputs aren’t pass/fail. You can reason about confidence, uncertainty, and when a model’s answer is “good enough.”
  • LLM/ML evaluation experience — you’ve built or maintained evaluation pipelines for real AI products
  • Statistical rigor — understands metrics, significance, sampling, inter-rater reliability
  • Prompt engineering depth — can design evaluation prompts, judge models, and build scoring rubrics
  • Comfortable with ambiguity — evaluation criteria are often subjective and evolving. You define the standard, not just enforce it.
  • AI-first workflow — you use Claude Code, Cursor, or similar to move fast

What Worca Offers

  • Interesting work — directly with founding teams at AI startups
  • Flexibility — hourly, part-time, or full-time
  • USD compensation — competitive, benchmarked to the role
  • Continuity — Worca manages employment and matches you to next engagements

Engagement Structure

  • Type: Contract (hourly or monthly) or full-time
  • Trial: 2-4 week trial project
  • Timezone: APAC-based, US hours overlap required
  • Location: Remote — Philippines, Taiwan, Singapore, or broader APAC

How to Apply

Send your resume and a brief note on evaluation work you’ve done to careers@worca.io.