Position: PhD Rater
Type: Part-Time
Compensation: $50–$100/hour
Location: Remote
Commitment: 30+ hours/week (primarily weekdays)
Role Responsibilities
- Design challenging, real-world STEM benchmark problems in domains such as data science, machine learning, finance, and software engineering.
- Implement tasks within an agentic development environment using Python.
- Create reproducible problem setups with clear specifications and executable tests.
- Evaluate and analyze AI model behavior, including reasoning traces and agent workflows.
- Diagnose reasoning failures, logic gaps, and problem-solving limitations in AI systems.
- Contribute to improving benchmark quality and evaluation frameworks for frontier AI models.
Requirements
- Active or recently graduated PhD.
- Deep expertise in data science, machine learning, finance, and/or Python-based software development.
- Strong research background in advanced STEM topics.
- Ability to commit reliably for 30+ hours per week.
- Demonstrated technical output such as high-quality open-source contributions or research work.
- Ability to analyze agent behavior traces and diagnose failures beyond surface-level errors.
Application Process
- Upload resume
- Interview
- Submit form