RL Environment Reviewer
Preference Model
San Francisco, CA, USA
Location
San Francisco
Employment Type
Full time
Location Type
On-site
Department
Engineering
About us
Preference Model is building the next generation of training data to power the future of AI. Today's models are powerful but fail to reach their potential across diverse use cases because so many of the tasks that we want to use these models are out of distribution. Preference Model creates RL environments where models encounter research and engineering problems, iterate, and learn from realistic feedback loops.
Our founding team has previous experience on Anthropic’s data team building data infrastructure, tokenizers, and datasets behind Claude. We are partnering with leading AI labs to push AI closer to achieving its transformative potential. We are backed by a16z.
About the role
Every RL environment we ship needs to survive a model that is actively trying to game it. A task with a weak grader or an exploitable reward signal is worse than no task at all: it teaches the model to hack rather than reason. We need someone whose full-time job is finding those holes before the model does.
We've learned that domain knowledge alone doesn't make a good reviewer. The people who are best at this have spent time thinking adversarially: designing problems that are hard to game, breaking other people's problems, or researching reward hacking directly.
What You Will do
Review RL environments and training tasks for correctness, robustness, and resistance to reward hacking
Identify ways a model could exploit graders, game evaluation criteria, or shortcut past the intended reasoning
Work directly with environment authors to tighten graders, fix reward signals, and redesign tasks that don't hold up
Develop and maintain review standards and checklists as we scale from hundreds to thousands of tasks per month
-
Advise on grader design during environment planning, before tasks are built, not after
What We are Looking For
You think like an attacker. You've spent real time designing problems that are hard to game, or breaking problems other people thought were solid. You have enough ML knowledge to understand what a model might try, and enough engineering sense to evaluate whether a grader actually tests what it says it tests.
Must have:
Track record of adversarial or constructive problem design: competitive programming problem authoring (ICPC, Codeforces, etc.), CTF challenge design, or similar
Familiarity with RL, reward hacking, and specification gaming (you've read Amodei et al., Krakovna's list, or similar work, and you've thought about it beyond surface level)
Strong Python reading skills
Ability to articulate clearly in writing why a task is broken and what needs to change
Any of these would make you stand out:
Published research on reward hacking, specification gaming, RLHF robustness, or AI safety
Background in security engineering, penetration testing, or red-teaming (with enough ML context to apply that mindset to RL environments)
Experience authoring or reviewing problems for competitive programming contests
You've built automated evaluation systems and know where they break
You've worked on LLM evaluation, benchmarking, or alignment research
What We Offer:
Competitive cash and equity compensation (>90th percentile)
Ownership and autonomy in a fast moving startup environment
Opportunity to work with top machine learning engineers
Health, vision, dental, benefits
401K match
Visa sponsorship & relocation support available
We value diverse perspectives and experiences. If you're excited about this role but don't check every box, we still encourage you to apply.
