Low-Level / Kernels Engineer
San Francisco, CA, USA
About XOR
XOR is a platform that helps world-class companies pushing the frontier of AI hire exceptional ML, RL, and AI engineering talent.
About Our Client
Our client is a well-funded AI startup working on next-generation training systems for large language models. The team is small, technical, and moving fast, with a strong focus on hands-on engineering over process.
About the Role
We’re looking for experienced engineers for a Low-Level / Kernels team that builds training tasks at the lowest layers of the stack - GPU and accelerator kernels, vector ISAs, codec and crypto primitives, FPGA work, and more. These are domains where current frontier models are weakest: niche paradigms, hardware underrepresented in training data, and open benchmarks where models lag. The role blends research and engineering - you'll develop novel approaches and realize them in code, owning tasks end-to-end: choosing the domain, designing the problems, building the scoring and infrastructure, and hardening it against shortcuts.
What You'll Do
- Design and build low-level / kernel-focused training tasks that target a specified model and difficulty distribution
- Choose which tasks are worth building - targeting niche or genuinely hard domains, exercising real hardware features (tiling, streaming, async copy, vector ISAs), using interesting hardware or simulators (FPGAs, novel accelerators, gem5), grounded in benchmarks where models lag, with a recognized reference to measure against (cuBLAS / FFTW / OpenSSL / etc.), and scalable into many diverse tasks from a single design
- Build correctness and performance scoring that's deterministic and can't be gamed - the objective is clear, and the only way to hit it is to actually write the kernel
What We're Looking For
- Strong low-level / systems engineering: fluent in C / C++ / CUDA (or an equivalent kernel language), comfortable dropping to assembly when it matters
- Strong, engineering-quality Python across prior work - production code, automation and deployment scripts, data analysis and plotting (not notebook-only)
- Hardware-aware coding: writing with the silicon in mind, considering memory hierarchy, occupancy, data movement, parallelism, latency vs throughput
- Kernel development experience: writing kernels and optimizing them iteratively against a profiler
- An adversarial mindset: turning fuzzy goals into robust, ungameable scoring, and asking 'how would a model cheat this?'
- Hands-on work with LLMs
- Ownership and autonomy: building, debugging, and shipping end-to-end with minimal supervision
Nice to Have
- Have shipped a kernel that approached state of the art and can explain the remaining gap
- Depth in a niche hardware target or ISA: FPGA/HLS, RISC-V Vector, DSPs, SIMD/AVX, TPUs
- Depth in an adjacent discipline: HPC/heterogeneous clusters, hardware design (RTL/HDL, HLS), compilers and kernel toolchains (MLIR/LLVM, Mojo, Triton, gem5), or formal verification (Lean, Coq, SMT)
- Reads performance and architecture papers and turns them into running code
- Open-source contributions others rely on
- Strong competitive-programming background (ideally in a low-level language)
- Experience building evaluation infrastructure or agent harnesses
