DEVLOG: Fine-tuning Evo2 with RLVR for Regulatory DNA Design

Apr 24, 2026

This is a planning devlog for a project I’ve been thinking about for a while — using reinforcement learning with verifiable rewards (RLVR) to fine-tune Evo2 for targeted regulatory DNA design.

The Problem

Designing regulatory DNA sequences that activate in a specific cell type is hard. Current generative models — including Evo2, which I’ve worked with before — can produce plausible sequences, but they’re not steered toward any particular functional objective. You get diversity, not specificity.

The standard approach is to train a supervised model on MPRA (Massively Parallel Reporter Assay) data and hope it generalizes. But MPRA datasets now have 150M+ labeled sequences with measured regulatory activity across cell types. That’s enough signal to build a reward model — and if you have a reward model, you can do RL.

The Idea

The core loop is straightforward:

  1. Evo2 generates candidate regulatory sequences
  2. A reward model trained on MPRA data scores each sequence for actual regulatory activity in a target cell type
  3. RLVR updates Evo2 toward sequences that score high

The reward is grounded in real experimental measurements, not another model’s predictions. That’s what makes it “verifiable” in the RLVR sense — similar to how math RL uses a checker rather than a judge.

Top candidates from the RL loop get validated against Borzoi as a secondary oracle (a sequence-to-activity model trained on ENCODE data, Nature Genetics 2024).

This is inspired by the recent RL for Crystal Relaxation work — same philosophy: use a physically grounded reward instead of a learned proxy.

Why This Hasn’t Been Done

I looked through the literature and nobody has combined Evo2-scale generation (~7B parameters, trained on 2.7M genomes) with MPRA-grounded reward models under RLVR. A few reasons this gap exists:

The pieces are all available now. The RLVR framework from Prime Intellect’s verifiers library maps cleanly onto this — you just swap out the math verifier for an MPRA-trained reward model.

Prior Work I’m Building On

I’ve already shipped one application on top of Evo1, so I have a practical sense of where generative DNA models fail on functional sequence design. That experience is what made this problem obvious to me.

Compute Plan

Rough estimate: 100–150 H100 hours total.

Starting with Evo2-7B. If the reward signal is strong, scaling up to larger checkpoints is the obvious next step.

Next Steps

Will post updates as I make progress.