DEVLOG: Fine-tuning Evo2 with RLVR for Regulatory DNA Design

Apr 24, 2026

This is a planning devlog for a project I’ve been thinking about for a while — using reinforcement learning with verifiable rewards (RLVR) to fine-tune Evo2 for targeted regulatory DNA design.

The Problem

Designing regulatory DNA sequences that activate in a specific cell type is hard. Current generative models — including Evo2, which I’ve worked with before — can produce plausible sequences, but they’re not steered toward any particular functional objective. You get diversity, not specificity.

The standard approach is to train a supervised model on MPRA (Massively Parallel Reporter Assay) data and hope it generalizes. But MPRA datasets now have 150M+ labeled sequences with measured regulatory activity across cell types. That’s enough signal to build a reward model — and if you have a reward model, you can do RL.

The Idea

The core loop is straightforward:

Evo2 generates candidate regulatory sequences
A reward model trained on MPRA data scores each sequence for actual regulatory activity in a target cell type
RLVR updates Evo2 toward sequences that score high

The reward is grounded in real experimental measurements, not another model’s predictions. That’s what makes it “verifiable” in the RLVR sense — similar to how math RL uses a checker rather than a judge.

Top candidates from the RL loop get validated against Borzoi as a secondary oracle (a sequence-to-activity model trained on ENCODE data, Nature Genetics 2024).

This is inspired by the recent RL for Crystal Relaxation work — same philosophy: use a physically grounded reward instead of a learned proxy.

Why This Hasn’t Been Done

I looked through the literature and nobody has combined Evo2-scale generation (~7B parameters, trained on 2.7M genomes) with MPRA-grounded reward models under RLVR. A few reasons this gap exists:

Evo2 only dropped in early 2025
MPRA datasets at this scale are very recent (the 150M+ collection is from March 2025)
Most bio ML work still treats sequence design as a supervised problem

The pieces are all available now. The RLVR framework from Prime Intellect’s verifiers library maps cleanly onto this — you just swap out the math verifier for an MPRA-trained reward model.

Prior Work I’m Building On

Evo2 paper (Nature 2026) — the base model and weights
RL for regulatory DNA design — closest prior work, but uses a weaker base model and synthetic rewards
MPRA Dataset Collection — 150M+ labeled sequences, the reward model training data
Borzoi — secondary oracle for validation

I’ve already shipped one application on top of Evo1, so I have a practical sense of where generative DNA models fail on functional sequence design. That experience is what made this problem obvious to me.

Compute Plan

Rough estimate: 100–150 H100 hours total.

Reward model training on MPRA data: ~30–50 hours
RLVR fine-tuning of Evo2-7B: ~50–80 hours
Borzoi validation passes: ~10–20 hours

Starting with Evo2-7B. If the reward signal is strong, scaling up to larger checkpoints is the obvious next step.

Next Steps

Set up MPRA data pipeline and train initial reward model
Implement RLVR training loop using Prime Intellect’s verifiers framework
Baseline: compare RLVR outputs vs. pure Evo2 generation on target cell type activity
Validate top-K sequences against Borzoi

Will post updates as I make progress.