Google DeepMind Releases DiffusionGemma Text Model
The new open model generates up to 256 tokens in parallel and runs at over 1,000 tokens per second on an H100.
Ars TechnicaGoogle DeepMind released DiffusionGemma, a new member of the Gemma 4 open model family. The model produces an entire block of text in parallel rather than generating outputs linearly. 8 billion activated during inference.
It should fit in 18GB RAM. The model generates up to 256 tokens in parallel. In testing, DiffusionGemma produces around 700 tokens per second on an RTX 5090. On a single Nvidia H100 it produces over 1,000 tokens per second.
That output is about four times that of similarly sized autoregressive Gemma models. Google tuned DiffusionGemma to solve Sudoku puzzles. The model was optimized with Nvidia for quantized RTX GPUs, the H100, and the DGX Spark platform.
0 license. Ars Technica reported that Google stresses DiffusionGemma is experimental. Google has experimented with diffusion for text in cloud-based Gemini models.
Text diffusion has a higher error rate than autoregressive models. Diffusion models waste resources when the desired output is only a few tokens long. Google recently began implementing Multi-Token Prediction drafters.
Diffusion is even faster than the MTP versions of Gemma.


