Protocol Intelligence / RFC 6330

The Black Magic of
Liquid Data.

Turn any file into an infinite stream of interchangeable packets. Collect any K of them, in any order, and recover the original. The total overhead: under 5%.

Scroll to Explore

I have to admit that I find the existence of genuinely shocking. The idea that you can take some arbitrary file, turn it into a stream of fungible blobs, receive those blobs in literally any order, and have each new one help you reconstruct the original already seems pretty impressive. But then you learn that the total overhead for all of this is under 5%, and that the receiver often needs just two extra symbols beyond the bare minimum to decode with near-certainty. That seems both magical and frankly improbable.

To see why this matters, think about how we normally move data around. TCP is fundamentally a conversation: "I sent packet 4." "I didn’t get packet 4." "Okay, resending packet 4." "Got it." That works fine for loading a webpage, but it falls apart when latency is high (try a 40-minute round trip to Mars) or when you’re broadcasting to a million receivers at once over lossy cellular. TCP requires a feedback loop. The sender has to know exactly what the receiver is missing. Scale that to a million receivers, each losing different packets, all sending retransmission requests at once. That’s . The sender drowns.

RaptorQ does something completely different. You turn your file into a mathematical liquid and just spray packets at the receiver. The receiver is basically just a bucket. It doesn’t matter which drops land in it, and it doesn’t matter if half the spray blows away in the wind. As soon as the bucket has roughly $K+\epsilon$ drops (not any particular drops, just enough of them), the receiver reconstructs the original data.

How Good Is It, Really?

This is all codified in RaptorQ (RFC 6330). The RFC actually has a SHALL-level decoder requirement: if you receive encoding symbols whose IDs are chosen uniformly at random, the average decode failure rate must be at most 1 in 100 when receiving $K'$ symbols, 1 in 10,000 at $K'+1$ , and 1 in 1,000,000 at $K'+2$ . The receiver almost never needs more than K + 2 symbols to decode perfectly.

But "+2 symbols" is only the reception overhead—the extra packets the receiver must collect. The full picture includes the precode’s internal expansion from $K$ source symbols to $L \approx 1.025K$ intermediate symbols. That ~2.5% structural redundancy is what makes the "+2 symbols" trick possible. Combined with the LT layer, total system overhead is under 5%—still remarkably small.

What You May Already Know

If you’ve ever used PAR2 files to repair a corrupted download, or relied on RAID 5 to survive a dead drive, you’ve already seen the seed of this idea. PAR2 uses to generate repair blocks from your original data. Lose a few files? Throw in some .par2 recovery blocks, and the repair tool reconstructs the missing pieces.

Reed-Solomon is pretty powerful. It achieves what coding theorists call (Maximum Distance Separable) behavior: any of $K+R$ encoded symbols perfectly reconstruct $K$ source symbols. Zero overhead. Mathematically optimal.

So why isn’t this the end of the story? Two problems:

You must choose $R$ in advance

Reed-Solomon is fixed-rate. You pick your redundancy budget before you send anything. If the channel is worse than expected, you’re dead. If it’s better, you wasted bandwidth.

It gets slow at scale

Reed-Solomon encoding and decoding cost grows with block size. Standard implementations are for encoding and for decoding.

The dream was to fix both problems at once: don’t choose a rate. Don’t negotiate. Just open a valve and spray encoded packets. Each receiver collects whichever drops happen to arrive and stops when it has enough.

Reed-Solomon gave us the "any $K$ of $N$ " property. Fountain codes ask: what if $N$ could be infinity?

Packets Are Equations

Everything in this article rests on one mental shift: stop thinking of packets as "chunks of a file" and start thinking of them as . Once you do that, a lot of things click into place.

Suppose your file is four bytes: A=5, B=3, C=7, D=2. Instead of sending those four values directly, you generate encoded packets by subsets:

Packet 1: A ⊕ C = 5 ⊕ 7 = 2 [1,0,1,0]
Packet 2: B ⊕ C ⊕ D = 3 ⊕ 7 ⊕ 2 = 6 [0,1,1,1]
Packet 3: A ⊕ B = 5 ⊕ 3 = 6 [1,1,0,0]
Packet 4: A ⊕ B ⊕ C ⊕ D = 5 ⊕ 3 ⊕ 7 ⊕ 2 = 3 [1,1,1,1]

Each packet is a linear equation over the unknowns [A,B,C,D], and the binary vector says which symbols participate. The receiver who collects any four independent packets can solve the system by , XOR-ing rows until each unknown is isolated.

In RaptorQ, we treat the source data as a vector of unknowns $[x_1, x_2, ..., x_K]$ and generate an infinite stream of encoded packets, each one the XOR of a random subset of source symbols.

This explains the fungibility. Order doesn’t matter because the order in which you write down equations doesn’t change the solution. Every new packet provides a bit more information, constraining the possible values of the source symbols.

What RaptorQ Promises

RFC 6330 makes three promises, and they’re worth stating precisely: , , and near-.

Rateless

The sender can generate as many repair packets as needed. No fixed $n$ . No negotiation loop.

Systematic

The original data symbols are part of the encoding stream. In the common case of low loss, the receiver just gets the source symbols and never runs a decoder.

Near-MDS

It behaves almost like an optimal erasure code: you need only slightly more than the block size. The RFC pins down a steep reliability curve: $\le 1\%$ failure at $K'$ , $\le 0.01\%$ at $K'+1$ , $\le 10^{-6}$ at $K'+2$ .

The Coupon Collector’s Tax

If the idea is just "send random linear equations," why didn’t we do this 50 years ago? It’s called Random Linear Network Coding, and it works perfectly. But there is a catch:

Dense equations = solvable but slow.
Sparse equations = fast but broken.

Dense: solving a system of $K$ random linear equations takes $O(K^3)$ time. If your file has 50,000 blocks, that’s 125 trillion operations.

So we make the equations sparse: instead of XORing 50% of the symbols, we XOR only a few (say, 5 or 10). This makes solving fast. But sparsity introduces a new villain: the .

The Core Probability

Suppose each encoded packet touches exactly $d$ randomly chosen source symbols. After receiving $n$ packets, the probability a specific symbol was never touched is:

With $d = 5$ , you should expect about $e^{-5} \approx 0.7\%$ of your symbols to have zero appearances. That’s ~70 symbols in a block of 10,000 that are information-theoretically unrecoverable.

With constant-degree random mixing, you need

O(K \log K)

packets to cover every symbol with high probability.

The whole story of design is threading this needle. Dense equations give you solvability but $O(K^3)$ decoding. Sparse equations give you speed but the log tax on overhead.

LT Codes: The Ripple

(Luby, 2002) were the first practical fountain codes. The core move is simple: don’t pick a fixed degree. For each packet, you choose a degree $d$ from a carefully shaped distribution, then XOR those $d$ symbols.

The Soliton Intuition

A degree- $d$ equation becomes useful when exactly $d-1$ of its neighbors are already known, because then it collapses to a single unknown. The idealized version is the degree law:

In practice the Ideal Soliton is too fragile. It maintains an expected of exactly one degree-1 check at each step. One unlucky dry spell and the ripple hits zero: the cascade stalls. LT codes fix this with a Robust Soliton distribution that adds a deliberate buffer.

The decoding picture is graph-theoretic. Draw a bipartite graph: variables on the left (unknown symbols), checks on the right (received packets), and edges for "this packet touches that symbol." The is the set of degree-1 check nodes at any moment.

Peeling succeeds as long as the ripple never hits zero. If it does, you’re in a (a 2-core): every remaining packet has degree $\ge 2$ , so nothing is directly solvable.

The One Clever Idea

This is the part I find most interesting. The idea is simple once you see it:

Don’t make the fountain code perfect.
Let it be sloppy. Clean up with a second code.

Instead of trying to recover 100% of the symbols from the sparse fountain code, you only ask it to recover about 97% of symbols.

Getting to 97% is easy and costs $O(K)$ time. It’s only the tail, the last few stubborn symbols, that gets expensive. So you just... don’t bother with the tail.

Obviously you still need 100% of the file. So before the fountain process starts, you take the source file and apply a : you expand it by a tiny amount (say, 3%) using a high-density erasure code.

In RaptorQ, that precode is layered: a sparse -style part (cheap XOR constraints) plus a small, denser "insurance" part (where the spec leans on to crush failures).

The workflow becomes two stages:

The Precode (Insurance): Expand source to Intermediate $L$ (adding ~3% structured redundancy).
The Fountain (Delivery): Spray packets generated from $L$ using a fast, sparse code (LT Code).
Decoding Phase 1: The receiver collects packets and uses the fast sparse decoder. It stalls at roughly 97%.
Decoding Phase 2: The Precode kicks in. Because the file has internal structure, we can mathematically deduce the missing 3%.

We moved the "hard work" from the probabilistic layer (where it costs overhead) to a deterministic layer (where it is cheap because the missing count is small).

What’s Actually Being Solved

One detail that’s easy to miss: RaptorQ doesn’t directly solve for your raw source symbols. It first defines a slightly larger set of $C[0],\dots,C[L-1]$ , then constructs equations that let the decoder recover those $L$ unknowns.

LT constraints (sparse): the fountain layer equations (peel-friendly).
constraints (sparse): a web of cheap "insurance" equations.
constraints (denser): a small set of heavy-hitting equations where improves rank.

A Mental Model (Not Exact RFC Layout)

+------------------------------------+
| HDPC constraints (H rows, GF(256))  |
| LDPC constraints (S rows, GF(2))    |
| LT   constraints (~K rows, GF(2))   |
+------------------------------------+
                L unknowns

Most operations stay XOR-cheap (). The part is deliberately isolated: it costs more per operation, but it buys you a dramatic reduction in "unlucky" rank deficiency.

Walkthrough: A Toy Decode

Let’s do a complete end-to-end decode in miniature. We’ll use $K=4$ one-byte source symbols $A,B,C,D$ , plus one precode parity symbol $P$ . The receiver will never receive $C$ directly, yet will still reconstruct it.

Peeling & Inactivation

We’ve solved the overhead problem. Now, how do we solve 50,000 equations in milliseconds? We use a (also known as Belief Propagation). It works like Sudoku.

You look for an equation with only one unknown. You solve it instantly. Then you "peel" that known value out of all other equations (by XORing it into them). This might reduce a complex equation to a single-unknown equation. The process cascades.

But what if the peeling stalls? What if every remaining equation depends on at least 2 unknowns?

A Stuck System (The 2-Core)

y₁ = A ⊕ B
y₂ = B ⊕ C
y₃ = C ⊕ D
y₄ = D ⊕ E
y₅ = E ⊕ A

Five equations, five unknowns, every equation degree-2. No equation has a single unknown; peeling can’t start. The equations form a cycle. This is the , and pure peeling is helpless against it.

This is where RaptorQ introduces . Instead of giving up, the algorithm picks a variable, marks it as "Inactive," and treats it as a known quantity for now. Suddenly the cascade resumes. At the end, only a tiny "core" of Inactive variables remains, solved via .

Inactivation Decoding In 4 Phases

Triangulate (Peel): greedily eliminate degree-1 checks; this solves the easy majority in linear time.
Inactivate: when the ripple dies (2-core), pick a variable to "park" as unknown and keep peeling around it.
Dense solve: run Gaussian elimination on the inactive core only (this is where the GF(256) insurance matters).
Back-substitute: push the solved core back into the sparse system and finish peeling.

The Engineering Tricks

So we have sparse equations for speed, a precode for the tail, and inactivation for stalls. The reason it works in production is the engineering around those ideas.

The first $K$ encoding symbols are the source symbols themselves. If the channel is clean, the receiver doesn’t decode at all.

One Integer of Metadata

A repair packet carries an (ESI). Both sender and receiver run the same deterministic generator to reconstruct which symbols were combined. Broadcast-friendly, coordination-free.

Padding to $K'$

RFC 6330 quietly pads your block from $K$ to $K'$ using a lookup table, adding padding symbols that are never transmitted. This lets encoder and decoder reuse fixed parameters and keeps behavior interoperable.

A Degree Table, Not Pure Randomness

RFC 6330 hardcodes a degree distribution heavily weighted toward low degrees (2 and 3 dominate). The expected LT degree is about 4.8. With permanently-inactivated symbols, each repair symbol touches about 7 intermediates on average: constant work per packet.

Permanent Inactivation

RaptorQ pre-selects a small set of intermediate symbols (the PI symbols) to be treated as inactivated from the start. The count scales as roughly $\sqrt{'{K}'}$ : for $K = 10{,}000$ , the dense core is about $100 \times 100$ . That’s cubic work on a 100-variable system, not a 10,000-variable one.

How One ID Becomes An Equation (Simplified)

id  = ESI
isi = id (source), or id + (K' - K) (repair)
d   = degree_from_table(isi)   // mostly 2,3,4...
b,a = tuple_params(isi)        // stepping params

indices = []
for t in 0..d-1:
  indices.push(b)
  b = (b + a) mod L

y = XOR(C[i] for i in indices)

Why +2 Packets Changes Everything

The RFC 6330 failure rates look almost too good to be true: $\le 1\%$ at $K'$ , $\le 0.01\%$ at $K'+1$ , $\le 10^{-6}$ at $K'+2$ . Two extra packets improve reliability by four orders of magnitude?

A clean piece of math demystifies the dramatic drop. If you pick a random $K \times K$ matrix over $GF(q)$ , the probability it’s full rank is:

Over (pure XOR)

The product converges to about 0.289. That’s a ~71% chance of failure with exactly $K$ random binary equations.

$K$ rows: ~29% success · $K+1$ : ~57% · $K+2$ : ~78%

Over (byte arithmetic)

The product converges to about 0.996. Nearly full rank with exactly $K$ rows. And each extra row crushes failure probability by a factor of $\sim 1/256$ .

$K$ rows: ~99.6% · $K+1$ : ~99.998% · $K+2$ : ~99.99999%

This is why RaptorQ uses for its dense "insurance" component: the larger field makes random coefficient vectors dramatically more independent.

The Composition Trick

The overall overhead of a Raptor-style code is the product of its two layers. The adds a small constant expansion $\delta$ (say, 3%), and the LT layer requires a small extra fraction $\epsilon'$ (say, 2%) to get peeling "close enough." Combined:

Both $\delta$ and $\epsilon'$ are constants independent of $K$ . The $\log K$ is gone. Raptor codes are provably universally capacity-achieving on the binary erasure channel: for any erasure rate $q$ , a Raptor code can transmit at rate for arbitrarily small $\epsilon$ , with linear encoding and decoding time.

How We Got Here

RaptorQ is the polished endpoint of a sequence of tricks that kept attacking the same tradeoff: overhead vs. speed.

1997 · Tornado Codes (Luby, Mitzenmacher, Shokrollahi, Spielman)

The first linear-time erasure codes, using layered bipartite graph structures with peeling decoders. Fixed-rate, not rateless, but they proved that sparse graph codes could match the performance of dense codes.

1998 · Digital Fountain

Michael Luby founded Digital Fountain around the "fountain" idea: an endless spray of packets so receivers can join late, suffer loss, and still finish without feedback. Qualcomm later acquired the company and its patent portfolio.

2002 · LT Codes (Luby)

Sparse XOR equations + a tuned degree distribution so a peeling decoder can run fast. Revolutionary, but the last few symbols still force a non-constant tail overhead.

2006 · Raptor Codes (Shokrollahi)

"RAPid TORnado": take the Tornado code architecture and make it fast and rateless. The breakthrough is adding a high-rate precode so the fountain layer only needs to get you "almost there."

2011 · RaptorQ (RFC 6330)

RaptorQ is Raptor done as if you cared about finite-length behavior. Systematic encoding, deterministic symbol generation from a single integer ID, permanent inactivation decoding, and a GF(256) HDPC layer.

The Cryptographic Cousin

In 1979, Adi Shamir published a scheme for splitting a secret into $N$ pieces such that any of them can reconstruct the secret, but $K-1$ pieces reveal absolutely nothing. It’s called , and if you squint, it’s doing the same thing as a fountain code, just with a very different goal.

A Concrete Example

Suppose your secret is a number $S = 42$ (the combination to a vault). You want to split it into 5 shares such that any 3 can reconstruct it, but 2 reveal nothing.

Pick a random polynomial of degree $K-1 = 2$ , pinning the constant term to your secret:

Evaluate at 5 distinct points to create shares:

Share 1:

f(1) = 42 + 7 + 3 = 52

Share 2:

f(2) = 42 + 14 + 12 = 68

Share 3:

f(3) = 42 + 21 + 27 = 90

Share 4:

f(4) = 42 + 28 + 48 = 118

Share 5:

f(5) = 42 + 35 + 75 = 152

To reconstruct the secret, any 3 people pool their shares and use Lagrange interpolation to recover the unique degree-2 polynomial. The secret is $f(0) = 42$ .

Same Continent, Different Countries

Both Shamir and RaptorQ are solving the same abstract problem with the same mathematical tools:

Linear algebra over finite fields. Both generate redundant linear measurements of unknown data.
Redundancy through structure, not duplication. The polynomial (Shamir) or XOR constraint (RaptorQ) encodes relationships, not copies.
Decoding as solving. Recovery means inverting a matrix: Vandermonde for Shamir, sparse + precode for RaptorQ.
"Any $K$ of $N$ " from rank. The system is solvable precisely when you have $K$ independent equations.

Property	Shamir’s Secret Sharing	RaptorQ
Threshold	Exact: K shares, always	Probabilistic: K + ε, almost always
Matrix type	Dense Vandermonde (guaranteed rank)	Sparse + precode (rank w/ high prob.)
Security	K–1 shares reveal nothing	No secrecy (matrix is public)
Decode speed	O(K²) (interpolation)	O(K) (peeling + small dense core)
Coordination	Must know which shares you have	Self-identifying (ESI in each packet)
Design goal	Certainty + secrecy	Speed + adaptivity

If you understand , you’re 80% of the way to understanding fountain codes. The leap from "secret sharing" to "erasure coding" is smaller than it appears, and the bridge is the Rank-Nullity Theorem.

RaptorQ vs. The Alternatives

It helps to see RaptorQ in context.

Scheme	Overhead	Speed	Notes
Reed-Solomon	0 (MDS)	O(K²)	Optimal but slow. Fixed-rate.
Random Linear	~0	O(K³)	Theoretically ideal, computationally brutal.
LT Codes	O(√K log²K)	O(K log K)	First practical fountain code.
Raptor	O(1)	O(K)	Precode trick eliminates log tax.
RaptorQ (RFC 6330)	+2 symbols	O(K)	Production-grade. GF(256) insurance. Near-MDS at finite K.

The Black Magic ofLiquid Data.

How Good Is It, Really?

What You May Already Know

You must choose RRR in advance

It gets slow at scale

Packets Are Equations

What RaptorQ Promises

Rateless

Systematic

Near-MDS

The Coupon Collector’s Tax

LT Codes: The Ripple

The Soliton Intuition

The One Clever Idea

What’s Actually Being Solved

Walkthrough: A Toy Decode

Peeling & Inactivation

The Engineering Tricks

Systematic Encoding

One Integer of Metadata

Padding to K′K'K′

A Degree Table, Not Pure Randomness

Permanent Inactivation

Why +2 Packets Changes Everything

Over GF(2) (pure XOR)

Over GF(256) (byte arithmetic)

The Composition Trick

How We Got Here

The Cryptographic Cousin

A Concrete Example

Same Continent, Different Countries

RaptorQ vs. The Alternatives

The Black Magic ofLiquid Data.

How Good Is It, Really?

What You May Already Know

You must choose RRR in advance

It gets slow at scale

Packets Are Equations

What RaptorQ Promises

Rateless

Systematic

Near-MDS

The Coupon Collector’s Tax

LT Codes: The Ripple

The Soliton Intuition

The One Clever Idea

What’s Actually Being Solved

Walkthrough: A Toy Decode

Peeling & Inactivation

The Engineering Tricks

Systematic Encoding

One Integer of Metadata

Padding to K′K'K′

A Degree Table, Not Pure Randomness

Permanent Inactivation

Why +2 Packets Changes Everything

Over GF(2) (pure XOR)

Over GF(256) (byte arithmetic)

The Composition Trick

How We Got Here

The Cryptographic Cousin

A Concrete Example

Same Continent, Different Countries

RaptorQ vs. The Alternatives

The Black Magic of
Liquid Data.

You must choose $R$ in advance

Padding to $K'$

Over (pure XOR)

Over (byte arithmetic)

The Black Magic of
Liquid Data.

You must choose $R$ in advance

Padding to $K'$

Over (pure XOR)

Over (byte arithmetic)