Home Technology How Data is Stored, Replicated and Synchronized in Distributed Ledgers

Technology

How Data is Stored, Replicated and Synchronized in Distributed Ledgers

December 1, 2025

Most people think “blockchain” when they hear distributed ledger technology, but the core problem is far more general: how do thousands of mutually distrusting machines maintain an identical, tamper-evident history and current state without a central coordinator?

At the lowest level, every participating node stores a sequence of blocks containing ordered transactions plus the resulting state root — a Merkle root committing to the entire world state at that height. The innovation is not the append-only log itself; logs have existed forever. The trick is making sure every honest node ends up with the same log and the same deterministic state derived from it.

Monolithic Replication: Everyone Runs Everything

In first- and second-generation ledgers (Bitcoin, Ethereum pre-Danksharding, Solana’s current design), replication is brutally simple: full nodes download and execute every transaction from genesis. This gives extremely strong security guarantees — any node can independently verify the entire history — but it scales poorly.

A Bitcoin full node today stores ~650 GB of blocks and ~6 GB of UTXO set. An Ethereum archive node exceeds 14 TB because it keeps historical states. The replication factor is effectively equal to the number of full nodes (tens of thousands for Bitcoin, several thousand for Ethereum mainnet). Economic incentives (mining rewards, staking yields) keep enough independent operators running this heavy software.

The Move to Modular Replication

Starting in 2021–2022, the industry largely accepted that monolithic replication cannot support global-scale throughput. Projects like Celestia, Ethereum’s Danksharding roadmap, Avail, Near DA, and Polygon Avail all converge on the same insight: consensus and data availability can be separated from execution.

In a modular stack:

– The consensus + data availability (DA) layer only orders transactions and guarantees that block data remains available for ~1–2 weeks.

– Execution environments (EVM rollups, SVM rollups, custom VMs) process transactions off-chain and post compact fraud or validity proofs plus state diffs to the DA layer.

– Full nodes on the DA layer still download everything, but there are far fewer of them.

– Rollup full nodes only download data relevant to their own chain.

This reduces the replication burden by orders of magnitude for most participants while preserving cryptographic guarantees.

Data Availability Sampling (DAS)

The breakthrough that makes modular DA practical is data availability sampling. Introduced by Mustafa Al-Bassam in the LazyLedger paper and adopted by Celestia, DAS lets light clients verify data availability without downloading the full block.

Here’s how it works in practice:

The block producer erasure-codes the block using Reed-Solomon codes, typically expanding data by 2–4× into chunks.
Nodes broadcast these chunks over a P2P gossip network.
Light clients repeatedly request small random chunks from many peers.
Once a client has collected enough unique chunks (e.g., 25% for 4× expansion), Reed-Solomon reconstruction guarantees the full block can be recovered if it ever becomes needed.

Celestia mainnet (launched October 2023) already achieves ~10–20 MiB blocks using 2D Reed-Solomon and DAS. Light clients need only ~1–2 Mbps sustained bandwidth to keep up — smartphone-friendly.

State Synchronization Mechanics

Even with data availability solved, nodes still need to reach identical state. Three distinct synchronization strategies dominate today:

**Fast sync / snapshot sync**

Most Ethereum and EVM-compatible nodes no longer execute from genesis. They download a recent trusted state root (via manual checkpoint or honest-majority assumption), then fetch Merkle proofs for accounts and storage slots they care about. This drops sync time from weeks to hours.

**Parallel block propagation**

Solana’s Turbine and Arweave’s Blockweave split blocks into shreds or chunks and fan them out tree-style across the gossip network. A 10 MB block can reach 50,000 validators in <400 ms.

**State expiry and re-execution**

Rollups like Arbitrum and Optimism only guarantee data availability for 1–14 days. After the challenge window, old state is pruned. If someone needs historical proofs, they re-execute from archived data — trading storage for occasional recomputation.

Real-World Example 1 — Ethereum Danksharding (in progress)

Ethereum’s upcoming Proto-Danksharding (EIP-4844) and full Danksharding vision introduce “blob-carrying transactions.” Instead of permanent state, rollups attach large (up to ~2 MB) blobs that are guaranteed available for ~18 days (4096-slot challenge period). Nodes download blobs, validators attest to seeing them, and KZG commitments let anyone verify inclusion.

After the window, blobs are pruned, but the state root remains forever in the beacon chain. This is the first step toward separating Ethereum’s monolithic replication model.

Real-World Example 2 — Celestia + Rollups

Celestia went live with 8 MiB blocks in 2023 and upgraded to 64 MiB target in 2025. Rollups like Dymension and Saga publish their state roots and fraud proofs to Celestia. A Dymension rollup full node downloads only its own ~100–500 MB of historical data instead of Celestia’s entire multi-terabyte history. Light clients verify everything via DAS and Namespaced Merkle Trees (NMTs), which let them prove a rollup’s data was included under its specific namespace.

Cross-Chain State Replication

When assets or messages move between ledgers, some subset of state must be replicated across chains. Three dominant patterns have emerged:

– Lock-and-mint bridges (e.g., Wormhole, LayerZero) — observers relay headers and Merkle proofs.

– Light-client relayers (IBC, Polygon AggLayer, Near Rainbow) — full nodes of chain A run a light client of chain B inside their VM and vice-versa.

– Shared sequencing + asynchronous state roots (Espresso, Astria) — multiple rollups agree on ordering through a common service, eliminating traditional bridges.

Each approach replicates only the minimal necessary state — typically Merkle paths from a header to a specific event — rather than entire chains.

Trade-offs No One Talks About Publicly

Full archival replication gives the strongest censorship resistance and decentralization, but almost no retail user runs an archive node anymore. Most “decentralization” metrics now count staked validators or light clients, not independent state replicators.

Data availability layers reduce hardware requirements dramatically, yet introduce a new attack surface: data withholding attacks. If block data disappears before proofs are finalized, the chain halts. That’s why DAS confidence thresholds are set extremely high (often 99.99% statistical certainty).

Closing Thought

Distributed ledger technology has evolved from “everyone stores and executes everything” to a spectrum of replication strategies tailored to security budgets and use cases. Monolithic designs still offer the simplest mental model and strongest guarantees. Modular separation with data availability sampling scales to global throughput without sacrificing verifiable correctness.

The next five years will likely settle on a handful of high-capacity DA layers (Celestia, Ethereum blobs, Avail, EigenDA) underneath hundreds or thousands of specialized execution environments. State will be replicated selectively, synchronized probabilistically, and verified cryptographically — a far more nuanced picture than the original “immutable shared database” narrative.

That nuance is what actually makes the system work at scale.