← Back to articles

RDMA, HPC, and Compute-Storage Separation: Why Everyone Is Talking About It

As a backend engineer working primarily in payments, distributed systems, and data infrastructure, I recently had a discussion with a friend about deploying stateful services such as MySQL on Kubernetes. During the conversation, a term kept appearing that sounded almost magical:

RDMA (Remote Direct Memory Access).

At first glance, RDMA seems like another networking optimization. After digging deeper, I realized it represents a fundamentally different way of thinking about communication between machines.

This article summarizes my learning journey.


What Is RDMA?

RDMA stands for Remote Direct Memory Access.

Traditionally, when one server sends data to another, the operating system and TCP/IP stack are heavily involved:

Application

Kernel

TCP/IP Stack

NIC
══════ Network ══════
NIC

TCP/IP Stack

Kernel

Application

With RDMA, a machine can directly read or write memory on another machine:

Application

RDMA NIC
══════ Network ══════
RDMA NIC

Remote Memory

The remote CPU may not even participate in the operation.

This dramatically reduces:

  • Context switching
  • Kernel overhead
  • Data copies
  • CPU utilization

Why Is RDMA Fast?

The biggest benefit is not necessarily bandwidth.

The real advantages are:

  • Kernel bypass
  • Zero-copy communication
  • Hardware offloading to the network card

Instead of:

User Space
 ↔ Kernel Space
 ↔ TCP/IP Stack
 ↔ Interrupt
 ↔ Context Switch

RDMA allows:

NIC DMA Engine

Remote Memory

This reduces latency from tens of microseconds to only a few microseconds in modern datacenters.

Typical numbers:

TechnologyApproximate Latency
Local RAM~100 ns
NVMe SSD~100 μs
TCP Network20–100 μs
RDMA1–5 μs

The exact numbers depend on hardware and workload, but the order of magnitude difference is real.


The Most Important Benefit: CPU Savings

Initially I thought RDMA was mainly about latency.

In reality, many production systems adopt RDMA because it dramatically reduces CPU consumption.

Instead of CPUs spending cycles:

Packet Processing
TCP/IP Handling
Interrupts
Buffer Copies

the RDMA-capable NIC performs much of this work.

For large distributed systems, reducing network-related CPU overhead can be more valuable than shaving a few microseconds off latency.


Why RDMA Appears in Kubernetes Discussions

Kubernetes itself does not require RDMA.

The discussion usually appears when people are building compute-storage separated architectures.

Traditional deployment:

MySQL
+
Local SSD

Everything runs on the same machine.

Modern architecture:

MySQL Pod

Distributed Storage

Multiple Storage Nodes

Now storage access travels through the network.

Network latency suddenly becomes part of the database hot path.

This is where RDMA becomes attractive.

Typical examples include:

  • Ceph
  • NVMe over Fabrics (NVMe-oF)
  • BeeGFS
  • Lustre
  • Other distributed storage systems

The goal is simple:

Make remote storage feel as close as possible to local storage.


HPC: The Original Home of RDMA

Another term that came up during the discussion was HPC.

HPC stands for:

High Performance Computing

Before cloud computing and AI became popular, HPC clusters were already solving problems such as:

  • Weather prediction
  • Aircraft simulation
  • Computational fluid dynamics
  • Genomics
  • Drug discovery
  • Scientific research

Unlike traditional distributed systems, where each node processes independent requests:

Node A -> Transaction A
Node B -> Transaction B
Node C -> Transaction C

HPC systems typically have thousands of nodes collaborating on a single computation:

1000 Nodes

One Massive Job

Communication becomes the bottleneck.

This is why HPC adopted RDMA decades ago.


AI Has Made RDMA Popular Again

Today, AI training clusters are effectively modern HPC systems.

Imagine:

8000 GPUs

training a large language model.

Every training step requires exchanging gradients across thousands of GPUs.

Without RDMA:

GPU

CPU

TCP

CPU

GPU

With RDMA:

GPU

NIC
══════ Network ══════
NIC

GPU

Technologies such as:

  • InfiniBand
  • RoCE
  • GPUDirect RDMA

exist specifically to optimize this communication.

Many AI engineers are now rediscovering concepts that have existed in the HPC world for decades.


Common RDMA Technologies

InfiniBand

The traditional HPC solution.

Characteristics:

  • Lowest latency
  • Highest performance
  • Most expensive

Widely used in supercomputers and large AI clusters.


RoCE

RDMA over Converged Ethernet.

Characteristics:

  • Runs on standard Ethernet
  • More enterprise-friendly
  • Increasingly common in datacenters

Many organizations choose RoCE because it combines RDMA capabilities with existing Ethernet infrastructure.


iWARP

RDMA over TCP.

Historically important, but less common today.


Does MySQL Automatically Benefit from RDMA?

Not necessarily.

Many MySQL workloads are limited by:

  • Poor indexing
  • SQL design
  • Lock contention
  • Storage engine behavior
  • Buffer pool efficiency

rather than network performance.

If MySQL is running on:

MySQL
+
Local NVMe

RDMA will probably provide little value.

RDMA becomes interesting when:

MySQL

Remote Storage

Distributed Storage Cluster

or when using distributed databases with heavy cross-node communication.


A Useful Mental Model

The most helpful way to think about RDMA is:

Traditional distributed systems:

Call API

Serialize

TCP

Deserialize

RDMA systems:

Read Remote Memory
Write Remote Memory

Instead of treating the network as message passing, RDMA starts treating the network as an extension of memory.

That mental shift explains why RDMA is so important in modern storage systems, AI clusters, and HPC environments.


My Takeaway

As someone coming from a payments and distributed backend background, I don’t think RDMA should be the first topic engineers learn.

The learning order that currently makes the most sense to me is:

  1. Database internals (InnoDB)
  2. Kafka internals
  3. Distributed consensus (Raft)
  4. Spark/Flink
  5. Distributed storage systems (Ceph)
  6. RDMA
  7. InfiniBand internals

Once you’ve experienced the pain of moving data across machines, RDMA stops looking like black magic and starts looking like a very practical engineering solution.

And perhaps that’s the most interesting realization:

RDMA is not about making the network faster. It’s about making remote resources feel local.