ClickHouse on EBS: the storage trap nobody draws on the architecture diagram

Someone runs DROP PARTITION. Three months of cold data, gone. The directory shrinks in real time. system.parts confirms it. That part is fine. It’s everything underneath that’s the problem.

ClickHouse is, by database standards, the well-behaved one here. Postgres, MongoDB, MySQL – they hold onto deleted space inside the engine. You delete rows, the engine reuses that space internally for new writes, and the file on disk doesn’t shrink. Giving space back means rebuilding the table.

ClickHouse doesn’t do that. Parts are marked inactive, hang around for old_parts_lifetime (480 seconds by default) so a crash doesn’t lose data, then they’re unlinked. Filesystem releases the blocks. Done.

What doesn’t release is the EBS volume underneath. AWS is explicit, you can grow a volume, you can’t shrink it. The only path to a smaller volume is creating a new one and migrating data across.

So you have a database that returns space cleanly, sitting on a storage layer that doesn’t. Logical size shrinks. Provisioned size doesn’t. The gap widens.

This is true of any database on EBS. The thing that makes ClickHouse different is what comes next.

ClickHouse needs free space to free space

Most databases need some headroom. ClickHouse needs a lot, and it needs it in a specific way.

Background merges aren’t optional. They’re the mechanism that turns a stream of small parts into a manageable number of large ones. A merge reads the source parts, writes a new combined part, then unlinks the sources. Both versions exist on disk during the merge. Default max_bytes_to_merge_at_max_space_in_pool is 150 GiB – a single merge can reserve 150 GiB of free space just to start. If it’s not there, the merge doesn’t start.

There’s a fallback. When free space drops, ClickHouse switches to max_bytes_to_merge_at_min_space_in_pool – default 1 MiB. Big merges stop. Small ones keep going. Part count grows. Query latency degrades.

Then this:

Code: 243. DB::Exception: Not enough space for merging parts.

The volume is at 95%. Big merges have been stalled for a while – you may not have noticed, because small ones are still running and the metric most people watch is NumberOfPartsInPartition, not the merge size distribution. Someone runs DROP PARTITION to relieve pressure. Parts unlink. df shows free space. But part count is stuck high – small unmerged parts accumulated while big merges were stalled. Inserts hit parts_to_delay_insert, then parts_to_throw_insert. Queries slow down because they’re reading from too many parts.

Natural reaction is OPTIMIZE TABLE ... FINAL. Force the consolidation. Except OPTIMIZE FINAL requests a merge that needs the full 150 GiB. That free space isn’t there, because the volume is still 95% full.

The database is technically healthy. The disk is technically not full. The operation that would actually fix it can’t run.

The escape hatches exist. Grow the volume via Elastic Volumes – multi-hour operation on a multi-TB volume, capped at four modifications per volume per 24-hour rolling window since AWS lifted the cooldown in January. Drop more partitions. Lower max_bytes_to_merge_at_max_space_in_pool and accept smaller merges. Move data to a TO DISK tier. All of these work. All of them are things you’re doing under pressure, with degraded performance, while Slack fills up.

This is the trap. Three independent constraints – ClickHouse needs working space, EBS only grows, merges stall before they fail – interacting at the wrong moment. No single component is misbehaving, which is why the architecture diagram doesn’t show it.

Why this is structurally different from the Postgres/Mongo version

The shrink problem in Postgres or MongoDB is mostly economic. Storage is wasted, the bill is too high. The system itself is fine.

In ClickHouse, the shrink problem is also a capacity problem. Free space is volatile – it expands and contracts as merges run, TTLs fire, partitions drop. Provision conservatively to absorb the volatility, you’re paying for headroom you rarely use. Provision tightly, you risk the deadlock above.

The question isn’t “how do we shrink the volume to save money.” It’s “how do we make our storage layer match the way ClickHouse actually uses disk.” Different question, different answer.

What teams actually do

Provision for the worst case. The default. Pick a volume that handles the largest backfill you can imagine, plus 30-40% for merges. Walk away. Works fine until you have 50TB+ per node across a fleet.

Tier to S3. ClickHouse storage policies move older parts to S3 as they age. Caps how much hot data sits on EBS. Forward-looking only – does nothing about the volume you have right now.

Migrate to a smaller volume. The supported AWS path. FREEZE partitions, populate a fresh smaller volume via replicated ATTACH, bring the new node into the cluster, retire the old one. (FREEZE-and-restore-from-snapshot is the other path, but EBS won’t let you provision a snapshot-restored volume smaller than the source, so it gets fiddly.) Multi-day project per cluster. Most teams do it once, bundled into something else they were doing anyway. They don’t do it twice.

Build the automation. A few teams have built internal tooling to rotate replicas through smaller volumes on a schedule. Right instinct. Honest version is that it takes longer than expected – a FREEZE that exhausts inode count, a Keeper session timeout mid-ATTACH, a replica falling behind during cutover, an EBS modification that takes six hours on a multi-TB volume. And the tooling has to be maintained as ClickHouse versions and EBS behaviour drift. Every ClickHouse-on-EBS team is building roughly the same thing. None of them are publishing it.

OPTIMIZE FINAL and hope. Doesn’t shrink the volume. Won’t run when you need it. Move on.

Where Datafy fits

Disclosure: we make a tool that addresses this.

Datafy sits below ClickHouse at the block layer. Presents a stable virtual volume to the filesystem, manages the underlying EBS allocation automatically – grows when consumption rises, shrinks when it falls. The MergeTree engine doesn’t know anything has changed. Rule-based and deterministic; you set the thresholds. Can be removed at any time, leaving you on standard EBS with no Datafy dependency.

What changes for the reader is the shape of the problem. You still think about merges, headroom, and partition strategy. You stop thinking about whether your fleet’s provisioning policy is paying for capacity that’s only used during backfills, and whether the next backfill walks you into the deadlock above.

Not the only way to solve this. Migration works. Tiering works. Building it in-house works if you have the team. The argument for buying is the usual one – the problem is common, the stakes are real, and the cost of getting it wrong shows up at 2am.

What to do on Monday

Measure the gap first.

SELECT
  formatReadableSize(sum(bytes_on_disk)) AS active_size,
  count() AS active_parts
FROM system.parts
WHERE active;

Compare three numbers: what system.parts reports as active, what df reports on the data volume, what AWS bills you for. The first gap (active vs. df) is inactive parts and merge working space – solvable with patience. The second gap (df vs. EBS) is structural waste. That’s the number that matters.

While you’re there, check system.merges and your NumberOfParts metric. Big merges that have been “running” for hours without progress means you’re closer to the deadlock than you think.

  • Under 15%: leave it alone.

  • 15-30%: bundle a migration into your next version upgrade.

  • Over 30%: pick one – manual migration, in-house automation, or something underneath ClickHouse that handles the volatility.

Whatever you pick, test it on a restored snapshot first. The mistakes on this kind of work don’t show up during the operation. They show up three days later, when a query that used to be fast isn’t, and you’re trying to remember what you changed.

In This Article