nix/doc/disk_performance_analysis_report.md

# Disk performance analysis report
_Exported on 2/10/2026 at 16:37:25 GMT+1 from Cursor (2.4.31)_

---

**User**

this is the report of "iostat -x". what can you tell me about my disks?

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.78    0.00   26.35   54.39    0.00   10.47

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
dm-0           594.00   2376.00     0.00   0.00    0.20     4.00  411.00   1644.00     0.00   0.00    0.30     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.24   5.40
dm-1             0.00      0.00     0.00   0.00    0.00     0.00   90.00    408.00     0.00   0.00    0.39     4.53    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.04   0.90
dm-10            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
dm-2             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
dm-3           201.00   4872.00     0.00   0.00    0.28    24.24  134.00    556.00     0.00   0.00    0.42     4.15    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.11   2.80
dm-4           201.00   4872.00     0.00   0.00    0.28    24.24  134.00    556.00     0.00   0.00    0.42     4.15    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.11   2.80
dm-6             0.00      0.00     0.00   0.00    0.00     0.00   84.00    344.00     0.00   0.00    0.31     4.10    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.03   1.00
dm-7           201.00   4872.00     0.00   0.00    0.31    24.24   46.00    196.00     0.00   0.00    0.46     4.26    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.08   2.90
dm-8             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
dm-9             0.00      0.00     0.00   0.00    0.00     0.00    4.00     16.00     0.00   0.00    2.25     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.01   0.60
sda              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
sdb            574.00   7248.00   221.00  27.80    0.21    12.63  254.00   2608.00   381.00  60.00    0.24    10.27    0.00      0.00     0.00   0.00    0.00     0.00    3.00    2.00    0.19   7.30
sdc            164.00   8044.00     0.00   0.00   36.77    49.05    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    6.03  74.20
sdd            134.00   6696.00     0.00   0.00   42.81    49.97    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    5.74  69.20
sde            152.00   7852.00     0.00   0.00   38.44    51.66    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    5.84  67.10
sdf            140.00   6992.00     0.00   0.00   42.84    49.94    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    6.00  75.30

---

**Cursor**

Here's my analysis of your `iostat -x` output:

## CPU Overview -- I/O Bottleneck

The most alarming number is **%iowait at 54.39%**. Your CPU is spending over half its time idle, waiting on disk I/O. Combined with only 10.47% true idle time, your system is severely **I/O bound**.

## Physical Disks

### `sdb` -- Healthy (likely SSD)
- Handling both reads (574 r/s, ~7 MB/s) and writes (254 w/s, ~2.5 MB/s)
- Excellent latencies: **0.21 ms** read, **0.24 ms** write
- Good merge ratios (27.8% read, 60% write -- the I/O scheduler is coalescing requests well)
- Only **7.3% utilized** -- this disk has plenty of headroom
- This is almost certainly an **SSD** given the sub-millisecond latencies

### `sdc`, `sdd`, `sde`, `sdf` -- Saturated (likely HDDs in a RAID)
| Disk | r/s | rkB/s | r_await (ms) | aqu-sz | %util |
|------|-----|-------|-------------|--------|-------|
| sdc  | 164 | 8,044 | **36.77**   | 6.03   | **74.2%** |
| sdd  | 134 | 6,696 | **42.81**   | 5.74   | **69.2%** |
| sde  | 152 | 7,852 | **38.44**   | 5.84   | **67.1%** |
| sdf  | 140 | 6,992 | **42.84**   | 6.00   | **75.3%** |

These four disks share a very similar pattern:
- **Read-only** workload (zero writes)
- **High latencies** in the 37-43 ms range -- typical of spinning HDDs under load
- **High utilization** (67-75%) -- approaching saturation
- **Deep queues** (aqu-sz ~6) -- requests are piling up waiting to be served
- Similar request sizes (~50 KB) suggest they are part of a **RAID array** (likely RAID 5/6/10)

### `sda` -- Completely idle

## Device Mapper (dm-*) Devices

- **dm-0**: Main active LV on `sdb`, small 4 KB I/Os, healthy latencies (<0.3 ms), low utilization (5.4%)
- **dm-3, dm-4**: Identical stats (mirrors or the same LV exposed twice), reading ~4.8 MB/s at 24 KB request size -- these map to the HDD array
- **dm-7**: Also reads from the HDD array, similar pattern
- **dm-1, dm-6, dm-9**: Light write activity, healthy
- **dm-2, dm-8, dm-10**: Completely idle

## Key Takeaways

1. **Your bottleneck is the HDD array (sdc/sdd/sde/sdf).** They are heavily loaded with reads, running at 67-75% utilization with 37-43 ms latencies and deep queues. This is what's causing the 54% iowait.

2. **Your SSD (sdb) is fine** -- low utilization, sub-ms latencies, plenty of capacity left.

3. **Recommendations:**
   - If possible, move the read-heavy workload (whatever is hitting sdc-sdf) to SSD storage
   - Check what process is generating all those reads (`iotop -o` or `pidstat -d` can help identify it)
   - If this is a RAID array, verify its health -- degraded RAID can cause elevated latencies
   - Consider adding a read cache (e.g., `bcache`, `lvmcache`) in front of the HDD array using SSD

---

**User**

this is a Proxmox Backup server. the HDD are in Raid5. probably the server is running a GC job. How can i try to lessen the load or improve the performance? i don't think a cache SSD will do the job. i need something better

---

**Cursor**

Great context -- PBS garbage collection is notoriously I/O-heavy because it walks the entire chunk store to find unreferenced chunks. With RAID5 on spinning disks, that's a worst-case scenario (tons of random reads across all drives). Here's what you can do, from most impactful to least:

---

## 1. ZFS Special VDEV (Best Bang for Buck)

If you're running ZFS (which is the default/recommended for PBS datastores), this is **the** answer. A **special vdev** stores all **metadata and small blocks** on SSD. Since PBS GC is overwhelmingly a metadata-walking operation, this moves the bottleneck off the HDDs entirely.

```bash
# Add a special vdev (mirror recommended for redundancy)
zpool add <your-pool> special mirror /dev/sdX /dev/sdY

# Set small block allocation to special vdev (e.g., blocks <= 64K)
zfs set special_small_blocks=64K <your-pool>/<datastore>
```

Why this is better than L2ARC/bcache:
- **Guaranteed** metadata lives on SSD (not a probabilistic cache)
- Small PBS chunks (typically 64K-4MB, but metadata entries are tiny) get stored on SSD too
- GC performance improves **dramatically** -- often 10x or more
- Survives reboots (it's persistent, not a cache)

You only need a relatively small SSD -- metadata is typically 1-5% of total pool size. A mirrored pair of 200-500 GB NVMe/SSDs would cover most datastores.

---

## 2. Throttle GC I/O Impact

While you work on the hardware side, immediately reduce GC's impact on the system:

### Schedule GC during low-activity windows

In the PBS web UI: **Datastore -> Garbage Collection -> Schedule**, set it to run at night or whenever backup traffic is lowest.

### Limit GC I/O priority with ionice

Find the GC process and lower its I/O priority:

```bash
# Find the GC worker process
pgrep -f "proxmox-backup" | xargs -I{} ionice -c3 -p {}
```

`-c3` sets it to "idle" class -- it only gets I/O when nothing else needs the disk.

For a persistent solution, create a systemd drop-in:

```bash
mkdir -p /etc/systemd/system/proxmox-backup-proxy.service.d
cat > /etc/systemd/system/proxmox-backup-proxy.service.d/io-limit.conf << 'EOF'
[Service]
IOSchedulingClass=idle
IOSchedulingPriority=7
EOF
systemctl daemon-reload
systemctl restart proxmox-backup-proxy
```

### Tune chunk verification schedule

In `/etc/proxmox-backup/datastore.cfg`, reduce verification frequency if you have aggressive verify jobs -- they compound the I/O load on top of GC.

---

## 3. ZFS Tuning (If Applicable)

```bash
# Increase ARC size to cache more metadata in RAM
# Check current ARC size
cat /proc/spl/kstat/zfs/arcstats | grep c_max

# Set ARC max to e.g. 16GB (in bytes) in /etc/modprobe.d/zfs.conf
echo "options zfs zfs_arc_max=17179869184" > /etc/modprobe.d/zfs.conf

# Tune read-ahead for sequential GC scanning
echo "options zfs zfs_vdev_read_gap_limit=32" >> /etc/modprobe.d/zfs.conf
```

Also ensure `ashift=12` was used when the pool was created (should be default for modern drives).

---

## 4. Consider RAID Level Change (Longer Term)

RAID5 with 4 disks gives you:
- 3 disks of usable space
- Random read IOPS roughly equal to 4 individual disks (each disk serves reads)
- But **parity calculation overhead** and the **write hole** problem

**RAID10 (mirror + stripe)** would give you:
- 2 disks of usable space (you lose capacity)
- But **2x the random read IOPS** compared to RAID5 for the same disk count
- And much better write performance (no parity calculation)

If you can tolerate the capacity loss, converting to RAID10 helps.

---

## TL;DR Priority List

| Priority | Action | Impact | Effort |
|----------|--------|--------|--------|
| 1 | **ZFS special vdev** (mirrored SSD) | Huge | Medium |
| 2 | ionice GC to idle class | Medium | Low |
| 3 | Schedule GC off-peak | Medium | Low |
| 4 | Increase ZFS ARC (RAM) | Medium | Low |
| 5 | RAID5 -> RAID10 migration | Large | High |

The **special vdev** is the single best investment you can make. Two mid-range NVMe SSDs in a mirror will transform GC performance without replacing your entire HDD array or losing storage capacity.

---

**User**

let's analyze deeper the ZFS vdev thing. PBS is a LXC container with a mount point which is part of my ZFS pool. how can i achieve the metadata mirror on SSD? i need special care in term of redundancy?

---

**Cursor**

Great question -- the special vdev is powerful but has a critical "gotcha" around redundancy. Let me walk through your specific setup.

## Your Architecture

```
PVE Host
├── ZFS Pool (raidz1: sdc, sdd, sde, sdf)  <-- HDD array
│   └── dataset (e.g., tank/pbs-data)
│       └── mounted into LXC container as bind mount
└── LXC Container (PBS)
    └── /mnt/datastore/... (sees the ZFS dataset)
```

The key insight: **the special vdev is a pool-level operation done on the PVE host**. The LXC container doesn't know or care -- it just transparently benefits from faster metadata access.

---

## Redundancy: This Is Critical

**If you lose the special vdev, you lose the ENTIRE pool.** Not just metadata -- everything. The pool becomes unimportable.

This is because metadata on the special vdev is **not duplicated** back to the main raidz1 vdev. It's the authoritative copy. This makes the special vdev fundamentally different from L2ARC or SLOG (where loss just means losing a cache).

### Redundancy rules:

| Pool main vdev | Special vdev minimum | Recommendation |
|---|---|---|
| raidz1 (single parity) | **mirror** (minimum) | mirror of 2 SSDs |
| raidz2 (double parity) | **mirror** or 3-way mirror | 3-way mirror ideally |
| mirror | **mirror** | mirror of 2 SSDs |

Since your pool is **raidz1** (tolerates 1 disk failure), a **mirrored pair of SSDs** gives you matching fault tolerance: you can lose 1 SSD and the pool keeps running.

---

## Step-by-Step Implementation

### 1. Identify your pool

On the **PVE host**:

```bash
# See your pool layout
zpool status

# Note the pool name (e.g., "tank", "rpool", "pbs-pool", etc.)
# and confirm it's the raidz1 with sdc/sdd/sde/sdf
```

### 2. Install and identify SSDs

Install two SSDs (same model/size recommended). Identify them by persistent path:

```bash
ls -la /dev/disk/by-id/ | grep -v part
```

**Always use `/dev/disk/by-id/` paths**, never raw `/dev/sdX` -- device letters can change on reboot.

### 3. Add the mirrored special vdev

```bash
# DRY RUN first -- add -n flag to simulate
zpool add -n <pool-name> special mirror \
  /dev/disk/by-id/<ssd1-id> \
  /dev/disk/by-id/<ssd2-id>

# If the dry run looks good, do it for real
zpool add <pool-name> special mirror \
  /dev/disk/by-id/<ssd1-id> \
  /dev/disk/by-id/<ssd2-id>
```

**WARNING:** This operation is **irreversible** on OpenZFS 2.1 (PVE 7.x). On OpenZFS 2.2+ (PVE 8.x), `zpool remove` of special vdevs is technically possible but still considered risky. Treat it as permanent.

### 4. Configure small block allocation

```bash
# Find the dataset that's mounted into the PBS container
zfs list

# Set the small block threshold
zfs set special_small_blocks=64K <pool-name>/<dataset-name>
```

What this does:
- **All metadata** -> always goes to the special vdev (this is automatic, no setting needed)
- **Data blocks <= 64K** -> also go to the special vdev
- **Data blocks > 64K** -> stay on the HDD raidz1

For PBS, the metadata (chunk indexes, manifests, namespace info) is what GC hammers. The actual backup chunks are larger and stay on HDDs -- which is fine since sequential reads of large chunks are what HDDs do well.

### 5. Verify

```bash
zpool status
```

You should now see something like:

```
  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0
        special
          mirror-1                  ONLINE       0     0     0
            /dev/disk/by-id/ssd1    ONLINE       0     0     0
            /dev/disk/by-id/ssd2    ONLINE       0     0     0
```

---

## What Happens Over Time

- **New writes**: metadata + small blocks immediately go to SSDs
- **Existing data**: old metadata migrates to the special vdev gradually as it gets rewritten (during scrubs, GC rewrites, etc.)
- You can force a faster migration by running a scrub: `zpool scrub <pool-name>`
- GC performance improves immediately for new data and progressively for old data

---

## SSD Sizing

Rule of thumb for PBS:

| Datastore size | Recommended SSD size (each) |
|---|---|
| < 5 TB | 120-250 GB |
| 5-20 TB | 250-500 GB |
| 20-50 TB | 500 GB - 1 TB |
| > 50 TB | 1 TB+ |

Metadata is typically 3-5% of pool size. With `special_small_blocks=64K`, add some buffer. It's better to oversize than undersize -- if the special vdev fills up, ZFS falls back to the main vdev (no data loss, just reduced benefit).

---

## Summary Checklist

- [ ] Use **two SSDs in mirror** -- never a single SSD
- [ ] Use `/dev/disk/by-id/` paths
- [ ] Do a `zpool add -n` dry run first
- [ ] Set `special_small_blocks=64K` on the PBS dataset
- [ ] Consider enterprise SSDs with power-loss protection (data integrity matters for metadata)
- [ ] After adding, run `zpool scrub` to accelerate migration of existing metadata
- [ ] No changes needed inside the LXC container -- it's transparent