nix/doc/disk_performance_analysis_report.md
2026-02-10 20:56:35 +01:00

16 KiB

Disk performance analysis report

Exported on 2/10/2026 at 16:37:25 GMT+1 from Cursor (2.4.31)


User

this is the report of "iostat -x". what can you tell me about my disks?

avg-cpu: %user %nice %system %iowait %steal %idle 8.78 0.00 26.35 54.39 0.00 10.47

Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util dm-0 594.00 2376.00 0.00 0.00 0.20 4.00 411.00 1644.00 0.00 0.00 0.30 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 5.40 dm-1 0.00 0.00 0.00 0.00 0.00 0.00 90.00 408.00 0.00 0.00 0.39 4.53 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.90 dm-10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-3 201.00 4872.00 0.00 0.00 0.28 24.24 134.00 556.00 0.00 0.00 0.42 4.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.11 2.80 dm-4 201.00 4872.00 0.00 0.00 0.28 24.24 134.00 556.00 0.00 0.00 0.42 4.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.11 2.80 dm-6 0.00 0.00 0.00 0.00 0.00 0.00 84.00 344.00 0.00 0.00 0.31 4.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 1.00 dm-7 201.00 4872.00 0.00 0.00 0.31 24.24 46.00 196.00 0.00 0.00 0.46 4.26 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.08 2.90 dm-8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 dm-9 0.00 0.00 0.00 0.00 0.00 0.00 4.00 16.00 0.00 0.00 2.25 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.60 sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 sdb 574.00 7248.00 221.00 27.80 0.21 12.63 254.00 2608.00 381.00 60.00 0.24 10.27 0.00 0.00 0.00 0.00 0.00 0.00 3.00 2.00 0.19 7.30 sdc 164.00 8044.00 0.00 0.00 36.77 49.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.03 74.20 sdd 134.00 6696.00 0.00 0.00 42.81 49.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.74 69.20 sde 152.00 7852.00 0.00 0.00 38.44 51.66 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.84 67.10 sdf 140.00 6992.00 0.00 0.00 42.84 49.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.00 75.30


Cursor

Here's my analysis of your iostat -x output:

CPU Overview -- I/O Bottleneck

The most alarming number is %iowait at 54.39%. Your CPU is spending over half its time idle, waiting on disk I/O. Combined with only 10.47% true idle time, your system is severely I/O bound.

Physical Disks

sdb -- Healthy (likely SSD)

  • Handling both reads (574 r/s, ~7 MB/s) and writes (254 w/s, ~2.5 MB/s)
  • Excellent latencies: 0.21 ms read, 0.24 ms write
  • Good merge ratios (27.8% read, 60% write -- the I/O scheduler is coalescing requests well)
  • Only 7.3% utilized -- this disk has plenty of headroom
  • This is almost certainly an SSD given the sub-millisecond latencies

sdc, sdd, sde, sdf -- Saturated (likely HDDs in a RAID)

Disk r/s rkB/s r_await (ms) aqu-sz %util
sdc 164 8,044 36.77 6.03 74.2%
sdd 134 6,696 42.81 5.74 69.2%
sde 152 7,852 38.44 5.84 67.1%
sdf 140 6,992 42.84 6.00 75.3%

These four disks share a very similar pattern:

  • Read-only workload (zero writes)
  • High latencies in the 37-43 ms range -- typical of spinning HDDs under load
  • High utilization (67-75%) -- approaching saturation
  • Deep queues (aqu-sz ~6) -- requests are piling up waiting to be served
  • Similar request sizes (~50 KB) suggest they are part of a RAID array (likely RAID 5/6/10)

sda -- Completely idle

Device Mapper (dm-*) Devices

  • dm-0: Main active LV on sdb, small 4 KB I/Os, healthy latencies (<0.3 ms), low utilization (5.4%)
  • dm-3, dm-4: Identical stats (mirrors or the same LV exposed twice), reading ~4.8 MB/s at 24 KB request size -- these map to the HDD array
  • dm-7: Also reads from the HDD array, similar pattern
  • dm-1, dm-6, dm-9: Light write activity, healthy
  • dm-2, dm-8, dm-10: Completely idle

Key Takeaways

  1. Your bottleneck is the HDD array (sdc/sdd/sde/sdf). They are heavily loaded with reads, running at 67-75% utilization with 37-43 ms latencies and deep queues. This is what's causing the 54% iowait.

  2. Your SSD (sdb) is fine -- low utilization, sub-ms latencies, plenty of capacity left.

  3. Recommendations:

    • If possible, move the read-heavy workload (whatever is hitting sdc-sdf) to SSD storage
    • Check what process is generating all those reads (iotop -o or pidstat -d can help identify it)
    • If this is a RAID array, verify its health -- degraded RAID can cause elevated latencies
    • Consider adding a read cache (e.g., bcache, lvmcache) in front of the HDD array using SSD

User

this is a Proxmox Backup server. the HDD are in Raid5. probably the server is running a GC job. How can i try to lessen the load or improve the performance? i don't think a cache SSD will do the job. i need something better


Cursor

Great context -- PBS garbage collection is notoriously I/O-heavy because it walks the entire chunk store to find unreferenced chunks. With RAID5 on spinning disks, that's a worst-case scenario (tons of random reads across all drives). Here's what you can do, from most impactful to least:


1. ZFS Special VDEV (Best Bang for Buck)

If you're running ZFS (which is the default/recommended for PBS datastores), this is the answer. A special vdev stores all metadata and small blocks on SSD. Since PBS GC is overwhelmingly a metadata-walking operation, this moves the bottleneck off the HDDs entirely.

# Add a special vdev (mirror recommended for redundancy)
zpool add <your-pool> special mirror /dev/sdX /dev/sdY

# Set small block allocation to special vdev (e.g., blocks <= 64K)
zfs set special_small_blocks=64K <your-pool>/<datastore>

Why this is better than L2ARC/bcache:

  • Guaranteed metadata lives on SSD (not a probabilistic cache)
  • Small PBS chunks (typically 64K-4MB, but metadata entries are tiny) get stored on SSD too
  • GC performance improves dramatically -- often 10x or more
  • Survives reboots (it's persistent, not a cache)

You only need a relatively small SSD -- metadata is typically 1-5% of total pool size. A mirrored pair of 200-500 GB NVMe/SSDs would cover most datastores.


2. Throttle GC I/O Impact

While you work on the hardware side, immediately reduce GC's impact on the system:

Schedule GC during low-activity windows

In the PBS web UI: Datastore -> Garbage Collection -> Schedule, set it to run at night or whenever backup traffic is lowest.

Limit GC I/O priority with ionice

Find the GC process and lower its I/O priority:

# Find the GC worker process
pgrep -f "proxmox-backup" | xargs -I{} ionice -c3 -p {}

-c3 sets it to "idle" class -- it only gets I/O when nothing else needs the disk.

For a persistent solution, create a systemd drop-in:

mkdir -p /etc/systemd/system/proxmox-backup-proxy.service.d
cat > /etc/systemd/system/proxmox-backup-proxy.service.d/io-limit.conf << 'EOF'
[Service]
IOSchedulingClass=idle
IOSchedulingPriority=7
EOF
systemctl daemon-reload
systemctl restart proxmox-backup-proxy

Tune chunk verification schedule

In /etc/proxmox-backup/datastore.cfg, reduce verification frequency if you have aggressive verify jobs -- they compound the I/O load on top of GC.


3. ZFS Tuning (If Applicable)

# Increase ARC size to cache more metadata in RAM
# Check current ARC size
cat /proc/spl/kstat/zfs/arcstats | grep c_max

# Set ARC max to e.g. 16GB (in bytes) in /etc/modprobe.d/zfs.conf
echo "options zfs zfs_arc_max=17179869184" > /etc/modprobe.d/zfs.conf

# Tune read-ahead for sequential GC scanning
echo "options zfs zfs_vdev_read_gap_limit=32" >> /etc/modprobe.d/zfs.conf

Also ensure ashift=12 was used when the pool was created (should be default for modern drives).


4. Consider RAID Level Change (Longer Term)

RAID5 with 4 disks gives you:

  • 3 disks of usable space
  • Random read IOPS roughly equal to 4 individual disks (each disk serves reads)
  • But parity calculation overhead and the write hole problem

RAID10 (mirror + stripe) would give you:

  • 2 disks of usable space (you lose capacity)
  • But 2x the random read IOPS compared to RAID5 for the same disk count
  • And much better write performance (no parity calculation)

If you can tolerate the capacity loss, converting to RAID10 helps.


TL;DR Priority List

Priority Action Impact Effort
1 ZFS special vdev (mirrored SSD) Huge Medium
2 ionice GC to idle class Medium Low
3 Schedule GC off-peak Medium Low
4 Increase ZFS ARC (RAM) Medium Low
5 RAID5 -> RAID10 migration Large High

The special vdev is the single best investment you can make. Two mid-range NVMe SSDs in a mirror will transform GC performance without replacing your entire HDD array or losing storage capacity.


User

let's analyze deeper the ZFS vdev thing. PBS is a LXC container with a mount point which is part of my ZFS pool. how can i achieve the metadata mirror on SSD? i need special care in term of redundancy?


Cursor

Great question -- the special vdev is powerful but has a critical "gotcha" around redundancy. Let me walk through your specific setup.

Your Architecture

PVE Host
├── ZFS Pool (raidz1: sdc, sdd, sde, sdf)  <-- HDD array
│   └── dataset (e.g., tank/pbs-data)
│       └── mounted into LXC container as bind mount
└── LXC Container (PBS)
    └── /mnt/datastore/... (sees the ZFS dataset)

The key insight: the special vdev is a pool-level operation done on the PVE host. The LXC container doesn't know or care -- it just transparently benefits from faster metadata access.


Redundancy: This Is Critical

If you lose the special vdev, you lose the ENTIRE pool. Not just metadata -- everything. The pool becomes unimportable.

This is because metadata on the special vdev is not duplicated back to the main raidz1 vdev. It's the authoritative copy. This makes the special vdev fundamentally different from L2ARC or SLOG (where loss just means losing a cache).

Redundancy rules:

Pool main vdev Special vdev minimum Recommendation
raidz1 (single parity) mirror (minimum) mirror of 2 SSDs
raidz2 (double parity) mirror or 3-way mirror 3-way mirror ideally
mirror mirror mirror of 2 SSDs

Since your pool is raidz1 (tolerates 1 disk failure), a mirrored pair of SSDs gives you matching fault tolerance: you can lose 1 SSD and the pool keeps running.


Step-by-Step Implementation

1. Identify your pool

On the PVE host:

# See your pool layout
zpool status

# Note the pool name (e.g., "tank", "rpool", "pbs-pool", etc.)
# and confirm it's the raidz1 with sdc/sdd/sde/sdf

2. Install and identify SSDs

Install two SSDs (same model/size recommended). Identify them by persistent path:

ls -la /dev/disk/by-id/ | grep -v part

Always use /dev/disk/by-id/ paths, never raw /dev/sdX -- device letters can change on reboot.

3. Add the mirrored special vdev

# DRY RUN first -- add -n flag to simulate
zpool add -n <pool-name> special mirror \
  /dev/disk/by-id/<ssd1-id> \
  /dev/disk/by-id/<ssd2-id>

# If the dry run looks good, do it for real
zpool add <pool-name> special mirror \
  /dev/disk/by-id/<ssd1-id> \
  /dev/disk/by-id/<ssd2-id>

WARNING: This operation is irreversible on OpenZFS 2.1 (PVE 7.x). On OpenZFS 2.2+ (PVE 8.x), zpool remove of special vdevs is technically possible but still considered risky. Treat it as permanent.

4. Configure small block allocation

# Find the dataset that's mounted into the PBS container
zfs list

# Set the small block threshold
zfs set special_small_blocks=64K <pool-name>/<dataset-name>

What this does:

  • All metadata -> always goes to the special vdev (this is automatic, no setting needed)
  • Data blocks <= 64K -> also go to the special vdev
  • Data blocks > 64K -> stay on the HDD raidz1

For PBS, the metadata (chunk indexes, manifests, namespace info) is what GC hammers. The actual backup chunks are larger and stay on HDDs -- which is fine since sequential reads of large chunks are what HDDs do well.

5. Verify

zpool status

You should now see something like:

  pool: tank
 state: ONLINE
config:

        NAME                        STATE     READ WRITE CKSUM
        tank                        ONLINE       0     0     0
          raidz1-0                  ONLINE       0     0     0
            sdc                     ONLINE       0     0     0
            sdd                     ONLINE       0     0     0
            sde                     ONLINE       0     0     0
            sdf                     ONLINE       0     0     0
        special
          mirror-1                  ONLINE       0     0     0
            /dev/disk/by-id/ssd1    ONLINE       0     0     0
            /dev/disk/by-id/ssd2    ONLINE       0     0     0

What Happens Over Time

  • New writes: metadata + small blocks immediately go to SSDs
  • Existing data: old metadata migrates to the special vdev gradually as it gets rewritten (during scrubs, GC rewrites, etc.)
  • You can force a faster migration by running a scrub: zpool scrub <pool-name>
  • GC performance improves immediately for new data and progressively for old data

SSD Sizing

Rule of thumb for PBS:

Datastore size Recommended SSD size (each)
< 5 TB 120-250 GB
5-20 TB 250-500 GB
20-50 TB 500 GB - 1 TB
> 50 TB 1 TB+

Metadata is typically 3-5% of pool size. With special_small_blocks=64K, add some buffer. It's better to oversize than undersize -- if the special vdev fills up, ZFS falls back to the main vdev (no data loss, just reduced benefit).


Summary Checklist

  • Use two SSDs in mirror -- never a single SSD
  • Use /dev/disk/by-id/ paths
  • Do a zpool add -n dry run first
  • Set special_small_blocks=64K on the PBS dataset
  • Consider enterprise SSDs with power-loss protection (data integrity matters for metadata)
  • After adding, run zpool scrub to accelerate migration of existing metadata
  • No changes needed inside the LXC container -- it's transparent