diff --git a/README.md b/README.md index 0de1414..c09604f 100644 --- a/README.md +++ b/README.md @@ -2,6 +2,10 @@ L'homelab è suddiviso in vari host configurati come container LXC su una macchina Proxmox +## Docs + +- Improve PBS performance: [link](doc/disk_performance_analysis_report.md) + ## Troubleshooting ### Immich diff --git a/doc/disk_performance_analysis_report.md b/doc/disk_performance_analysis_report.md new file mode 100644 index 0000000..41691d5 --- /dev/null +++ b/doc/disk_performance_analysis_report.md @@ -0,0 +1,370 @@ +# Disk performance analysis report +_Exported on 2/10/2026 at 16:37:25 GMT+1 from Cursor (2.4.31)_ + +--- + +**User** + +this is the report of "iostat -x". what can you tell me about my disks? + +avg-cpu: %user %nice %system %iowait %steal %idle + 8.78 0.00 26.35 54.39 0.00 10.47 + +Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz w/s wkB/s wrqm/s %wrqm w_await wareq-sz d/s dkB/s drqm/s %drqm d_await dareq-sz f/s f_await aqu-sz %util +dm-0 594.00 2376.00 0.00 0.00 0.20 4.00 411.00 1644.00 0.00 0.00 0.30 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.24 5.40 +dm-1 0.00 0.00 0.00 0.00 0.00 0.00 90.00 408.00 0.00 0.00 0.39 4.53 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.04 0.90 +dm-10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 +dm-2 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 +dm-3 201.00 4872.00 0.00 0.00 0.28 24.24 134.00 556.00 0.00 0.00 0.42 4.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.11 2.80 +dm-4 201.00 4872.00 0.00 0.00 0.28 24.24 134.00 556.00 0.00 0.00 0.42 4.15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.11 2.80 +dm-6 0.00 0.00 0.00 0.00 0.00 0.00 84.00 344.00 0.00 0.00 0.31 4.10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.03 1.00 +dm-7 201.00 4872.00 0.00 0.00 0.31 24.24 46.00 196.00 0.00 0.00 0.46 4.26 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.08 2.90 +dm-8 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 +dm-9 0.00 0.00 0.00 0.00 0.00 0.00 4.00 16.00 0.00 0.00 2.25 4.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.01 0.60 +sda 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 +sdb 574.00 7248.00 221.00 27.80 0.21 12.63 254.00 2608.00 381.00 60.00 0.24 10.27 0.00 0.00 0.00 0.00 0.00 0.00 3.00 2.00 0.19 7.30 +sdc 164.00 8044.00 0.00 0.00 36.77 49.05 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.03 74.20 +sdd 134.00 6696.00 0.00 0.00 42.81 49.97 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.74 69.20 +sde 152.00 7852.00 0.00 0.00 38.44 51.66 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.84 67.10 +sdf 140.00 6992.00 0.00 0.00 42.84 49.94 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 6.00 75.30 + +--- + +**Cursor** + +Here's my analysis of your `iostat -x` output: + +## CPU Overview -- I/O Bottleneck + +The most alarming number is **%iowait at 54.39%**. Your CPU is spending over half its time idle, waiting on disk I/O. Combined with only 10.47% true idle time, your system is severely **I/O bound**. + +## Physical Disks + +### `sdb` -- Healthy (likely SSD) +- Handling both reads (574 r/s, ~7 MB/s) and writes (254 w/s, ~2.5 MB/s) +- Excellent latencies: **0.21 ms** read, **0.24 ms** write +- Good merge ratios (27.8% read, 60% write -- the I/O scheduler is coalescing requests well) +- Only **7.3% utilized** -- this disk has plenty of headroom +- This is almost certainly an **SSD** given the sub-millisecond latencies + +### `sdc`, `sdd`, `sde`, `sdf` -- Saturated (likely HDDs in a RAID) +| Disk | r/s | rkB/s | r_await (ms) | aqu-sz | %util | +|------|-----|-------|-------------|--------|-------| +| sdc | 164 | 8,044 | **36.77** | 6.03 | **74.2%** | +| sdd | 134 | 6,696 | **42.81** | 5.74 | **69.2%** | +| sde | 152 | 7,852 | **38.44** | 5.84 | **67.1%** | +| sdf | 140 | 6,992 | **42.84** | 6.00 | **75.3%** | + +These four disks share a very similar pattern: +- **Read-only** workload (zero writes) +- **High latencies** in the 37-43 ms range -- typical of spinning HDDs under load +- **High utilization** (67-75%) -- approaching saturation +- **Deep queues** (aqu-sz ~6) -- requests are piling up waiting to be served +- Similar request sizes (~50 KB) suggest they are part of a **RAID array** (likely RAID 5/6/10) + +### `sda` -- Completely idle + +## Device Mapper (dm-*) Devices + +- **dm-0**: Main active LV on `sdb`, small 4 KB I/Os, healthy latencies (<0.3 ms), low utilization (5.4%) +- **dm-3, dm-4**: Identical stats (mirrors or the same LV exposed twice), reading ~4.8 MB/s at 24 KB request size -- these map to the HDD array +- **dm-7**: Also reads from the HDD array, similar pattern +- **dm-1, dm-6, dm-9**: Light write activity, healthy +- **dm-2, dm-8, dm-10**: Completely idle + +## Key Takeaways + +1. **Your bottleneck is the HDD array (sdc/sdd/sde/sdf).** They are heavily loaded with reads, running at 67-75% utilization with 37-43 ms latencies and deep queues. This is what's causing the 54% iowait. + +2. **Your SSD (sdb) is fine** -- low utilization, sub-ms latencies, plenty of capacity left. + +3. **Recommendations:** + - If possible, move the read-heavy workload (whatever is hitting sdc-sdf) to SSD storage + - Check what process is generating all those reads (`iotop -o` or `pidstat -d` can help identify it) + - If this is a RAID array, verify its health -- degraded RAID can cause elevated latencies + - Consider adding a read cache (e.g., `bcache`, `lvmcache`) in front of the HDD array using SSD + +--- + +**User** + +this is a Proxmox Backup server. the HDD are in Raid5. probably the server is running a GC job. How can i try to lessen the load or improve the performance? i don't think a cache SSD will do the job. i need something better + +--- + +**Cursor** + +Great context -- PBS garbage collection is notoriously I/O-heavy because it walks the entire chunk store to find unreferenced chunks. With RAID5 on spinning disks, that's a worst-case scenario (tons of random reads across all drives). Here's what you can do, from most impactful to least: + +--- + +## 1. ZFS Special VDEV (Best Bang for Buck) + +If you're running ZFS (which is the default/recommended for PBS datastores), this is **the** answer. A **special vdev** stores all **metadata and small blocks** on SSD. Since PBS GC is overwhelmingly a metadata-walking operation, this moves the bottleneck off the HDDs entirely. + +```bash +# Add a special vdev (mirror recommended for redundancy) +zpool add special mirror /dev/sdX /dev/sdY + +# Set small block allocation to special vdev (e.g., blocks <= 64K) +zfs set special_small_blocks=64K / +``` + +Why this is better than L2ARC/bcache: +- **Guaranteed** metadata lives on SSD (not a probabilistic cache) +- Small PBS chunks (typically 64K-4MB, but metadata entries are tiny) get stored on SSD too +- GC performance improves **dramatically** -- often 10x or more +- Survives reboots (it's persistent, not a cache) + +You only need a relatively small SSD -- metadata is typically 1-5% of total pool size. A mirrored pair of 200-500 GB NVMe/SSDs would cover most datastores. + +--- + +## 2. Throttle GC I/O Impact + +While you work on the hardware side, immediately reduce GC's impact on the system: + +### Schedule GC during low-activity windows + +In the PBS web UI: **Datastore -> Garbage Collection -> Schedule**, set it to run at night or whenever backup traffic is lowest. + +### Limit GC I/O priority with ionice + +Find the GC process and lower its I/O priority: + +```bash +# Find the GC worker process +pgrep -f "proxmox-backup" | xargs -I{} ionice -c3 -p {} +``` + +`-c3` sets it to "idle" class -- it only gets I/O when nothing else needs the disk. + +For a persistent solution, create a systemd drop-in: + +```bash +mkdir -p /etc/systemd/system/proxmox-backup-proxy.service.d +cat > /etc/systemd/system/proxmox-backup-proxy.service.d/io-limit.conf << 'EOF' +[Service] +IOSchedulingClass=idle +IOSchedulingPriority=7 +EOF +systemctl daemon-reload +systemctl restart proxmox-backup-proxy +``` + +### Tune chunk verification schedule + +In `/etc/proxmox-backup/datastore.cfg`, reduce verification frequency if you have aggressive verify jobs -- they compound the I/O load on top of GC. + +--- + +## 3. ZFS Tuning (If Applicable) + +```bash +# Increase ARC size to cache more metadata in RAM +# Check current ARC size +cat /proc/spl/kstat/zfs/arcstats | grep c_max + +# Set ARC max to e.g. 16GB (in bytes) in /etc/modprobe.d/zfs.conf +echo "options zfs zfs_arc_max=17179869184" > /etc/modprobe.d/zfs.conf + +# Tune read-ahead for sequential GC scanning +echo "options zfs zfs_vdev_read_gap_limit=32" >> /etc/modprobe.d/zfs.conf +``` + +Also ensure `ashift=12` was used when the pool was created (should be default for modern drives). + +--- + +## 4. Consider RAID Level Change (Longer Term) + +RAID5 with 4 disks gives you: +- 3 disks of usable space +- Random read IOPS roughly equal to 4 individual disks (each disk serves reads) +- But **parity calculation overhead** and the **write hole** problem + +**RAID10 (mirror + stripe)** would give you: +- 2 disks of usable space (you lose capacity) +- But **2x the random read IOPS** compared to RAID5 for the same disk count +- And much better write performance (no parity calculation) + +If you can tolerate the capacity loss, converting to RAID10 helps. + +--- + +## TL;DR Priority List + +| Priority | Action | Impact | Effort | +|----------|--------|--------|--------| +| 1 | **ZFS special vdev** (mirrored SSD) | Huge | Medium | +| 2 | ionice GC to idle class | Medium | Low | +| 3 | Schedule GC off-peak | Medium | Low | +| 4 | Increase ZFS ARC (RAM) | Medium | Low | +| 5 | RAID5 -> RAID10 migration | Large | High | + +The **special vdev** is the single best investment you can make. Two mid-range NVMe SSDs in a mirror will transform GC performance without replacing your entire HDD array or losing storage capacity. + +--- + +**User** + +let's analyze deeper the ZFS vdev thing. PBS is a LXC container with a mount point which is part of my ZFS pool. how can i achieve the metadata mirror on SSD? i need special care in term of redundancy? + +--- + +**Cursor** + +Great question -- the special vdev is powerful but has a critical "gotcha" around redundancy. Let me walk through your specific setup. + +## Your Architecture + +``` +PVE Host +├── ZFS Pool (raidz1: sdc, sdd, sde, sdf) <-- HDD array +│ └── dataset (e.g., tank/pbs-data) +│ └── mounted into LXC container as bind mount +└── LXC Container (PBS) + └── /mnt/datastore/... (sees the ZFS dataset) +``` + +The key insight: **the special vdev is a pool-level operation done on the PVE host**. The LXC container doesn't know or care -- it just transparently benefits from faster metadata access. + +--- + +## Redundancy: This Is Critical + +**If you lose the special vdev, you lose the ENTIRE pool.** Not just metadata -- everything. The pool becomes unimportable. + +This is because metadata on the special vdev is **not duplicated** back to the main raidz1 vdev. It's the authoritative copy. This makes the special vdev fundamentally different from L2ARC or SLOG (where loss just means losing a cache). + +### Redundancy rules: + +| Pool main vdev | Special vdev minimum | Recommendation | +|---|---|---| +| raidz1 (single parity) | **mirror** (minimum) | mirror of 2 SSDs | +| raidz2 (double parity) | **mirror** or 3-way mirror | 3-way mirror ideally | +| mirror | **mirror** | mirror of 2 SSDs | + +Since your pool is **raidz1** (tolerates 1 disk failure), a **mirrored pair of SSDs** gives you matching fault tolerance: you can lose 1 SSD and the pool keeps running. + +--- + +## Step-by-Step Implementation + +### 1. Identify your pool + +On the **PVE host**: + +```bash +# See your pool layout +zpool status + +# Note the pool name (e.g., "tank", "rpool", "pbs-pool", etc.) +# and confirm it's the raidz1 with sdc/sdd/sde/sdf +``` + +### 2. Install and identify SSDs + +Install two SSDs (same model/size recommended). Identify them by persistent path: + +```bash +ls -la /dev/disk/by-id/ | grep -v part +``` + +**Always use `/dev/disk/by-id/` paths**, never raw `/dev/sdX` -- device letters can change on reboot. + +### 3. Add the mirrored special vdev + +```bash +# DRY RUN first -- add -n flag to simulate +zpool add -n special mirror \ + /dev/disk/by-id/ \ + /dev/disk/by-id/ + +# If the dry run looks good, do it for real +zpool add special mirror \ + /dev/disk/by-id/ \ + /dev/disk/by-id/ +``` + +**WARNING:** This operation is **irreversible** on OpenZFS 2.1 (PVE 7.x). On OpenZFS 2.2+ (PVE 8.x), `zpool remove` of special vdevs is technically possible but still considered risky. Treat it as permanent. + +### 4. Configure small block allocation + +```bash +# Find the dataset that's mounted into the PBS container +zfs list + +# Set the small block threshold +zfs set special_small_blocks=64K / +``` + +What this does: +- **All metadata** -> always goes to the special vdev (this is automatic, no setting needed) +- **Data blocks <= 64K** -> also go to the special vdev +- **Data blocks > 64K** -> stay on the HDD raidz1 + +For PBS, the metadata (chunk indexes, manifests, namespace info) is what GC hammers. The actual backup chunks are larger and stay on HDDs -- which is fine since sequential reads of large chunks are what HDDs do well. + +### 5. Verify + +```bash +zpool status +``` + +You should now see something like: + +``` + pool: tank + state: ONLINE +config: + + NAME STATE READ WRITE CKSUM + tank ONLINE 0 0 0 + raidz1-0 ONLINE 0 0 0 + sdc ONLINE 0 0 0 + sdd ONLINE 0 0 0 + sde ONLINE 0 0 0 + sdf ONLINE 0 0 0 + special + mirror-1 ONLINE 0 0 0 + /dev/disk/by-id/ssd1 ONLINE 0 0 0 + /dev/disk/by-id/ssd2 ONLINE 0 0 0 +``` + +--- + +## What Happens Over Time + +- **New writes**: metadata + small blocks immediately go to SSDs +- **Existing data**: old metadata migrates to the special vdev gradually as it gets rewritten (during scrubs, GC rewrites, etc.) +- You can force a faster migration by running a scrub: `zpool scrub ` +- GC performance improves immediately for new data and progressively for old data + +--- + +## SSD Sizing + +Rule of thumb for PBS: + +| Datastore size | Recommended SSD size (each) | +|---|---| +| < 5 TB | 120-250 GB | +| 5-20 TB | 250-500 GB | +| 20-50 TB | 500 GB - 1 TB | +| > 50 TB | 1 TB+ | + +Metadata is typically 3-5% of pool size. With `special_small_blocks=64K`, add some buffer. It's better to oversize than undersize -- if the special vdev fills up, ZFS falls back to the main vdev (no data loss, just reduced benefit). + +--- + +## Summary Checklist + +- [ ] Use **two SSDs in mirror** -- never a single SSD +- [ ] Use `/dev/disk/by-id/` paths +- [ ] Do a `zpool add -n` dry run first +- [ ] Set `special_small_blocks=64K` on the PBS dataset +- [ ] Consider enterprise SSDs with power-loss protection (data integrity matters for metadata) +- [ ] After adding, run `zpool scrub` to accelerate migration of existing metadata +- [ ] No changes needed inside the LXC container -- it's transparent +