Add docs for future work

2026-02-10 20:56:35 +01:00 · 2026-02-10 20:56:35 +01:00 · e641c5b9b4
commit e641c5b9b4
parent 58a8ff88d5
2 changed files with 374 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -2,6 +2,10 @@

 L'homelab è suddiviso in vari host configurati come container LXC su una macchina Proxmox

+## Docs
+
+- Improve PBS performance: [link](doc/disk_performance_analysis_report.md)
+
 ## Troubleshooting

 ### Immich
--- a/doc/disk_performance_analysis_report.md
+++ b/doc/disk_performance_analysis_report.md
@ -0,0 +1,370 @@
+# Disk performance analysis report
+_Exported on 2/10/2026 at 16:37:25 GMT+1 from Cursor (2.4.31)_
+
+---
+
+**User**
+
+this is the report of "iostat -x". what can you tell me about my disks?
+
+avg-cpu:  %user   %nice %system %iowait  %steal   %idle
+           8.78    0.00   26.35   54.39    0.00   10.47
+
+Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
+dm-0           594.00   2376.00     0.00   0.00    0.20     4.00  411.00   1644.00     0.00   0.00    0.30     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.24   5.40
+dm-1             0.00      0.00     0.00   0.00    0.00     0.00   90.00    408.00     0.00   0.00    0.39     4.53    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.04   0.90
+dm-10            0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
+dm-2             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
+dm-3           201.00   4872.00     0.00   0.00    0.28    24.24  134.00    556.00     0.00   0.00    0.42     4.15    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.11   2.80
+dm-4           201.00   4872.00     0.00   0.00    0.28    24.24  134.00    556.00     0.00   0.00    0.42     4.15    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.11   2.80
+dm-6             0.00      0.00     0.00   0.00    0.00     0.00   84.00    344.00     0.00   0.00    0.31     4.10    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.03   1.00
+dm-7           201.00   4872.00     0.00   0.00    0.31    24.24   46.00    196.00     0.00   0.00    0.46     4.26    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.08   2.90
+dm-8             0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
+dm-9             0.00      0.00     0.00   0.00    0.00     0.00    4.00     16.00     0.00   0.00    2.25     4.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.01   0.60
+sda              0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    0.00   0.00
+sdb            574.00   7248.00   221.00  27.80    0.21    12.63  254.00   2608.00   381.00  60.00    0.24    10.27    0.00      0.00     0.00   0.00    0.00     0.00    3.00    2.00    0.19   7.30
+sdc            164.00   8044.00     0.00   0.00   36.77    49.05    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    6.03  74.20
+sdd            134.00   6696.00     0.00   0.00   42.81    49.97    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    5.74  69.20
+sde            152.00   7852.00     0.00   0.00   38.44    51.66    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    5.84  67.10
+sdf            140.00   6992.00     0.00   0.00   42.84    49.94    0.00      0.00     0.00   0.00    0.00     0.00    0.00      0.00     0.00   0.00    0.00     0.00    0.00    0.00    6.00  75.30
+
+---
+
+**Cursor**
+
+Here's my analysis of your `iostat -x` output:
+
+## CPU Overview -- I/O Bottleneck
+
+The most alarming number is **%iowait at 54.39%**. Your CPU is spending over half its time idle, waiting on disk I/O. Combined with only 10.47% true idle time, your system is severely **I/O bound**.
+
+## Physical Disks
+
+### `sdb` -- Healthy (likely SSD)
+- Handling both reads (574 r/s, ~7 MB/s) and writes (254 w/s, ~2.5 MB/s)
+- Excellent latencies: **0.21 ms** read, **0.24 ms** write
+- Good merge ratios (27.8% read, 60% write -- the I/O scheduler is coalescing requests well)
+- Only **7.3% utilized** -- this disk has plenty of headroom
+- This is almost certainly an **SSD** given the sub-millisecond latencies
+
+### `sdc`, `sdd`, `sde`, `sdf` -- Saturated (likely HDDs in a RAID)
+| Disk | r/s | rkB/s | r_await (ms) | aqu-sz | %util |
+|------|-----|-------|-------------|--------|-------|
+| sdc  | 164 | 8,044 | **36.77**   | 6.03   | **74.2%** |
+| sdd  | 134 | 6,696 | **42.81**   | 5.74   | **69.2%** |
+| sde  | 152 | 7,852 | **38.44**   | 5.84   | **67.1%** |
+| sdf  | 140 | 6,992 | **42.84**   | 6.00   | **75.3%** |
+
+These four disks share a very similar pattern:
+- **Read-only** workload (zero writes)
+- **High latencies** in the 37-43 ms range -- typical of spinning HDDs under load
+- **High utilization** (67-75%) -- approaching saturation
+- **Deep queues** (aqu-sz ~6) -- requests are piling up waiting to be served
+- Similar request sizes (~50 KB) suggest they are part of a **RAID array** (likely RAID 5/6/10)
+
+### `sda` -- Completely idle
+
+## Device Mapper (dm-*) Devices
+
+- **dm-0**: Main active LV on `sdb`, small 4 KB I/Os, healthy latencies (<0.3 ms), low utilization (5.4%)
+- **dm-3, dm-4**: Identical stats (mirrors or the same LV exposed twice), reading ~4.8 MB/s at 24 KB request size -- these map to the HDD array
+- **dm-7**: Also reads from the HDD array, similar pattern
+- **dm-1, dm-6, dm-9**: Light write activity, healthy
+- **dm-2, dm-8, dm-10**: Completely idle
+
+## Key Takeaways
+
+1. **Your bottleneck is the HDD array (sdc/sdd/sde/sdf).** They are heavily loaded with reads, running at 67-75% utilization with 37-43 ms latencies and deep queues. This is what's causing the 54% iowait.
+
+2. **Your SSD (sdb) is fine** -- low utilization, sub-ms latencies, plenty of capacity left.
+
+3. **Recommendations:**
+   - If possible, move the read-heavy workload (whatever is hitting sdc-sdf) to SSD storage
+   - Check what process is generating all those reads (`iotop -o` or `pidstat -d` can help identify it)
+   - If this is a RAID array, verify its health -- degraded RAID can cause elevated latencies
+   - Consider adding a read cache (e.g., `bcache`, `lvmcache`) in front of the HDD array using SSD
+
+---
+
+**User**
+
+this is a Proxmox Backup server. the HDD are in Raid5. probably the server is running a GC job. How can i try to lessen the load or improve the performance? i don't think a cache SSD will do the job. i need something better
+
+---
+
+**Cursor**
+
+Great context -- PBS garbage collection is notoriously I/O-heavy because it walks the entire chunk store to find unreferenced chunks. With RAID5 on spinning disks, that's a worst-case scenario (tons of random reads across all drives). Here's what you can do, from most impactful to least:
+
+---
+
+## 1. ZFS Special VDEV (Best Bang for Buck)
+
+If you're running ZFS (which is the default/recommended for PBS datastores), this is **the** answer. A **special vdev** stores all **metadata and small blocks** on SSD. Since PBS GC is overwhelmingly a metadata-walking operation, this moves the bottleneck off the HDDs entirely.
+
+```bash
+# Add a special vdev (mirror recommended for redundancy)
+zpool add <your-pool> special mirror /dev/sdX /dev/sdY
+
+# Set small block allocation to special vdev (e.g., blocks <= 64K)
+zfs set special_small_blocks=64K <your-pool>/<datastore>
+```
+
+Why this is better than L2ARC/bcache:
+- **Guaranteed** metadata lives on SSD (not a probabilistic cache)
+- Small PBS chunks (typically 64K-4MB, but metadata entries are tiny) get stored on SSD too
+- GC performance improves **dramatically** -- often 10x or more
+- Survives reboots (it's persistent, not a cache)
+
+You only need a relatively small SSD -- metadata is typically 1-5% of total pool size. A mirrored pair of 200-500 GB NVMe/SSDs would cover most datastores.
+
+---
+
+## 2. Throttle GC I/O Impact
+
+While you work on the hardware side, immediately reduce GC's impact on the system:
+
+### Schedule GC during low-activity windows
+
+In the PBS web UI: **Datastore -> Garbage Collection -> Schedule**, set it to run at night or whenever backup traffic is lowest.
+
+### Limit GC I/O priority with ionice
+
+Find the GC process and lower its I/O priority:
+
+```bash
+# Find the GC worker process
+pgrep -f "proxmox-backup" | xargs -I{} ionice -c3 -p {}
+```
+
+`-c3` sets it to "idle" class -- it only gets I/O when nothing else needs the disk.
+
+For a persistent solution, create a systemd drop-in:
+
+```bash
+mkdir -p /etc/systemd/system/proxmox-backup-proxy.service.d
+cat > /etc/systemd/system/proxmox-backup-proxy.service.d/io-limit.conf << 'EOF'
+[Service]
+IOSchedulingClass=idle
+IOSchedulingPriority=7
+EOF
+systemctl daemon-reload
+systemctl restart proxmox-backup-proxy
+```
+
+### Tune chunk verification schedule
+
+In `/etc/proxmox-backup/datastore.cfg`, reduce verification frequency if you have aggressive verify jobs -- they compound the I/O load on top of GC.
+
+---
+
+## 3. ZFS Tuning (If Applicable)
+
+```bash
+# Increase ARC size to cache more metadata in RAM
+# Check current ARC size
+cat /proc/spl/kstat/zfs/arcstats | grep c_max
+
+# Set ARC max to e.g. 16GB (in bytes) in /etc/modprobe.d/zfs.conf
+echo "options zfs zfs_arc_max=17179869184" > /etc/modprobe.d/zfs.conf
+
+# Tune read-ahead for sequential GC scanning
+echo "options zfs zfs_vdev_read_gap_limit=32" >> /etc/modprobe.d/zfs.conf
+```
+
+Also ensure `ashift=12` was used when the pool was created (should be default for modern drives).
+
+---
+
+## 4. Consider RAID Level Change (Longer Term)
+
+RAID5 with 4 disks gives you:
+- 3 disks of usable space
+- Random read IOPS roughly equal to 4 individual disks (each disk serves reads)
+- But **parity calculation overhead** and the **write hole** problem
+
+**RAID10 (mirror + stripe)** would give you:
+- 2 disks of usable space (you lose capacity)
+- But **2x the random read IOPS** compared to RAID5 for the same disk count
+- And much better write performance (no parity calculation)
+
+If you can tolerate the capacity loss, converting to RAID10 helps.
+
+---
+
+## TL;DR Priority List
+
+| Priority | Action | Impact | Effort |
+|----------|--------|--------|--------|
+| 1 | **ZFS special vdev** (mirrored SSD) | Huge | Medium |
+| 2 | ionice GC to idle class | Medium | Low |
+| 3 | Schedule GC off-peak | Medium | Low |
+| 4 | Increase ZFS ARC (RAM) | Medium | Low |
+| 5 | RAID5 -> RAID10 migration | Large | High |
+
+The **special vdev** is the single best investment you can make. Two mid-range NVMe SSDs in a mirror will transform GC performance without replacing your entire HDD array or losing storage capacity.
+
+---
+
+**User**
+
+let's analyze deeper the ZFS vdev thing. PBS is a LXC container with a mount point which is part of my ZFS pool. how can i achieve the metadata mirror on SSD? i need special care in term of redundancy?
+
+---
+
+**Cursor**
+
+Great question -- the special vdev is powerful but has a critical "gotcha" around redundancy. Let me walk through your specific setup.
+
+## Your Architecture
+
+```
+PVE Host
+├── ZFS Pool (raidz1: sdc, sdd, sde, sdf)  <-- HDD array
+│   └── dataset (e.g., tank/pbs-data)
+│       └── mounted into LXC container as bind mount
+└── LXC Container (PBS)
+    └── /mnt/datastore/... (sees the ZFS dataset)
+```
+
+The key insight: **the special vdev is a pool-level operation done on the PVE host**. The LXC container doesn't know or care -- it just transparently benefits from faster metadata access.
+
+---
+
+## Redundancy: This Is Critical
+
+**If you lose the special vdev, you lose the ENTIRE pool.** Not just metadata -- everything. The pool becomes unimportable.
+
+This is because metadata on the special vdev is **not duplicated** back to the main raidz1 vdev. It's the authoritative copy. This makes the special vdev fundamentally different from L2ARC or SLOG (where loss just means losing a cache).
+
+### Redundancy rules:
+
+| Pool main vdev | Special vdev minimum | Recommendation |
+|---|---|---|
+| raidz1 (single parity) | **mirror** (minimum) | mirror of 2 SSDs |
+| raidz2 (double parity) | **mirror** or 3-way mirror | 3-way mirror ideally |
+| mirror | **mirror** | mirror of 2 SSDs |
+
+Since your pool is **raidz1** (tolerates 1 disk failure), a **mirrored pair of SSDs** gives you matching fault tolerance: you can lose 1 SSD and the pool keeps running.
+
+---
+
+## Step-by-Step Implementation
+
+### 1. Identify your pool
+
+On the **PVE host**:
+
+```bash
+# See your pool layout
+zpool status
+
+# Note the pool name (e.g., "tank", "rpool", "pbs-pool", etc.)
+# and confirm it's the raidz1 with sdc/sdd/sde/sdf
+```
+
+### 2. Install and identify SSDs
+
+Install two SSDs (same model/size recommended). Identify them by persistent path:
+
+```bash
+ls -la /dev/disk/by-id/ | grep -v part
+```
+
+**Always use `/dev/disk/by-id/` paths**, never raw `/dev/sdX` -- device letters can change on reboot.
+
+### 3. Add the mirrored special vdev
+
+```bash
+# DRY RUN first -- add -n flag to simulate
+zpool add -n <pool-name> special mirror \
+  /dev/disk/by-id/<ssd1-id> \
+  /dev/disk/by-id/<ssd2-id>
+
+# If the dry run looks good, do it for real
+zpool add <pool-name> special mirror \
+  /dev/disk/by-id/<ssd1-id> \
+  /dev/disk/by-id/<ssd2-id>
+```
+
+**WARNING:** This operation is **irreversible** on OpenZFS 2.1 (PVE 7.x). On OpenZFS 2.2+ (PVE 8.x), `zpool remove` of special vdevs is technically possible but still considered risky. Treat it as permanent.
+
+### 4. Configure small block allocation
+
+```bash
+# Find the dataset that's mounted into the PBS container
+zfs list
+
+# Set the small block threshold
+zfs set special_small_blocks=64K <pool-name>/<dataset-name>
+```
+
+What this does:
+- **All metadata** -> always goes to the special vdev (this is automatic, no setting needed)
+- **Data blocks <= 64K** -> also go to the special vdev
+- **Data blocks > 64K** -> stay on the HDD raidz1
+
+For PBS, the metadata (chunk indexes, manifests, namespace info) is what GC hammers. The actual backup chunks are larger and stay on HDDs -- which is fine since sequential reads of large chunks are what HDDs do well.
+
+### 5. Verify
+
+```bash
+zpool status
+```
+
+You should now see something like:
+
+```
+  pool: tank
+ state: ONLINE
+config:
+
+        NAME                        STATE     READ WRITE CKSUM
+        tank                        ONLINE       0     0     0
+          raidz1-0                  ONLINE       0     0     0
+            sdc                     ONLINE       0     0     0
+            sdd                     ONLINE       0     0     0
+            sde                     ONLINE       0     0     0
+            sdf                     ONLINE       0     0     0
+        special
+          mirror-1                  ONLINE       0     0     0
+            /dev/disk/by-id/ssd1    ONLINE       0     0     0
+            /dev/disk/by-id/ssd2    ONLINE       0     0     0
+```
+
+---
+
+## What Happens Over Time
+
+- **New writes**: metadata + small blocks immediately go to SSDs
+- **Existing data**: old metadata migrates to the special vdev gradually as it gets rewritten (during scrubs, GC rewrites, etc.)
+- You can force a faster migration by running a scrub: `zpool scrub <pool-name>`
+- GC performance improves immediately for new data and progressively for old data
+
+---
+
+## SSD Sizing
+
+Rule of thumb for PBS:
+
+| Datastore size | Recommended SSD size (each) |
+|---|---|
+| < 5 TB | 120-250 GB |
+| 5-20 TB | 250-500 GB |
+| 20-50 TB | 500 GB - 1 TB |
+| > 50 TB | 1 TB+ |
+
+Metadata is typically 3-5% of pool size. With `special_small_blocks=64K`, add some buffer. It's better to oversize than undersize -- if the special vdev fills up, ZFS falls back to the main vdev (no data loss, just reduced benefit).
+
+---
+
+## Summary Checklist
+
+- [ ] Use **two SSDs in mirror** -- never a single SSD
+- [ ] Use `/dev/disk/by-id/` paths
+- [ ] Do a `zpool add -n` dry run first
+- [ ] Set `special_small_blocks=64K` on the PBS dataset
+- [ ] Consider enterprise SSDs with power-loss protection (data integrity matters for metadata)
+- [ ] After adding, run `zpool scrub` to accelerate migration of existing metadata
+- [ ] No changes needed inside the LXC container -- it's transparent
+