BTRFS Defrag

I got more and more warnings about long read and write-times on my main file store. It is a VM with two physical “spinning-disk” hard drives using BTRFS and snapshots as a bonehead “oops” protection(actual backups are uploaded to Amazon). I figured I would try defragmenting them:

btrfs fi defrag -rv /srv/storage/Media/

I did this for my main stores and now these wait times are shorter. I think you can see from the graph to the left approximately when I did the defragmenting:

Notice that my backup host samba01(to the right) is very even at 1 ms because it receives no read- or write-requests, being a read-only backup of samba03(yes, I have a different VM on a different physical machine with different hard drives as a read-only backup of samba03 in addition to offsite backups). Only if samba03 goes down will Keepalived move the virtual IP address to samba01, at which point I would expect it to exhibit more of the behavior currently seen on samba03.

Moral of the story: you might need to defragment your BTRFS file systems from time to time.

Linux io troubleshooting

I was trying to find the source of some weird iowait on my main workstation and came across https://haydenjames.io/what-is-iowait-and-linux-performance/ which recommended the following

  • iostat – try it with the -xm 2 options for extended statistics, in megabytes and in two-second intervals.
  • iotop – top-like I/O monitor. Try it with the -oPa options to show the accumulated I/O of active processes only.
  • ps – use auxf, then under the “STAT” column “D” usually indicates disk iowait.
  • strace – view the actual operations issued by a process. Read the strace man page.
  • lsof – aft

Not bad. iostat -x is very wide so iostat -xsm 2 can be useful. It will show iowait unlike iostat -sm 2

iotop -oPa was great and showed that btrfs-transaction kernel thread was the big writer.

With apt install inotify-tools I got some nice inotify-based tools(as the name would suggest). Turns out that no files were actually written to the disks seeing iowait so I’ll demo it on another folder:

root@amd:~# inotifywatch -r /home/cjp/.cache/
Establishing watches...
Finished establishing watches, now collecting statistics.
^Ctotal  modify  close_write  open  create  delete  filename
53     33      5            5     5       5       /home/cjp/.cache/google-chrome/Default/Cache/Cache_Data/

It’s probably just another btrfs write-amplification-thing-a-majig. This showed the same thing:

lsof +D /path/to/btrfsmount

I’ve seen this before, just not with zero writes to the filesystem… Like 100KB=>500KB write I’ve seen. But 0=>500KB? That’s new.

Memo to self: be careful about putting Btrfs on SSDs.

Ceph quorom shenanigans

God I love the word “shenanigans”. I can’t tell you how many files on my work laptop include that word.

Anway… So I moved some VMs around to upgrade pve2 to get rid of an annoying issue with UEFI where the console won’t work. But I overloaded pve1 which promptly went and died. Well, rebooted… And then mon.pve1 couldn’t join the cluster again. I figured I could take that opportunity to upgrade Ceph and the whole pve1-server actually.

No dice. Okey, delete mon.pve1 and add again? No dice. Proxmox’ Ceph tools don’t like things being wonky. After (looking at the time) an hour and a half? Roughly I’ve deleted and readded mon.pve1 several times and finally got it to work by also running “ceph mon add pve1 192.168.1.21” on one of the quorate nodes.

That might seem obvious but can you find that in the guide for adding and removing mons? https://docs.ceph.com/en/latest/rados/operations/add-or-rm-mons/

I can’t. Also, why is it ceph mon add pve1 192.168.1.21 and not ceph mon add mon.pve1 192.168.1.21? The mon-part is included everywhere else. But this is why I run Ceph at home. To learn this stuff when it’s only me getting annoyed. My bosses are running Ceph in production the poor dears. Obviously not on my rinky-dink setup crammed into a cupboard but still… Nerve-wracking stuff.

Well, I guess I’d better get to upgrading pve2 then… Slightly behind schedule.

LV thin volume data usage

Noticed that a thin volume had weird data usage according to LVM. 94% of data space was being used. But inside the VM it said 25%. Learned of a command called fstrim which sends “discard” commands to the underlying storage while going through a filesystem to tell the storage what isn’t actually being used. Didn’t work. Not sure if remount was sufficient or if it was that MariaDB was shut down or maybe that I removed the noatime setting for the FS, but after some tinkering it worked:

root@cluster1:~# fstrim -av
/var/lib/mysql: 0 B (0 bytes) trimmed on /dev/sdb
/boot: 0 B (0 bytes) trimmed on /dev/sda2
/: 74.8 MiB (78364672 bytes) trimmed on /dev/mapper/ubuntu--vg-ubuntu--lv
root@cluster1:~# fsck
fsck         fsck.btrfs   fsck.cramfs  fsck.ext2    fsck.ext3    fsck.ext4    fsck.fat     fsck.minix   fsck.msdos   fsck.vfat    fsck.xfs
root@cluster1:~# fsck.ext4 /var/lib/mysql/
e2fsck 1.45.5 (07-Jan-2020)
fsck.ext4: Is a directory while trying to open /var/lib/mysql/

The superblock could not be read or does not describe a valid ext2/ext3/ext4
filesystem.  If the device is valid and it really contains an ext2/ext3/ext4
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
    e2fsck -b 8193 <device>
 or
    e2fsck -b 32768 <device>

root@cluster1:~# fsck.ext4 /dev/sdb
e2fsck 1.45.5 (07-Jan-2020)
/dev/sdb is mounted.
e2fsck: Cannot continue, aborting.

root@cluster1:~# systemctl stop mysql
root@cluster1:~# umount /var/lib/mysql
root@cluster1:~# fsck.ext4 /dev/sdb
e2fsck 1.45.5 (07-Jan-2020)
/dev/sdb: clean, 481/1966080 files, 1872010/7864320 blocks
root@cluster1:~# vim /etc/fstab
root@cluster1:~# mount /var/lib/mysql/
root@cluster1:~# df -h
Filesystem                         Size  Used Avail Use% Mounted on
udev                               415M     0  415M   0% /dev
tmpfs                              170M  1.1M  169M   1% /run
/dev/mapper/ubuntu--vg-ubuntu--lv   20G  7.4G   12G  40% /
tmpfs                              459M     0  459M   0% /dev/shm
tmpfs                              5.0M     0  5.0M   0% /run/lock
tmpfs                              459M     0  459M   0% /sys/fs/cgroup
/dev/loop0                          56M   56M     0 100% /snap/core18/1997
/dev/sda2                          976M  199M  711M  22% /boot
/dev/loop5                          71M   71M     0 100% /snap/lxd/19647
/dev/loop4                          68M   68M     0 100% /snap/lxd/20326
/dev/loop7                          56M   56M     0 100% /snap/core18/2066
/dev/loop1                          33M   33M     0 100% /snap/snapd/12057
/dev/loop6                          33M   33M     0 100% /snap/snapd/12159
tmpfs                              102M     0  102M   0% /run/user/1000
/dev/sdb                            30G  6.6G   22G  24% /var/lib/mysql
root@cluster1:~# fstrim -v /var/lib/mysql
/var/lib/mysql: 22.9 GiB (24544501760 bytes) trimmed
root@cluster1:~# systemctl restart mysql

But still no change in lvdisplay:

root@pve1:~# lvdisplay pve/vm-109-disk-1
  --- Logical volume ---
  LV Path                /dev/pve/vm-109-disk-1
  LV Name                vm-109-disk-1
  VG Name                pve
  LV UUID                ATDHHH-DXGG-KqtR-sLA6-XeUW-xa51-HLXxId
  LV Write Access        read/write
  LV Creation host, time pve1, 2020-09-10 20:25:44 +0200
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                30.00 GiB
  Mapped size            93.87%
  Current LE             7680
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:11

Damn it… This took a while. Apparently the big issue was that the VM definition in ProxMox had “discard=0” configured for that hard drive. So I had to change that, and then fstrim -av worked a lot better. It ran way slower of course since it actually ended up doing work.

root@cluster1:~# fstrim -av
/var/lib/mysql: 22.9 GiB (24514723840 bytes) trimmed on /dev/sdb
/boot: 676.1 MiB (708972544 bytes) trimmed on /dev/sda2
/: 11.9 GiB (12730437632 bytes) trimmed on /dev/mapper/ubuntu--vg-ubuntu--lv

And result:

root@pve1:~# lvdisplay pve/vm-109-disk-1
  --- Logical volume ---
  LV Path                /dev/pve/vm-109-disk-1
  LV Name                vm-109-disk-1
  VG Name                pve
  LV UUID                ATDHHH-DXGG-KqtR-sLA6-XeUW-xa51-HLXxId
  LV Write Access        read/write
  LV Creation host, time pve1, 2020-09-10 20:25:44 +0200
  LV Pool name           data
  LV Status              available
  # open                 1
  LV Size                30.00 GiB
  Mapped size            23.40%
  Current LE             7680
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:11

*§#% Ceph

So I updated my Proxmox servers a few weeks ago. One worked fine but for two of them Ceph couldn’t start OSD:s for two drives. The processes segfaulted as it read through each disk and started over. I had to blank the drives and resync them. Well that’s not kosher! I was going to update Proxmox anyway so maybe it will be better with Ceph Nautilus which comes with Proxmox 6.2. Does it bollocks!

I upgrade one server at a time. One server is blanked and installed with Proxmox 6.2 and a new Ceph cluster is created with replica size 1. I then rsync data over from the old cluster. I had run this setup for 48 hours when a sudden reboot of the server left Ceph unable to read from two hard drives containing the CephFS. This time there was at least an error logged, not just a segfault:

2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:07:02.911572 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:21:50.726262 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:21:50.726295 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:36:50.729917 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:36:50.729969 currently queued for pg
2020-09-08 21:38:00.252 7fd75494f700 -1 osd.2 122 get_health_metrics reporting 6 slow ops, oldest is osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+read+known_if_redirected+full_force e119)
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+read+known_if_redirected+full_force e119) initiated 2020-09-08 21:06:50.724397 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:07:02.911572 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:21:50.726262 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:21:50.726295 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:36:50.729917 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 0 log_channel(cluster) log [WRN] : slow request osd_op(mds.0.364:7 2.14 2.844f3494 (undecoded) ondisk+retry+read+known_if_redirected+full_force e122) initiated 2020-09-08 21:36:50.729969 currently queued for pg
2020-09-08 21:38:01.256 7fd75494f700 -1 osd.2 122 get_health_metrics reporting 6 slow ops, oldest is osd_op(mds.0.364:4 2.6 2.4b2c82a6 (undecoded) ondisk+read+known_if_redirected+full_force e119)

Well, you get the idea. So that’s #¤§* it! It’s not that I can’t work around these issues and blank drives and bring them back into the cluster, but I have zero tolerance for buggy file systems and database engines. I run the old Luminous release for which there has been ample time to iron out kinks and OSD:s can’t start. I switch over to Nautilus: same but different.

I’ll run a toy Ceph-cluster in a couple of VM:s but for actual workloads? How about no? This means standard logical volumes for VM:s and btrfs for my 2TB of files. I’ll go back to my old btrfs script with the “send snapshot” functionality activated. I’ve run it without that send-snapshot thing for a while as a DB backup solution(a plan that turned out less than ideal which I should have predicted for such high IOPS applications).

Memo to self: if running Ceph in production, set up a staging cluster that can be upgraded and beaten about before rolling out the same release to the real cluster.

GlusterFS

I first set up my new server arrangement with GlusterFS which is easier to get up and running than Ceph and has great disaster recovery properties since it stores data in regular file systems on hard drives. So if GlusterFS stops working with error message PC LOAD LETTER I can just copy all the files back from the FS used on each drive.

Buuut… GlusterFS is slow as tar. I tried to put a MySQL database on a virtual disk stored on GlusterFS and it didn’t go well. IO wait was absurdly high. That made me set up a Percona XtraDB Cluster on each Proxmox node’s local storage which worked well at first… Well that’s another issue. For metadata operations there’s a setting readdir-ahead ( https://staged-gluster-docs.readthedocs.io/en/release3.7.0beta1/Features/readdir-ahead/ ) which helps a bit but it’s still slow.

Apparently GlusterFS is okay for people streaming big files, like video editors. At the very least it should be good for archival purposes. Though the necessity of clustered expand-online storage for archives is probably not super common.