Working for a client recently we discovered a ZFS server running that was not being monitored appropriately.
Now you might be looking at this and wondering WTF? I was, and it took me a bit to figure out everything going on and why those failover disks didn’t kick into action.
Looking at the image above, you a can see that the zpool is composed of two RAIDZ2 arrays, both of which are having problems, but that the zpool has a number of spares available. Now when ZFS decides it has a degraded disk but has available spare disks, it will automatically pull one of the spares as well as the degraded disk and put them into a mirror together. Now if you keep that in mind and look again at the image above, you can see that raidz2–0 has two degraded disks and one which is faulted. But, you can also see that all three of those have been properly put into a mirror with a spare disk and so the data is safe.
Then apparently, two disks on the raidz2–1 side became faulted. This is where it gets serious because there were only 3 valid spare disks to begin with and all three have now been allocated to handle the degraded disks in raidz2–0. “But wait” you say, “I see there are three more spare disks available”. This is true, but there is a small problem with these disks…they are Advanced Format or 4K sector disks. The other disks that make up the pool are not 512 byte sector size. ZFS will not replace a 512 byte sector disk with a 4K sector disk. Something about geometry and maths.
So on the brink of losing 30 TeraBytes of what would be irrecoverable, and extremely important data. We spent the next week nail-biting while we Spinrite’d old 3TB drives and shoving them into this thing as soon as they were available. Eventually we did succeed in getting this thing back into a non-critical state. We were lucky to able to do so and have implemented multiple layers of controls and backup strategies to help prevent and if necessary recover from this type of incident.
TL;DR monitoring is important um kay.
Originally published at https://smartaleksolutions.com on January 5, 2016.