Scanning through the logs as they went flying by today, two of the scariest lines I've seen in a long time went flying past:
fileserver: 2009-Nov-25 17:29:08 ufs: [ID 879645 kern.notice] NOTICE: /usr: unexpected free inode 1147, run fsck(1M) -o f fileserver: 2009-Nov-25 17:29:13 ufs: [ID 879645 kern.notice] NOTICE: /usr: unexpected free inode 1145, run fsck(1M) -o f
Now, /usr is important for any system, but a primary file server isn't something we can just take down for an hour to run fsck a couple times. Especially the day before a long weekend. So, how to fix this? Sun helpfully recommends rebooting into single user mode from alternate media (actually, `boot net -s` from the openboot prom, but good luck getting that to happen on a thumper), so the Sun docs are out.
Or are they? Squirreled away in the See Also section of the fsck(1M) man page is clri(1M). This little utility should clear out the data with zeroes and release the inode back into the free list. This is (almost) exactly right, as it will just delete the offending files. But...there are important things in /usr, so what are those inodes supposed to be?
an app a utility for that. ncheck(1M) will look at the disk and generate a list of pathnames from inode numbers or inode numbers from pathnames. I had already generated a list of suspects from the errors running `find` on /usr, but it's good to have confirmation.
fileserver# ncheck -i 1145,1147 /dev/md/rdsk/d20 /dev/md/rdsk/d20: 1145 /bin/hd 1147 /bin/hdadm
`ls -i` confirms that these indeed do map to these inode numbers, but ncheck is more complete in that it shows all the hard links to each file. So, now that I know the names of what I'm missing...what am I missing? These are symlinks into /opt/SUNWhd/hd/bin for the thumper and thor hard drive utilities hd and hdadm. The SUNWhd package installs them, so whenever triage is done, it's probably best if they are put back. Thankfully, they're just symlinks.
So, I don't need the data. The only problem with clri is that it won't muck with the directory listing for the file. That's not so much a failing of clri as a bonus (do one thing: clear a file), but if the directory isn't cleaned up, then fsck will fail just as hard. It may even try to undo all my hard work, and we can't have that. In a fit of insanity, I tried pulling up the directory in ed to see if I could do anything with it (it's just a file like everything else, right?), but that way lies even more madness. There must be a tool somewhere to do this (and be less prone to hunan error).
Back to the man pages, and I ran across fsdb(1M). It looks fairly useful, as it lets you debug the file system. If this is anything like gdb (and it slightly resembles it), this includes displaying various values, as well as editing them. But, the man page is very vague and doesn't give any indication of how to actually use it.
Enter the google. fsdb is a wrapper around the file system specific debugger, so the actual relevant man page is fsdb_ufs(1M). The syntax is based on something called adb, which looks like a horrible thing to program in. And I've written brainfuck before (editor's note: don't ever write anything in brainfuck).
So, the next step is to grab a test box and learn how to use this debugging tool. I hear tales on the intarblag that there is a zfs version, too. But for that, I could just have snapshots instead of a tarball in case I explode the real file system. And given some of the things you can do ("fill an area of disk with pattern p"), I fully expect to blow away a few test filesystems learning it.