Stories, essays, learning, and other considerations
By Jehan

There and back again: a voyage to inode hell

A few weeks ago, I was preparing a demo for a client which involved several machines, a small datacenter and some complicated remote operations. In order to easily access all those machines, I had set up a machine with an Ubuntu server installed on it to serve as a “control center”. I had a tmux running with four ssh sessions, each connected to one of the strategic machines. The Ubuntu server was also used to run benchmarks, transfer files, etc. In particular, I was running jmeter to simulate traffic hitting a load balancer on it. A little while after one of those lengthy runs, I received a surprising (at least to me) error message.

Trusting error messages

The message occurred while I was transferring a small text file from a machine to another via the ubuntu server, using scp :

1
scp: file.txt: no space left on device

The message is quite obvious, but I knew I had a lot place left on my server. So I didn’t trust it, and started to search a solution somewhere else: what’s wrong with my command? Did I misspell something? Am I hitting the wrong machine? Is there an ssh key problem? I put scp in verbose/debug mode and started searching for a reason somewhere else. After half an hour of fiddling with every possible option, I finally yielded to scp and… decided to trust the error message: maybe, just maybe, I actually did not have enough space left on the device.

Checking that was easy: I connected to the machine and ran a simple:

1
df -h

The thing is, df showed me what I expected: a very comfortable amount of space left on my device. At that point, I was back to my initial assessment: scp is going crazy; which is remarkably stupid given that scp is probably in a top-ten list of most used linux tools and thus scrutinized to the last semi-colon: in other words, a reliable tool one should probably trust. But hey, it was late and the frustration had been building up for a while: I wasn’t thinking with a clear mind anymore.

So, in order to prove my hypothesis (i.e. scp is the culprit) I tried to run some commands that would certainly work:

  • copying a file with cp: it failed.
  • mkdir: failed as well.
  • touch: another fail.

When I reached the point where a touch told me I didn’t have any space left, I got the point: something was wrong with my file system, not with that wrongfully accused scp I had been so eager to blame.

Space is not just about storage

So I contacted everybody’s best friend in this kind of situation: my search engine. After doing some search on the subject, I stumbled upon a topic mentioning inode usage: a critical aspect of unix filesystems I had entirely overlooked until then. In short, an inode is simply the data structure in which the system stores the necessary metadata about your files and folders. This means that you need an inode for every file and folder of your filesystem, however small they may be. And there’s a finite amount of inodes for every file system.

So, here I am doubtfully checking that usage, using:

1
df -i

Of course, I was pretty sure that it couldn’t be the issue in my case. We’ve all made that kind of reasoning: “I never made that many files! A mistake? I never make that kind of mistake!”. Yeah. Right. By now, you’ve guessed the result:

1
2
Filesystem Inodes IUsed IFree IUse% Mounted on\
/dev/sda6 6111232 6111232 0 100% /

Yep, every single inode of my system was used. Outch. How on earth could that happen to me? I had been an amateur sysadmin for a few years, and I had never worried about inodes before! I guess I should have.

How did that happen to me?

You have probably seen a pattern emerge here: denial after denial, I had refused to accept what was right in front of my eyes until the proof was irrefutably shown to me in the clearest way possible. So once again, I started blaming anyone or anything but me:

It must be a bug. I can’t have that many files. I’m not the kind of guy to have rogue processes filling up folders!

Just to be sure, I nevertheless tried counting inode usage in the root folders of my filesystem, using a nifty command I found somewhere . Just in case.

1
for i in /\*; do echo \$i; find \$i |wc -l; done

The concept is easy: if you see a directory where the command is taking an unusual time to complete, you’ve probably found your culprit. In my case, it blocked in /home/. Then in /home/myusername/. Then in /home/myusername/jmeter/.

I knew jmeter did not have any subfolders: I had found my culprit.
How did it happen, though? The explanation was dead simple, and the culprit yours truly. I had configured jmeter erroneously to download and store a result file for each request it was running. And my jmeter had had 500 threads running requests continuously for several hours. To this day, I have no idea how many files where in there: from the number of inodes I freed, there were probably around 5 million of them.

Solving the problem: patience, patience, patience.

Now I knew what the problem was. And I knew I was working remotely, on a Saturday night. If I were to reboot the machine, I would certainly loose it forever. So I tried listing the contents of the folders (ls): the command hanged. I tried deleting the folder (rm -r): the command also hanged. In both case, nothing had happened after 45 minutes. And even after hitting Ctrl + C, I never managed to stop either of them: they had to be killed brutally (kill -9).

How do you delete the files in a folder whose contents you can’t even list? Again, my search engine of choice was here to help. And gave me the magical command I needed:

1
find jmeter/ -exec rm -f {} \\;

In short, the command deletes the files it finds in the dreaded folder, one by one (If you run it, don’t worry about the message stating rm: cannot remove `jmeter/': Is a directory).

In my case, at first it didn’t seem to work. A bit desperate about the whole thing, I made the good choice to let it run while I went on doing something else. When I came back to it, a miracle had happened: I had a few thousand inodes free! As it turns out, the command is slow, but it works. And doesn’t hang.

When I say slow, I mean really slow. After fives days, the inode usage had fallen to 65%. And it was still to early to ls the directory.

1
2
3
df -i
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/sda6 6111232 3940214 2171018 65% /

Epilogue

The command took 8 days, running continuously, to complete. My inode usage had gone down to approximately 10%. What did I learn from it ? Two things:

  1. Try trusting the tool before trusting yourself.
  2. Don’t rush things. I was prudent enough to try solving the problem without resorting to drastic actions like rebooting the server, for example. It was the best decision to make. If I had rebooted, I would have lost it entirely.