There and back again: a voyage to inode hell
A few weeks ago, I was preparing a demo for a client which involved several machines, a small datacenter and some complicated remote operations. In order to easily access all those machines, I had set up a machine with an Ubuntu server installed on it to serve as a “control center”. I had a tmux running with four ssh sessions, each connected to one of the strategic machines. The Ubuntu server was also used to run benchmarks, transfer files, etc. In particular, I was running jmeter to simulate traffic hitting a load balancer on it. A little while after one of those lengthy runs, I received a surprising (at least to me) error message.
Trusting error messages
The message occurred while I was transferring a small text file from a machine to another via the ubuntu server, using scp :
|
|
The message is quite obvious, but I knew I had a lot place left on my
server. So I didn’t trust it, and started to search a solution somewhere
else: what’s wrong with my command? Did I misspell something? Am I
hitting the wrong machine? Is there an ssh key problem? I put scp
in
verbose/debug mode and started searching for a reason somewhere else.
After half an hour of fiddling with every possible option, I finally
yielded to scp
and… decided to trust the error message: maybe, just
maybe, I actually did not have enough space left on the device.
Checking that was easy: I connected to the machine and ran a simple:
|
|
The thing is, df
showed me what I expected: a very comfortable amount
of space left on my device. At that point, I was back to my initial
assessment: scp is going crazy; which is remarkably stupid given that
scp is probably in a top-ten list of most used linux tools and thus
scrutinized to the last semi-colon: in other words, a reliable tool one
should probably trust. But hey, it was late and the frustration had been
building up for a while: I wasn’t thinking with a clear mind anymore.
So, in order to prove my hypothesis (i.e. scp
is the culprit) I tried
to run some commands that would certainly work:
- copying a file with
cp
: it failed. mkdir
: failed as well.touch
: another fail.
When I reached the point where a touch
told me I didn’t have any space
left, I got the point: something was wrong with my file system, not with
that wrongfully accused scp
I had been so eager to blame.
Space is not just about storage
So I contacted everybody’s best friend in this kind of situation: my search engine. After doing some search on the subject, I stumbled upon a topic mentioning inode usage: a critical aspect of unix filesystems I had entirely overlooked until then. In short, an inode is simply the data structure in which the system stores the necessary metadata about your files and folders. This means that you need an inode for every file and folder of your filesystem, however small they may be. And there’s a finite amount of inodes for every file system.
So, here I am doubtfully checking that usage, using:
|
|
Of course, I was pretty sure that it couldn’t be the issue in my case. We’ve all made that kind of reasoning: “I never made that many files! A mistake? I never make that kind of mistake!”. Yeah. Right. By now, you’ve guessed the result:
|
|
Yep, every single inode of my system was used. Outch. How on earth could that happen to me? I had been an amateur sysadmin for a few years, and I had never worried about inodes before! I guess I should have.
How did that happen to me?
You have probably seen a pattern emerge here: denial after denial, I had refused to accept what was right in front of my eyes until the proof was irrefutably shown to me in the clearest way possible. So once again, I started blaming anyone or anything but me:
It must be a bug. I can’t have that many files. I’m not the kind of guy to have rogue processes filling up folders!
Just to be sure, I nevertheless tried counting inode usage in the root folders of my filesystem, using a nifty command I found somewhere . Just in case.
|
|
The concept is easy: if you see a directory where the command is taking
an unusual time to complete, you’ve probably found your culprit. In my
case, it blocked in /home/
. Then in /home/myusername/
. Then in
/home/myusername/jmeter/
.
I knew jmeter
did not have any subfolders: I had found my culprit.
How did it happen, though? The explanation was dead simple, and the
culprit yours truly. I had configured jmeter erroneously to download and
store a result file for each request it was running. And my jmeter had
had 500 threads running requests continuously for several hours. To this
day, I have no idea how many files where in there: from the number of
inodes I freed, there were probably around 5 million of them.
Solving the problem: patience, patience, patience.
Now I knew what the problem was. And I knew I was working remotely, on a
Saturday night. If I were to reboot the machine, I would certainly loose
it forever. So I tried listing the contents of the folders (ls
): the
command hanged. I tried deleting the folder (rm -r
): the command also
hanged. In both case, nothing had happened after 45 minutes. And even
after hitting Ctrl + C, I never managed to stop either of them: they had
to be killed brutally (kill -9
).
How do you delete the files in a folder whose contents you can’t even list? Again, my search engine of choice was here to help. And gave me the magical command I needed:
|
|
In short, the command deletes the files it finds in the dreaded folder,
one by one (If you run it, don’t worry about the message stating
rm: cannot remove `jmeter/': Is a directory
).
In my case, at first it didn’t seem to work. A bit desperate about the whole thing, I made the good choice to let it run while I went on doing something else. When I came back to it, a miracle had happened: I had a few thousand inodes free! As it turns out, the command is slow, but it works. And doesn’t hang.
When I say slow, I mean really slow. After fives days, the inode
usage had fallen to 65%. And it was still to early to ls
the
directory.
|
|
Epilogue
The command took 8 days, running continuously, to complete. My inode usage had gone down to approximately 10%. What did I learn from it ? Two things:
- Try trusting the tool before trusting yourself.
- Don’t rush things. I was prudent enough to try solving the problem without resorting to drastic actions like rebooting the server, for example. It was the best decision to make. If I had rebooted, I would have lost it entirely.