Monday, March 15, 2010

which process is using the most IO?

Discovering which process uses the most IO on a Linux server is something everyone will run into sooner or later.  If you have a somewhat modern variant of Linux, running a kernel version of at least 2.6.20, then you are in luck, as the kernel stores lots of useful IO information.

The easiest way to monitor the hungriest IO process is via the wonderful monitoring tool 'dstat' by the maintainer of the Dag RPM archive.  You can grab dstat from http://dag.wieers.com/home-made/dstat/

Once you have dstat installed, run 'dstat --top-io' or 'dstat --bw --top-io' if you have a white background terminal.  Dstat will then start displaying, one per line, the process with the most IO.  Here's an example on my home server, showing the small load created by Winamp scanning my Samba share for new mp3s:
root@calcifer:~# dstat --bw --top-io
----most-expensive----
     i/o process      
init [3]    623k 7154B
smbd        223k  200k
smbd        235k  205k
smbd        238k  210k
smbd        232k  204k

The great thing about dstat is that its plugins, like the one displaying IO above, are all written in Python.  As Python is fairly easy to read if you're familiar with any programming language, we can figure out just where in the system the IO stats are stored.

The plugins for dstat are stored in /usr/share/dstat on my system.  Viewing the source for 'dstat_top_io.py'  gives us the following information:
http://svn.rpmforge.net/svn/trunk/tools/dstat/plugins/dstat_top_io.py
  • Line 16 checks for the existence of /proc/self/io
  • Line 22 loops over a list of current processes returned from the function 'proc_pidlist'
  • Line 31 grabs the process name from /proc/PID/stat
  • Lines 34 through 44 grab the IO stats from /proc/PID/io
  • Lines 49 through 61 get the highest IO process and display it
From that we can figure out that /proc has a wealth of information for us to browse.  If we couldn't install dstat (maybe you're trying to diagnose a system that is locked down and cannot be changed), we could still write some scripts to grab the IO stats for the currently running processes.

An Example

We have a Linux server that does multiple tasks;  it stores home directories and also serves as a database server, DNS host, backup server, and terminal server.  You've received a complaint that the server seems slow today, so you start to look at the problem.  The load on the server is definitely high, but you're not sure which process is really overloading the system.

We'd first need to confirm that we are looking for the right thing;  tools like 'iostat' and 'sar' can show us per-device IO stats (which will be covered in a different post!).  Once you have per-device IO stats, you can try to figure out which processes are using that device ('lsof' would help here).  Maybe your guess is that Samba is thrashing a user directory volume shared to lots of Windows users.

So from that, we think we should be monitoring the smbd processes on the system.  Using 'ps -C smbd -o pid --no-heading' we get our list of smbd processes.  We can take that and look manually, but it's much more fun (for a given value of fun) to automate this a little more.  We can pipe the output of ps into xargs and then get the read bytes from the /proc/PID/io.  With a little xargs trickery, we get a somewhat readable list of processes and the amount of data they are reading:
root@calcifer:~# ps -C smbd -o pid --no-heading | xargs -I {} bash -c 'echo -n "{} -> "; cat /proc/{}/io | grep read_bytes' 
2426 -> read_bytes: 91269668864
2465 -> read_bytes: 0
30935 -> read_bytes: 378146816
31215 -> read_bytes: 132063232
31367 -> read_bytes: 94208

As you can see, the information presented is still in a raw state and needs interpretation.  One way to do that would be to run the command above within a 'watch' process and see which processes are reading (or writing) the most bytes.  You could duplicate the functionality of the dstat top-io plugin within perl or a bash script (although I'd prefer the easy route and use dstat!).

From the output above, repeated over a few minutes, you think that process 30935 is the culprit.  Its read_bytes value keeps increasing beyond a value we think is acceptable.  From this we can run 'lsof -p 30935' to get a list of files it has open.

From here we have to use our imagination;  perhaps a user is trying to store their mp3 collection in their home directory, or perhaps a department share is being used for very large photoshop files.

I hope this post has been informative.  It's my first, so please comment if you feel I need to improve in some areas!

No comments:

Post a Comment