The easiest way to monitor the hungriest IO process is via the wonderful monitoring tool 'dstat' by the maintainer of the Dag RPM archive. You can grab dstat from http://dag.wieers.com/home-made/dstat/
Once you have dstat installed, run 'dstat --top-io' or 'dstat --bw --top-io' if you have a white background terminal. Dstat will then start displaying, one per line, the process with the most IO. Here's an example on my home server, showing the small load created by Winamp scanning my Samba share for new mp3s:
root@calcifer:~# dstat --bw --top-io ----most-expensive---- i/o process init [3] 623k 7154B smbd 223k 200k smbd 235k 205k smbd 238k 210k smbd 232k 204k
The great thing about dstat is that its plugins, like the one displaying IO above, are all written in Python. As Python is fairly easy to read if you're familiar with any programming language, we can figure out just where in the system the IO stats are stored.
The plugins for dstat are stored in /usr/share/dstat on my system. Viewing the source for 'dstat_top_io.py' gives us the following information:
http://svn.rpmforge.net/svn/trunk/tools/dstat/plugins/dstat_top_io.py
- Line 16 checks for the existence of /proc/self/io
- Line 22 loops over a list of current processes returned from the function 'proc_pidlist'
- Line 31 grabs the process name from /proc/PID/stat
- Lines 34 through 44 grab the IO stats from /proc/PID/io
- Lines 49 through 61 get the highest IO process and display it
An Example
We have a Linux server that does multiple tasks; it stores home directories and also serves as a database server, DNS host, backup server, and terminal server. You've received a complaint that the server seems slow today, so you start to look at the problem. The load on the server is definitely high, but you're not sure which process is really overloading the system.
We'd first need to confirm that we are looking for the right thing; tools like 'iostat' and 'sar' can show us per-device IO stats (which will be covered in a different post!). Once you have per-device IO stats, you can try to figure out which processes are using that device ('lsof' would help here). Maybe your guess is that Samba is thrashing a user directory volume shared to lots of Windows users.
So from that, we think we should be monitoring the smbd processes on the system. Using 'ps -C smbd -o pid --no-heading' we get our list of smbd processes. We can take that and look manually, but it's much more fun (for a given value of fun) to automate this a little more. We can pipe the output of ps into xargs and then get the read bytes from the /proc/PID/io. With a little xargs trickery, we get a somewhat readable list of processes and the amount of data they are reading:
root@calcifer:~# ps -C smbd -o pid --no-heading | xargs -I {} bash -c 'echo -n "{} -> "; cat /proc/{}/io | grep read_bytes' 2426 -> read_bytes: 91269668864 2465 -> read_bytes: 0 30935 -> read_bytes: 378146816 31215 -> read_bytes: 132063232 31367 -> read_bytes: 94208
As you can see, the information presented is still in a raw state and needs interpretation. One way to do that would be to run the command above within a 'watch' process and see which processes are reading (or writing) the most bytes. You could duplicate the functionality of the dstat top-io plugin within perl or a bash script (although I'd prefer the easy route and use dstat!).
From the output above, repeated over a few minutes, you think that process 30935 is the culprit. Its read_bytes value keeps increasing beyond a value we think is acceptable. From this we can run 'lsof -p 30935' to get a list of files it has open.
From here we have to use our imagination; perhaps a user is trying to store their mp3 collection in their home directory, or perhaps a department share is being used for very large photoshop files.
I hope this post has been informative. It's my first, so please comment if you feel I need to improve in some areas!
No comments:
Post a Comment