源文地址http://devo.ps/blog/2013/03/06/troubleshooting-5minutes-on-a-yet-unknown-box.html
First 5 Minutes Troubleshooting A Server
Back when our team was dealing with operations, optimization and scalability at our previous company, we had our fair share of troubleshooting poorly performing applications and infrastructures of various sizes, often large (think CNN or the World Bank). Tight deadlines, “exotic” technical stacks and lack of information usually made for memorable experiences. The cause of the issues was rarely obvious: here are a few things we usually got started with. Don’t rush on the servers just yet, you need to figure out how much is already known about the server and the specifics of the issues. You don’t want to waste your time (trouble) shooting in the dark. A few “must have”: The last two ones are the most convenient sources of information, but don’t expect too much: they’re also the ones usually painfully absent. Tough luck, make a note to get this corrected and move on. Not critical, but you’d rather not be troubleshooting a platform others are playing with. One cook in the kitchen is enough. Always a good thing to look at; combined with the knowledge of who was on the box earlier on. Be responsible by all means, being admin shouldn’t allow you to break ones privacy. A quick mental note for later, you may want to update the environment variable While I tend to prefer running them separately, mainly because I don’t like looking at all the services at the same time. Identify the running services and whether they’re expected to be running or not. Look for the various listening ports. You can always match the PID of the process with the output of We usual prefer to have more or less specialized boxes, with a low number of services running on each one of them. If you see 3 dozens of listening ports you probably should make a mental note of investigating this further and see what can be cleaned up or reorganized. This should answer a few questions: There are still a lot of bare-metal servers out there, this should help with; Very useful commands to analyze the overall performances of your backend; Have a look at Linux TCP tuning for some more pointer as to how to tune your network stack. There is a lot to analyze here, but it’s unlikely you’ll have time to be exhaustive at first. Focus on the obvious ones, for example in the case of a LAMP stack: After these first 5 minutes (give or take 10 minutes) you should have a better understanding of: You may even have found the actual root cause. If not, you should be in a good place to start digging further, with the knowledge that you’ve covered the obvious.Get some context
Who’s there?
$ w $ last
What was previously done?
$ history
HISTTIMEFORMAT
to keep track of the time those commands were ran. Nothing is more frustrating than investigating an outdated list of commands…What is running?
$ pstree -a $ ps aux
ps aux
tends to be pretty verbose, pstree -a
gives you a nice condensed view of what is running and who called what.Listening services
$ netstat -ntlp $ netstat -nulp $ netstat -nxlp
netstat -nalp
will do to though. Even then, I’d ommit the numeric
option (IPs are more readable IMHO).ps aux
; this can be quite useful especially when you end up with 2 or 3 Java or Erlang processes running concurrently.CPU and RAM
$ free -m $ uptime $ top $ htop
Hardware
$ lspci $ dmidecode $ ethtool
IO Performances
$ iostat -kx 2 $ vmstat 2 10 $ mpstat 2 10 $ dstat --top-io --top-bio
dstat
is my all-time favorite. What is using the IO? Is MySQL sucking up the resources? Is it your PHP processes?Mount points and filesystems
$ mount $ cat /etc/fstab $ vgs $ pvs $ lvs $ df -h $ lsof +D / /* beware not to kill your box */
Kernel, interrupts and network usage
$ sysctl -a | grep ... $ cat /proc/interrupts $ cat /proc/net/ip_conntrack /* may take some time on busy servers */ $ netstat $ ss -s
conntrack_max
set to a high enough number to handle your traffic?TIME_WAIT
, …)?netstat
can be a bit slow to display all the existing connections, you may want to use ss
instead to get a summary.System logs and kernel messages
$ dmesg $ less /var/log/messages $ less /var/log/secure $ less /var/log/auth
Cronjobs
$ ls /etc/cron* + cat $ for user in $(cat /etc/passwd | cut -f1 -d:); do crontab -l -u $user; done
Application logs
5xx
errors, look for possible limit_zone
errors.mysql.log
, trace of corrupted tables, innodb repair process in progress. Looks for slow logs and define if there is disk/index/query issues.varnishlog
and varnishstat
, check your hit/miss ratio. Are you missing some rules in your config that let end-users hit your backend instead?Conclusion