Having to go the the server room to reset servers is the most headache thing for admins managing a cluster of Linux servers in a remote site. Either you can ping the server but can not ssh to it, or you even can not ping it. There are various reasons that may cause a Linux server crash or fail to be connected to by SSH. The most common two from my experience are: there may be a bad behaving progress that use up almost all physical memory and swap or there may be a kernel panic. In this post, I describe several techniques I learned to make myself go to the server room less by dealing with these kinds of failures.
Force Linux to reboot even you could not start a shell via SSH ∞
If the server is too busy, creating the shell via SSH may also fail even though sshd is alive. Some times, you get lucky that you can remotely execute some commands by ssh directly. You may try to make use of the magical SysRq to force Linux to restart.
ssh root@server_home \ 'echo 1 > /proc/sys/kernel/sysrq; echo b > /proc/sysrq-trigger'
Reference: Force Linux to reboot.
After this command, if you find your server disappear from the network, it may be rebooting itself. Wait for a while and it may come back.
Make Linux reboot automatically after a kernel panic ∞
Some times, you get bad luck that there is a kernel panic. Almost everything including the network stop working and you can not connect to the server any more. That is not good but may not be too bad if we did some home work before by configuring Linux to reboot itself after kernel panics.
Linux has a nice feature that reboots itself after a timeout if a kernel panic happened. Usually, it is disabled. We can turn it on as we are lazy system admins. It can be enabled by setting the
kernel.panic kernel parameter.
For a running system:
# echo 20 >/proc/sys/kernel/panic
Here, 20 is the number of seconds before the kernel reboots. 0 means this feature is disabled.
To make the configuration persistent, you have at least 2 choices:
- add the kernel parameter
panic=20to your bootloader (grub or grub2).
kernel.panic = 20to /etc/sysctl.conf .
I prefer the second method that writes the configuration to /etc/sysctrl.conf.
For more details, please check How to make Linux automatically reboot after a kernel panic.
Auto reboot is good. It will be better that the server also notifies the admins after a reboot. The technique discussed at How to email admins automatically after Linux server starts makes the server send email notifications after reboots.
It makes use of the
@reboot cron jobs and
mailx by adding an entry like
@reboot date | mailx -S smtp=smtp://smtp.example.com -s "`hostname` started" -r email@example.com firstname.lastname@example.org
For sending emails, you may either Sending Email Using mailx in Linux Through Internal SMTP or Sending Email from mailx Command in Linux Using Gmail’s SMTP.