Linux is our bread and butter. The CLI is where we live. We live and die by our LAMP stack.
We’ve all had Linux server failures, and I want to try and help remind our readers that when Linux fails, it is not irrecoverable. There are many reasons a server might fail, but as long as you cover some very simple rules, you will be able to get back up and running. All it takes is a little extra time, and some problem solving skills.
If you’re reading this, then we can assume you already have the latter of the two.
Most Linux blogs or forums might tell you how to repair specific problems, but the issue is more about how to think about fixing those issues.
These rules are from my own personal experience, and may not reflect your own environment. Take it with a grain of salt, and hopefully you will find something here that can apply to your setup.
Rule #1 – Backup
This alone could be Rule 1; 2, and 3. Your backup is your savior. If you don’t backup regularly, you’re in for a world of hurt when the proverbial hits the fan. Take it from someone who knows. You don’t want to be caught with your pants down at crunch time.
If you have time to troubleshoot an issue, then you had time setup regular backups in the beginning. Please, for the love of all that is holy – Backup
Setting up a backup is a simple process, and you can use any method you think is appropriate. Using
rsync you can backup your data to a remote server. Creating a small BASH script can have you remotely backing up your data over FTP quickly. Don’t forget that you can also use
cron to do all the scheduled work.
One thing to remember, is to not keep your backup on the same machine that is running live. If you do, and for some reason it fails, then it will take the backups with it.
If you’re lazy, do a complete machine backup. This is easier with a virtualized environment, but restoring a machine with a couple of clicks may be preferable.
Rule #2 – Reboot
I cannot tell you how often this has gotten me out of a severe jam.
When all else fails, reboot your machine. It’s surprising how often this will successfully restart the services that have failed. A lot of services fail after being patched or upgraded. Usually, this can be attributed to the config files being altered in the new build, or left over cache files etc.
The installation script for most applications or services will usually take care of this, and restart the service with the new configs. On occasion, the script doesn’t, and will need a quick reboot to make the switch.
Also, it gives the machine a chance to refresh any resident data in the RAM, which can sometimes allow a service to restart successfully.
Rule #3 – Read the Linux logs
If your server has fallen over, and you don’t know where to start – The Logs.
Always the logs. This is where you should always start your troubleshooting process after getting to your command line. You will be like Sherlock Holmes, and the Logs are where all of your clues are held.
The clues will not always be obvious, but they are there. Buried in the Logs you can ferret out some very interesting information about what could potentially be wrong with your machine.
The Logs are laid out in a very logical fashion, and you should immediately be able to narrow down where to look. If you have an issue with Apache, you’re not going to need to look at the postfix log – yet.
Assuming that Apache is faulting, here’s a neat little trick. Run this in your terminal:
tail -f /var/log/apache2/error.log
It will give you a live scrolling display of the Apache error log.
Now, all you need to do is replicate the error. You will be able to see the output in real time. This way, you can start to narrow down the issues, and get to fixin’!
Rule #4 – Google is your friend
I’ve been blasted for this in the past, but the reality is that there are too many applications and potential error messages for a single person to retain every single outcome of an error, and what the error message will mean.
While Rule #3 may help us to find out what the error is, it doesn’t tell us what the error means.
The chances of someone else having had the exact error you have, is extremely high. If not the exact error, then something very similar.
Using Google can really help us find out if an error is relevant, or just system log noise.
Search for the exact phrase of the error, and remember to strip any custom paths or hex messages, as they are usually only relevant to your system, and will interfere with your search results.
Make sure that the sources you are getting results from are good quality. Sources like AskUbuntu and StackOverflow are usually good. Tech blogs can also be helpful. Avoid forums, unless you get no other results, or they are relevant to your issue (for example Apache forums for an issue with Apache).
Rule #5 – Reinstall
Before you freak out, I don’t mean reinstall your OS. What I’m pointing to is a quick reinstall of your application. This isn’t always a simple task, but if you’re using prebuilt binaries, a simple reinstall with your favorite package manager should be enough.
For example, on Ubuntu – quickly backup your Apache conf files, and then run:
sudo apt install --reinstall apache2
This will completely remove Apache (but not the supporting packages), and reinstall the base package.
Rule #6 – Configs
To begin with, backup the default configs. These can save you in a pinch.
If you’re not careful, your config files can destroy even the most seasoned of server administrators. Config files are notorious for wanting exact syntax.
Go back and check your config files to make certain they are correct. Sometimes, it can be a misplaced character, or a space in the wrong place.
Occasionally, there are harder things to spot like EOL (End Of Line) markers. For example, the
/etc/fstab requires a trailing EOL for it to be valid.
If all else fails, restore the default configs, and work step by step until you find the failing line.
Rule #7 – Restore
The final refuge. If all else fails, restore your machine. Sometimes, this can be the only option left to you. Don’t be afraid to admit defeat, and bite the bullet.
Not all errors can be solved in a timely manner. On a live server, you may have very tight time constraints.
Restore the machine and monitor the logs closely. Hopefully, you will see it if it happens again.