Troubleshooting

Overview


While you've read about many troubleshooting scenarios throughout this book, it's the troubleshooting part of the Red Hat exams that I believe causes the most "fear and loathing" among Red Hat certification candidates.

Troubleshooting is a mindset based on experience and a systematic way of thinking. Troubleshooting strategies on the Red Hat exams are based on the simplest problems that you can check quickly, moving to more complex problems.

Red Hat has done excellent work addressing some problems that formerly led to unbootable systems. For example, flaws in the /etc/fstab file used to lead to an unbootable system. Now most users would hardly know the difference if this file is missing.

The most important troubleshooting tool is the linux rescue environment, which can bypass boot problems, from a missing GRUB boot loader to a missing kernel. In most cases, the first installation CD, booted into the linux rescue environment, can detect and mount even damaged installations of RHEL.

This topic focuses on the Troubleshooting and System Maintenance section of the RHCT and RHCE exams, as defined in the Inside the Exam sidebar. It further focuses on troubleshooting skills, as they evoke more concern than regular system maintenance.

This chapter includes a number of exercises for which you'll need the help of a partner. When you start an exercise, let your partner have your computer and wait until your system begins to reboot. This chapter includes enough exercises to allow you and your partner to take turns working with the system.


Troubleshooting and System Maintenance

  1. Define the question.

    Understand what happened. Take the error messages you see. If possible, analyze log files for other messages. If you've read this book and run the labs, you may recognize the problem and cause immediately.

  2. Gather information and resources.

    Analyze your system. This may require that you check the relevant configuration files to make sure that appropriate services are running and that security or other characteristics of your system are working as they should. If you have experience, you'll often recognize the problem and cause when you see something wrong in these areas.

  3. Form a hypothesis.

    If you're still not sure what's wrong, make your best guess. Remember that time is severely limited during the Red Hat exams, so if you can afford it, consider skipping a problem. (To qualify for either the RHCT or RHCE, you're required to solve all RHCT-level Troubleshooting and System Maintenance issues.)

  4. Perform experiments and collect data.

    Before performing any experiments, back up anything you might change. For example, if you think the problem is with your Samba configuration file, back up your /etc/samba/smb.conf file, in case your hypothesis makes things worse.

  5. Analyze data.

    This is essentially identical to step 1. If what you do doesn't solve the problem, you'll need to analyze what went wrong, using error messages and log files as appropriate.

  6. Interpret data and draw conclusions that serve as a starting point for new hypotheses.

    In many cases, you'll want to restore what you did from the backup in step 4, repeat steps 2 through 4, and try again.

  7. Publish the results.

    Once you've solved the problem, you'll want to make sure the problem remains solved after rebooting your system. For example, if you've addressed a Samba problem, you'll want to "publish" by making sure the Samba daemon starts the next time your Linux system boots.

Two places where you are likely to make errors that result in an unbootable system are in the boot loader and init configuration files, /boot/grub/grub.conf and /etc/ inittab. For example, identifying the wrong partition as the root partition (/) can lead to a kernel panic. Other configuration errors in /boot/grub/grub.conf can also cause a kernel panic when you boot Linux. Whenever you make changes to these files, the only way to fully test them out is to reboot Linux.

The following scenarios and solutions list some possible problems and solutions that you can have during the boot process, and possible associated solutions. It is far from comprehensive. The solutions that I've listed work on my computer, as I've configured it. There may be (and often is) more than one possible cause. These solutions may not work for you on your computer or on the Red Hat exams. To know what else to try, use your experience.

To get the equivalent of more experience, try additional scenarios (remember: never do these things on a production computer). Once you're familiar with the linux rescue environment, test these scenarios. These scenarios worked as shown when I tested them on RHEL 5. However, they lead to different errors on RHEL 4 and RHEL 3.

For the first scenario shown, change the name of the grub.conf file so it can't be loaded. Reboot and see what it does on your system. Use the linux rescue environment to boot into RHEL and use the noted solution to fix your system.

For the second scenario shown, overwrite the MBR; on a SATA/SCSI drive, you can do so with the following command (substitute hda for sda if your system uses an IDE/PATA drive):

# dd if=/dev/zero of=/dev/sda bs=446 count=1

The third scenario is misleading; it's what happened when I overwrote my /bin/ mount with /sbin/mount.nfs and rebooted.

The fourth scenario is what happened when I overwrote my /bin/init command.

The fifth scenario is based on a missing /etc/inittab; I suspect it's much more likely that you'll see some major error (such as a key command, commented out) in that file.

The sixth scenario results in the messages , which happened when I set the default runlevel to 3 and commented out the commands with the mingetty directives in /etc/inittab.



The seventh scenario is based on a typo in the root directive in /boot/grub/grub. conf.

Sometimes, you may run into a problem with the default runlevel. But you're not stuck. There are two ways to boot into different runlevels. You can boot directly from the GRUB configuration menu, or you can boot into the linux rescue environment from the first RHEL installation CD.

SCENARIO & SOLUTION

When you boot, you see a grub> prompt.

You may have a problem that prevents the boot loader from reading the GRUB configuration file, grub.conf. The file may be missing or corrupt. For hints on creating a new grub.conf, see menu.1st in the /usr/share/doc/grub-versionnum directory.

When you boot your computer, you see a message such as "Missing operating system" or "Operating System Not Found."

Your master boot record (MBR) has been erased, and you'll need to reload GRUB on the MBR using grub-install. (It's possible that everything has been erased, which I believe is beyond the scope of this part of the exam.)

During the boot process, you see the "Could not start the X server (graphical environment) due to some internal error" message.

You could have problems with a full or unmounted /tmp or /home directory. If these directories are not mounted, the mount command may be corrupt. In that case, you'll need to reload it from the mount RPM.

You see an "exec of init (/sbin/init) failed!!!" error.

Your init command may be corrupt. Try reloading it from the SysVinit RPM.

You see the "INIT: No inittab file found" message.

This is straightforward-there is something wrong with your /etc/inittab file. RHEL 5 prompts you to "Enter runlevel"; as of this writing, if /etc/inittab is missing, enter s to see a bash prompt.

You see a message

You may not have anything starting a text or GUI console in the active runlevel; trace it starting with /etc/inittab.

You see a message . Take careful note of the last file cited in the message.

RHEL has encountered some problems when reading the grub.conf configuration file. Start the linux rescue environment and check this file as well as the referenced files in the /boot directory.

Booting Into Different Runlevels

In brief, you can boot into the runlevel of your choice from the GRUB configuration menu. This is one of the RHCT Troubleshooting and System Maintenance skills and also an essential skill for all Linux administrators.

Table 16-1: Linux Runlevels

Runlevel

Description

0

Halts the system

1

Activates SELinux; runs /etc/rc.sysinit, which checks and mounts filesystems; executes all scripts in the /etc/rc1.d directory

s or single

Single-user mode; activates SELinux; runs /etc/rc.sysinit, which checks and mounts filesystems

emergency

Emergency boot mode; activates SELinux; mounts only the root (/) filesystem

init=/bin/sh

Emergency boot mode; mounts only the root (/) filesystem

2

Multiuser mode with some networking; does not include some NFS functions, the automounter, or CUPS

3

Multiuser mode with networking; boots into a text login console

4

Generally unused; however, the defaults support near-identical settings to runlevel 3

5

Multiuser mode with the X Window; boots into an X-based login screen

6

Reboots the system

The Red Hat Exam Prep guide states that "RHCTs should be able to boot systems into different run levels for troubleshooting and system maintenance." This is straightforward; at the boot loader prompt, you can start Linux at a different runlevel. This may be useful for two purposes. If your default runlevel in /etc/inittab is 5, your system normally boots into the GUI. If you're having problems booting into the GUI, you can start RHEL into the standard text mode, runlevel 3.

One other option to help rescue a damaged Linux system is is appropriate if your system can find at least the root filesystem (/). Your system may not have problems finding its root partition and starting the boot process, but it may encounter problems such as damaged configuration files or an inability to boot into one of the higher runlevels. When you boot into single-user mode, options are similar to those of the standard linux rescue environment described later in this topic.

To boot into a different runlevel, first assume that you're using the default RHEL boot loader, GRUB. In that case, press (lowercase) p to enter the GRUB password if required. Type (lowercase) a to modify the kernel arguments. When you see a line similar to

grub append> ro root=LABEL=/ rhgb quiet

add one of the following commands (shown in boldface) to the end of that line:

grub append> ro root=LABEL=/ single
grub append> ro root=LABEL=/ init=/bin/sh
grub append> ro root=LABEL=/ emergency
grub append> ro root=LABEL=/ 1

You can use the same technique to boot into another runlevel. For example, to boot from the GRUB boot loader into runlevel 3, navigate to where you can modify the kernel arguments, and add the following command to the end of the following line:

grub append> ro root=LABEL=/ 3



On the Job

The terms boot loader and bootloader are used interchangeably. In this book, I've normally used the term boot loader, as that seems to be the direction of the Red Hat documentation. However, the term bootloader is still common even in Red Hat documentation.

grub append> ro root=LABEL=/ 3

When you boot into runlevel 1, no password is required to access the system. As you'll see later in this topic, running your system in this runlevel is somewhat similar to running a system booted in rescue mode. Many of the commands and utilities you normally use are unavailable. You may have to mount additional drives or partitions and specify the full pathname when running some commands.

When you have corrected the problem, you can reboot the system. Alternatively, you can type the exit command to boot into the default runlevel as defined in /etc/ inittab, probably runlevel 3 or 5.


On the Job

In runlevel 1, any user can change the root password. You do not want people rebooting your computer to go into this runlevel to change your root password. Therefore, it's important to keep your server in a secure location. You can also password-protect GRUB or even the BIOS menu to keep anyone with physical access to your computer from booting it in single-user mode.