1 OS, 2 Servers, 5... days?

At work, we're switching a number of our LAMP stack applications to be hosted on Ubuntu Server.  Because of its increased stability, we generally run the LTS editions, so we're currently on Lucid Lynx (10.04).  In this particular case, we're moving our Drupal CMS hosting over from RHEL 5.4 to Ubuntu Server on two new servers to be configured for high availability.  Turns out it took 5 days to do what would normally be done in a half a day.

In this particular case, we're installing on a pair of IBM x3650 M2 servers which use the LSI MegaRaid SAS controller that IBM brands as the M1015.  Due to the driver version in the Lucid CD kernel, the installation CD does not recognize the RAID controller (and thus the drives connected to it) [Bug 546091].   As a result, we had to get "creative" in order to install Ubuntu.

Attempt 1

On the first attempt, I attached an external USB drive (HDD in enclosure) with the thought that I would install on it, update the kernel, and reboot to gain access to the internal drives.  I could then drop to single-user mode, use dd to copy over the USB drive to the hard drive, and reboot on the internal drive.  FAILED because we could not get the x3650 to boot from the USB-attached hard drive.  (Some on the IBM forums have reported this before.)

Attempt 2&3

Next, I attempted to boot from a LiveCD to chroot and update then dd the image.  Knowing that the 10.04 LiveCD was useless, I grabbed the 10.10 install CD.  FAILED because the 10.10 install CD kernel panics with an unable to mount root error when looking for the optical media (I intend to file a bug report on this).  I then tried to download a 11.04/Natty daily build and boot from it to perform the same steps.  Much like the 10.10 install disk, this FAILED because of a kernel panic when mounting the root filesystem.

Attempt 4&5

Since RHEL 6 has just been released, I decided to try its install CD.  The RHEL 6 CD booted and could see the internal drives.  I used dd to copy over the image and chroot and upgrade, but upon reboot, the UEFI in the system would not see a bootable disk.  It turns out that with a Fiber Channel HBA installed and the OS not being installed in a EFI-compatbile manner (MBR vs GPT), the HBA attempts to take over the boot process.  FAILED due to UEFI settings.  After finding a document on IBM's website describing blocking the UEFI from loading the HBA first, I was able to boot from the internal drives, but then the system hung with an initramfs prompt.  The initramfs version of udev was looking for a libselinux.so which it could not find.  This is probably because Ubuntu does not use SELinux, but RHEL does.  Despite the initramfs having been created within a chroot, it seems to have been "tainted" by the external RHEL.  I'm not sure if it looks at something in /proc or /dev (which were mounted from the RHEL system), if the update-initramfs process breaks chroot, or if the kernel has a role in this, but I will be investigating this further when I have time.  Long story short: FAILED because of unusable initrd.

Attempt 6

Some information on IBM's website lead us to believe that with a newer version of the UEFI, we could boot from HDDs in external enclosures, allowing us to go back to Attempt 1 to find a solution.  After updating the UEFI (which also requires that the Integrated Management Module, or IMM, be updated) I attempted to boot from the external drive again.  FAILED because the UEFI will still not boot that drive.

Attempt 7

The same information as above also lead us to believe that there is a boot-time distinction between flash drives and external hard drives on enclosures.  What this difference is, I don't know, but I'd like to find out.  So we inserted a 4GB USB drive, and booted from the 10.04 CD again, installed to it rather than the external drive, and attempted to boot from the flash drive.  This succeeded, so I upgraded to the new kernel from -proposed, which has support for the RAID controller.  After a reboot, I was able to see the RAID array.  I dropped into single user mode and tried to remount the root filesystem read only and was greeted with "device busy."  Thanks to lsof, I was able to determine that it was rsyslog keeping the log files open, so a quick "stop rsyslog" later, I had the filesystem mounted read-only.  After mounting, I was able to dd over the filesystem and rebooted the server, waiting to see if it would boot.  SUCCESS! The server booted!


I still need to do our normal configuration on the server, and repeat the process from #7 on a 2nd server, but at least we have a workaround (kludge) now.  Then it will be on to getting HA and other software setup, in hopes of a Drupal migration during the Winter holidays.  (Our original target had been next Wednesday, but this process has put a significant wrench in the works.)  I'd like to thank my directors Tom and Gary for giving me the chance to bounce ideas off of them and talk through it -- sometimes things become very clear when you try to explain them to someone else.