Man this has been a real pain. Whoever said technology makes life easier should really be held accountable. hehehe. I’d like to try and break this down into laymen terms but it’s rather difficult.
In my machine there were three drives.
* Boot / Root drive (sata)
* /opt = archive & application file store (eide)
* /share = mp3’s/application back ups (eide)
I originally thought that my boot drive was eide and that the drive that failed was sata. I was wrong, but I had already purchased two more sata drives that I could mirror with my adaptec card. This caused my first headache as I would have to move my booting drive off the sata and on to the new ones. Since this server is linux it is possible to do this with a disc copy utility called ‘dd’. That combined with an application called resize2fs would allow me to copy my partitions over to new drives, and then resize them to take advantage of the new space. Sounds good, right? It did go well, until I realized that I had a fake-raid controller that would munge the drives when I went to mirror them. At this junction I was left with a choice, either keep the original boot drive, throw one of my two drives in the machine, no future protection, just more storage, and call it a day, or.. Buy a real sata raid card (3Ware), and do the job the correct way. As I was weighing the decisions and looking over the mayhem inside the opened server I noticed another possible issue.
Heat. The new sata drives were really hot, I mean felt like egg cooking hot. In reflection I should have grabbed my laser temperature thermometer and gauged it, but I digress. Let’s do the full monty I reasoned in my crazy state of mind. I ordered a HW raid controller, I ordered a PCI-Slot fan exhaust card. I picked up a 5.25″ bay drive cooler for my original sata drive, I picked up an attachable 3.25″ fan that goes right on the base of a new sata drive.
All of this material came in yesterday. I told all those connected to our inhouse system of impeding downtime and went to work at 4pm. This should have been rather easy. Take out all eide remenants, build the machine with dual sata controllers, rig all the new fans and power together, and I should have been looking at a grub boot screen. Of course not.
Welp, when things don’t work, you have to troubleshoot. To troubleshoot you can go at it one of two ways. Strip it all down and build it till it works again, or work in reverse and remove things until it the problem is understood. I generally choose to work at it in reverse because the belief is the problem should be rather high in the architecture. This is a flaw of mine since Murphy teaches me each time it’s something much farther down. No matter what BIOS was telling me, it would not hand over booting to either of my sata controllers.
The only way the drives would boot is if I added my dieing 80gb drive back into the pool. It had a master boot record on it, swap was its first partition (undamaged), and it would bounce the boot straight to the sata drive. I think to myself, fine fine.. I’ll just port my data off my known good eide drive so I don’t have to use the weirdly defunct drive that it chooses to work with now. I move my 120g of data over to the new array, fdisk/repartition the 170gb eide to have a nice little boot partition, activate it, reboot, and nothing. Try a few more configurations, double check my work, nothing. *sigh*
There are probably better ways to solve this, but the time was around 11pm and I was really seeing crosseyed working all day coming home to work all night. I replaced the large clean eide drive with the dirty/decaying one, blew away the partition that was corrupt and just kept it’s little bootable swap slice alive. The drive serves no purpose but to transfer boot status to the sata real boot drive and everything is hunky dory. I close the case up and get boot errors. hah. Closing the damn case caused an sata cable to come undone and a memory stick to become dislodged. I was about to tear something apart. Kept my cool though, rewired and re-seated everything and it was happy, thank goodness. My heart couldn’t take any more machine mischief.
In the end I’ve got a bad drive acting as a boot slave, 5 new fans, probably 300 CFM more air current, 20 new decibles, and a case which is staying around 29.3’C ambient. I don’t particularly like the configuration but I don’t quite know how to proceed at this time either. I didn’t ask for this problem, and I sure didn’t like bleeding a lot of cash to fix this properly. Whats a technologist supposed to do?
I might tinker some more in the coming days.. I’m just not sure at the moment. Here’s to less problems, and long uptimes!
-a
ps.. That really wasn’t a good laymen speil was it, maybe this should have been said instead. Computer go boom boom, fixing it no startie, pour money into internet, fixie no starie again, percussive maintenance with bat, kludge metal into slot, workie workie!