Archive for category Troubleshooting

Another raid5 scare, and how to recover an apparently trashed array

This morning after waking up to lots of thunder and lightning, I got a text message saying my raid5 array had failed. Only this time, 2 of the 3 drives were missing. Since both of those drives were actually mounted via a vblade share (on a different physical machine), I assumed that the other server had freaked out during a power surge. I quickly rebooted the machine to bring back the vblade shares, but then the trouble started.

At some point, the array was “started” but had two faulty drives. I tried –remove and –add to remove and re-add the “faulty” drives. This had the effect of bringing the array back “online” with all the drives as spares. I removed the drives again, and tried the trick I used last time:

mdadm --assemble -f /dev/md0 /dev/sda2 /dev/etherd/e4.1 /dev/etherd/e4.2

However, this also didn’t work. It showed the array with /dev/sda2 and /dev/etherd/e4.2 as spares, and e4.1 was nowhere to be seen. At this point I was a little more than worried that I had done something to trash the array. That’s when a google search led me to this handy command:

mdadm -E /dev/sda2

This prints out the superblock information that is present on the hard drive. This told me that the e4.2 drive had not been damaged, since I was able to see information there. Also, the UUIDs on all three drives still matched. However, the bottom section of the report differed on all the drives.

A few google searches later, and I came across this:

mdadm --create --assume-clean --level=5 --raid-devices=3 /dev/md0 /dev/sda2 /dev/etherd/e4.2 /dev/etherd/e4.1

Using the –assume-clean flag tells mdadm not to write any data to the drives, or to start initializing the array. However, what I didn’t realize was that it would reset the UUIDs. That command brought the array back online, at least according to /proc/mdstat, but when I tried to mount it, it couldn’t figure out the filesystem.

That’s when I realized that the *order* in which you specify the drives to the –create command actually matters. I re-ran the command like this:

mdadm --create --assume-clean --level=5 --raid-devices=3 /dev/md0 /dev/sda2 /dev/etherd/e4.1 /dev/etherd/e4.2

The array came back online, and I was able to mount it!

So while RAID 5 protects against a single hard drive failing, it does not protect against me running stupid commands on the array. I’m going to have to start backing up my raid arrays onto other drives…

Useful Links

, ,

No Comments

Final Cut: “Selection contains no media”

I encountered this error while trying to export a video from Final Cut using Compressor. Obviously my sequence did in fact have clips in it, but I didn’t see why this was occurring at first.

Somehow, the “in” point had been set to the last frame of the main timeline, so essentially I was trying to export 0 frames. The solution was to clear the in and out points by right clicking on the timeline of the sequence.

No Comments

Stating the obvious: IE is a pain in the butt

Thank you Brandon K! You just saved me 6 hours apparently:

from http://php.net/header

Brandon K [ brandonkirsch uses gmail ]
26-Apr-2007 01:04
I just lost six hours of my life trying to use the following method to send a PDF file via PHP to Internet Explorer 6:

header('Content-type: application/pdf');
header('Content-Disposition: attachment; filename="downloaded.pdf"');
readfile('original.pdf');

When using SSL, Internet Explorer will prompt with the Open / Save dialog, but then says “The file is currently unavailable or cannot be found. Please try again later.” After much searching I became aware of the following MSKB Article titled “Internet Explorer file downloads over SSL do not work with the cache control headers” (KBID: 323308)

PHP.INI by default uses a setting: session.cache_limiter = nocache which modifies Content-Cache and Pragma headers to include “nocache” options. You can eliminate the IE error by changing “nocache” to “public” or “private” in PHP.INI — This will change the Content-Cache header as well as completely remove the Pragma header. If you cannot or do not want to modify PHP.INI for a site-wide fix, you can send the following two headers to overwrite defaults:

header('Cache-Control: maxage=3600'); //Adjust maxage appropriately
header('Pragma: public');

You will still need to set the content headers as listed above for this to work. Please note this problem ONLY effects Internet Explorer, while Firefox does not exhibit this flawed behavior.

Why can’t IE just play nice?

No Comments

photoshop freezes on startup

This has happened once before. One day, photoshop decides it doesn’t want to start up. During the loading screen, it just stops on “Reading text global resources.”

Luckily, after a short google search, I found this:

http://photoshopnews.com/2005/01/07/photoshop-freezes-when-you-start-it-cs-on-mac-win/

Here’s the fix:

You need to delete the “New Doc Sizes.psp” file. It is located at:
C:\Documents and Settings\[username]\Application Data\Adobe\Photoshop\9.0\Adobe Photoshop CS2 Settings

Fin.

,

No Comments

2/4 failed drives in a raid5 array??

Apparently raid arrays don’t like it when you kill power to the machine. We’ve been doing remodeling in the house, and I’ve been turning on and off breakers. Apparently I forgot which breaker the servers were on, and turned it off accidentally a few times.

One of the drives apparently did fail for some reason. I accept this. This happens all the time, and that is exactly why I have a raid 5 array. I was like ok no big deal, I’ll just send it in for an RMA. But shortly after, another drive looked like it failed. I saw this in /proc/mdstat:

…[U__U]

meaning only two of the four drives were left. Future attempts at rebooting the machine resulted in the raid volume not being accessible at all. Other clues that indicated a drive failure:

From /var/log/messages:

Dec 29 19:29:06 onyx kernel: Buffer I/O error on device md0, logical block 0
Dec 29 19:29:06 onyx kernel: lost page write due to I/O error on md0
Dec 29 19:29:06 onyx kernel: EXT2-fs error (device md0): ext2_readdir: bad page in #2

There was also some output in dmesg I found by typing

dmesg | less

(But I didn’t write it down, and now dmesg outputs information from the last boot which successfully brought up the array with 3/4 drives.)

I was convinced I hadn’t actually lost 2/4 drives at the same time, and set out to figure out a way to bring it back.

After several hours of looking through forums and reading the mdadm documentation, I was able to get the array back running on 3/4 drives.

I created the configuration file /etc/mdadm.conf:

DEVICE /dev/sd[abcd]1
ARRAY /dev/md0 devices=/dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1

Then ran:

/sbin/mdadm –assemble -f /dev/md0
mdadm: forcing event count in /dev/sdc1(2) from 1077319 upto 1077330
mdadm: clearing FAULTY flag for device 1 in /dev/md0 for /dev/sdc1
mdadm: /dev/md0 has been started with 3 drives (out of 4).

Now I just have to RMA this drive very quickly before another drive actually does fail.

,

No Comments

Failed disk when creating RAID5 array

I spent about an hour bashing my head against a wall with this one.

I just put 4 400gb drives into my linux server. I didn’t feel like getting a hardware raid card, so I went with the linux software raid solultion.

I run the command to create the array:

mdadm -v –create /dev/md1 –chunk=128 –level=5 –raid-devices=4 /dev/hda1 /dev/hdb1 /dev/hdc1 /dev/hdh1

And then it starts to build the array. I check the status with

watch -n 1 cat /proc/mdstat

In the output of /proc/mdstat, it shows all the drives in the array and which ones are failed. Like so:

md1 : active raid5 hdh1[4] hdc1[2] hdb1[1] hda1[0]
1172126208 blocks level 5, 128k chunk, algorithm 2 [4/4] [UUU_]

In Webmin, I see that the array is created, but one of the drives has failed. At first I thought I had a bad drive. But it turns out I can get any of the four drives to fail depending on the order that I specify them. Whichever drive is last in the list is the one that fails.

Finally, after an hour and a half of trying various combinations of creating arrays with 2, 3, 4 disks, raid 0 and 1, a regular file system, dinner, and several unsuccessful google searches, I stumbled upon an email from Neil Brown, the creator of the mdadm program, which explains everything.

RAID5 rebuild question

Here is the answer:

.
.
.
4/ Assume that the parity blocks are all correct, but that one drive is missing (i.e. the array is degraded). This is repaired by reconstructing what should have been on the missing drive, onto a spare. This involves reading all the ‘good’ drives in parallel, calculating them missing block (whether data or parity) and writing it to the ’spare’ drive. The ’spare’ will be written to a few (10s or 100s of) blocks behind the blocks being read off the ‘good’ drives, but each drive will run completely sequentially and so at top speed.

On a new array where most of the parity blocks are probably bad, ‘4′ is clearly the best option. ‘mdadm’ makes sure this happens by creating a raid5 array not with N good drives, but with N-1 good drives and one spare. …

So after all that, it was working correctly, and if I had just let it rebuild, it would go back to normal.

But, I had also come across a different website which mentioned the –force option to mdadm. I tried that, and lo and behold, the array was set up correctly!

This guy had the same problem, which is where I got the link to Neil Brown’s email.

,

No Comments

problems when unmounting filesystems in linux

I’m reconfiguring my RAID 5 array in one of my servers (called malachite), and having some problems doing so.

What I’m trying to do is to completely delete the current RAID 5 array, and rebuild it with different sized partitions on the same disks.

I can’t unmount the raid array directly, because linux reports back to me:

# umount /raid
umount: /raid: device is busy

So I’m trying to trace this back to see why it’s busy. I’m guessing it’s because this folder is mounted with nfs to my other server (called onyx). However, I also can’t unmount that folder because it is also busy. There is a neat command which tells you what processes are accessing a certain file. I found it here.

# fuser -v -m /RAIDmalachite

will list all processes using the /RAIDmalachite folder. After doing this, I saw that indeed, Samba is using that folder.

Once I took care of that, I still couldn’t unmount the drive on malachite. This is probably because nfs is still exporting it. Note that the kernel nfs server doesn’t count as a process, so it won’t show up by running the above command. It is necessary to stop nfs with:

# /etc/init.d/nfs stop

Now I can unmount the drive:

# umount /raid

At this point, I can now go back into Webmin and delete the RAID array, and then remove all the partitions on the drives.

After creating my new raid array, I can watch the status of it rebuilding in real time from the command line:

# watch -n 1 cat /proc/mdstat

What is even better is that I can format, and start copying files back to the array even before it has completely been rebuilt. The files just won’t be redundant until after it finishes reubilding. Looks like it will take about 2 hours to rebuild the 650gb array.

No Comments

Trouble restarting Apache

My VPS was restarted yesterday, and apparently Apache wasn’t shut down gracefully. I tried to go to my website, but Apache wasn’t running. I SSH’d into my machine to restart apache. I typed

/etc/init.d/httpd restart

I can’t remember exactly what it said back to me, but it gave me the PID of Apache (1039) and said it was running. Apache was most definitely not running. I found this out by getting a list of all running processes:

ps -ax

That gave me a list of PIDs corresponding to programs, and 1039 was not httpd. It occurred to me that if Apache thinks it’s running, it has to have some way of knowing that. In Apache’s log file directory, there is a file which holds the PID of Apache when it’s running. That file is called httpd.pid. By removing that file, Apache no longer thinks it’s already running, and will actually try to start. If you don’t know where httpd.pid is, run this command:

whereis httpd.pid

That should find it. If not, it is probably in /usr/local/apache2/logs if you installed with the default values. Then you just need to rm httpd.pid, and restart apache again. I got everything back up and running with just under 10 minutes of troubleshooting!

No Comments