Archive for category Linux
Linux Software RAID and LVM
Posted by aaron in Hardware, Linux, Server Software on August 31st, 2009
I wanted to create a RAID 5 array with four 1tb drives. I only have three, but I have a 750gb and 320gb drive lying around. I figured there was probably a way to combine them into a 1tb drive that I could use with the others.
Using Linux’s LVM, I can create a logical partition from the two smaller drives as big as the 1tb drive.
$ pvdisplay
Lists all physical volumes managed by LVM. First we have to create the physical volumes for LVM. I prefer to create the volumes out of partitions, although you can do it from raw drives too.
First let’s use fdisk to create “Linux LVM” partitions on the two drives.
$ fdisk /dev/sdb Press "n" to create a new partition Press "t" to set the partition type, and enter "8e" for Linux LVM.
When creating the partition, accepting the defaults will make it use the whole drive. I want to use the entire 750 drive and only part of the 320 drive, so that in total it has the same number of blocks as the 1tb drives. So I first created the “Linux RAID” partitions on the 1tb drives so I could see how many cylinders it listed, which ended up being 121601. So I created a partition the full size of the 750 drive (91201 cylinders), then created a 121601-91201 cylinder partition on the 320 drive.
$ fdisk -l /dev/sdb Disk /dev/sdb: 320.0 GB, 320072933376 bytes 255 heads, 63 sectors/track, 38913 cylinders Device Boot Start End Blocks Id System /dev/sdb1 1 30400 244187968+ 8e Linux LVM /dev/sdb2 30401 38913 68380672+ 83 Linux $ fdisk -l /dev/sdj Disk /dev/sdj: 750.1 GB, 750156374016 bytes 255 heads, 63 sectors/track, 91201 cylinders Device Boot Start End Blocks Id System /dev/sdj1 1 91201 732572001 8e Linux LVM
Now that I have the two partitions ready, I moved on to LVM setup.
$ pvcreate /dev/sdb1 $ pvcreate /dev/sdj1
This sets up the two partitions as physical volumes for LVM.
Next is creating a logical volume group:
$ vgcreate vg_tb /dev/sdb1 /dev/sdj1
This creates a volume group called “vg_tb” using the two physical volumes sdb1 and sdj1.
Let’s take a look at what we have so far:
$ pvdisplay --- Physical volume --- PV Name /dev/sdb1 VG Name vg_tb PV Size 232.88 GB / not usable 832.50 KB Allocatable yes PE Size (KByte) 4096 Total PE 59616 Free PE 59616 Allocated PE 0 PV UUID 70hPKX-n11U-RcB6-0Kyt-1SOP-ni7E-2Y9hcE --- Physical volume --- PV Name /dev/sdj1 VG Name vg_tb PV Size 698.64 GB / not usable 2.34 MB Allocatable yes PE Size (KByte) 4096 Total PE 178850 Free PE 178850 Allocated PE 0 PV UUID PzFb9b-lapG-KdT3-78nh-Gq75-F0Lo-I3xCrl --- Physical volume --- PV Name /dev/sda2 VG Name VolGroup00 PV Size 111.60 GB / not usable 2.86 MB Allocatable yes PE Size (KByte) 32768 Total PE 3571 Free PE 1 Allocated PE 3570 PV UUID 1wM65Z-3QGd-vDiq-mq1R-YhEE-Ackp-hh3g13 $ vgdisplay --- Volume group --- VG Name vg_tb System ID Format lvm2 Metadata Areas 2 Metadata Sequence No 1 VG Access read/write VG Status resizable MAX LV 0 Cur LV 0 Open LV 0 Max PV 0 Cur PV 2 Act PV 2 VG Size 931.51 GB PE Size 4.00 MB Total PE 238466 Alloc PE / Size 0 / 0 Free PE / Size 238466 / 931.51 GB VG UUID olg9GP-x1sC-sFAD-TgWY-KIIx-YWNt-kL763n --- Volume group --- VG Name VolGroup00 System ID Format lvm2 Metadata Areas 1 Metadata Sequence No 3 VG Access read/write VG Status resizable MAX LV 0 Cur LV 2 Open LV 2 Max PV 0 Cur PV 1 Act PV 1 VG Size 111.59 GB PE Size 32.00 MB Total PE 3571 Alloc PE / Size 3570 / 111.56 GB Free PE / Size 1 / 32.00 MB VG UUID IZ25LV-oMOG-DwKK-QuFN-bqqp-ClUe-d71k5l
So far so good. The next step is to create a logical volume in the new volume group:
$ lvcreate vg_tb -n onetb -l 100%VG Logical volume "onetb" created
This creates a new logical volume called “onetb” in the “vg_tb” group using 100% of the group’s available space. Now let’s take a look at the list of logical volumes:
$ lvdisplay --- Logical volume --- LV Name /dev/vg_tb/onetb VG Name vg_tb LV UUID sQKaq9-D6Mv-p8it-vWGW-O7DX-FGmC-cl5FSh LV Write Access read/write LV Status available # open 0 LV Size 931.51 GB Current LE 238466 Segments 2 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:2 --- Logical volume --- LV Name /dev/VolGroup00/LogVol00 VG Name VolGroup00 LV UUID G5hlbb-tA3S-qhTS-03us-f9dl-1Vxy-9vDSU5 LV Write Access read/write LV Status available # open 1 LV Size 109.62 GB Current LE 3508 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:0 --- Logical volume --- LV Name /dev/VolGroup00/LogVol01 VG Name VolGroup00 LV UUID 7NZrY9-1wSJ-4fRp-VPnM-V07u-9rj8-vmtxTx LV Write Access read/write LV Status available # open 1 LV Size 1.94 GB Current LE 62 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:1
You can ignore all of the VolGroup00 things, those are the auto-created volumes from when I installed Fedora.
At this point, I have a new device at /dev/vg_tb/onetb which is the same size as my 1tb drives, and I can use it exactly as I would use the 1tb partition at /dev/sdf1.
Now it’s time to create the RAID 5 array from these four volumes.
$ mdadm -v --create /dev/md1 --chunk=128 --level=5 --raid-devices=4 /dev/sdf1 /dev/sdh1 /dev/sdi1 /dev/vg_tb/onetb mdadm: layout defaults to left-symmetric mdadm: layout defaults to left-symmetric mdadm: layout defaults to left-symmetric mdadm: layout defaults to left-symmetric mdadm: layout defaults to left-symmetric mdadm: size set to 976756608K mdadm: array /dev/md1 started.
The array will begin syncing, and you can watch it by running:
$ watch -n 1 "cat /proc/mdstat" Personalities : [raid6] [raid5] [raid4] md1 : active raid5 dm-2[4] sdi1[2] sdh1[1] sdf1[0] 2930269824 blocks level 5, 128k chunk, algorithm 2 [4/3] [UUU_] [=>...................] recovery = 5.5% (54017536/976756608) finish=593.1min speed=25925K/sec md0 : active raid5 sdg1[0] sdc1[3] sde1[2] sdd1[1] 2197715712 blocks level 5, 64k chunk, algorithm 2 [4/4] [UUUU] unused devices: <none> </none>
While this is syncing, we can create the ext3 filesystem.
$ mke2fs -j -b 4096 -m 0 -E stride=32,stripe-width=96 /dev/md1
This creates an ext2 filesystem with journaling (ext3), the block size is 4kb, and 0% of the blocks are reserved for the superuser. The stride is calculated as the raid block size / ext2 block size (128k / 4k = 32). The stripe width is calculated as the stride value times the number of data disks in the array. In a 4-disk RAID 5 array, there are three data disks, one being for the parity data.
This will take some time, and will significantly slow down the sync process. Mine dropped from 25mb/s to around 1mb/s. I figure I’ll let it create the filesystem so I can start copying data to it right away, and let it finish its sync on its own time.
While you’re at it, you should set up munin to monitor the SMART data from all the drives as well as the status of the array using my munin-raid-monitor plugin.
Additional Reading:
- http://linux-raid.osdl.org/index.php/RAID_setup
- http://bfish.xaedalus.net/2006/11/software-raid-5-in-ubuntu-with-mdadm/
- http://www.linuxquestions.org/questions/linux-hardware-18/how-can-i-override-the-5.00-reserved-for-the-super-user-mkfs.ext3-creates-616546/
- http://www.linuxconfig.org/Linux_lvm_-_Logical_Volume_Manager
- http://www.centos.org/docs/5/html/Cluster_Logical_Volume_Manager/LV_create.html
Workaround for Comcast blocking port 25
Comcast just started blocking port 25 outgoing. I have several computers at home configured to send email reports of cron jobs. Of course they do this by trying to send mail on port 25 from inside the house to my mail server outside. Now that Comcast is blocking that, I need some other way for my emails to be delivered.
The easiest solution I could come up with was to tell my mail server to listen on another port such as 587, and have my firewall route requests for port 25 to port 587. Here is the iptables rule to do that!
iptables -t nat -A PREROUTING -p tcp -i eth0 -d xx.xx.xx.xx --dport 25 -j DNAT --to-destination :587
Where xx.xx.xx.xx is the IP address of my mail server. Now all the computers inside think they are communicating with my mail server on port 25, but the firewall secretly passes the request on to port 587 instead.
SSHfs on OS X via Samba
Posted by aaron in Apple/os x, Linux, Server Software on November 20th, 2008
Why, you ask?
Because sshfs on OSX’s version of sshfs is wonky.
My solution is to use an intermediate linux server which does the sshfs mount, then serves that to os x over a samba share. Another benefit to this is that you only need one samba share to mount all your sshfs connections, since the linux server will be taking care of those. Also if you had a windows computer you wanted to use this with, it would also be able to mount the samba share from the linux server.
Let’s get started.
I ran into a couple tricky config issues while setting this up. This post is here mostly for me to remember them for next time I need to do them. If anybody else happens to stumble across this and finds it useful, then that’s a bonus.
On the linux server, you’ll need to install fuse-sshfs as well as samba. In Fedora, this can be done like so:
$ yum install samba $ yum install fuse-sshfs
You need to add the linux user to the fuse group, and create a samba user account.
$ usermod -a -G fuse aaron $ smbpasswd -a aaron
Here’s something that isn’t normally covered in fuse tutorials. In order to allow samba to access the fuse mount, you need to create a file, /etc/fuse.conf with the following contents:
user_allow_other
If you don’t do that, the mounted folder just disappears from the samba share.
You need to make some changes to /etc/samba/smb.conf file in order for symlinks to be shared to os x. I’m not sure if this is required for fuse to work, but it’s nice to have anyway:
unix extensions = no (add this outside of a share definition) follow symlinks = yes (add this inside a share definition)
Of course, you need to configure the firewall to allow access to samba, (tcp and udp ports 445, 137, 138, 139 should do). And, you’ll need to make sure samba starts when the machine boots.
$ chkconfig smb on
We’re almost there.
To actually mount the sshfs folder, you’ll run a command which looks something like this. Note the extra option at the end:
sshfs username@example.com:/home/username mount_target -o allow_other
Now you can mount your home folder on the linux server over samba, and you’ll see a folder mount_target, which is the sshfs mount.
Note: you’ll probably want to set up your ssh server with public key authentication so you don’t have to enter your password every time you connect. This is not the topic of this post, so I won’t bother mentioning how. There’s plenty of other tutorials on the Internet.
I hope this covers it, but feel free to comment if I’ve left anything out.
Redundant web & database servers on a budget using Virtual Private Servers
Posted by aaron in Linux, Server Software, Website Development on September 22nd, 2008
Background
First let me just say that I have been struggling with this problem for quite some time now. The problem is to provide redundancy for a website so that the website continues to run even if there is a problem with one of the servers it’s running on.
In a typical simple server setup, there is a single machine running the web and mysql servers. The machine can be either a dedicated server, or as I have been using, a VPS. I have been running my websites off of VPSs for several years now, with minimal trouble. This works most of the time, but the having a site on must one machine means a Single Point Of Failure. If something is wrong with that server, the websites are non-functional until it is fixed. The trouble I have run in to falls under a few categories:
- A problem with the physical host
- A problem at the VPS level (operating system, Apache or MySQL errors, etc)
- A problem at the network level
Problems with the physical host
Problems with the physical host do occur. With a VPS, these are completely out of your control, and are the responsibility of the hosting provider. Some examples of things I’ve encountered include a failed RAID array, a corrupted filesystem on the host, requiring a several-hour-long fsck, or an unplugged power cord. The worst issue I’ve had was when the provider said they had lost 2 drives in a RAID 5 array, and all the VPSs on that host were completely gone. Luckily I had a backup of the system and was up and running on a new VPS within a couple hours.
Problems at the application level
I haven’t actually run in to many problems at the VPS level compared to the other types of problems. However the latest issue I’m having does fall in to this category. Apache periodically starts crashing part way through serving a page with the error “[notice] child pid 21106 exit signal Segmentation fault (11)”. Visitors see a completely blank page some of the time.
Problems at the nework level
By far the most frequently occurring problem I encounter is network-related. These problems are usually out of both my and the hosting provider’s control. If there is a problem with the network, the downtime can vary greatly, anywhere from 5 minutes to 12 hours. It can be caused by a Denial of Service attack on a completely different server in the same datacenter, or it could just be a routing issue somewhere along the path from me to the server.
A typical redundant setup will cover both #1 and #2. Typical setups may include one or more load balancers in front of multiple application servers. If a machine goes down, the load balancers can stop sending requests to it. This works great if you’re trying to protect against servers failing. However if the load balancers are all on the same network, unless the network has multiple redundant paths, the whole system is still vulnerable to network issues.
My Solution
Since I most frequently encounter network issues, I can’t get away with a just a typical load-balancing solution. What I really need is a copy of the entire website in a geographically different location. Here is my solution:
One VPS in Dallas, TX (called triton), and another VPS in Newark, NJ (called proteus). (Yes, I name my servers after Greek mythology.) Triton holds the master copies of the websites’ php files, and proteus gets a copy of them via rsync. If I ever need to update the site, I edit the files on triton and then rsync them to proteus. Here is where the redundancy comes in. My DNS entries point the domain to both IP addresses. This means during normal operation, visitors will be more or less distributed between the two hosts evenly. If one server goes down, I can stop resolving DNS queries to it, and the worst that will happen is some dead pages for as long as the TTL on the domain.
This works as long as you’re just serving static content. Serving dynamic content, such as from a database, gets a little more complicated. MySQL’s NDB clusters are apparently only effective when run within a high speed network, with at least a 10 MBPS connection between them. Replication turns out to be more along the lines of what I’m looking for.
Replication to the rescue!
Replication is designed for a one-way sync between a master and slave. However, it is possible to configure two servers to be both a master and a slave. They will both notify each other of changes made to their databases. There is one trick you need to do in order to prevent primary key conflicts if rows are written to both databases while the link is down. It involves setting the auto_increment offset and increment, so that one server will only create even keys, and the other creates only odd keys.
/etc/my.cnf auto_increment_increment = 2 auto_increment_offset = 1
Here’s some dry reading on replication from the MySQL manual. Here is a slightly clearer guide to replication which sums everything up pretty nicely. Overall, replication was pretty easy to set up. It seems to be pretty robust as well. I simulated network problems by adding firewall rules to block the servers from each other. I was able to continue to interact with each database, and the changes were all carried over when the link came back up.
Feel free to comment if you have any experience or insights into configuring web and database servers! I’m curious to hear what other people have done.
Another raid5 scare, and how to recover an apparently trashed array
Posted by aaron in Hardware, Linux, Troubleshooting on August 18th, 2008
This morning after waking up to lots of thunder and lightning, I got a text message saying my raid5 array had failed. Only this time, 2 of the 3 drives were missing. Since both of those drives were actually mounted via a vblade share (on a different physical machine), I assumed that the other server had freaked out during a power surge. I quickly rebooted the machine to bring back the vblade shares, but then the trouble started.
At some point, the array was “started” but had two faulty drives. I tried –remove and –add to remove and re-add the “faulty” drives. This had the effect of bringing the array back “online” with all the drives as spares. I removed the drives again, and tried the trick I used last time:
mdadm --assemble -f /dev/md0 /dev/sda2 /dev/etherd/e4.1 /dev/etherd/e4.2
However, this also didn’t work. It showed the array with /dev/sda2 and /dev/etherd/e4.2 as spares, and e4.1 was nowhere to be seen. At this point I was a little more than worried that I had done something to trash the array. That’s when a google search led me to this handy command:
mdadm -E /dev/sda2
This prints out the superblock information that is present on the hard drive. This told me that the e4.2 drive had not been damaged, since I was able to see information there. Also, the UUIDs on all three drives still matched. However, the bottom section of the report differed on all the drives.
A few google searches later, and I came across this:
mdadm --create --assume-clean --level=5 --raid-devices=3 /dev/md0 /dev/sda2 /dev/etherd/e4.2 /dev/etherd/e4.1
Using the –assume-clean flag tells mdadm not to write any data to the drives, or to start initializing the array. However, what I didn’t realize was that it would reset the UUIDs. That command brought the array back online, at least according to /proc/mdstat, but when I tried to mount it, it couldn’t figure out the filesystem.
That’s when I realized that the *order* in which you specify the drives to the –create command actually matters. I re-ran the command like this:
mdadm --create --assume-clean --level=5 --raid-devices=3 /dev/md0 /dev/sda2 /dev/etherd/e4.1 /dev/etherd/e4.2
The array came back online, and I was able to mount it!
So while RAID 5 protects against a single hard drive failing, it does not protect against me running stupid commands on the array. I’m going to have to start backing up my raid arrays onto other drives…
Useful Links
Google Treasure Hunt – Question 2
I just stumbled across Google’s Treasure Hunt. They’re always doing neat things, so I checked it out. It’s a series of four problem-solving puzzles released one week at a time. I found out about it after the first question was already over, so I jumped in starting with the second question.
Question 2: File sum and multiplication problem
In this problem you are given a zip folder with hundreds of text files nested in folders an arbitrary number of levels deep. The question asks to find the sum of the nth line of all files with abc in their path, ending in .xyz. You do this twice, then calculate the product of the two sums. You can download the file I got from them at the link below.
googletreasurehunt.zip
My question is below:
Sum of line 3 for all files with path or name containing jkl and ending in .rtf
Sum of line 5 for all files with path or name containing stu and ending in .rtf
Multiply all the above sums together and enter the product below.
I thought this would be a good chance to practice my bash scripting, so I decided to give it a shot.
$ find . -type f -wholename "*jkl*\.rtf"
-type f restricts the search to files, not directories. -wholename searches for files matching that pattern. Once we run this, we get a list of all the files we’ll need to process. Now we need to create a running total of the values in the nth line in each file. This is where a bash ‘for’ loop and ‘awk’ come in handy.
$ for x in $(find . -type f -wholename "*jkl*\.rtf")
The ‘$’ tells bash to execute that command and return the result. The ‘for’ loop looks at the output line by line, and the variable x will contain each filename.
$ awk 'NR==3' filename.txt
This is an awk command that prints the 3rd line of filename.txt. If we can stick this into the ‘for’ loop, we can get a running total.
sum1=0 for x in $(find . -type f -wholename "*jkl*\.rtf") do sum1=$(( `awk 'NR==3' $x` + $sum1 )) done echo $sum1
Now we just repeat for the other pattern, and we end up with our sum in two variables. Calculating the product is simple.
echo $(( $sum1 * $sum2 ))
Personally, I thought this puzzle was a little bit simplistic. I would have expected them to throw in some tricks, so that a poorly written search command would return some other files that would mess with your result. Now that I’ve said that, I hope I got it right! I find out in 10 hours if my answer is correct.
Feel free to comment if you have a different and/or better solution to this problem!
How to keep a MySQL connection alive
Posted by aaron in Linux, Server Software on May 26th, 2007
Ok, so I wasted about 3 hours on this, and have only sort of found a solution. I have a PHP script which is going to be running 24/7 checking a message queue and then interacting with a database when a message is received. It was mostly working, until it sat there for a day and I tried to use it the next day. The PHP script had quit, with the error “MySQL server has gone away.” I figured there must be some sort of time limit that the mysql server will keep an idle connection alive. I couldn’t find it, but luckily I was talking to nick, and he did! He found this blog entry. So I changed the my.cnf file to set the timeout to one week:
wait_timeout=604800
interactive_timeout=604800
That’ll do for now. The other alternative would be to check if the connection is alive before running a query, and connecting if it isn’t. I couldn’t get that one to work, though.
simple guide to creating a RAM disk
Posted by aaron in Linux, Server Software on May 20th, 2007
I had a thought of using a RAM disk as a message queue for a messaging application I am working on, as opposed to creating a table in a database, or using a regular flat text file. Every time I need a message to be sent, I’ll add a text file on the RAM disk, then the sending process can scan for files. All of this is happening in RAM, so it won’t be thrashing the disk!
Here is a guide I found which was very straightforward. I got the disk set up in a few minutes.
http://www.vanemery.com/Linux/Ramdisk/ramdisk.html
I didn’t bother changing the size of my disks, and it turns out they defaulted to 16mb anyway.
This can be added to a shell script and set to run on startup by calling it in /etc/rc.local
/sbin/mke2fs -q -m 0 /dev/ram0
/bin/mount /dev/ram0 /mnt/rd
/bin/chown van:root /mnt/rd
/bin/chmod 0750 /mnt/rd
iptables not restoring firewall rules on startup
Posted by aaron in Linux, Server Software on January 31st, 2007
One of my servers has a nasty habit of not reloading the firewall rules when it boots up. Iptables starts up just fine, but doing iptables-save shows that there are no rules listed. The weird part is that on all my other servers, I never had to do anything special to get it to remember the rules on restart.
A short google search and I found the answer. Or at least a solution. Well, I haven’t actually tested it, but it looks like it worked.
service iptables save
It reports that the firewall rules are saved to /etc/sysconfig/iptables.
I found out about that trick here.
2/4 failed drives in a raid5 array??
Posted by aaron in Hardware, Linux, Troubleshooting on December 30th, 2006
Apparently raid arrays don’t like it when you kill power to the machine. We’ve been doing remodeling in the house, and I’ve been turning on and off breakers. Apparently I forgot which breaker the servers were on, and turned it off accidentally a few times.
One of the drives apparently did fail for some reason. I accept this. This happens all the time, and that is exactly why I have a raid 5 array. I was like ok no big deal, I’ll just send it in for an RMA. But shortly after, another drive looked like it failed. I saw this in /proc/mdstat:
…[U__U]
meaning only two of the four drives were left. Future attempts at rebooting the machine resulted in the raid volume not being accessible at all. Other clues that indicated a drive failure:
From /var/log/messages:
Dec 29 19:29:06 onyx kernel: Buffer I/O error on device md0, logical block 0
Dec 29 19:29:06 onyx kernel: lost page write due to I/O error on md0
Dec 29 19:29:06 onyx kernel: EXT2-fs error (device md0): ext2_readdir: bad page in #2
There was also some output in dmesg I found by typing
dmesg | less
(But I didn’t write it down, and now dmesg outputs information from the last boot which successfully brought up the array with 3/4 drives.)
I was convinced I hadn’t actually lost 2/4 drives at the same time, and set out to figure out a way to bring it back.
After several hours of looking through forums and reading the mdadm documentation, I was able to get the array back running on 3/4 drives.
I created the configuration file /etc/mdadm.conf:
DEVICE /dev/sd[abcd]1
ARRAY /dev/md0 devices=/dev/sda1,/dev/sdb1,/dev/sdc1,/dev/sdd1
Then ran:
/sbin/mdadm –assemble -f /dev/md0
mdadm: forcing event count in /dev/sdc1(2) from 1077319 upto 1077330
mdadm: clearing FAULTY flag for device 1 in /dev/md0 for /dev/sdc1
mdadm: /dev/md0 has been started with 3 drives (out of 4).
Now I just have to RMA this drive very quickly before another drive actually does fail.
