Tuesday, August 29, 2006

Creating a 4-disk RAID10 using mdadm

Since I can't seem to find instructions on how to do this (yet)...

I'm going to create a 4-disk RAID10 array using Linux Software RAID and mdadm. The old way is to create individual RAID1 volumes and then stripe a RAID0 volume over the RAID1 arrays. That requires creating extra /dev/mdN nodes which can be confusing to the admin that follows you.

1) Create the /dev/mdN node for the new RAID10 array. In my case, I already have /dev/md0 to /dev/md4 so I'm going to create /dev/md5 (note that "5" appears twice in the command).

# mknod /dev/md5 b 9 5

2) Use fdisk on the (4) drives, create a single primary partition of type "fd" (Linux raid autodetect). Note that I have *nothing* on these brand new drives, so I don't care if it wipes out data.

3) Create the mdadm RAID set using 4 devices and a level of RAID10.

# mdadm --create /dev/md5 -v --raid-devices=4 --chunk=32 --level=raid10 /dev/sdc1 /dev/sdd1 /dev/sde1 /dev/sdf1

Which will result in the following output:

mdadm: layout defaults to n1
mdadm: size set to 732571904K
mdadm: array /dev/md5 started.

# cat /proc/mdstat

Personalities : [raid1] [raid10]
md5 : active raid10 sdf1[3] sde1[2] sdd1[1] sdc1[0]
1465143808 blocks 32K chunks 2 near-copies [4/4] [UUUU]
[>....................] resync = 0.2% (3058848/1465143808) finish=159.3min speed=152942K/sec


As you can see, we get around 150MB/s from the RAID10 array. The regular RAID1 arrays only have about 75MB/s throughput (same as a single 750GB drive).

A final note. My mdadm.conf file is completely empty on this system. That works well for simple systems, but you'll want to create a configuration file in more complex setups.

Updates:

Most of the arrays that I've built have been based on 7200 RPM SATA drives. For small arrays (4 disks w/ a hot spare), often you can find enough ports on the motherboard. For larger arrays, you'll need to look for PCIe SATA controllers. I've used Promise and 3ware SATA RAID cards. Basically any card that allows the SATA drives to be seen and is supported directly in the Linux kernel are good bets (going forward we're going to switch to Areca at work).

Starting in 2011 we switched over to using LSI SAS adapters with either 8 or 16 ports (as 1:4 mini-SAS breakout cables).  The latest one that I had good results with is the LSI SAS 9201-16i. We are using the SuperMicro SAS enclosure along with a total of (10) 300GB 15k RPM SAS drives.

(Note that the SuperMicro CSE-M35T-1 or CSE-M35T-1B is the SATA version.  If you want the SAS version you have to look for CSE-M35TQB or CSE-M35TQ.  These are very good enclosures which fit five 3.5" SAS drives into the space of three 5.25" bays.  The TQB enclosure can hold SAS or SATA drives, while the other version is SATA only.)

Example of creating a 6-drive RAID10 array:

# mdadm --create /dev/md5 --raid-devices=6 --spare-devices=1 --layout=n2 --level=raid10 /dev/sd[a-g]1

In this case, we're setting up a 6-drive RAID10 array along with 1 hot-spare. Disks sda to sdg all have a single partition on them, tagged as "fd" Linux RAID in fdisk.

"n2" is the default RAID10 layout for mdadm and is a good default that provides balanced performance for reads and writes.  It is also the most common RAID 1+0 definition where each of the mirror disks is identical to the other mirror disk with no offset.

"o2" is a modest offset version where each sector on the next disk is slightly offset from the first disk.

"f2" is an optional layout that has better read performance, but worse write performance.

Sunday, August 27, 2006

irq 7: nobody cared (try booting with the "irqpoll" option)

Not sure what I'm going to do about this error on the AMD64 Asus M2N32-SLI Deluxe motherboard.

Aug 27 20:16:42 san1-azure irq 7: nobody cared (try booting with the "irqpoll" option)
Aug 27 20:16:42 san1-azure
Aug 27 20:16:42 san1-azure Call Trace: {__report_bad_irq+48}
Aug 27 20:16:42 san1-azure {note_interrupt+472} {__do_IRQ+183}
Aug 27 20:16:42 san1-azure {do_IRQ+57} {default_idle+0}
Aug 27 20:16:42 san1-azure {ret_from_intr+0} {default_idle+43}
Aug 27 20:16:42 san1-azure {cpu_idle+151} {start_secondary+1141}
Aug 27 20:16:42 san1-azure handlers:
Aug 27 20:16:42 san1-azure [] (usb_hcd_irq+0x0/0x54)
Aug 27 20:16:42 san1-azure Disabling IRQ #7


Putting "irqpoll" on the end of the kernel line in grub.conf causes the system to panic during boot (has to do with the 2nd core in the X2 chip).

Saturday, August 26, 2006

Gentoo AMD64 on Asus M2N32-SLI Deluxe (part 5)

Now there were a few minor things that I had to fix after the reboot before I could SSH back in. I hadn't pointed my swap line in /etc/fstab at the proper mdadm RAID volume, the 3c509 moved from eth7 to eth4 after the reboot, I had to use the "noapic" kernel option and I'm still getting the IRQ7 warning.

But the system is mostly in a workable state at this point. So it's time to start installing administration packages and cleaning up the install. I'll save the existing kernel as my "base" configuration in case I screw things up.

The first package that I always install is emerge screen. That provides me with multiple virtual terminals in my single SSH connection. Even better, if I disconnect accidentally, I don't lose my state and programs that were running within screen sessions will continue running. After reconnecting, I can type "screen -x" and reconnect to my old sessions.

Other key packages to install, even on a basic system like this one, are:

app-benchmarks/bonnie
app-editors/vim
app-misc/colordiff
app-portage/gentoolkit
app-text/tree
dev-util/subversion
net-analyzer/iptraf
net-analyzer/nettop
net-analyzer/nload
net-misc/ntp
sys-apps/dstat
sys-apps/eject
sys-apps/smartmontools
sys-process/atop

san1-azure ~ # emerge -pv bonnie vim colordiff gentoolkit tree subversion iptraf nettop nload ntp dstat eject smartmontools atop

These are the packages that I would merge, in order:

Calculating dependencies ...done!
[ebuild N ] app-benchmarks/bonnie-2.0.6 6 kB
[ebuild N ] dev-util/ctags-5.5.4-r2 254 kB
[ebuild N ] app-editors/vim-core-7.0.17 -acl -bash-completion -livecd +nls 5,997 kB
[ebuild N ] app-editors/vim-7.0.17 -acl -bash-completion -cscope +gpm -minimal (-mzscheme) +nls +perl +python -ruby -vim-pager -vim-with-x 0 kB
[ebuild N ] app-vim/gentoo-syntax-20051221 -ignore-glep31 18 kB
[ebuild N ] app-misc/colordiff-1.0.5-r2 13 kB
[ebuild N ] app-portage/gentoolkit-0.2.2 84 kB
[ebuild N ] app-text/tree-1.5.0 -bash-completion 25 kB
[ebuild N ] dev-libs/apr-0.9.12 +ipv6 -urandom 1,024 kB
[ebuild N ] dev-libs/apr-util-0.9.12 +berkdb -gdbm -ldap 578 kB
[ebuild N ] net-misc/neon-0.26.1 +expat -gnutls +nls -socks5 +ssl -static +zlib 763 kB
[ebuild N ] dev-util/subversion-1.3.2-r1 -apache2 -bash-completion +berkdb -emacs -java +nls -nowebdav +perl +python -ruby +zlib 6,674 kB
[ebuild N ] net-analyzer/iptraf-2.7.0-r1 +ipv6 410 kB
[ebuild N ] net-libs/libpcap-0.9.4 +ipv6 415 kB
[ebuild N ] sys-libs/slang-1.4.9-r2 -cjk -unicode 628 kB
[ebuild N ] net-analyzer/nettop-0.2.3 22 kB
[ebuild N ] net-analyzer/nload-0.6.0 118 kB
[ebuild N ] net-misc/ntp-4.2.0.20040617-r3 -caps -debug +ipv6 -logrotate -openntpd -parse-clocks (-selinux) +ssl 2,403 kB
[ebuild N ] sys-apps/dstat-0.6.0-r1 35 kB
[ebuild N ] sys-apps/eject-2.1.0-r1 +nls 65 kB
[ebuild N ] mail-client/mailx-support-20030215 8 kB
[ebuild N ] net-libs/liblockfile-1.06-r1 31 kB
[ebuild N ] mail-client/mailx-8.1.2.20040524-r1 126 kB
[ebuild N ] sys-apps/smartmontools-5.36 -static 528 kB
[ebuild N ] sys-process/acct-6.3.5-r1 300 kB
[ebuild N ] sys-process/atop-1.15 102 kB

Total size of downloads: 20,638 kB
san1-azure ~ #


That should take all of about 0.1 seconds on this Athlon64 X2. (I joke, slightly... the box is quite snappy even with only 7200rpm SATA drives.)

Now I'm going to configure SubVersion. There have been some changes in that process that I learned through trial and error. Folders that I would recommend placing under version control are:

/boot (most files)
/etc (most files, especially configuration files)
/usr/local/sbin (local sysadmin scripts that you create)
/usr/src (the .config file, make sure you add the actual directory with "svn add -N" before adding the "linux" symbolic link)

I know there are other folders to add, but I typically add them on the fly as I start customizing the system.

Gentoo AMD64 on Asus M2N32-SLI Deluxe (part 4)

This is a record of the kernel flags that I'm going to use for my AMD64 system. It's an Asus M2N32-SLI Deluxe (NVIDIA nForce 590 SLI MCP chipset) with an Athlon64 X2 4200+ chip along with 2GB of RAM. Hard drives are hooked up to the onboard SATA-II controller (NVIDIA nForce 590 SLI MCP chipset). Plus the motherboard has a pair of onboard gigabit ethernet NICs (Marvell 88E1116) and a Silicon Image Sil3132 SATA-II controller. Other chips on the motherboard are the nVidia C51XE, nVidia MCP55PXE, AD1988B, and TSB43AB22A.

In addition, I'll have even more hard drives hooked up to a HighPoint RocketRAID 2300 PCIe card. There's also a 3Com 3C905B PCI ethernet card installed along with a pair of Intel PRO/1000 PCIe gigabit NICs.

# emerge mdadm
# emerge lvm2
# cd /usr/src/linux
# make menuconfig


Linux Kernel v2.6.17-gentoo-r4 Configuration
Code maturity level options
General setup
Loadable module support
Processor type and features
--> Processor family (changed to "AMD-Opteron/Athlon64")
--> Preemption Model (No Forced Preemption (Server))
Power management options (ACPI, APM)
Bus options (PCI, etc.)
Executable file formats
Device drivers
--> ATA/ATAPI/MFM/RLL support
--> --> generic/default IDE chipset support (should already be ON)
--> --> --> ATI IXP chipset IDE support (turn OFF)
--> --> --> Intel PIIXn chipsets support (turn OFF)
--> --> --> IT821X IDE support (turn OFF)
--> SCSI device support
--> --> SCSI generic support (turn this ON)
--> --> SCSI low-level drivers
--> --> --> Serial ATA (SATA) support (should already be ON)
--> --> --> --> Intel PIIX/ICH SATA support (turn OFF)
--> --> --> --> Silicon Image SATA support (turn OFF)
--> --> --> --> Silicon Image 3124/3132 SATA support (turn ON as BUILT-IN)
--> --> --> --> VIA SATA support (turn OFF)
--> Multi-device support (should already be ON)
--> --> RAID support (turn it ON as BUILT-IN)
--> --> --> RAID-1 mirroring mode (turn it ON as BUILT-IN)
--> --> --> RAID-10 mirroring striping mode (turn it ON as BUILT-IN)
--> --> Device mapper support (turn ON as BUILT-IN)
--> Networking support
--> --> Ethernet (1000Mbit)
--> --> --> Intel. PRO/1000 Gigabit Ethernet support (turn ON)
--> --> --> Broadcom Tigon3 support (turn OFF)
--> Character Devices
--> --> Intel/AMD/VIA HW Random Number Generator (should be ON)
--> --> Intel 440LX/BX/GX, I8xx and E7x05 chipset support (turn it OFF)
--> Sound
--> --> Sound card support (turn OFF)
File systems
--> Network File Systems
--> --> SMB file system support (turn ON as BUILT-IN)
--> --> CIFS support (turn ON as BUILT-IN)
Profiling support
Kernel hacking
Security options
Cryptographic options
--> Cryptographic API (turn ON)
--> --> HMAC support (NEW) (turn ON as BUILT-IN)
--> --> (turn ON all other options as MODULE)
Library routines

Now we can compile and copy the kernel to the /boot partition.

# make && make modules_install
# ls -l /boot
# ls -l arch/x86_64/boot
# df
# cp arch/x86_64/boot/bzImage /boot/kernel-2.6.17-25Aug2006-2300
# cp System.map /boot/System.map-2.6.17-25Aug2006-2300
# cp .config /boot/config-2.6.17-25Aug2006-2300
# ls -l /boot


Next is Chapter 8, Configuring your System.

(chroot) livecd linux # nano -w /etc/fstab

My fstab (there are lines not shown):

/dev/md0                /boot           ext2            noauto,noatime  1 2        
/dev/md1 / ext3 noatime 0 1
/dev/md3 none swap sw 0 0
/dev/cdroms/cdrom0 /mnt/cdrom iso9660 noauto,ro 0 0
#/dev/fd0 /mnt/floppy auto noauto 0 0

/dev/vgmirror/home /home ext3 noatime 0 3
/dev/vgmirror/tmp /tmp ext2 noatime 0 3
/dev/vgmirror/vartmp /var/tmp ext2 noatime 0 3
/dev/vgmirror/log1 /var/log ext3 noatime 0 3
/dev/vgmirror/portage /usr/portage ext3 noatime 0 3

/dev/vgmirror/svn /var/svn ext3 noatime 0 4
/dev/vgmirror/backupsys /backup/system ext3 noatime 0 4


Now for some final clean-up work:

(chroot) livecd linux # nano -w /etc/conf.d/hostname
(chroot) livecd linux # nano -w /etc/conf.d/net
config_eth7=( "192.168.142.100 netmask 255.255.255.0" )
routes_eth7=( "default gw 192.168.142.1" )
(chroot) livecd linux # cd /etc/init.d
(chroot) livecd init.d # ln -s net.lo net.eth7
(chroot) livecd init.d # rc-update add net.eth7 default
* net.eth7 added to runlevel default
* rc-update complete.
(chroot) livecd init.d # cat /etc/resolv.conf
(verify your DNS servers if you specified a static IP)
(chroot) livecd init.d # nano -w /etc/conf.d/clock
CLOCK_SYSTOHC="yes"
(chroot) livecd init.d # passwd
(set your root password to something you will remember)
(chroot) livecd init.d # passwd
New UNIX password:
Retype new UNIX password:
passwd: password updated successfully
(chroot) livecd init.d #
# emerge syslog-ng
# rc-update add syslog-ng default
# emerge dcron
# rc-update add dcron default
# crontab /etc/crontab
# /usr/bin/ssh-keygen -t dsa -b 2048 -f /etc/ssh/ssh_host_dsa_key -N ""
(the key may take a a minute to generate)
# chmod 600 /etc/ssh/ssh_host_dsa_key
# chmod 644 /etc/ssh/ssh_host_dsa_key.pub
# rc-update add sshd default


Now it's time for grub.

(chroot) livecd init.d # emerge grub
(chroot) livecd init.d # ls -l /boot
total 3468
-rw-r--r-- 1 root root 1090703 Aug 26 00:35 System.map-2.6.17-25Aug2006-2300
lrwxrwxrwx 1 root root 1 Aug 25 18:09 boot -> .
-rw-r--r-- 1 root root 28714 Aug 26 00:35 config-2.6.17-25Aug2006-2300
drwxr-xr-x 2 root root 1024 Aug 26 01:03 grub
-rw-r--r-- 1 root root 2397504 Aug 26 00:35 kernel-2.6.17-25Aug2006-2300
drwx------ 2 root root 12288 Aug 25 16:46 lost+found
(chroot) livecd init.d # nano -w /boot/grub/grub.conf
# Which listing to boot as default. 0 is the first, 1 the second etc.
default 0
timeout 30

# Aug 2006 Base Installation (software RAID, LVM2)
title=Gentoo Linux 2.6.17 (Aug 25 2006) BASE INSTALL
root (hd0,0)
kernel /kernel-2.6.17-25Aug2006-2300 root=/dev/md1

# Aug 2006 Base Installation (software RAID, LVM2) - NOAPIC
title=Gentoo Linux 2.6.17 (Aug 25 2006) BASE NOAPIC
root (hd0,0)
kernel /kernel-2.6.17-25Aug2006-2300 root=/dev/md1 noapic
(chroot) livecd init.d # grub --no-floppy
grub> find /grub/stage1
(hd0,0)
(hd1,0)
grub> root (hd0,0)
grub> setup (hd0)
grub> device (hd0) /dev/sdb
grub> root (hd0,0)
grub> setup (hd0)
grub> quit


Time to exit the chroot, unmount everything, and try a reboot.

livecd / # cat /proc/mounts
rootfs / rootfs rw 0 0
tmpfs / tmpfs rw 0 0
/dev/hda /mnt/cdrom iso9660 ro 0 0
/dev/loop/0 /mnt/livecd squashfs ro 0 0
proc /proc proc rw,nodiratime 0 0
sysfs /sys sysfs rw 0 0
udev /dev tmpfs rw,nosuid 0 0
devpts /dev/pts devpts rw 0 0
tmpfs /mnt/livecd/lib64/firmware tmpfs rw 0 0
tmpfs /mnt/livecd/usr/portage tmpfs rw 0 0
usbfs /proc/bus/usb usbfs rw 0 0
/dev/md1 /mnt/gentoo ext3 rw,data=ordered 0 0
/dev/md0 /mnt/gentoo/boot ext2 rw,nogrpid 0 0
/dev/vgmirror/tmp /mnt/gentoo/tmp ext2 rw,nogrpid 0 0
/dev/vgmirror/vartmp /mnt/gentoo/var/tmp ext2 rw,nogrpid 0 0
/dev/vgmirror/home /mnt/gentoo/home ext3 rw,data=ordered 0 0
/dev/vgmirror/portage /mnt/gentoo/usr/portage ext3 rw,data=ordered 0 0
/dev/vgmirror/log1 /mnt/gentoo/var/log ext3 rw,data=ordered 0 0
/dev/vgmirror/svn /mnt/gentoo/var/svn ext3 rw,data=ordered 0 0
/dev/vgmirror/backupsys /mnt/gentoo/backup/system ext3 rw,data=ordered 0 0
none /mnt/gentoo/proc proc rw,nodiratime 0 0
udev /mnt/gentoo/dev tmpfs rw,nosuid 0 0
livecd / # unmount /mnt/gentoo/backup/system /mnt/gentoo/var/svn /mnt/gentoo/var/log /mnt/gentoo/usr/portage
-bash: unmount: command not found
livecd / # umount /mnt/gentoo/backup/system /mnt/gentoo/var/svn /mnt/gentoo/var/log /mnt/gentoo/usr/portage
livecd / # umount /mnt/gentoo/home /mnt/gentoo/var/tmp /mnt/gentoo/tmp
livecd / # umount /mnt/gentoo/boot /mnt/gentoo/dev /mnt/gentoo/proc /mnt/gentoo
livecd / # reboot


Remove the LiveCD and cross your fingers. Success!

Friday, August 25, 2006

Gentoo AMD64 on Asus M2N32-SLI Deluxe (part 3)

It's now time to start refering to the Gentoo Installation Handbook for AMD64. While I'm pretty sure that my install method works, it's worthwhile verifying against the handbook. Plus I'm using a minimal CD to do the installation, so things will be slightly different then normal.

(I use my own recipe for the initial configuration due to the mix of Software RAID + LVM2. It has served me well over the past few years and works well.)

This page starts with section 5 in the Gentoo handbook (Installing the Gentoo Installation Files).

livecd / # date
Fri Aug 25 21:23:52 UTC 2006
livecd / # cd /mnt/gentoo
livecd gentoo # links http://www.gentoo.org/main/en/mirrors.xml


Follow the directions on the Gentoo handbook page to download the correct stage3 tarball for your install. The steps are roughly thus:


  1. Pick an HTTP mirror from the list (arrow up/down then press [Enter] on the link)
  2. Go into the "releases/" folder
  3. Go into the "amd64/" folder (note: Not all mirrors carry the AMD64 folder, you may need to pick another mirror)
  4. Go into the "current/" folder
  5. Go into the "stages/" folder
  6. Find the stage3-amd64-NNNN.N-tar.bz2 tarball, highlight it and click [D] to download
  7. Press [Q] to quit out of links


Now you should have the tarball in /mnt/gentoo:

livecd gentoo # ls -l
total 105797
drwxr-xr-x 3 root root 4096 Aug 25 21:11 backup
drwxr-xr-x 3 root root 1024 Aug 25 20:46 boot
drwxr-xr-x 3 root root 4096 Aug 25 20:59 home
drwx------ 2 root root 16384 Aug 25 20:47 lost+found
-rw-r--r-- 1 root root 108186115 Aug 25 22:05 stage3-amd64-2006.0.tar.bz2
drwxrwxrwt 3 root root 4096 Aug 25 20:59 tmp
drwxr-xr-x 3 root root 4096 Aug 25 21:00 usr
drwxr-xr-x 5 root root 4096 Aug 25 21:10 var
livecd gentoo #


Extract the tarball:

livecd gentoo # tar xvjpf stage3-*.tar.bz2

You'll follow similar steps for the portage tarball. In fact, we probably should've downloaded it at the same time to /mnt/gentoo.

livecd gentoo # tar xvjf /mnt/gentoo/portage-20060123.tar.bz2 -C /mnt/gentoo/usr

Setup your make flags. Since I have an X2 CPU, I'm using "-j3" for MAKEOPTS.

livecd gentoo # vi /mnt/gentoo/etc/make.conf
# These settings were set by the catalyst build script that automatically built this stage
# Please consult /etc/make.conf.example for a more detailed example
CFLAGS="-march=k8 -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CXXFLAGS="${CFLAGS}"
MAKEOPTS="-j3"


Now we start in on Section 6 (Installing the Gentoo Base System). Time to pick mirrors and other things.

livecd gentoo # mirrorselect -i -o >> /mnt/gentoo/etc/make.conf
livecd gentoo # mirrorselect -i -r -o >> /mnt/gentoo/etc/make.conf
livecd gentoo # cat /mnt/gentoo/etc/make.conf
# These settings were set by the catalyst build script that automatically built this stage
# Please consult /etc/make.conf.example for a more detailed example
CFLAGS="-march=k8 -O2 -pipe"
CHOST="x86_64-pc-linux-gnu"
CXXFLAGS="${CFLAGS}"
MAKEOPTS="-j3"


GENTOO_MIRRORS="http://gentoo.arcticnetwork.ca/ http://www.gtlib.gatech.edu/pub/gentoo http://gentoo.chem.wisc.edu/gentoo/ http://gentoo.mirrors.pair.com/ "

SYNC="rsync://rsync.namerica.gentoo.org/gentoo-portage"
livecd gentoo #


Copy the resolv.conf file and mount /proc and /dev:

livecd gentoo # cp -L /etc/resolv.conf /mnt/gentoo/etc/resolv.conf
livecd gentoo # mount -t proc none /mnt/gentoo/proc
livecd gentoo # mount -o bind /dev /mnt/gentoo/dev


We're ready to chroot and start the build.

livecd gentoo # chroot /mnt/gentoo /bin/bash
livecd / # env-update
>>> Regenerating /etc/ld.so.cache...
livecd / # source /etc/profile
livecd / # export PS1="(chroot) $PS1"
(chroot) livecd / #


Read the next section carefully! I use an extremely limited USE flag (that turns off all multimedia and graphical support).

(chroot) livecd / # emerge --sync

(chroot) livecd / # ls -FGg /etc/make.profile
lrwxrwxrwx 1 50 Aug 25 22:10 /etc/make.profile -> ../usr/portage/profiles/default-linux/amd64/2006.0/
(chroot) livecd / # nano -w /etc/make.conf
USE="-alsa -apm -arts -bitmap-fonts -gnome -gtk -gtk2 -kde -mad -mikmod -motif -opengl -oss -qt -quicktime -sdl -truetype -truetype-fonts -type1-fonts -X -xmms -xv"
(chroot) livecd / # ls /usr/share/zoneinfo
(chroot) livecd / # ln -sf /usr/share/zoneinfo/EST5EDT /etc/localtime
(chroot) livecd / # date
(chroot) livecd / # zdump GMT
(chroot) livecd / # zdump EST5EDT


Read section 7 carefully. I'm using the default "gentoo-sources" kernel.

(chroot) livecd / # USE="-doc symlink" emerge gentoo-sources

I'll cover configuration of the kernel in the next post.

Gentoo AMD64 on Asus M2N32-SLI Deluxe (part 2)

As I said last time, I'm setting this unit up with the following partitions on the 2-disk RAID1 set (sda and sdb):

sda1 (md0) - 128MB for /boot
sda2 (md1) - 8GB for the root partition (primary)
sda3 (md2) - 8GB for a root partition (backup operating system)
sda4 - place-holder for extended partition
sda5 (md3) - 4GB swap file partition
sda6 (md4) - 679GB in LVM2 volume group (vgmirror)

Since I already created the RAID arrays last time, this time I only need to start up the RAID sets.

# modprobe md
# modprobe raid1
# for i in 0 1 2 3 4; do mknod /dev/md$i b 9 $i; done
# mdadm --assemble /dev/md0 /dev/sda1 /dev/sdb1
# mdadm --assemble /dev/md1 /dev/sda2 /dev/sdb2
# mdadm --assemble /dev/md2 /dev/sda3 /dev/sdb3
# mdadm --assemble /dev/md3 /dev/sda5 /dev/sdb5
# mdadm --assemble /dev/md4 /dev/sda6 /dev/sdb6


The LVM2 area on /dev/md4 is already created as well:

# modprobe dm-mod
# pvscan
PV /dev/md4 VG vgmirror lvm2 [679.39 GB / 679.39 GB free]
Total: 1 [679.39 GB] / in use: 1 [679.39 GB] / in no VG: 0 [0 ]
# vgscan
Reading all physical volumes. This may take a while...
Found volume group "vgmirror" using metadata type lvm2


Create the basic mdadm configuration file. While mdadm is able to figure out most things automatically, it's useful to give it hints.

# mdadm --detail --scan >> /etc/mdadm.conf
# vi /etc/mdadm.conf


Here's my mdadm.conf file. Notice the use of UUIDs by mdadm to ensure that it always matches up the correct partitions with the mdadm device numbers. This also should ensure that the mdadm device numbers never change.

ARRAY /dev/md4 level=raid1 num-devices=2 UUID=9b76e544:c3775946:1458656b:c78ce692
ARRAY /dev/md3 level=raid1 num-devices=2 UUID=a441836c:a3801fed:a6e616da:dd829ebc
ARRAY /dev/md2 level=raid1 num-devices=2 UUID=84c2076e:edd3ceaf:595ae236:381044ca
ARRAY /dev/md1 level=raid1 num-devices=2 UUID=b4da9f10:265c3868:db128369:583c900e
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=ada9bc71:2044d255:74204255:b1ba5cd1


Next we create the file systems:

livecd / # mke2fs /dev/md0
livecd / # mke2fs -j /dev/md1
livecd / # mke2fs -j /dev/md2
livecd / # mkswap /dev/md3 ; swapon /dev/md3
livecd / # mount /dev/md1 /mnt/gentoo
livecd / # mkdir /mnt/gentoo/boot ; mount /dev/md0 /mnt/gentoo/boot


For the LVM2 volumes, things are a bit more complex. The majority of the action is going to take place inside of the root volumes since we are only doing a minimal build on order to prep for a Xen Domain0 kernel. However, there are a few volumes that are worthwhile placing in LVM so that they are available to both the primary and the backup operating system. (Mostly /home and /usr/portage along with the temporary volumes.)

# lvcreate -L2G -ntmp vgmirror
# lvcreate -L2G -nvartmp vgmirror
# lvcreate -L2G -nhome vgmirror
# lvcreate -L4G -nportage vgmirror
# ls -l /dev/vgmirror
# lvscan
# mke2fs /dev/vgmirror/tmp
# mke2fs /dev/vgmirror/vartmp
# mke2fs -j /dev/vgmirror/home
# mke2fs -j /dev/vgmirror/portage
# mkdir /mnt/gentoo/tmp ; mount /dev/vgmirror/tmp /mnt/gentoo/tmp
# chmod 1777 /mnt/gentoo/tmp
# mkdir /mnt/gentoo/var
# mkdir /mnt/gentoo/var/tmp ; mount /dev/vgmirror/vartmp /mnt/gentoo/var/tmp
# chmod 1777 /mnt/gentoo/var/tmp
# mkdir /mnt/gentoo/home ; mount /dev/vgmirror/home /mnt/gentoo/home
# mkdir /mnt/gentoo/usr
# mkdir /mnt/gentoo/usr/portage ; mount /dev/vgmirror/portage /mnt/gentoo/usr/portage


Now for the supplementary volumes (logs, subversion and system backup). I'm using separate volumes for the log files of the primary vs secondary operating system. The secondary (backup) O/S only gets 1GB of space for its log files.

# lvcreate -L4G -nlog1 vgmirror
# lvcreate -L1G -nlog2 vgmirror
# lvcreate -L2G -nsvn vgmirror
# lvcreate -L16G -nbackupsys vgmirror
# mke2fs -j /dev/vgmirror/log1
# mke2fs -j /dev/vgmirror/log2
# mke2fs -j /dev/vgmirror/svn
# mke2fs -j /dev/vgmirror/backupsys
# mkdir /mnt/gentoo/var/log ; mount /dev/vgmirror/log1 /mnt/gentoo/var/log
# mkdir /mnt/gentoo/var/svn ; mount /dev/vgmirror/svn /mnt/gentoo/var/svn
# mkdir /mnt/gentoo/backup
# mkdir /mnt/gentoo/backup/system ; mount /dev/vgmirror/backupsys /mnt/gentoo/backup/system


Whew, that's a big packet of LVM partitions. But it prevents problems down the road.

At this point, everything is setup and ready for the initial install (or a chroot into an existing system for repairs).

Gentoo AMD64 on Asus M2N32-SLI Deluxe (part 1)

Time to build the base Gentoo Linux O/S. While I plan on switching over to a Xen hypervisor kernel in a few days, I still need to get a base Gentoo system up and running. But first, let me document what sort of machine I'm building and the reasoning behind some of the decisions.

The unit is a custom-built system that will serve as a test unit for building out an iSCSI SAN (and eventually serve as part of that SAN if it tests well). There will be multiple NICs so that I can bond NICs for bandwidth and so that I can connect NICs to multiple switches in the SAN mesh (for fault-tolerance). Initially, it will have only (2) SATA drives installed, but the eventual loadout will have a total of (14) SATA drives.

Since I needed (2) SATA RAID cards and (2) Intel PRO/1000 dual-port server gigabit NICs, I needed a motherboard with multiple PCIe slots. The Asus M2N32-SLI Deluxe meets that requirement with (2) x16, (1) x4, and (1) x1 slots. Plus it has (2) PCI slots. These slots will be populated as follows:

PCIe x16: Intel PRO/1000 dual-port PCIe x4
PCIe x4: 8-port SATA RAID card PCIe x4
PCIe x1: HighPoint RocketRAID 2300 SATA-II 4-port PCIe x1
PCIe x16: Intel PRO/1000 dual-port PCIe x4
PCI: 3com 3C905B NIC
PCI: old PCI video card

So I'm using the x16 slots for something other then a video card (which works on newer BIOS revisions).

Plus, the motherboard is an AM2 with the newer AMD Pacifica virtualization technology (AMD-V). That will do a better job of running Xen then the current Opteron 940pin CPUs. This motherboard also supports ECC, which I will be installing in a few weeks. Other features include: (6) SATA-II ports, (2) gigabit Marvel NICs, (1) internal SATA-II port on a Silicon Image chip, (1) ESATA on the Silicon Image chip, (10) USB ports, (2) Firewire ports.

Not a cheap board ($200 retail) and has quite a bit of headroom. The PCIe architecture should also perform better then the older PCI motherboards, which is important in a 14-disk SAN unit.

All of this is being installed in a ThermalTake Armor tower case (MODELNUMBER?). The Armor case has (2) internal hard drive bays (actually 3, but I'm only using 2 for thermal reasons) and (11) 5.25" bays on the front. One of those 5.25" bays is occupied by the floppy drive mount location, the power and reset switches and the power and HD LEDs. That leaves me with (10) 5.25" bays.

In those (10) 5.25" bays, I'm installing a DVD-RW in the top and then filling the rest of the (9) bays with 4:3 SATA hot-plug back planes. These will hold (12) SATA-II hard drives and allow us to swap out hard drives easily (even if we don't hotplug we can still minimize downtime). The two internal drives are not as easy to replace, but they are still fairly easy to get at and replace in under 30 minutes.

My initial configuration plan is a (2) drive RAID1 using the internal bays and the motherboard SATA-II connectors. That will allow me to get the OS up and running and do some limited testing. After that I will add (4) HDs to the first hotswap bay, connect them to the HighPoint 2300 card, and run them as RAID10 for twice the performance as a 2-drive RAID1 set.

Down the road we will add the 8-port SATA RAID card and fill the other 8 bays in the front. These will be configured as a (6) drive RAID10 set (3x performance) with (2) hot spare drives. That will fill out the unit. If I'm pressed for space and capacity, I could add a 3rd drive to the internal drive bay for a hot-spare (risking more heat) and setup the last (8) drives up front as a (8) drive RAID10.

The current power-supply is a 750W ThermalTake ToughPower unit. Ideally, I'd install a redundant PSU in this case, but I'll tackle that at a future date. One thing that I do wish I had done was to have bought the "modular" 750W PSU. That unit makes the component power cables connect to plugs on the PSU making it easier to swap out a failed PSU without re-wiring all of the components. However, due to all of the room inside the Armor case, re-wiring is not that difficult or time-consuming.

I'm also not ready to spend $500-$600 on a redundant PSU until I know whether the 750W can handle (15) hard drives plus an Athlon64 X2 4200+ with multiple NICs and expansion cards. I'm pretty sure that it will. The base recommendation for an Athlon64 X2 with a simple setup is 300-350W. Figure that each additional hard drive adds 20W to the load (overstatement) and 15x20W = 300W. That puts us up around 600-650W not including the extra expansion cards. So the 750W should perform very well.

...

For the install, I plan on a partition layout like so (both sda and sdb are configured identically and then mirrored together):

Disk /dev/sda: 750.1 GB, 750156374016 bytes
255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes

Device Boot Start End Blocks Id System
/dev/sda1 1 17 136521 fd Linux raid autodetect
/dev/sda2 18 1015 8016435 fd Linux raid autodetect
/dev/sda3 1016 2013 8016435 fd Linux raid autodetect
/dev/sda4 2014 91201 716402610 5 Extended
/dev/sda5 2014 2512 4008186 fd Linux raid autodetect
/dev/sda6 2513 91201 712394361 fd Linux raid autodetect


Partition #1 is the boot partition. I generally go with 128MB as it allows me to have a dozen or so kernels setup in grub.

Partition #2 and #3 are 8GB install partitions. I plan on installing once to the first 8GB partition, then cloning the install to the second 8GB partition. That way, worst case, if the primary install gets hosed, we can boot to the second partition which is still functional.

Partition #4 is simply the extended (logical) partition place-holder.

Partition #5 will be used for the Linux swap area. I always mirror my swap so that the machine keeps running even if one of the drives fails.

Partition #6 is for LVM2 and will be sub-divided up. I estimate that I'll use about 50-100GB for the O/S which will leave 600GB or so for use by clients.

That's a fairly conservative partition layout and should provide me with enough flexibility to recover from most situations without having to toss an install CD into the unit.

...

Other sysadmin tricks that I plan on using are:

- Using SubVersion to store the contents of /boot, /etc, and other system configuration files. That will give me a history of changes to the system. That has been working very well on my other systems.

- Using Bacula or rdiff-snapshot or rsync or dd to create snapshots of the working root volumes. In worst-case, we restore the root volumes from backups.

- Sharing the portage tree between the two root partitions. This is low-risk and will save space and time.

- Sharing the /tmp and /var/tmp LVM partitions between the two root partitions. Naturally, the swap partition is also shared between the two root partitions.

- Putting /home on its own LVM partition and sharing it between the two root partitions. Moderately risky, but since this is a server-only headless setup with no users, it's not a big deal.

...

So it's a moderately complex setup, but provides me with multiple fall-back positions and restoration options. Anything from reverting a configuration file to a newer version, to booting a known-good root partition, to restoring the system from backups.

And there's always the last resort of putting a Gentoo boot CD in the unit, starting networking and the SSH daemon, and attempting to fix the unit remotely.

Thursday, August 24, 2006

Gentoo AMD64 and Asus M2N32-SLI Deluxe

Finally got my Asus M2N32-SLI Deluxe motherboard setup and ready for installation. Tossed the 2006.0 Gentoo AMD64 CD in and told it to use 80x25 for the old PCI video card that I have in it. Unforunately, it hangs after a bit.

So my first plan of attack is to update the Asus BIOS from 0406 to 0603 (which is 2 revisions newer). The 0603 revision was released on June 29, 2006. The Asus BIOS includes an EZ-FLASH tool in the BIOS Setup (Tools menu), all I have to do is burn the BIOS file to a CD-ROM or diskette.

Update: It was actually easier to put the BIOS update on a USB flash drive and connect it one of the the USB ports.

Update #2: I had to boot the kernel using:

boot: gentoo noapic noacpi

Which gets me past the hang at:

io scheduler noop registered
io scheduler deadline registered


I'm also using a very old PCI video card so I have to specify my video mode at the prompt (I usually pick 80x43).

Instead it now hangs at "Letting udev process events". Could be time to try the "gentoo-nofb noapic" kernel option (no frame buffer). Hmm, that got me farther, but still no luck. I'm reading that the nForce 590 chipsets aren't well supported by Linux yet. Trying with the "noapic" option resulted in a hang entering kernel mode 3.

Time for plan B... seeing if the latest Ubuntu 6.06 CD works on this system. That may give me some hints. From what I'm reading I have to use the "apic=off" option on the Ubuntu 6.06 CD. Hmm... that hung as well.

# gentoo-nofb noapic noacpi nolapic

Hmm... still hangs. Off to do some more research.

Update #3: Finally got a system that seems to be working.

Reverted the BIOS from the 0604 revision back to the 0503 revision. Now I can boot the Gentoo AMD64 2006.0 minimal CD using the options gentoo-nofb noapic. I still get the IRQ7 error that seems to be bugging other Gentoo users, but the system is stable enough to start the install.

Xen plans

So as I wait for parts to be delivered for the Test SAN unit, I'm reading up on Xen and iSCSI and debating what my plan of attack will be. My options (roughly) are:

1) Build a normal Gentoo server and load iscsitarget on top of it. This is the simplest of all options and something that I'm very comfortable with. However, if I were then then need to run other services on top of the unit, I run the risk of making the iSCSI server portion unstable.

2) Xen with various services running in DomUs. A bit more complex but offers a lot more flexibility. I can start setting up virtual servers for the sub-tasks and then migrate them off of the test unit when I have more hardware available.

What I'm not sure is whether to run iscsitarget in its own DomU or if it should run in Dom0. If I run the iscsitarget in its own DomU, then it becomes easy to restart the iscsitarget server in a worst case without taking down the entire box.

Links:

Infrastructure virtualisation with Xen advisory

Monday, August 21, 2006

Rsync and SSH on Windows 2003 Server

Taking another stab at setting up RSync and SSH on our Windows 2003 servers. The goal is that we can upload web files to a central server and then have it synchronize the other servers in the array. Once again, I'm going to use the cwRsync and copSSH packages (latest version is 2.0.9).

Installation on a Windows 2003 Domain Controller:

  1. Download cwRSync, open up the ZIP file, then extract/run cwRsync_Server_x.x.x_Installer.exe.
  2. Click "Next" to move past the splash screen
  3. Click "I Agree" to move past the license screen
  4. Select both the "Rsync Server" and "OpenSSH Server" (unless you have already installed and configured SSH) then click "Next"
  5. Choose your installation location, the default is "C:\Program Files\cwRsyncServer"
  6. Click "Install" to begin the installation process
  7. cwRsync will install and create a default service account with a randomly generated password.
  8. Write down the service account password.
  9. Click "Close" when the install has finished.


So now if you look in "Active Directory Users and Computers", there should be a newly created account called "SvcwRsync". Since we are installing this on a domain controller, you should rename this account to "SvcwRsync_SERVERNAME" so that it doesn't cause problems for other installations. You'll also need to change the login details for the "RsyncServer" and "OpenSSH SSHD" services.

Once you have things configured, make sure to go to the Services control and set the services to start up automatically. I also recommend configuring the Recovery tab so that the services are automatically restarted after 2 or 5 minutes.

...

Now to start locking things down. First, I'm going to restrict what interfaces (IP addresses) that the cwRSync service can listen on by adding an address line to rsyncd.conf.

address = 127.0.0.1

One the machine that you will be using to talk to the rsync daemon on the host server, you'll also need the cwRsync tools installed along with OpenSSH. Because the rsync daemon can only listen on 127.0.0.1 (localhost), we'll need to create an SSH tunnel from the client machine to the host server before we can talk to the rsync daemon.

One the client machine:

1. Create a new folder under "C:\Program Files\cwRsyncServer\home" for the new user. In my particular case, I'm calling my user "backuppull" because I am pulling backup files off of the rsync server and down to my local machine.

2. Create a ".ssh" folder under that new home folder.

3. Open up a command window (Start, Run, "cmd") and change directories to the home folder ("C:\Program Files\cwRsyncServer\home\backuppull")

4. Create ssh keys for this user. Since we want to do this sync in a batch file without user-interaction, they'll need to be created with null passwords. You may wish to use the "-b 2048" option to create stronger keys (recommended for RSA, DSA can only be up to 1024 bits).

mkdir .ssh
..\..\ssh-keygen -t rsa -N "" -b 2048 -f .ssh\id_rsa
..\..\ssh-keygen -t dsa -N "" -b 1024 -f .ssh\id_dsa

5. You will now need to transfer the public key files to the host server. Again, you will create a new home directory for the user in the "C:\Program Files\cwRsyncServer\home" folder tree along with creating a ".ssh" folder under that home folder. The two files that need to be copied are:

id_dsa.pub
id_rsa.pub

6. Now append the contents of these files to the ".ssh/authorized_keys" file on the host server.

type id_dsa.pub >> authorized_keys
type id_rsa.pub >> authorized_keys

7. Now to configure SSHD on the host server. You will need to find and edit the sshd_config file (probably in "C:\Program Files\cwRsyncServer\etc"). The following changes should be made in the current version default settings.

PermitRootLogin no
PasswordAuthentication no

Sunday, August 20, 2006

Benchmarking - hdparm (follow-up)

If you recall, the last time I tested hdparm on my Celeron 566MHz system, I was getting very poor read performance on /dev/hda.

So I added in a HTP302 (HighPoint Rocket133) 2-port PCI card and moved hda over to the new card.

I now get 20MB/s buffered reads from both the primary and secondary disk in the RAID1 array and 30MB/s buffered reads from the mirror set. Which is a lot better then the 3.1MB/s for hda and 3.5MB/s for the overall RAID1 array.

That should make the system feel a lot more snappy. Now the bottleneck will probably be the CPU or the 100Mbit ethernet card.

Saturday, August 19, 2006

SAN testing switch

For testing out the SAN, it looks like I can make use of either a
SMC SMCGS16-SMART or SMC SMCGS24-SMART switch. These are 16/24 port gigabit switches that support link aggragation. Price on the 16-port unit is only $260 or so and the 24-port switch sells for $320.

I'm still figuring out how I want to support bonding in the unit. I have the (4) ports from the Intel Pro adapters plus the (2) ports on the motherboard. My initial plan is to bond the Intel adapters together into a two pairs, then hook each pair to a different switch for fault-tolerance. That would give me either 200MB/s of bandwidth or 400MB/s of bandwidth.

I'm still trying to find out what I would need to do on the switches to support this. It's possible that the two switches would need to have some sort of interconnects. Alternately, I may simply have (2) switches installed and bond all (4) Intel NICs together as a single adapter for 400MB/s of bandwidth. In the remote case that the one switch fails, recovery would involve moving all of the cables to the 2nd switch.

On the disk drives, I would probably need a 12-drive RAID 1+0 array in order to drive those 4 bonded NICs. Figure that I'm able to write 40MB/s to a single RAID1 array in the unit. Putting 6 of those RAID1s together in a RAID0 set would give me around 240MB/s.

I could possibly go as high as a 16-drive RAID10 which would put me up around 320MB/s. Again, it all depends on how good the performance is with a single RAID1 spindle pair.

Looking at my test results from Bonnie, I was only seeing around 15MB/s of performance from 300GB 5400RPM drives. A naive estimate is that moving to 750GB 7200RPM drives would drive performance up by about 3.3x which would be around 50MB/s. A more realistic estimate is around 33MB/s but probably as low as 20MB/s.

RAID10 (semi-random performance)
4-spindle: 40-66MB/s
8-spindle: 80-132MB/s
12-spindle: 120-198MB/s
16-spindle: 160-264MB/s

RAID10 (sequential reads/writes)
4-spindle: 120MB/s
8-spindle: 240MB/s
12-spindle: 360MB/s
16-spindle: 480MB/s

Those are big S.W.A.G. estimates. In reality, performance will probably be closer to the semi-random performance numbers. Which means that the first set of drives needs to be configured into an 8-spindle RAID10 in order to be viable. A PCI motherboard would choke on this amount of bandwidth, but the newer PCIe motherboards should be able to handle it.

SAN design - part 2

Trying to decide how to allocate disks within the SAN unit. I have (14) or (17) slots. For now, I'll assume that the 5:3 bay units work which will give me a total of 17 disks.

-------------------------------------------------
BAYS / DRIVE CONFIGURATION (2 INT, 15 BAY-COOLER)
-------------------------------------------------
INT1 M/B RAID1(A) RAID1(A)
INT2 '' '' ''
BCA1 '' RAID1(B1) RAID1(B1)
BCA2 '' RAID1(B2) RAID1(B2)
BCA3 '' RAID1(B3) RAID1(B3)
BCA4 HP2300 RAID1(B1) RAID1(B1)
BCA5 '' RAID1(B2) RAID1(B2)
BCB1 '' RAID1(B3) RAID1(B3)
BCB2 '' HOT SPARE HOT SPARE
BCB3 HP2320 RAID1 (C1) RAID6(C)
BCB4 '' '' ''
BCB5 '' RAID1 (C2) ''
BCC1 '' '' ''
BCC2 '' RAID1 (C3) ''
BCC3 '' '' ''
BCC4 '' RAID1 (C4) ''
BCC5 '' '' ''


INTx - Internal drive bay at back of case
BCAx - 5:3 bay cooler
BCBx - 5:3 bay cooler
BCCx - 5:3 bay cooler

M/B - Indicates that I'm using the 5 SATA ports on the motherboard
HP2300 - HighPoint RocketRAID 2300 PCIe x1 SATA 4-port
HP2320 - HighPoint RocketRAID 2320 PCIe x4 SATA 8-port

A) In the first configuration I have:

700GB RAID1
2100GB RAID0 over (3) RAID1 sets
2800GB RAID0 over (4) RAID1 sets
====
5600GB total (11200GB gross capacity)

B) The second configuration sets up a 8-disk RAID6 array

700GB RAID1
2100GB RAID0 over (3) RAID1 sets
4200GB RAID6 over 8 disks
====
7000GB total (11200GB gross capacity)

The RAID0 over (3) RAID1 sets (a.k.a. RAID 10) should give me roughly 3x the performance of a regular RAID1 volume. Reads and writes should both see a 3x improvement over a simple RAID1.

For the RAID6 volume, I estimate that read performance will be 6x that of the RAID1 set but I'm not sure what write performance will be.

Benchmarking - bonnie

The first test is done on /dev/md3 (composed of partitions on /dev/hda and /dev/sda).

nogitsune tmp # bonnie -s 16384 -m nogitsune-vgmirror
File './Bonnie.9315', size: 17179869184
Writing with putc()...done
Rewriting...done
Writing intelligently...done
Reading with getc()...done
Reading intelligently...done
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
nogitsun 16384 16714 39.6 21765 8.4 12889 4.3 42827 80.7 55645 5.0 155.6 0.4


Test results for a 3-disk RAID5 volume on Nogitsune:

nogitsune backup # bonnie -s 16384 -m raid5-{3}
File './Bonnie.9472', size: 17179869184
Writing with putc()...done
Rewriting...done
Writing intelligently...done
Reading with getc()...done
Reading intelligently...done
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
raid5-{3 16384 12250 30.9 15062 8.3 10926 6.3 34752 65.0 84890 15.4 134.5 0.7
nogitsune backup #


These are test results for the RAID1 set on the VIA C3 unit.

nezumi / # bonnie -s 2047 -m nezumi-raid1
File './Bonnie.22025', size: 2146435072
Writing with putc()...done
Rewriting...done
Writing intelligently...done
Reading with getc()...done
Reading intelligently...done
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
nezumi-r 2047 2637 94.2 27421 57.8 12256 18.9 2665 93.7 29279 25.7 231.7 5.9
nezumi / #


Here's is the Celeron 566MHz unit showing performance for the RAID1 array:

coppermine svn # bonnie -s 2047 -m coppermine-raid1
File './Bonnie.8564', size: 2146435072
Writing with putc()...done
Rewriting...done
Writing intelligently...done
Reading with getc()...done
Reading intelligently...done
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
coppermi 2047 1087 19.7 1272 6.6 1020 19.5 3505 95.4 6763 78.9 143.8 5.5
coppermine svn #


The following is the Celeron 566MHz talking to a single 120GB 5400rpm drive:

coppermine backup # bonnie -s 2047 -m direct          
File './Bonnie.8837', size: 2146435072
Writing with putc()...done
Rewriting...done
Writing intelligently...done
Reading with getc()...done
Reading intelligently...done
Seeker 1...Seeker 2...Seeker 3...start 'em...done...done...done...
-------Sequential Output-------- ---Sequential Input-- --Random--
-Per Char- --Block--- -Rewrite-- -Per Char- --Block--- --Seeks---
Machine MB K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU K/sec %CPU /sec %CPU
direct 2047 4517 83.2 11374 58.1 9073 54.7 6932 98.6 36154 52.7 99.8 3.2
coppermine backup #

Benchmarking - hdparm

So as I prepare for the iSCSI build, I need to start gathering tools to help me find bottlenecks. As well as establishing baseline performance estimates.

hdparm -tT {blockdevice name} - This tool ships standard with most (all?) versions of Linux. It's a read-only, non-destructive (if used properly), command that can be used to test raw read performance from any block device. So you can check individual drives in a RAID array as well as doing a quick check of the over all RAID array performance. Run time is generally around 10 seconds.

Here are some sample runs from my firewall box (VIA C3 600MHz, 1GB RAM, 2x60GB 5400rpm notebook drives):

nezumi backup1 # hdparm -tT /dev/md2

/dev/md2:
Timing cached reads: 276 MB in 2.02 seconds = 136.88 MB/sec
Timing buffered disk reads: 106 MB in 3.06 seconds = 34.61 MB/sec

nezumi backup1 # hdparm -tT /dev/hda2

/dev/hda2:
Timing cached reads: 276 MB in 2.02 seconds = 136.78 MB/sec
Timing buffered disk reads: 90 MB in 3.04 seconds = 29.58 MB/sec

nezumi backup1 # hdparm -tT /dev/hdc3

/dev/hdc3:
Timing cached reads: 276 MB in 2.02 seconds = 136.87 MB/sec
Timing buffered disk reads: 88 MB in 3.04 seconds = 28.91 MB/sec

nezumi backup1 # hdparm -tT /dev/md2

/dev/md2:
Timing cached reads: 276 MB in 2.02 seconds = 136.49 MB/sec
Timing buffered disk reads: 108 MB in 3.03 seconds = 35.65 MB/sec

nezumi backup1 # hdparm -tT /dev/md2

/dev/md2:
Timing cached reads: 276 MB in 2.02 seconds = 136.72 MB/sec
Timing buffered disk reads: 108 MB in 3.00 seconds = 35.99 MB/sec


What you will notice is that the "cached" reads value has more to do with the memory speed then the speed of the disk. The buffered disk read values are closer to real-world performance values. There are no surprises in the performance of the VIA C3 system.

However, on my older Gigabyte Celeron 566MHz system, there's a big surprise:

coppermine thomas # hdparm -tT /dev/hda4

/dev/hda4:
Timing cached reads: 336 MB in 2.00 seconds = 167.99 MB/sec
Timing buffered disk reads: 10 MB in 3.18 seconds = 3.15 MB/sec

coppermine thomas # hdparm -tT /dev/hde4

/dev/hde4:
Timing cached reads: 336 MB in 2.01 seconds = 167.32 MB/sec
Timing buffered disk reads: 86 MB in 3.07 seconds = 27.99 MB/sec

coppermine thomas # hdparm -tT /dev/hdg1

/dev/hdg1:
Timing cached reads: 336 MB in 2.00 seconds = 167.65 MB/sec
Timing buffered disk reads: 58 MB in 3.02 seconds = 19.20 MB/sec

coppermine thomas # hdparm -tT /dev/md2

/dev/md2:
Timing cached reads: 336 MB in 2.01 seconds = 166.99 MB/sec
Timing buffered disk reads: 12 MB in 3.28 seconds = 3.66 MB/sec

coppermine thomas # hdparm -tT /dev/md3

/dev/md3:
Timing cached reads: 336 MB in 2.00 seconds = 167.65 MB/sec
Timing buffered disk reads: 10 MB in 3.18 seconds = 3.15 MB/sec


This shows that there are severe performance issues with any block devices using /dev/hda (such as /dev/md2 and /dev/md3). The motherboard chipset is either extremely slow, or there are performance bottlenecks that need to be investigated.

The test results also show that the old Celeron 566MHz is slightly faster then the VIA C3 (167MB/s vs 137MB/s). So if I can find and fix the bottleneck for /dev/hda, I should see a significant increase in performance from this particular unit.

nogitsune etc # hdparm -tT /dev/hda

/dev/hda:
Timing cached reads: 3220 MB in 2.00 seconds = 1609.90 MB/sec
Timing buffered disk reads: 172 MB in 3.02 seconds = 56.95 MB/sec

nogitsune etc # hdparm -tT /dev/sda

/dev/sda:
Timing cached reads: 3236 MB in 2.00 seconds = 1617.90 MB/sec
Timing buffered disk reads: 184 MB in 3.00 seconds = 61.25 MB/sec

nogitsune etc # hdparm -tT /dev/md3

/dev/md3:
Timing cached reads: 3208 MB in 2.00 seconds = 1603.90 MB/sec
Timing buffered disk reads: 188 MB in 3.03 seconds = 62.08 MB/sec

nogitsune etc # hdparm -tT /dev/hde1

/dev/hde1:
Timing cached reads: 3228 MB in 2.00 seconds = 1613.90 MB/sec
Timing buffered disk reads: 136 MB in 3.01 seconds = 45.15 MB/sec

nogitsune etc # hdparm -tT /dev/hdg1

/dev/hdg1:
Timing cached reads: 3232 MB in 2.00 seconds = 1615.90 MB/sec
Timing buffered disk reads: 136 MB in 3.00 seconds = 45.27 MB/sec

nogitsune etc # hdparm -tT /dev/md4

/dev/md4:
Timing cached reads: 3244 MB in 2.00 seconds = 1621.90 MB/sec
Timing buffered disk reads: 136 MB in 3.00 seconds = 45.33 MB/sec

nogitsune etc # hdparm -tT /dev/hdk1

/dev/hdk1:
Timing cached reads: 3232 MB in 2.00 seconds = 1615.90 MB/sec
Timing buffered disk reads: 136 MB in 3.01 seconds = 45.21 MB/sec

nogitsune etc # hdparm -tT /dev/hdo1

/dev/hdo1:
Timing cached reads: 3224 MB in 2.00 seconds = 1611.90 MB/sec
Timing buffered disk reads: 136 MB in 3.00 seconds = 45.33 MB/sec

nogitsune etc # hdparm -tT /dev/hds1

/dev/hds1:
Timing cached reads: 3224 MB in 2.00 seconds = 1611.90 MB/sec
Timing buffered disk reads: 134 MB in 3.01 seconds = 44.49 MB/sec

nogitsune etc # hdparm -tT /dev/md5

/dev/md5:
Timing cached reads: 3224 MB in 2.00 seconds = 1611.90 MB/sec
Timing buffered disk reads: 266 MB in 3.01 seconds = 88.43 MB/sec

nogitsune etc # hdparm -tT /dev/sdb1

/dev/sdb1:
Timing cached reads: 3232 MB in 2.00 seconds = 1615.90 MB/sec
Timing buffered disk reads: 162 MB in 3.01 seconds = 53.85 MB/sec


What we see here is that for the RAID1 arrays, speed is equivalent to the disks within the array. But for the RAID5 array with (3) disks, speed is 2x that of the single disks within the array.

Friday, August 18, 2006

Starting an iSCSI SAN unit

So here's my first stab at a SAN unit that can hold 14 or 17 SATA drives. All of the drives will be mounted in hot-swap trays which should make things much easier. Some of the components are a bit overkill (such as the pair of dual-port server NICs) but I'm planning ahead to when we have two gigabit switches and we want to connect multiple gigabit ports together for speed.

For the case, motherboard and misc parts:

$0159 Thermaltake Armor VA8000BNS Black Chassis: 1.0mm SECC
$0190 Thermaltake ToughPower W0117RU ATX12V/ EPS12V 750W
$0035 DVD-RW (BLACK)
$0050 misc parts (fans, cables)
$0182 MB-BA22658 AMD Athlon64 X2 4200+ AM2 (WINDSOR)
$0200 XXXXXXXXXX Asus M2N32-SLI DLX
$0138 XXXXXXXXXX Mwave 2GB DDR2 533 (1GB x 2)
$0009 XXXXXXXXXX Assemble & Test

The motherboard is a ASUS M2N32-SLI with multiple PCIe slots (2 x16, 1 x4, 1 x1), 2 PCI slots, 8 SATA-II ports, and dual-NICs on an nForce 590 chipset. I plan on using the expansion slots as follows:

PCIe x16: Intel Pro/1000 PT Dual-Port PCIe x4
PCIe x4: HP RocketRAID 2320 8-port PCIe x4
PCIe x1: HP RocketRAID 2300 4-port PCIe x4
PCIe x16: Intel Pro/1000 PT Dual-Port PCIe x4
PCI:
PCI: PCI video card

Note: I expect that you will need a fairly recent BIOS version in order to use the first x16 slot for something other then a video card.

Prices for the SATA controllers and NICs:

$0167 INTEL PRO/1000 PT DUAL PORT EXPI9402PT gigabit PCIe x4
$0140 HighPoint RocketRAID 2300 PCIe x1 (4-port SATA-II)
$0260 HighPoint RocketRAID 2320 PCIe x4 (8-port SATA-II)

Note: I'm not 100% sure that I'm going to use the RocketRAID 2320. I still need to do some research to verify that it works properly in Linux without special binary drivers (I want it to work as a regular controller card). Otherwise I may use either the PROMISE SuperTrak EX8350 PCIe x4 (8-port SATA-II) or the 3ware 9590SE-8ML PCIe x4 (8-port SATA-II).

Other components:

$0350 Seagate Barracuda 7200.10 750GB SATA-II
$0110 Athena Power 5:3 SATA-II Backplane SATA3051B (350SATA)

The 5:3 backplane should allow me to fit 15 drives into the front 5.25" expansion bays on the Thermaltake Armor case. If I don't care for the design of the 5:3 backplane there are more conservative 4:3 backplanes that will allow me to fit 12 drives into the front bays.

Or I could even use some 4:3 SCSI SCA hot-swap enclosures along with a PCIe x4 SCSI card. That lets me have both SATA and SCSI drives in the same unit. A pair of those would give me 6 SATA drives and 8 SCSI drives.

Estimated cost for the starter kit is $3200, including a pair of 750GB SATA drives. Since I'll be building two of these, and then gradually expanding them:

$3200 02 - 700GB SAN (2 base disks per SAN) -- $4.57/GB
$3200 02 - redundancy for 700GB SAN (2 base disks per SAN) -- $9.14/GB
$1400 04 - expansion to 1.4TB (2 data disks per SAN added) -- $5.57/GB
$2800 08 - expansion to 2.8TB or reconfig to 3.5TB -- $3.78 to $3.03/GB
$2800 12 - expansion to 4.2TB or reconfig to 6.3TB -- $3.19 to $2.13/GB
$1400 14 - expansion to 4.9TB or reconfig to 7.7TB -- $3.02 to $1.92/GB

The 2nd column is the number of drives installed within a single SAN unit. The first capacity is if I stick with my initial plan of RAID1 within the SAN unit and then RAID1 across the SAN fabric for redundancy. The second capacity is if I RAID5 within the unit and then RAID1 across the SAN fabric for redundancy.

I may also expand the memory in the SAN boxes to 6GB or 8GB down the road to maximize any possible read-caching.

For home use, I wouldn't bother to build a 2nd SAN unit for redundancy. Downtime in case of a blown power-supply would only be about 2 hours (assuming you have one on-hand) to a day or two. I could also cut some costs by going with less expensive NICs or SATA controllers. But that doesn't gain you much.

The other advantage of building out slowly is that you can take advantage of the 1TB and 2TB drives that might appear in 2007-2009.

Thursday, August 17, 2006

Virtualization and SANs

One of my next projects at the office is to start migrating us from individual servers with direct attached storage (DAS) to a virtual server running on a storage area network (SAN).

Individual servers with DAS
+ Easy configuration
- A downed server takes services and data offline
+ Cheap
- Inflexible
- Wasted storage space
+ Higher utilization of raw storage capacity (50-90%)

Virtual servers with SAN
- Complexity
- Cost
+ Virtual servers can move from host to host on the fly
+ Data is accessible to a virtual server no matter which host
+ Fault-tolerant
+ Server redundancy
- Lower utilization of raw storage capacity (25%-40%)

For virtualization, you setup a hypervisor layer on the server hardware and all servers run on top of that layer as virtual servers. With most virtualization setups, this means that a virtual guest server can be moved to other server hardware on-the-fly when needed. So if you want to take down a host server for maintenance, you can simply move all of the guest O/Ss off to other servers temporarily.

Naturally, this works best if the data for your virtual servers is stored on a SAN rather then on local disks (DAS). That way, when the server hardware is taken down, it doesn't affect the availability of data. The downside is that you now have a central point of failure (the SAN) for multiple servers. So you need to take care and design in redundancy / fault-tolerance / failover.

For the host servers, redundancy is as easy as having multiple host servers connected up to the SAN fabric (so they can talk to their data stores).

For the SAN fabric, redundancy can be done in various ways. The easiest is to simply have two switches and two network paths between each host server and the SAN units. There are also more complex topologies (core-edge, full-mesh, etc) but for the small business, a pair of switches will probably provide enough fault-tolerance.

The SAN storage units are the remaining weak link. It should be possible to RAID together multiple SAN units to provide fault-tolerance. But that's something that I'm still exploring with regards to iSCSI. The other downside of RAID'ing multiple SAN units together is that it generally cuts the net storage amount in half. So if you need N GB of data storage, you need N * 4 GB of raw storage (assuming RAID1 within the SAN and RAID1 across two SAN units). The upside is that you would have to lose 3 of the 4 disks in a RAID 1+1 setup before you would lose data.

(If you construct your SAN as a 12-disk RAID6 with 83% net storage, a RAID1 of two SAN units would give you 42% net storage instead of only 25%. Whether performance would be adequate is uncertain.)

...

Now, eventually, we may bring in the big guns such as EMC / IBM / Dell / Whoever. But we estimate that we can get a multi-terabyte test SAN up and running for around $5000-$10000 using commodity hardware and SATA drives.

If things don't work out, then we will use the commodity hardware for other projects and call in the big guns.

Estimated costs for redundant SAN storage (these are very rough ballpark figures for fully-populated SANs):

$2.00/GB - homegrown
$4.50/GB - pre-built SATA iSCSI solutions
$13.33/GB - pre-built SCSI iSCSI solutions

Costs for half populated SANs are about double those $/GB values. The SCSI SAN unit can only hold about 2/5 the capacity of the SATA SAN units.

Tuesday, August 08, 2006

Linksys to Shorewall

Over the past week, I've been encountering slowdowns on my DSL connection. Normally, I can get 1.5Mbps download speeds, which is what I'm provisioned for and what the line is capable of. But over the past week or so, my top download speed has gradually fallen to around 300-500Kbps (1/5 to 1/3 of the usual bandwidth). It didn't seem to matter what protocol I was using at the time either.

So I did some testing. With the LinkSys router involved, I was seeing download rates of 300-500Kbps. But if I hooked a laptop directly to the DLS modem, I could get 1.5Mbps again. Ah ha! That shows clearly that the issue is not any sort of bandwidth shaping or traffic limiting by the ISP but rather something strange going on with my Linksys BEFW11S4 router.

Since the router is a few years old (circa 2001), it could simply be age related. Maybe a heatsink has come loose (making the CPU throttle back) or a capacitor has failed or something else.

Rather then adding a new hardware-based router to the mix, I decided to press my old VIA C3 box into service as the new firewall/NAT. It's something I had been considering doing for a while, but kept putting it off due to the complexities involved. Mostly, I worry about my misconfiguring the box, allowing it to get hacked and turned into someone else's playtoy.

It took me around 2 hours to configure Shorewall for Gentoo on my C3 box. So far it seems to be working fine and provides me with some new diagnostic tools (such as "nettop"). According to the ShieldsUp! portscan service, everything is stealthed except for the tcp/113 ident port which is simply closed.

Now I can go read up on my Linux security books and figure out if I want to continue using Shorewall or not.

Follow-up note: Unfortunately, I was never able to get PPTP pass-through working. I was trying to use Shorewall in a SOHO environment where I create a VPN connection from my laptop out through the Shorewall NAT to a PPTP VPN server on the public internet. But something in either the Linux kernel or the Shorewall / IPTables configuration is blocking all or some of the PPTP traffic. So I had to drop the Linksys hardware router back in so that I could get work done.