Sunday, October 03, 2010

3ware 9650SE failing RAID6 array in Linux

I'm using a 3ware 9650SE 16-port controller in my Linux server, setup as a big single-disk RAID6 array. However, it's currently operating in degraded mode due to at least one disk failure. See 3ware RAID maintenance with tw_cli for a link to the 3ware documentation PDF. See also Fixing a degraded disk on the 3ware raid (blur).

# tw_cli help

Copyright (c) 2009 AMCC
AMCC/3ware CLI (version 2.00.09.012)

Commands  Description
-------------------------------------------------------------------
show      Displays information about controller(s), unit(s) and port(s).
flush     Flush write cache data to units in the system.
rescan    Rescan all empty ports for new unit(s) and disk(s).
update    Update controller firmware from an image file.
commit    Commit dirty DCB to storage on controller(s).     (Windows only)
/cx       Controller specific commands.
/cx/ux    Unit specific commands.
/cx/px    Port specific commands.
/cx/phyx  Phy specific commands.
/cx/bbu   BBU specific commands.                               (9000 only)
/cx/ex    Enclosure specific commands.                       (9690SA only)
/ex       Enclosure specific commands.                      (9KSX/SE only)

Certain commands are qualified with constraints of controller type/model
support.  Please consult the tw_cli documentation for explanation of the
controller-qualifiers.

Type help  to get more details about a particular command.
For more detail information see tw_cli's documentation. 

So if we take a look at the drives installed:

# tw_cli show

Ctl   Model        (V)Ports  Drives   Units   NotOpt  RRate   VRate  BBU
------------------------------------------------------------------------
c6    9650SE-16ML  16        7        1       1       4       4      1  

The columns are as follows:

Ports - # of drive ports on the card
Drives - # of drives connected
Units - # of RAID units created on the card
NotOpt - "not optimal"
RRate - "rebuild rate"
VRate - ???
BBU - Battery backup

# tw_cli /c6 show

Unit  UnitType  Status         %RCmpl  %V/I/M  Stripe  Size(GB)  Cache  AVrfy
------------------------------------------------------------------------------
u0    RAID-6    DEGRADED       -       -       64K     4889.37   OFF    OFF    

Port   Status           Unit   Size        Blocks        Serial
---------------------------------------------------------------
p0     NOT-PRESENT      -      -           -             -
p1     NOT-PRESENT      -      -           -             -
p2     NOT-PRESENT      -      -           -             -
p3     NOT-PRESENT      -      -           -             -
p4     NOT-PRESENT      -      -           -             -
p5     NOT-PRESENT      -      -           -             -
p6     OK               u0     698.63 GB   1465149168    5QD4####            
p7     OK               u0     698.63 GB   1465149168    3QD0####            
p8     NOT-PRESENT      -      -           -             -
p9     OK               u0     698.63 GB   1465149168    3QD0####            
p10    NOT-PRESENT      -      -           -             -
p11    OK               u0     698.63 GB   1465149168    5QD3####            
p12    OK               u0     698.63 GB   1465149168    3QD0####            
p13    OK               u0     698.63 GB   1465149168    5QD4####            
p14    OK               u0     698.63 GB   1465149168    3QD0####            
p15    NOT-PRESENT      -      -           -             -

As you can see, port 8 and port 10 have failed. Which means our RAID6 array is in dire shape. After testing, one of the units had failed completely, the other is merely suspect and was put back into the array. I did the rebuild in the BIOS, but when rebuilding, you will see the following:

# tw_cli /c6/u0 show all
/c6/u0 status = DEGRADED-RBLD
/c6/u0 is rebuilding with percent completion = 13%(A)
/c6/u0 is not verifying, its current state is DEGRADED-RBLD
/c6/u0 is initialized.
/c6/u0 Write Cache = off
/c6/u0 volume(s) = 1
/c6/u0 name = vg6                  
/c6/u0 serial number = 5QD40L3K00005F00#### 
/c6/u0 Ignore ECC policy = off       
/c6/u0 Auto Verify Policy = off       
/c6/u0 Storsave Policy = protection  
/c6/u0 Command Queuing Policy = on        
/c6/u0 Parity Number = 2         

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u0       RAID-6    DEGRADED-RBLD  13%(A)  -       -     64K     4889.37   
u0-0     DISK      OK             -       -       p14   -       698.481   
u0-1     DISK      OK             -       -       p13   -       698.481   
u0-2     DISK      OK             -       -       p12   -       698.481   
u0-3     DISK      OK             -       -       p11   -       698.481   
u0-4     DISK      DEGRADED       -       -       p10   -       698.481   
u0-5     DISK      OK             -       -       p9    -       698.481   
u0-6     DISK      DEGRADED       -       -       -     -       698.481   
u0-7     DISK      OK             -       -       p7    -       698.481   
u0-8     DISK      OK             -       -       p6    -       698.481   
u0/v0    Volume    -              -       -       -     -       4889.37

Specifically, we can see that the array is 13% through with the rebuild after only 32 minutes. I have not yet replaced port-8 as I'm going to wait for the array to finish rebuilding before I jostle it again.

Notes:

I strongly recommend that you feed the output of "tw_cli /c# show" and "tw_cli /c#/u# show all" into text files daily and parse them for issues. Or mail them to a monitoring email address. Being able to tell the technician to pull drive XYZ with a specific serial # helps eliminate errors. But that's hard to do if you don't keep track of your serial numbers.

On the systems I administer, we have a /reports/configuration folder where we consolidate all those types of reports. Things like the output of pvscan, lvscan, df, ntpq, /proc/mdstat, etc. all get dumped into text files daily and then committed to the central SVN repository for the server with FSVS. When things go bad later, we can step back through the SVN repository and look at the various reports at previous points in time.

Example of bad capacitors from a few years ago

A few years back, we bought a bunch of GeForce 6200LE PCIe video cards for various uses in servers/desktops.  I liked them at the time because they are fanless (one less moving part to break).  However, in the last 2 years, we've had a lot of them fail due to bad capacitors

And when one of these capacitors "pops", it sounds like a small firecracker going off in the room.  Very noticeable at the time. Reminds me of the old miniature pop-caps or the small paper bags of a single grain of gunpowder (or flash powder?) mixed in with a few rocks (the size of a pea) that you could throw at hard surfaces and it would make a popping noise.





Friday, October 01, 2010

FSVS: Updated install on CentOS 5.5

(Also see my older post on this: FSVS - Install on CentOS 5 or FSVS - Install on CentOS 5.4. Or the original post where I explained the power of FSVS for sysadmins.)

Once again, I'm starting with the assumption that this is a pretty bare-bones CentOS 5.5 server install, with only the "server" package group (or no package groups at all) being selected during the initial install.  The basic steps remain the same for situations where you merely want to use FSVS and SVN to keep track of changes to your system:


  1. Setup the RPMForge repository
  2. Install the packages needed for FSVS
  3. Download and compile FSVS
  4. Configure ignore patterns
  5. Do the base check-ins
For the most part, I try to do this process as early in the lifespan of the server as possible.  But there's always a few minor things that get done before I get this far (creating an initial user, doing a "yum update" to patch the system up, etc).  Even a mature server can benefit from adding FSVS, but you'll find it much more useful the longer that you use it as it gives you a quick index to "what changed and why did you change it".

Setting up RPMForge

In order to get the latest Subversion packages for your system, you'll have to add RPMForge as a source repository. The CentOS base repository only has Subversion 1.4.2 and the latest is currently 1.6.12. I recommend doing this in conjunction with the yum-priorities package.

# yum install yum-priorities

After installing the yum-priorities package, you should edit the CentOS-Base.repo file found under /etc/yum.repos.d/. For the base repositories, I recommend setting them to priority values of 1 through 3  For example, in the "[base]" section, you would add (2) lines to the end of that section:

[base]
...
priority=1
exclude=subversion-*

The "priority=" tells yum-priorities that if it finds a package in multiple repositories, that [base] should take precedence.  The "exclude=" line keeps us from pulling Subversion from the [base] repository (and we'll instead pull it from RPMForge).

Now we can install the RPMForge repository (see Using RPMForge).  You'll need to look at the release folder in order to get the RPM name.

# cd /root/
# mkdir software
# cd software
# wget http://packages.sw.be/rpmforge-release/rpmforge-release-0.5.1-1.el5.rf.x86_64.rpm
# rpm -Uhv rpmforge-release-0.5.1-1.el5.rf.x86_64.rpm
# cd /etc/yum.repos.d/

Now you should edit the rpmforge.repo file and insert a priority= line. I recommend a value of 10 or 25. In addition, I suggest also telling the yum package manager to only pull in specific packages from RPMForge (or any 3rd party repository).  This is a bit of overkill, but yum-priorities is a bit of overkill in itself and I've run into issues in the past where priorities weren't enough and I wish I had gone with using "includepkgs" instead.  (Note that while "exclude" is used to exclude packages, the opposite term is "includepkgs" and not "include".)  This will make the end of your rpmforge.repo file look like:


[rpmforge]
...
priority=10
includepkgs=subversion-*

You can verify that you'll pull in the latest Subversion package by the following command:

# yum info subversion
Available Packages
Name       : subversion
Arch       : x86_64
Version    : 1.6.12
Release    : 0.1.el5.rf
Size       : 6.8 M
Repo       : rpmforge
Summary    : Modern Version Control System designed to replace CVS
URL        : http://subversion.tigris.org/
License    : BSD

Install the packages needed for FSVS

# yum install subversion subversion-devel ctags apr apr-devel gcc gdbm gdbm-devel pcre pcre-devel apr-util-devel

Download and compile FSVS

As always, you shouldn't compile code as the root user.

# su username
$ mkdir -p ~/software/fsvs
$ cd ~/software/fsvs
$ wget http://download.fsvs-software.org/fsvs-1.2.2.tar.bz2
$ tar xjf fsvs-1.2.2.tar.bz2
$ cd fsvs-1.2.2
$ ./configure
$ make
$ exit
# cp /home/username/fsvs/fsvs-1.2.2/src/fsvs /usr/local/bin/
# chmod 755 /usr/local/bin/fsvs

Creating the repository on the SVN server

This is how we setup users on our SVN server. Machine accounts are prefixed as "sys-" in front of the machine name. The SVN repository name matches the name of the machine. In general, only the machine account should have write access to the repository, although you may wish to add other users to the group so that they can gain read-only access.

# useradd -m sys-www-test
# passwd sys-www-test
# svnadmin create /var/svn/sys-www-test
# chmod -R 750 sys-www-test
# chmod -R g+s sys-www-test/db
# chown -R sys-www-test:sys-www-test sys-www-test

Back on the source machine (our test machine), we'll need to create an SSH key that can be used on our SVN server. You may wish to use a slightly larger RSA key (3200 bits or 4096 bits) if you're working on an extra sensitive server. But a key size of 2048 bits should be secure for another decade for this purpose.

# cd /root/
# mkdir .ssh
# chmod .ssh 700
# cd .ssh
# /usr/bin/ssh-keygen -N '' -C 'svn key for root@hostname' -t rsa -b 2048 -f root@hostname
# chmod 600 *
# cat root@hostname.pub


Copy this key into the clipboard or send it to the SVN server or the SVN server administrator. Then we'll need to create a ~/.ssh/config file to tell the user what account name, port and key file to use when talking to the SVN server.

# vi /root/.ssh/config
Host svn.tgharold.com
Port 22
User sys-www-test
IdentityFile /root/.ssh/root@hostname
# chmod 600 *


Back on the SVN server, you'll need to finish configuration of the user that will add files to the SVN repository.

# su username
$ cd ~/
$ mkdir .ssh
$ chmod 700 .ssh
$ cd .ssh
$ cat >> authorized_keys
(paste in the SSH key from the other server)
$ chmod 600 *

Now you'll want to prepend the following in front of the key line in the authorized_keys file.

command="/usr/bin/svnserve -t -r /var/svn",no-agent-forwarding,no-pty,no-port-forwarding,no-X11-forwarding

That ensures (mostly) that the key can only be used to run the svnserve command and that it can't be used to access a command shell on the SVN server. Test the configuration back on the original server by issuing the "svn info" command. Alternately, you can try to ssh to the SVN repository server. Errors will usually either be logged in /var/log/secure on the source server or in the same log file on the SVN repository server. Here's an example of a successful connection:

# ssh svn.tgharold.com
( success ( 2 2 ( ) ( edit-pipeline svndiff1 absent-entries commit-revprops depth log-revprops partial-replay ) ) )

This shows that they key is running the "svnserve" command automatically.

Connect the system to the SVN repository

The very first command that you'll need to issue for FSVS is the "urls" (or "initialize") command. This tells FSVS what repository will be used to store the files.

# cd /
# mkdir /var/spool/fsvs
# mkdir /etc/fsvs/
# fsvs urls svn+ssh://svn.tgharold.com/sys-www-test/

You may see the following error, which means you need to create the /var/spool/fsvs folder, then reissue the fsvs urls command.

stat() of waa-path "/var/spool/fsvs/" failed. Does your local WAA storage area exist?

The following error means that you forgot to create the /etc/fsvs/ folder.

Cannot write to the FSVS_CONF path "/etc/fsvs/".

Configure ignore patterns and doing the base check-in

When constructing ignore patterns, generally work on adding a few directories at a time to the SVN repository. Everyone has different directories that they won't want to version, so you'll need to tailor the following to match your configuration. However, I generally recommend starting with the following (this is the output from "fsvs ignore dump", which you can pipe into a file, edit, then pipe back into "fsvs ignore load"):

group:ignore,./backup/
group:ignore,./bin/
group:ignore,./dev/
group:ignore,./etc/fsvs/
group:ignore,./etc/gconf/
group:ignore,./etc/gdm/
group:ignore,./home/
group:ignore,./lib/
group:ignore,./lib64/
group:ignore,./lost+found
group:ignore,./media/
group:ignore,./mnt/
group:ignore,./proc/
group:ignore,./root/
group:ignore,./sbin/
group:ignore,./selinux/
group:ignore,./srv/
group:ignore,./sys/
group:ignore,./tmp/
group:ignore,./usr/bin/
group:ignore,./usr/include/
group:ignore,./usr/kerberos/
group:ignore,./usr/lib/
group:ignore,./usr/lib64
group:ignore,./usr/libexec/
group:ignore,./usr/sbin/
group:ignore,./usr/share/
group:ignore,./usr/src/
group:ignore,./usr/tmp/
group:ignore,./usr/X11R6/
group:ignore,./var/cache/
group:ignore,./var/gdm/
group:ignore,./var/lib/
group:ignore,./var/lock/
group:ignore,./var/log/
group:ignore,./var/mail/
group:ignore,./var/run/
group:ignore,./var/spool/
group:ignore,./var/tmp/

Then you'll either want to ignore (or encrypt) the SSH key files.

# cd /
# fsvs ignore group:ignore,./root/.ssh
# fsvs ignore group:ignore,./etc/ssh/shadow*
# fsvs ignore group:ignore,./etc/ssh/ssh_host_key
# fsvs ignore group:ignore,./etc/ssh/ssh_host_dsa_key
# fsvs ignore group:ignore,./etc/ssh/ssh_host_rsa_key

You can check what FSVS is going to version by using the "fsvs status pathname" command (such as "fsvs status /etc"). Once you are happy with the selection in a particular path, you can do the following command:

# fsvs ci -m "base check-in" /etc

Repeat this for the various top level trees until you have checked everything in. Then you should do one last check-in at the root level that catches anything you might have missed.