Saturday, June 19, 2004

Gentoo: Segmentation fault in vgscan during boot

Now for the other error that I got during the initial bootup.
* Using /etc/modules.autoload.d/kernel-2.6 as config:
* Loading module dm-mod... [ ok ]
* Autoloaded 1 module(s)
* Setting up the Logical Volume Manager...
/sbin/rc: line 429: 4422 Segmentation Fault /sbin/vgscan >/dev/nul [ ok ]
* Starting up RAID devices: ...
* Checking all filesystems...
/dev/md0: clean, 39/18072 files, 5573/72192 blocks
fsck.ext: No such file or directory while trying to open /dev/vgmirror/opt
/dev/vgmirror/opt:
The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193

("No such file or directory..." error repeats for all of the other
logical volumes in the volume group(s) on the system)

My initial guess is that the software RAID is not loading up prior to the LVM stuff trying to load. Possibly, I'll have to edit the ordering in "/etc/init.d/checkfs", however since RAID is compiled into the kernel as built-in, and the LVM stuff is a module, the RAID should've already started prior to the LVM stuff.

Looking closer at the boot screen, I can see the "md:" lines correctly autodetecting the RAID arrays. So RAID support seems to be working fine. In fact, if I login to maintenance mode, and "cat /proc/mdstat", all of the RAID arrays show up correctly.

Attempt #1:

Moved things around in /etc/init.d/checkfs. No change in the end-result, except that the messages are re-ordered ("Starting up RAID devices" now appears before "Setting up the Logical Volume Manager") and the error message changes to "4422 Segmentation Fault /sbin/vgscan >/dev/nul". Probably a dry-hole in terms of finding and fixing the real problem.

Attempt #2:

Flipped back to the Gentoo LVM2 documents to see if I missed anything in setting up the LVM set to auto-mount at startup. Booted my way into maintenance mode and use "vgscan -v" to let vgscan attempt to find all of the volume groups. "vgscan" will take a while to run, at least with verbose (-v) mode you'll be able to see some status. On my setup, "vgscan" correctly located the "vgmirror" volume group.

Did a look at the "/etc/lvm" folder on the root volume using "ls -la /etc/lvm" and saw something surprising. There is a ".cache" file which is huge (mine was 10881785 in size). Doing a "cat" of the contents, I see some entries like "/dev/discs/disc2/discs/disc2.../disc2/md/254" which looks like a recursive loop of some sort.

Hint #2, run "vgdisplay -vv" and I see the error message "Too many levels of symbolic links" after each of those long entries. I also see this problem if I run "vgscan -vv". I finally changed my "/etc/lvm/lvm.conf" file to look like the following, and vgscan and vgdisplay are very quick at finding the volume group on my raid array and no longer segfault while looking at other items:
devices = {
scan=["/dev/md"]
filter=["a|^/dev/md/3$|","r/.*/"]
}

Note that this filter only allows vgscan to scan the "md3" device. This keeps vgscan from scanning other devices that don't need to be scanned on my system (and fixes the segfault issue where it goes into infinite recursion on certain devices). If you need to scan other RAID devices (/dev/md1, etc.) or other physical partitions, then you'll need to adjust the "accept" portion of the filter.

Save, shutdown the raid (raidstop -a /dev/md0 for each /dev/md* device), and reboot. My server now boots up correctly.

Links:

Re: mkfs.xfs on software raid5 (2.6.5 kernel) - MD array /dev/md2 not in clean state (alt.os.linux.gentoo) - Shows the exact error message that I'm seeing.

Re: [gentoo-user] LVM2 Date: 2004-03-31 15:40:13 PST (linux.gentoo.user) - Talks about the ordering in which software RAID and the LVM modules load.

Example of lvm.conf file - shows a more complex lvm.conf file, complete with multiple filters.

A more complete example lvm.conf file

3 comments:

Anonymous said...

Thanx, this helped me out..

Couldn't get software raid + lvm2 working on a fresh gentoo install.

Two comments:

Your lvm.conf contains a typo
It should read "devices {" without the "="

You lvm.conf explicitly only accepts /dev/md/3 this causes all other raid-devices to be excluded for physical volume creation.

Greetz

Ramon

Thomas said...

Fixed the syntax error (thanks for the pointer). Also added a note to clarify that the filter will only allow md3 to be scanned for logical volumes.

Anonymous said...

Thanks, I encountered the same problem and didn't know where to start to track it down. I'm not certain if an update to genkernel or the gentoo-dev-sources (2.6.7) is the culprit. Where did all of the entries under /dev/md come from? Why are they needed?