Strange "filesystem busy" problem

From: Rainer Krienke
Date: Fri Apr 16 2004 - 03:01:18 EST


Hello,

I am running several NFS and SMB server using SuSE Linux 8.2 with a SuSE
patched kernel k_smp4G-2.4.21-202 (the one from suse9.0).
Basically the system is running very stable and without problems. Before I can
describe the problem I have to describe the architecture in a few words:

The data that are served by NFS and SMB are on LVM2 logical volumes (lvm2
tools version 2.2.00.05, device-mapper 4.0.1, about 10 LVs) using the xfs
filesystem( version 1.3.1). The LVM2 physical volume is a software raid 1.
The size of the software mirror is about 1TB. This mirror again is using two
hardware raid5 devices equipped with IDE disks that are connected by
fibrechannel to the host. There are two seperate paths to each hardware raid,
so its a multipath setup. In short:

xfs_fs: LVM2: md_soft_raid1: md_multipath: Fibrechannel_hardware_raid_5

We are making heavy use of the automounter (Version 4.4.0) in that every user
mount is a bind mount on the parent directory holding all the users data of
this server. Usually the are about 200 user mounts (in 5000) at one time on
one server.

The problem is that when rebooting the machine the xfs filesystems that lie on
the LVM2 logical volumes cannot be umounted because the kernel claims they
are busy. This causes a resync of the software mirror after the reboot which
due to the size of 1TB takes quite a long time and increases the risk of data
loss, if during the resync more than one IDE disk should die on the source
part of the mirror.

The same busy-problem happens when I change to single user mode and then try
to umount the xfs filesystems manually. Most of them will umount but there
is always one or two of them that won't. I tried to run fuser and lsof on the
filesystem to see whats making it busy, but both tools do not report
anything. OK fuser -v says "kernel mount" but does not report any processes
accessing the filesystem. You can take a look at the list of all processes
still running in single user mode here:

http://www.uni-koblenz.de/~krienke/tmp/processes.txt

At the moment I am simply out of ideas what to try. So what I am looking for
is a way to find out more about why the filesystems seem to be busy although
there is nothing making them busy any longer. Since the system is in
production I have only little time to reboot it and analyse the problem and I
cannot install a debug kernel, but perhaps there are other ways to find out
more?

Thanks for any hint
Rainer
--
---------------------------------------------------------------------------
Rainer Krienke, Universitaet Koblenz, Rechenzentrum, Raum A022
Universitaetsstrasse 1, 56070 Koblenz, Tel: +49 261287 -1312, Fax: -1001312
Mail: krienke@xxxxxxxxxxxxxx, Web: http://www.uni-koblenz.de/~krienke
Get my public PGP key: http://www.uni-koblenz.de/~krienke/mypgp.html
---------------------------------------------------------------------------

Attachment: pgp00000.pgp
Description: signature