Re: Many open/close on same files yeilds"No such file or directory".

From: Andrew Morton
Date: Fri May 02 2008 - 01:40:07 EST


On Thu, 01 May 2008 17:34:46 +0200 Jesper Krogh <jesper@xxxxxxxx> wrote:

> Hi list.
>
> I have a "fairly" reproducible problem. When a program opens and closes
> the same file many times, it eventually ends up with a "no such file or
> directory". Test program that can reproduce the problem on my setup:
>
> root@hest:~# cat test-file-c.c
> #include <stdlib.h>
> #include <stdio.h>
> #include <fcntl.h>
> #include <unistd.h>
>
> int main(int argc, char *argv[]) {
> unsigned long i=0;
> int fh;
> char *filename;
>
> filename=argv[1];
>
> while(1) {
> fh=open(filename, O_RDONLY);
> if (fh==-1) {
> printf("Failed to open %s\n", filename);
> printf("Open number: %ld\n",i);
> exit(10);
> }
> close(fh);
> i++;
> }
>
> exit(0);
> }
> root@hest:~# ./test-file-c /z/bio/databases/online/index/index-by-accno
> Failed to open /z/bio/databases/online/index/index-by-accno
> Open number: 61785000
> root@hest:~# ./test-file-c /z/bio/databases/online/index/index-by-accno
> Failed to open /z/bio/databases/online/index/index-by-accno
> Open number: 120929685
> (The problem is not isolate to a single file on the filesystem).
>

What an amazing bug.

> strace on the program reviel that the system indeed return a "No such
> file or directory" to the program.
>
> This is run on an Ubuntu Gutsy (vendor kernel): 2.6.22-14-server on an
> 4.5TB ext3 filesystem on an LVM volume. The volume was created on a
> dapper (2 releases back) and has just followed with during upgrades.

The test program is (almost) all in RAM and won't care about the hardware.

> I cannot reproduce it on other disks attached to the same server or on
> other servers attached to similar disksystems.

hmm.

I guess it would be interesting to remount that filesystem with `noatime'
to eliminate the last bit of I/O and block-=realted code.

> The filesystem was taken offline yesterday for a forced fsck and it was
> found to be clean.
>
> The diskarray is a quite old Fibrenetix FX1200 with 12xPATA disk
> in raid5 (with hotspare) exposed to the OS as 3 SCSI-disks of
> 2+2+0.5TB assembled with LVM afterwards. The SCSI-controller is a:
> 05:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
> Fusion-MPT Dual Ultra320 SCSI (rev c1)
>
> What suggestions do you have to solve this problem?
>
> I'm about to mkfs.ext3 the volume and spool it back in from the backup,
> but somehow I'm not convinced that it will solve the problem at all.
> It may just be a hardware problem, but dmesg doesnt tell anything.
>
> We actually got the problem from a perl-script, but this seems to be the
> minimal program that reproduces the problem.

I'd suspect that after 1e8 loops your CPU got too hot and started to
misbehave.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/