Many open/close on same files yeilds "No such file or directory".

From: Jesper Krogh
Date: Thu May 01 2008 - 11:49:23 EST


Hi list.

I have a "fairly" reproducible problem. When a program opens and closes
the same file many times, it eventually ends up with a "no such file or
directory". Test program that can reproduce the problem on my setup:

root@hest:~# cat test-file-c.c
#include <stdlib.h>
#include <stdio.h>
#include <fcntl.h>
#include <unistd.h>

int main(int argc, char *argv[]) {
unsigned long i=0;
int fh;
char *filename;

filename=argv[1];

while(1) {
fh=open(filename, O_RDONLY);
if (fh==-1) {
printf("Failed to open %s\n", filename);
printf("Open number: %ld\n",i);
exit(10);
}
close(fh);
i++;
}

exit(0);
}
root@hest:~# ./test-file-c /z/bio/databases/online/index/index-by-accno
Failed to open /z/bio/databases/online/index/index-by-accno
Open number: 61785000
root@hest:~# ./test-file-c /z/bio/databases/online/index/index-by-accno
Failed to open /z/bio/databases/online/index/index-by-accno
Open number: 120929685
(The problem is not isolate to a single file on the filesystem).


strace on the program reviel that the system indeed return a "No such
file or directory" to the program.

This is run on an Ubuntu Gutsy (vendor kernel): 2.6.22-14-server on an
4.5TB ext3 filesystem on an LVM volume. The volume was created on a
dapper (2 releases back) and has just followed with during upgrades.
I cannot reproduce it on other disks attached to the same server or on
other servers attached to similar disksystems.

The filesystem was taken offline yesterday for a forced fsck and it was
found to be clean.

The diskarray is a quite old Fibrenetix FX1200 with 12xPATA disk
in raid5 (with hotspare) exposed to the OS as 3 SCSI-disks of
2+2+0.5TB assembled with LVM afterwards. The SCSI-controller is a:
05:08.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev c1)

What suggestions do you have to solve this problem?

I'm about to mkfs.ext3 the volume and spool it back in from the backup,
but somehow I'm not convinced that it will solve the problem at all.
It may just be a hardware problem, but dmesg doesnt tell anything.

We actually got the problem from a perl-script, but this seems to be the
minimal program that reproduces the problem.

Jesper
--
Jesper


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/