CD-ROM Access Crashes [Patches enclosed]

Leonard N. Zubkoff (lnz@dandelion.com)
Tue, 28 May 1996 20:33:32 -0700


I am consolidating replies to a number of messages regarding problems with SCSI
CD-ROM's, rather than reply to each thread individually. I spent some time
over the last couple of days investigating the problems that have been
reported. Initial testing was with Linux 2.0pre8.

Note: Alexey Kuznetsov proposed changes to filemap.c did not fix any of this in
my testing. How about a unified diff next time without the #ifdef's by the
way? It's much easier to apply.

(1) Commercial CD-ROMs without physical errors read just fine without any
problems, both through the ISO9660 interface and through a "dd if=/dev/sr0"
command.

gwynedd:~# dd if=/dev/sr0 of=/dev/null
1330724+0 records in
1330724+0 records out

CD-ROM CAPACITY: 332681, Sector Size: 2048
Max size:332529 Log zone size:2048
First datazone:32 Root inode number 65536

(2) Recordable CDs (CD-R's) do seem to have a problem. I can read one just
fine through the ISO9660 interface but not with a "dd if=/dev/sr0" command.

gwynedd:~# dd if=/dev/sr0 of=/dev/null
dd: /dev/sr0: I/O error
981760+0 records in
981760+0 records out

CD-ROM CAPACITY: 245442, Sector Size: 2048
Max size:245440 Log zone size:2048
First datazone:28 Root inode number 57344

scsi1: CCB #31858 Target 5: Result 2 Host Adapter Status 00 Target Status 02
scsi1: CDB 08 03 BE C0 02 00
scsi1: Sense F0 00 03 00 03 BE C1 0A 00 00 00 00 11 05 00 00
scsi1 channel 0 : resetting for second half of retries.
SCSI bus is being reset for host 1 channel 0.
scsi1: Sending Bus Device Reset CCB #31859 to Target 5
scsi1: Bus Device Reset CCB #31859 to Target 5 Completed
scsi1: CCB #31860 Target 5: Result 2 Host Adapter Status 00 Target Status 02
scsi1: CDB 08 03 BE C0 02 00
scsi1: Sense 70 00 06 00 00 00 00 0A 00 00 00 00 29 00 00 00
scsi1: CCB #31861 Target 5: Result 2 Host Adapter Status 00 Target Status 02
scsi1: CDB 08 03 BE C0 02 00
scsi1: Sense 70 00 02 00 00 00 00 0A 00 00 00 00 04 01 00 00
CDROM not ready. Make sure you have a disc in the drive.
CD-ROM I/O error: dev 0b:00, sector 981760

READ_6: block = 245440, count = 2
Sense Key = 3, Additional Sense Code = 11, Qualifier = 5
MEDIUM ERROR: L-EC UNCORRECTABLE ERROR

The problem here seems to be that the result of a READ CAPACITY command is
incorrect with respect to the disk data. It says that there are 245442 blocks
on the CD, but attempting to read block 981760/4 = 245440 and 245441 together
fails on block 245441. I can only conclude from this that something about the
CD-ROM burning process used by cdwrite is at fault. I don't see anything wrong
with the kernel, though I'm not an expert in the CD-ROM specifications. It
would be interesting to see if there is any difference in results for CD-R's
made with any of the commercial CD-R programs. Note that the commercial CD
shows a much higher capacity than the ISO9660 header indicates, but those extra
blocks are all readable.

In any event, I don't see a read ahead issue here, since we would have
attempted to read block 245441 anyway based on the capacity.

Ooops. Looks like the above analysis is not quite correct. I just checked the
SCSI specification and it seems there is a kernel bug after all. According to
the specification for the READ CAPACITY command, a CD-ROM drive is *allowed* to
return a value up to 75 sectors beyond the last readable block on the medium.
This means that if we are to allow reading from sr devices without errors,
we're going to have to detect the case of a MEDIA ERROR within 75 sectors of
the end of the device, and not signal an error at all.

I'm not quite sure yet how we want to do this. A good first step would be to
resurrect the media error code Andries was working on but I assume was never
completed. I think that we will also need to update the device->capacity and
sr_sizes values to match the detected capacity, and modify block_read
to see an end-of-file for this case. Not at all pretty. Comments?

However, there is still another problem here. After the above error,
attempting a second dd command completely kills the system. In fact, most of
the deaths have been so spectacular that no information was available on the
screen. It gets worse...

(3) Commercial CD-ROMs with physical errors also crash the system completely
with no usable trace. I attempted to read my error test CD and the screen
flashed the most amazing colors, and error messages scrolled by at a ferocious
rate, or the system locked up with no indication of why. Something is
obviously seriously wrong here.

Investigating further (i.e. several hours later) , I find that the bug is an
interaction between the following patch from 1.3.85:

--- v1.3.84/linux/drivers/scsi/scsi_ioctl.c Sat Dec 30 11:59:29 1995
+++ linux/drivers/scsi/scsi_ioctl.c Mon Apr 8 15:47:34 1996
@@ -149,6 +149,10 @@

result = SCpnt->result;
SCpnt->request.rq_status = RQ_INACTIVE;
+
+ if(SCpnt->device->scsi_request_fn)
+ (*SCpnt->device->scsi_request_fn)();
+
wake_up(&SCpnt->device->device_wait);
return result;
}

and the code in do_sr_request that re-locks the CD-ROM door. Whenever this
code in do_sr_request is triggered, it calls scsi_ioctl, which locks the door
and then calls back to the request function due to the above patch, which calls
back to lock the door, etc. Instant recursive death! This same death would
occur with removable disks as well, by the way.

Since the above code really is necessary, as I recall from earlier bug reports,
we can't just remove that code. A reasonable fix for this problem seems to be
quite simple:

--- linux/drivers/scsi/scsi_ioctl.c- Sat May 18 12:19:09 1996
+++ linux/drivers/scsi/scsi_ioctl.c Tue May 28 15:42:10 1996
@@ -150,7 +150,7 @@
result = SCpnt->result;
SCpnt->request.rq_status = RQ_INACTIVE;

- if(SCpnt->device->scsi_request_fn)
+ if (!SCpnt->device->was_reset && SCpnt->device->scsi_request_fn)
(*SCpnt->device->scsi_request_fn)();

wake_up(&SCpnt->device->device_wait);

This fix is enough to keep the system from crashing, but its not really the
entire answer. I've done a bit more investigation since the last time a MEDIUM
ERROR issue came up, and I think for the moment at least we are safest in not
resetting the device or bus in response. I still think there might be cases
where a bus reset would reinitialize a device and thereby fix a bogus MEDIUM
ERROR condition, but given the present state of the SCSI subsystem, I think not
issuing a reset is the more conservative choice.

With the following additional patch, I can access all the files on my test
CD-ROM that do not actually have a physical error.

--- linux/drivers/scsi/scsi.c- Tue May 28 20:04:54 1996
+++ linux/drivers/scsi/scsi.c Tue May 28 19:48:52 1996
@@ -1569,9 +1569,15 @@
case SUGGEST_IS_OK:
break;
case SUGGEST_REMAP:
+#ifdef DEBUG
+ printk("SENSE SUGGEST REMAP - status = FINISHED\n");
+#endif
+ status = FINISHED;
+ exit = DRIVER_SENSE | SUGGEST_ABORT;
+ break;
case SUGGEST_RETRY:
#ifdef DEBUG
- printk("SENSE SUGGEST REMAP or SUGGEST RETRY - status = MAYREDO\n");
+ printk("SENSE SUGGEST RETRY - status = MAYREDO\n");
#endif
status = MAYREDO;
exit = DRIVER_SENSE | SUGGEST_RETRY;
@@ -1606,6 +1612,9 @@
status = REDO;
break;
case SUGGEST_REMAP:
+ status = FINISHED;
+ exit = DRIVER_SENSE | SUGGEST_ABORT;
+ break;
case SUGGEST_RETRY:
status = MAYREDO;
exit = DRIVER_SENSE | SUGGEST_RETRY;

There's still some work to be done so that requests can partially succeed when
the medium error is not for the first block. The above crash problem is
dangerous enough that I don't want this patch to wait for the more complex
code.

Leonard