Re: [BUG 2.6.17-git] kmem_cache_create: duplicate cache scsi_cmd_cache

From: Linus Torvalds
Date: Fri May 12 2006 - 17:27:35 EST




On Fri, 12 May 2006, Russell King wrote:
>
> On Fri, May 12, 2006 at 01:50:46PM -0700, Linus Torvalds wrote:
> >
> > No, you introduced a regression. This isn't the SCSI layer being evil,
> > this is the "regressions aren't acceptable".
>
> No - I introduced a correct fix.

Not so.

You introduced a commit that fixed one thing, and broke another thing.

This is not rocket science, Russell. This is how programming works. There
are dependencies on behaviour, and if you change behaviour, it can fix one
problem while breaking another one, because different sides depend on
different semantics.

Real complexity is in linkages and communication. Surprise, surprise.

The _real_ broken ends up being elsewhere. It might just be a
communication error (two different subsystems got the wrong idea because
of bad communication about how something _should_ be used), but it migth
also be some bad design.

But saying that the two one-liners were the "correct fix" is premature
just because they fixed the symptoms for _you_.

> It seems that fixing simple bugs cause other bugs, and that means we're
> heading into a maintainability nightmare.

That's just stupid, Russell.

You DID NOT FIX A SIMPLE BUG.

You made a change, that on the face of it _looked_ like a simple fix for a
simple bug. It wasn't. No amount of you saying so makes it so. It broke
something else, and suddenly what you claim was a simple and clear bug-fix
obviously isn't.

Ergo, it ended up not being "fixing simple bugs" at all.

> No idea I'm afraid, and since I've had a _really_ extremely shitty couple
> of days I'm not about to start going to look for it.

Right. You're pissy. That's fine. But you having a shitty couple of days
doesn't mean you need to take it out on everybody else. Especially not
when people actually offered to help, and asked for details on exactly
what the problem that _you_ tracked was.

You depend on the exact same infrastructure that SCSI depends on. SCSI has
different expectations of that infrastructure than you do. Arguably, there
are more SCSI users out there than MMC users, so when you claim that it's
SCSI's fault, you're being biased. It's equally possible that the fault
was in the MMC layer all along, and that the MMC layers use of the
structure was the thing that caused the oops for the MMC later.

Since you didn't bother to even point to the oopses or the patch that
caused then now, you aren't actually giving any reason to believe that
your view of the situation is the only right one.

So you havign a hissy-fit actually makes it harder for people to even help
you. James already said

"If Russell and Jens can tell me what they're trying to do I'll
see if there's another way to do it."

but you didn't, you just started ranting.

See? Not very useful, is it?

> > Really, the added ref-count should be gotten by whoever holds on to the
> > thing, and it sounds like it's the hotplug event that caused this and
> > should have held on to its hotplug reference.
>
> ... which would be the genhd layer - it's keeping the driverfs_dev
> around so that the genhd layer can generate hotplug events using it
> at mount/umount time. Thanks for just re-confirming my original fix
> and that it's SCSI which is the real problem. 8)

No. You want your fix to be the correct one, but the fact is, it may well
not be that at all. Maybe - once we look at what the oops is - we find
that it really _is_ the MMC layer that just doesn't follow the assumptions
that the device layer was doing.

I said "sounds like". And without help, we can't much help you. And being
pissy about it, will just make everybody more sure that it's the MMC
layer.

Reference counts aren't always simple. Exactly because you can get a
"recursive" reference count, where a device holds on to its reference
count just by account of existing - and that is a BUG, because such a
reference count will obviously never go down to zero.

Now, not going down to zero will certainly hide any problems that come
from the object being released early, because it will _never_ be released.
And it sounds like this is actually what your patch _really_ caused. The
oops went away, but perhaps at the cost of a leak?

And hey, maybe your patch was proper, and the bug really _is_ in the SCSI
layer. In which case we get back to that message from James that you
ignored.

Linus
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/