Re: ipc,sem: sysv semaphore scalability

From: Linus Torvalds
Date: Fri Mar 29 2013 - 17:12:20 EST


On Fri, Mar 29, 2013 at 1:41 PM, Linus Torvalds
<torvalds@xxxxxxxxxxxxxxxxxxxx> wrote:
>
> The alternative would be to make sure the thing is always locked (and
> in a rcu-read-safe region) before putref/getref. The only place (apart
> from the initial allocation, which doesn't matter, because nothing can
> see if itf that path fails) seems to be that freeque(), but I didn't
> check everything.
>
> Moving the
>
> msg_unlock(msq);
>
> to the end of freeque() might be the way to go. It's going to cause us
> to hold the lock for longer, but I'm not sure we care for that path.

Uhhuh. I think shm_destroy() has the same pattern. And I think that
one has locking reasons why it has to do the shm_unlock() before
tearing some things down, although I'm not sure..

The good news is that shm doesn't seem to have any users of
ipc_rcu_get/putref(), so I don't see anything to race against. So it
has the same buggy pattern, but I don't think it can trigger anything.

And ipc/sem.c has the same bug-pattern in freeary(). It does
"sem_unlock(sma)" followed by "ipc_rcu_putref(sma);", and it *does*
seem to have code that that can race against (sem_lock_and_putref()).
The whole "sem_putref()" tries to be careful and gets the lock for the
last ref, but getting the lock doesn't help when freeary() does the
refcount access without it.

The semaphore case seems to argue fairly strongly for an "atomic_t
refcount", because right now it does a lot of "sem_putref()" stuff
that just gets the lock for the putref. So not only is that
insufficient (if it races against freeary(), but it's also more
expensive than just an atomic refcount would have been.

I dunno. I'm still not sure this is triggerable, but it looks bad. But
both the semaphore case and the msg cases seem to be solvable by
moving the unlock down, and shm seem to have no getref/putref users to
race with, so this (whitespace-damaged) patch *may* be sufficient:

diff --git a/ipc/msg.c b/ipc/msg.c
index 31cd1bf6af27..338d8e2b589b 100644
--- a/ipc/msg.c
+++ b/ipc/msg.c
@@ -284,7 +284,6 @@ static void freeque(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)
expunge_all(msq, -EIDRM);
ss_wakeup(&msq->q_senders, 1);
msg_rmid(ns, msq);
- msg_unlock(msq);

tmp = msq->q_messages.next;
while (tmp != &msq->q_messages) {
@@ -297,6 +296,7 @@ static void freeque(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)
atomic_sub(msq->q_cbytes, &ns->msg_bytes);
security_msg_queue_free(msq);
ipc_rcu_putref(msq);
+ msg_unlock(msq);
}

/*
diff --git a/ipc/sem.c b/ipc/sem.c
index 58d31f1c1eb5..1cf024b9eac0 100644
--- a/ipc/sem.c
+++ b/ipc/sem.c
@@ -766,12 +766,12 @@ static void freeary(struct ipc_namespace *ns,
struct kern_ipc_perm *ipcp)

/* Remove the semaphore set from the IDR */
sem_rmid(ns, sma);
- sem_unlock(sma);

wake_up_sem_queue_do(&tasks);
ns->used_sems -= sma->sem_nsems;
security_sem_free(sma);
ipc_rcu_putref(sma);
+ sem_unlock(sma);
}

static unsigned long copy_semid_to_user(void __user *buf, struct
semid64_ds *in, int version)

(I didn't check very carefully, it's possible that we end up having
some locking problem if we move the unlocks down later, but it *looks*
fine)

Anybody?

Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/