Re: [PATCH 0/2] shmem: fix faulting into a hole while it's punched, take 3

From: Vlastimil Babka
Date: Tue Jul 22 2014 - 06:06:37 EST


On 07/22/2014 10:07 AM, Hugh Dickins wrote:
On Mon, 21 Jul 2014, Sasha Levin wrote:
On 07/19/2014 07:44 PM, Hugh Dickins wrote:
Otherwise, I've been unable to reproduce the shmem_fallocate hang.
Great. Andrew, I think we can say that it's now safe to send
1/2 shmem: fix faulting into a hole, not taking i_mutex
2/2 shmem: fix splicing from a hole while it's punched
on to Linus whenever suits you.

(You have some other patches in the mainline-later section of the
mmotm/series file: they're okay too, but not in doubt as these two were.)

I think we may need to hold off on sending them...

It seems that this code in shmem_fault():

/*
* shmem_falloc_waitq points into the shmem_fallocate()
* stack of the hole-punching task: shmem_falloc_waitq
* is usually invalid by the time we reach here, but
* finish_wait() does not dereference it in that case;
* though i_lock needed lest racing with wake_up_all().
*/
spin_lock(&inode->i_lock);
finish_wait(shmem_falloc_waitq, &shmem_fault_wait);
spin_unlock(&inode->i_lock);

Is problematic. I'm not sure what changed, but it seems to be causing everything
from NULL ptr derefs:

[ 169.922536] BUG: unable to handle kernel NULL pointer dereference at 0000000000000631

To memory corruptions:

[ 1031.264226] BUG: spinlock bad magic on CPU#1, trinity-c99/25740

And hangs:

[ 212.010020] INFO: rcu_preempt detected stalls on CPUs/tasks:

Ugh.

I'm so tired of this, I'm flailing around here, and could use some help.

But there is one easy change which might do it: please would you try
changing the TASK_KILLABLE a few lines above to TASK_UNINTERRUPTIBLE.

Hello,

I have discussed it with Michal Hocko, as he was recently fighting with wait queues, and indeed he found a possible race quickly. And yes, it's related to the TASK_KILLABLE state.

The problem would manifest when the waiting faulting task is woken by a signal (thanks to the TASK_KILLABLE), which will change its state to TASK_WAKING. Later it might become TASK_RUNNING again.
Either value means that a wake_up_all() from the punching task will skip the faulter's shmem_fault_wait when clearing up the wait queue (see the p->state & state check in try_to_wake_up()).

Then in the faulter's finish_wait(), &wait->task_list will not be empty, so it will try to access the wait queue, which might be already already gone from the puncher's stack...

So if this is true, the change to TASK_UNINTERRUPTIBLE will avoid the problem, but it would be nicer to keep the KILLABLE state.
I think it could be done by testing if the wait queue still exists and is the same, before attempting finish wait. If it doesn't exist, that means the faulter can skip finish_wait altogether because it must be already TASK_RUNNING.

shmem_falloc = inode->i_private;
if (shmem_falloc && shmem_falloc->waitq == shmem_falloc_waitq)
finish_wait(shmem_falloc_waitq, &shmem_fault_wait);

It might still be theoretically possible that although it has the same address, it's not the same wait queue, but that doesn't hurt correctness. I might be uselessly locking some other waitq's lock, but the inode->i_lock still protects me from other faulters that are in the same situation. The puncher is already gone.

However it's quite ugly and if there is some wait queue debugging mode (I hadn't checked) that e.g. checks if wait queues and wait objects are empty before destruction, it wouldn't like this at all...


I noticed when deciding on the i_lock'ing there, that a lot of the
difficulty in races between the two ends, came from allowing KILLABLE
at the faulting end. Which was a nice courtesy, but one I'll gladly
give up on now, if it is responsible for these troubles.

Please give it a try, I don't know what else to suggest. And I've
no idea why the problem should emerge only now. If this change
does appear to fix it, please go back and forth with and without,
to gather confidence that the problem is still reproducible without
the fix. If fix it turns out to be - touch wood.

Thanks,
Hugh


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/