Re: [PATCH] xfs: fix use-after-free on CIL context on shutdown

From: Dave Chinner
Date: Mon Jun 15 2020 - 22:39:03 EST


On Tue, Jun 16, 2020 at 09:16:09AM +0800, yukuai (C) wrote:
> On 2020/6/11 10:45, Dave Chinner wrote:
> >
> > From: Dave Chinner <dchinner@xxxxxxxxxx>
> >
> > xlog_wait() on the CIL context can reference a freed context if the
> > waiter doesn't get scheduled before the CIL context is freed. This
> > can happen when a task is on the hard throttle and the CIL push
> > aborts due to a shutdown. This was detected by generic/019:
> >
> > thread 1 thread 2
> >
> > __xfs_trans_commit
> > xfs_log_commit_cil
> > <CIL size over hard throttle limit>
> > xlog_wait
> > schedule
> > xlog_cil_push_work
> > wake_up_all
> > <shutdown aborts commit>
> > xlog_cil_committed
> > kmem_free
> >
> > remove_wait_queue
> > spin_lock_irqsave --> UAF
> >
> > Fix it by moving the wait queue to the CIL rather than keeping it in
> > in the CIL context that gets freed on push completion. Because the
> > wait queue is now independent of the CIL context and we might have
> > multiple contexts in flight at once, only wake the waiters on the
> > push throttle when the context we are pushing is over the hard
> > throttle size threshold.
>
> Hi, Dave,
>
> How do you think about the following fix:
>
> 1. use autoremove_wake_func(), and remove remove_wait_queue() to
> avoid UAF.
> 2. add finish_wait().
>
> @@ -576,12 +576,13 @@ xlog_wait(
> __releases(lock)
> {
> DECLARE_WAITQUEUE(wait, current);
> + wait.func = autoremove_wake_function;
>
> add_wait_queue_exclusive(wq, &wait);
> __set_current_state(TASK_UNINTERRUPTIBLE);
> spin_unlock(lock);
> schedule();
> - remove_wait_queue(wq, &wait);
> + finish_wait(wq, &wait);
> }

Yes, that would address this specific symptom of the problem, but it
doesn't fix the problem root cause: that the wq can be freed while
this function sleeps. IMO, this sort of change leaves a trap for
future modifications - all the code calling xlog_wait() assumes the
embedded wq the task is sleeping on still exists after waiting so we
really should be fixing the problem the incorrect existence
guarantee in the CIL code that you tripped over.

Cheers,

Dave.
--
Dave Chinner
david@xxxxxxxxxxxxx