Re: [PATCH v2] sctp: fix refcount bug in sctp_wfree

From: Marcelo Ricardo Leitner
Date: Tue Mar 17 2020 - 23:55:57 EST


On Wed, Mar 18, 2020 at 10:45:51AM +0800, Qiujun Huang wrote:
> On Wed, Mar 18, 2020 at 1:30 AM Marcelo Ricardo Leitner
> <marcelo.leitner@xxxxxxxxx> wrote:
> >
> > Hi,
> >
> > On Tue, Mar 17, 2020 at 11:55:36PM +0800, Qiujun Huang wrote:
> > > Do accounting for skb's real sk.
> > > In some case skb->sk != asoc->base.sk:
> > >
> > > migrate routing sctp_check_transmitted routing
> > > ------------ ---------------
> > sctp_close();
> > lock_sock(sk2);
> > sctp_primitive_ABORT();
> > sctp_do_sm();
> > sctp_cmd_interpreter();
> > sctp_cmd_process_sack();
> > sctp_outq_sack();
> > sctp_check_transmitted();
> >
> > lock_sock(sk1);
> > sctp_getsockopt_peeloff();
> > sctp_do_peeloff();
> > sctp_sock_migrate();
> > > lock_sock_nested(sk2);
> > > mv the transmitted skb to
> > > the it's local tlist
> >
> >
> > How can sctp_do_sm() be called in the 2nd column so that it bypasses
> > the locks in the left column, allowing this mv to happen?
> >
> > >
> > > sctp_for_each_tx_datachunk(
> > > sctp_clear_owner_w);
> > > sctp_assoc_migrate();
> > > sctp_for_each_tx_datachunk(
> > > sctp_set_owner_w);
> > >
> > > put the skb back to the
> > > assoc lists
> > > ----------------------------------------------------
> > >
> > > The skbs which held bysctp_check_transmitted were not changed
> > > to newsk. They were not dealt with by sctp_for_each_tx_datachunk
> > > (sctp_clear_owner_w/sctp_set_owner_w).
> >
> > It would make sense but I'm missing one step earlier, I'm not seeing
> > how the move to local list is allowed/possible in there. It really
> > shouldn't be possible.
>
> I totally agree that.
> My mistake. So I added some log in my test showing the case:
> The backtrace:
> sctp_close
> sctp_primitive_ABORT
> sctp_do_sm
> sctp_association_free
> __sctp_outq_teardown
> /* Throw away unacknowledged chunks. */
> list_for_each_entry(transport, &q->asoc->peer.transport_addr_list,
> transports) {
> printk("[%d]deal with transmitted %#llx from transport %#llx %s,
> %d\n", raw_smp_processor_id(),
> &transport->transmitted, transport, __func__, __LINE__);
> while ((lchunk = sctp_list_dequeue(&transport->transmitted)) != NULL) {
>
> The trouble skb is from another peer sk in the same asoc, but
> accounted to the base.sk.

Hmm, not sure how you got that out of that debug msg, but okay.
Even if so, how would this trouble skb be accounted on the wrong sk by
then?

Asking because the fix that we want may be a better locking, to
prevent this situation from happening, than compensating for it in
sctp_wfree(). But for that we need to understand how this happened.

Marcelo