Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression

From: Feng Tang
Date: Fri Jun 04 2021 - 03:52:33 EST


On Fri, Jun 04, 2021 at 03:04:11PM +0800, Feng Tang wrote:
> Hi Linus,
>
> Sorry for the late response.
>
> On Mon, May 24, 2021 at 05:11:37PM -1000, Linus Torvalds wrote:
> > On Mon, May 24, 2021 at 5:00 PM kernel test robot <oliver.sang@xxxxxxxxx> wrote:
> > >
> > > FYI, we noticed a -9.2% regression of will-it-scale.per_thread_ops due to commit:
> > > commit: 57efa1fe5957694fa541c9062de0a127f0b9acb0 ("mm/gup: prevent gup_fast from racing with COW during fork")
> >
> > Hmm. This looks like one of those "random fluctuations" things.
> >
> > It would be good to hear if other test-cases also bisect to the same
> > thing, but this report already says:
> >
> > > In addition to that, the commit also has significant impact on the following tests:
> > >
> > > +------------------+---------------------------------------------------------------------------------+
> > > | testcase: change | will-it-scale: will-it-scale.per_thread_ops 3.7% improvement |
> >
> > which does kind of reinforce that "this benchmark gives unstable numbers".
> >
> > The perf data doesn't even mention any of the GUP paths, and on the
> > pure fork path the biggest impact would be:
> >
> > (a) maybe "struct mm_struct" changed in size or had a different cache layout
>
> Yes, this seems to be the cause of the regression.
>
> The test case is many thread are doing map/unmap at the same time,
> so the process's rw_semaphore 'mmap_lock' is highly contended.
>
> Before the patch (with 0day's kconfig), the mmap_lock is separated
> into 2 cachelines, the 'count' is in one line, and the other members
> sit in the next line, so it luckily avoid some cache bouncing. After
> the patch, the 'mmap_lock' is pushed into one cacheline, which may
> cause the regression.
>
> Below is the pahole info:
>
> - before the patch
>
> spinlock_t page_table_lock; /* 116 4 */
> struct rw_semaphore mmap_lock; /* 120 40 */
> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> struct list_head mmlist; /* 160 16 */
> long unsigned int hiwater_rss; /* 176 8 */
>
> - after the patch
>
> spinlock_t page_table_lock; /* 124 4 */
> /* --- cacheline 2 boundary (128 bytes) --- */
> struct rw_semaphore mmap_lock; /* 128 40 */
> struct list_head mmlist; /* 168 16 */
> long unsigned int hiwater_rss; /* 184 8 */
>
> perf c2c log can also confirm this.

We've tried some patch, which can restore the regerssion. As the
newly added member 'write_protect_seq' is 4 bytes long, and putting
it into an existing 4 bytes long hole can restore the regeression,
while not affecting most of other member's alignment. Please review
the following patch, thanks!

- Feng