Re: deferred rss update instead of sloppy rss

From: Andrew Morton
Date: Mon Nov 22 2004 - 17:15:16 EST


Christoph Lameter <clameter@xxxxxxx> wrote:
>
> One way to solve the rss issues is--as discussed--to put rss into the
> task structure and then have the page fault increment that rss.
>
> The problem is then that the proc filesystem must do an extensive scan
> over all threads to find users of a certain mm_struct.
>
> The following patch does put the rss into task_struct. The page fault
> handler is then incrementing current->rss if the page_table_lock is not
> held.
>
> The timer interrupt checks if task->rss is non zero (when doing
> stime/utime updates. rss is defined near those so its hopefully on the
> same cacheline and has a minimal impact).
>
> If rss is non zero and the page_table_lock and the mmap_sem can be taken
> then the mm->rss will be updated by the value of the task->rss and
> task->rss will be zeroed. This avoids all proc issues. The only
> disadvantage is that rss may be inaccurate for a couple of clock ticks.
>
> This also adds some performance (sorry only a 4p system):
>
> sloppy rss:
>
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 4 10 1 0.593s 29.897s 30.050s 85973.585 85948.565
> 4 10 2 0.616s 42.184s 22.045s 61247.450 116719.558
> 4 10 4 0.559s 44.918s 14.076s 57641.255 177553.945
>
> deferred rss:
> Gb Rep Threads User System Wall flt/cpu/s fault/wsec
> 4 10 1 0.565s 29.429s 30.000s 87395.518 87366.580
> 4 10 2 0.500s 33.514s 18.002s 77067.935 145426.659
> 4 10 4 0.533s 44.455s 14.085s 58269.368 176413.196

hrm. I cannot see anywhere in this patch where you update task_struct.rss.

> Index: linux-2.6.9/kernel/exit.c
> ===================================================================
> --- linux-2.6.9.orig/kernel/exit.c 2004-11-22 09:51:58.000000000 -0800
> +++ linux-2.6.9/kernel/exit.c 2004-11-22 11:16:02.000000000 -0800
> @@ -501,6 +501,9 @@
> /* more a memory barrier than a real lock */
> task_lock(tsk);
> tsk->mm = NULL;
> + /* only holding mmap_sem here maybe get page_table_lock too? */
> + mm->rss += tsk->rss;
> + tsk->rss = 0;
> up_read(&mm->mmap_sem);

mmap_sem needs to be held for writing, surely?

> + /* Update mm->rss if necessary */
> + if (p->rss && p->mm && down_write_trylock(&p->mm->mmap_sem)) {
> + if (spin_trylock(&p->mm->page_table_lock)) {
> + p->mm->rss += p->rss;
> + p->rss = 0;
> + spin_unlock(&p->mm->page_table_lock);
> + }
> + up_write(&p->mm->mmap_sem);
> + }
> }

I'd also suggest that you do:

tsk->rss++;
if (tsk->rss > 16) {
spin_lock(&mm->page_table_lock);
mm->rss += tsk->rss;
spin_unlock(&mm->page_table_lock);
tsk->rss = 0;
}

just to prevent transient gross inaccuracies. For some value of "16".
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/