Re: [PATCH v3 0/2] mm/ksm: add fork-exec support for prctl

From: Stefan Roesch
Date: Fri Sep 22 2023 - 12:21:54 EST



Andrew Morton <akpm@xxxxxxxxxxxxxxxxxxxx> writes:

> On Thu, 21 Sep 2023 09:47:07 -0700 Stefan Roesch <shr@xxxxxxxxxxxx> wrote:
>
>> A process can enable KSM with the prctl system call. When the process is
>> forked the KSM flag is inherited by the child process.
>
> I guess that's logical, as it's still the same program.
>
>> However if the
>> process is executing an exec system call directly after the fork, the
>> KSM setting is cleared. This patch series addresses this problem.
>
> Well... who said it's a problem? There's nothing in our documentation
> about this(?). Why is the current behavior wrong? If the new program
> wants KSM, it can turn on KSM.
>
> This significant change in user-visible behavior deserves much more
> explanation and justification, please. Including an explanation of why
> it's OK to change kernel behavior under existing users' feet like this,

Today we have two ways to enable KSM:

1) madvise system call
This allows to enable KSM for a memory region for a long time.

2) prctl system call
This is a recent addition to enable KSM for the complete process.
In addition when a process is forked, the KSM setting is inherited.

This change only affects the second case.

One of the use cases for (2) was to support the ability to enable
KSM for cgroups. This allows systemd to enable KSM for the seed
process. By enabling it in the seed process all child processes inherit
the setting.

This works correctly when the process is forked. However it doesn't
support fork/exec workflow.

>From the previous cover letter:

....
Use case 3:
With the madvise call sharing opportunities are only enabled for the current
process: it is a workload-local decision. A considerable number of sharing
opportunities may exist across multiple workloads or jobs (if they are part
of the same security domain). Only a higler level entity like a job scheduler
or container can know for certain if its running one or more instances of a
job. That job scheduler however doesn't have the necessary internal workload
knowledge to make targeted madvise calls.
...


In addition it can also be a bit surprising that fork keeps the KSM
setting and fork/exec does not.