Re: proc_flush_task oops

From: Alexey Dobriyan
Date: Fri Dec 22 2017 - 05:13:33 EST


On 12/22/17, Eric W. Biederman <ebiederm@xxxxxxxxxxxx> wrote:
> Dave Jones <davej@xxxxxxxxxxxxxxxxx> writes:
>
>> On Thu, Dec 21, 2017 at 07:31:26PM -0600, Eric W. Biederman wrote:
>> > Dave Jones <davej@xxxxxxxxxxxxxxxxx> writes:
>> >
>> > > On Thu, Dec 21, 2017 at 12:38:12PM +0200, Alexey Dobriyan wrote:
>> > >
>> > > > > with proc_mnt still set to NULL is a mystery to me.
>> > > > >
>> > > > > Is there any chance the idr code doesn't always return the
>> lowest valid
>> > > > > free number? So init gets assigned something other than 1?
>> > > >
>> > > > Well, this theory is easy to test (attached).
>> > >
>> > > I didn't hit this BUG, but I hit the same oops in proc_flush_task.
>> >
>> > Scratch one idea.
>> >
>> > If it isn't too much trouble can you try this.
>> >
>> > I am wondering if somehow the proc_mnt that is NULL is somewhere in
>> the
>> > middle of the stack of pid namespaces.
>> >
>> > This adds two warnings. The first just reports which pid namespace in
>> > the stack of pid namespaces is problematic, and the pid number in that
>> > pid namespace. Which should give a whole lot more to go by.
>> >
>> > The second warning complains if we manage to create a pid namespace
>> > where the parent pid namespace is not properly set up. The test to
>> > prevent that looks quite robust, but at this point I don't know where
>> to
>> > look.
>>
>> Progress ?
>>
>> [ 1653.030190] ------------[ cut here ]------------
>> [ 1653.030852] 1/1: 2 no proc_mnt
>> [ 1653.030946] WARNING: CPU: 2 PID: 4420 at kernel/pid.c:213
>> alloc_pid+0x24f/0x2a0
>
> Yes. I don't know why Alexey's patch did not fire but this is
> confirmation that the first pid allocated was #2 and not #1.
>
> Which explains the pid_mnt not being set, and it is definitely the new
> code, changing from the pid bitmap+hash table to an idr.
>
> So it looks like idr_alloc_cyclic in some configuration for the first
> allocation returns value #2 instead of value #1.
>
> I don't know that code, and it is quite complicated so I will have to
> stare at it some more to even guess why it is doing that.
>
> This is confirmation that reverting those pid changes will fix the
> problem. As they are definitely at fault.
>
>
> Hmm. After a little more staring I have a hunch what is going wrong.
> It is just possible that there is a failure in alloc_pid during the
> first pid allocation and then idr_next gets left at 2. I need to sleep
> before I can think of a patch to test that.
>
> Hmm. A failure and then restart would also explain why Alexey's patch
> did not fire. An incomplete reset of state.

You are right (you are also right about sysctl :-\)

unshare
fork
alloc_pid in level 1 succeeds
alloc_pid in level 0 fails, ->idr_next is 2
fork
alloc pid 2
exit

Reliable reproducer and fail injection patch attached

I'd say proper fix is allocating pids in the opposite order
so that failure in the last layer doesn't move IDR cursor
in baby child pidns.


BUG: unable to handle kernel NULL pointer dereference at (null)
IP: proc_flush_task+0x7b/0x180
PGD 3dbb5067 P4D 3dbb5067 PUD 3d11c067 PMD 0
Oops: 0000 [#1] PREEMPT SMP
CPU: 1 PID: 2775 Comm: trinity Not tainted
4.15.0-rc4-00202-gead68f216110-dirty #12
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.10.2-1.fc27 04/01/2014
RIP: 0010:proc_flush_task+0x7b/0x180
RSP: 0018:ffffc9000016bca8 EFLAGS: 00010296
RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000032
RDX: ffffc9000016bcbc RSI: ffffc9000016bcc8 RDI: 0000000000000001
RBP: ffffc9000016bcbb R08: 0000000036373732 R09: 0000000000000000
R10: 0000000000000000 R11: ffffc9000016bcbc R12: 0000000000000002
R13: 0000000000000000 R14: 0000000000000002 R15: ffff88003e2aee00
FS: 00007f9b52c86700(0000) GS:ffff88003fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000003d24b000 CR4: 00000000000006a0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
release_task+0x3c/0x430
? thread_group_cputime_adjusted+0x35/0x40
wait_consider_task+0x7a3/0x7d0
do_wait+0xe5/0x1f0
kernel_wait4+0x74/0x110
? task_stopped_code+0x50/0x50
SyS_wait4+0x77/0x80
? handle_mm_fault+0x4a/0x50
? __do_page_fault+0x1a9/0x3e0
? entry_SYSCALL_64_fastpath+0x13/0x6c
entry_SYSCALL_64_fastpath+0x13/0x6c
diff --git a/kernel/pid.c b/kernel/pid.c
index b13b624e2c49..c9ef3328a52a 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -146,6 +146,8 @@ void free_pid(struct pid *pid)

struct pid *alloc_pid(struct pid_namespace *ns)
{
+ static int try = 0;
+ bool fail = false;
struct pid *pid;
enum pid_type type;
int i, nr;
@@ -153,6 +155,9 @@ struct pid *alloc_pid(struct pid_namespace *ns)
struct upid *upid;
int retval = -ENOMEM;

+ if (strncmp(current->comm, "trinity", 7) == 0)
+ printk("%s: pid_ns level %d\n", __func__, ns->level);
+
pid = kmem_cache_alloc(ns->pid_cachep, GFP_KERNEL);
if (!pid)
return ERR_PTR(retval);
@@ -163,6 +168,9 @@ struct pid *alloc_pid(struct pid_namespace *ns)
for (i = ns->level; i >= 0; i--) {
int pid_min = 1;

+ if (strncmp(current->comm, "trinity", 7) == 0)
+ printk("%s: '%s' level %d\n", __func__, current->comm, tmp->level);
+
idr_preload(GFP_KERNEL);
spin_lock_irq(&pidmap_lock);

@@ -173,12 +181,25 @@ struct pid *alloc_pid(struct pid_namespace *ns)
if (idr_get_cursor(&tmp->idr) > RESERVED_PIDS)
pid_min = RESERVED_PIDS;

+ if (fail && strncmp(current->comm, "trinity", 7) == 0) {
+ fail = false;
+ nr = -ENOMEM;
+ goto xxx;
+ }
+
/*
* Store a null pointer so find_pid_ns does not find
* a partially initialized PID (see below).
*/
+ if (strncmp(current->comm, "trinity", 7) == 0) {
+ printk("BEFORE ->idr_next %u\n", tmp->idr.idr_next);
+ }
nr = idr_alloc_cyclic(&tmp->idr, NULL, pid_min,
pid_max, GFP_ATOMIC);
+ if (strncmp(current->comm, "trinity", 7) == 0) {
+ printk(" AFTER ->idr_next %u\n", tmp->idr.idr_next);
+ }
+xxx:
spin_unlock_irq(&pidmap_lock);
idr_preload_end();

@@ -187,6 +208,12 @@ struct pid *alloc_pid(struct pid_namespace *ns)
goto out_free;
}

+ if (try == 0 && strncmp(current->comm, "trinity", 7) == 0) {
+ if (tmp->level == 1) {
+ fail = true;
+ }
+ }
+
pid->numbers[i].nr = nr;
pid->numbers[i].ns = tmp;
tmp = tmp->parent;
@@ -229,6 +256,9 @@ struct pid *alloc_pid(struct pid_namespace *ns)
spin_unlock_irq(&pidmap_lock);

kmem_cache_free(ns->pid_cachep, pid);
+ if (strncmp(current->comm, "trinity", 7) == 0) {
+ try++;
+ }
return ERR_PTR(retval);
}

#define _GNU_SOURCE
#include <sched.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/wait.h>

int main(void)
{
int pid;

unshare(CLONE_NEWPID);
fork();
pid = fork();
if (pid == 0) {
return 0;
}
waitpid(pid, NULL, 0);
return 0;
}