Re: Atomicity in syscalls ie fork [was Meaning of a kernel oops?]

Michael L. Galbraith (
Thu, 25 Sep 1997 20:45:49 +0200 (MET DST)

On Wed, 24 Sep 1997, Bill Hawes wrote:

> Anne wrote:
> > Ok, re-compiled kernel w/o modules and the problem didn't go away.
> > However, the stack trace still looks a bit sparse. I tried 2.0.30 again,
> > with and without modules, and it still doesn't exhibit the problem.
> Hi Anne,
> I think I see what's causing your squake problem. It looks like a
> possible race condition whereby a task has an mm struct but no page dir
> yet.
> Could you try the attached patch and see whether the error message
> prints out? It may not fix the problem, but this will help confirm the
> diagnosis.
> Regards,
> Bill

Hello Bill, other Gurus (and fellow tinkers :),

The oops which I posted earlier and sent you via private mail seems to confirm
that there is a race condition present. It is not only present in 2.0.??, but
also in 2.1.56. Before I go further and (likely) step in it ;) here is the oops
with what my I've come up with so far using my handy dandy assembler book.

Unable to handle kernel NULL pointer dereference at virtual address 00000bfc
current->tss.cr3 = 043ba000, `r3 = 043ba000
*pde = 00000000
Oops: 0000
CPU: 0
EIP: 0010:[<c013f952>]
EFLAGS: 00010217
eax: 00000000 ebx: 00000bfc ecx: bffff9ec edx: 000002ff
esi: bffff9ec edi: 00000000 ebp: bffff9f5 esp: c43f7f1c
ds: 0018 es: 0018 ss: 0018
Process top (pid: 231, process nr: 12, stackpage=c43f7000)
Stack: 00000000 c013f9db c2e78000 bffff9ec 00000216 c432b000 c23e8ea0 000007ff
c013fad7 c2e78000 bffff9ec bffff9f5 c432b000 00003ac6 c432b000 c01408a7
00003ac6 c432b000 c014096b c432b000 00003ac6 0000000a c23e8ea0 ffffffea
Call Trace: [<c013f9db>] [<c013fad7>] [<c01408a7>] [<c014096b>] [<c01225bb>] [<c010926a>]
Code: 8b 13 85 d2 74 20 89 d0 25 fb 0f 00 00 83 f8 63 74 18 52 68
Using `/boot/2.1.56/' to map addresses to symbols.

>>EIP: c013f952 <get_phys_addr+2a/8c>
Trace: c013f9db <get_array+27/6c>
Trace: c013fad7 <get_arg+57/60>
Trace: c01408a7 <get_process_array+63/88>
Trace: c014096b <array_read+9f/274>
Trace: c01225bb <sys_read+a7/ec>
Trace: c010926a <system_call+3a/40>

Code: c013f952 <get_phys_addr+2a/8c>
Code: c013f952 <get_phys_addr+2a/8c> 8b 13 movl (%ebx),%edx
Code: c013f954 <get_phys_addr+2c/8c> 85 d2 testl %edx,%edx
Code: c013f956 <get_phys_addr+2e/8c> 74 20 je c013f978 <get_phys_addr+50/8c>
Code: c013f958 <get_phys_addr+30/8c> 89 d0 movl %edx,%eax
Code: c013f95a <get_phys_addr+32/8c> 25 fb 0f 00 00 andl $0xffb,%eax
Code: c013f965 <get_phys_addr+3d/8c> 83 f8 63 cmpl $0x63,%eax
Code: c013f968 <get_phys_addr+40/8c> 74 18 je c013f97c <get_phys_addr+54/8c>
Code: c013f96a <get_phys_addr+42/8c> 52 pushl %edx
Code: c013f96b <get_phys_addr+43/8c> 68 00 90 90 90 pushl $0x90909000

vmlinux: file format elf32-i386

Disassembly of section .text:

static unsigned long get_phys_addr(struct task_struct * p, unsigned long ptr)
c013f928 <get_phys_addr> pushl %ebx
c013f929 <get_phys_addr+1> movl 0x8(%esp,1),%eax // mov p into eax
c013f92d <get_phys_addr+5> movl 0xc(%esp,1),%ecx // mov ptr into ecx
pgd_t *page_dir;
pmd_t *page_middle;
pte_t pte;

if (!p || !p->mm || ptr >= TASK_SIZE)
c013f931 <get_phys_addr+9> testl %eax,%eax // p != 0 or else
c013f933 <get_phys_addr+b> je c013f978 <get_phys_addr+50>
c013f935 <get_phys_addr+d> movl 0x3ac(%eax),%eax // p+0x3ac=p->mm
c013f93b <get_phys_addr+13> testl %eax,%eax // p->mm != 0 or else
c013f93d <get_phys_addr+15> je c013f978 <get_phys_addr+50>
c013f93f <get_phys_addr+17> cmpl $0xbfffffff,%ecx // cmp TASK_SIZE-1 to ptr
c013f945 <get_phys_addr+1d> ja c013f978 <get_phys_addr+50>
return 0;
page_dir = pgd_offset(p->mm,ptr);
c013f947 <get_phys_addr+1f> movl %ecx,%edx // mov ptr into edx
c013f949 <get_phys_addr+21> shrl $0x16,%edx // shift ptr right 16
c013f94c <get_phys_addr+24> movl 0x8(%eax),%eax // mov p->mm->pgd into eax
c013f94f <get_phys_addr+27> leal (%eax,%edx,4),%ebx // uhoh! p->mm->pgd NULL
if (pgd_none(*page_dir)) // ..not good
return 0; // all this is optimized
if (pgd_bad(*page_dir)) { // away by gcc on x86
printk("bad page directory entry %08lx\n", pgd_val(*page_dir));
return 0;
page_middle = pmd_offset(page_dir,ptr);
if (pmd_none(*page_middle))
c013f952 <get_phys_addr+2a> movl (%ebx),%edx <--OOPS

Exactly as you stated.. have p->mm, but no p->mm->pgd.

Here comes the part where I _hope_ I did't screw up :-/

Taking a look around, I find that p->mm->pgd is initialized via..

sys_fork() -> do_fork() -> ... -> copy_mm() -> new_page_tables() ...

Now, (ulp) what I don't see is anything on a UP machine which insures atomicity
for fork. Lock/unlock_kernel() is optimized away by gcc on UP, and even if it
weren't, it wouldn't do anything anyway.. so it _can_ be otimized away.

It looks (to me) like fork can (does) get preempted after copy_mm() but before
new_page_tables() and therefore produce oops. I haven't found any place else
where pgd is initialized. If that's true, then fork() must be being preempted.

Changing if (!p || !p->mm || ptr >= TASK_SIZE) to..
if (!p || !p->mm || !p->mm->pgd || ptr >= TASK_SIZE)
should avoid the oops, but that looks like a bandaid.

o Is there atomicity insurance for UP that I'm not seeing?

o Is there any reason for sys_fork() and sys_clone() being protected by
lock_kernel() _and_ do_fork() also locking? I only see one path.

The same applies (or not) to sys_read. Sys_read() -> array_read() [change inode
into pid] -> get_process_array(pid) -> get_arg(pid) [proc/array.c]

I can't find anything which guarantees that pid will remain valid throughout
it's lifetime (until we get to find_task_by_pid) _if_ preemption is allowed.

Educational raps up side the head very welcome if this is dainbramaged.