Re: Memory leak in 2.4.27 kernel, using mmap raw packet sockets

From: Marcelo Tosatti
Date: Fri Oct 15 2004 - 15:12:58 EST


On Thu, Oct 14, 2004 at 10:50:14AM -0400, bgagnon@xxxxxxxxxxxxx wrote:
> (Please cc me directly, since I am not a subscriber to this list)
>
> Hi,
>
> I discovered a memory leak in the mmap raw packet socket implementation,
> specifically when the client application core dumps (elf format). More
> specifically, the entire ring buffer memory leaks under such
> circumstances. I have reproduced the problem with our custom app, which
> links with Phil Wood's mmap version of libpcap (
> http://plublic.lanl.gov/cpw) as well as with tcpdump binding against the
> same library. I have asserted the problem presence in kernels 2.4.26 and
> 2.4.27. It has been resolved in kernel 2.6.8.
>
> In a nutshell, the reference count of the ring buffer page frame is not
> handled properly by the core dump function elf_core_dump() in
> fs/binfmt_elf.c. After exiting this function, all ring buffer page frames
> reference counts have been increased by one, which impedes their release
> when the socket is closed.
>
> The reference count mishap is related to the fact that the ring buffer
> page frames have their PG_reserved bit set when they are created by the
> raw packet socket. According to the comments in mm.h, this is to ensure
> the pages don't get swapped out to disk. elf_core_dump() calls
> get_user_pages(), which increments the reference count, irrespective of
> the PG_reserved bit, while the subsequent put_page() call only decrements
> it if the PG_reserved bit is clear.
>
> According to a similar kernel mailing list thread (
> http://www.ussg.iu.edu/hypermail/linux/kernel/0311.3/0495.html), the page
> reference count field is not significant if the PG_reserved bit is set.
> Applied to our particular case, the raw packet socket implementation
> should unconditionally reset the reference count when the socket is
> closed, so that the memory gets released. However, I fear this would cause
> user space applications to crash: according to the mmap man page, the
> memory mapping should survive the socket closure:
>
> "The munmap system call deletes the mappings for the specified address
> range, and causes further references to addresses within the range to
> generate invalid memory references. The region is also automatically
> unmapped when the process is terminated. On the other hand, closing
> the file descriptor does not unmap the region."
>
> So, on 2.4 kernels, we are facing the following situation:
>
> 1- The reference count information IS important to control the ring buffer
> lifetime
> 2- The PG_reserved bit causes the reference count not to be handled
> properly by different kernel subsystems, the core dump functions being one
> of them
>
> >From what I could understand, 2.6 developer fixed the get_user_pages()
> code so it does not increment to reference count if the page has the
> PG_reserved bit set. I don't know if this is readily applicable to 2.4
> kernels, since it may break some other subsystem that rely on its current
> behavior.
>
> So, unless a better solution is found, I would propose a localized fix for
> the leak, in function elf_core_dump(), starting at line 1275 (comments
> are my own) :
>
> if (get_user_pages(current, current->mm, addr, 1, 0, 1, /*BG: this
> ALWAYS will increment the ref count*/
> &page, &vma) <= 0) {
> DUMP_SEEK (file->f_pos + PAGE_SIZE);
> } else {
> if (page == ZERO_PAGE(addr)) {
> DUMP_SEEK (file->f_pos + PAGE_SIZE);
> } else {
> void *kaddr;
> flush_cache_page(vma, addr);
> kaddr = kmap(page);
> DUMP_WRITE(kaddr, PAGE_SIZE);
> flush_page_to_ram(page);
> kunmap(page);
> }
> /*BG: Start of fix*/
> if (PageReserved(page))
> /* BG: just undo the ref count increase done in
> get_user_pages*/
> put_page_testzero(page);
> else
> put_page(page);
> /*BG: end of fix */
> }

Hi Bernand,

I prefer doing the "if (PageReserved(page)) put_page_testzero(page)" as
you propose instead of changing get_user_pages(), as there are several
users which rely on its behaviour.

I have applied your fix to the 2.4 BK tree.

Thanks.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/