[Bisected Regression in 2.6.32.8] i915 with KMS enabled causes memorycorruption when resuming from suspend-to-disk

From: M. Vefa Bicakci
Date: Sat Mar 13 2010 - 00:18:50 EST


Hello,

As you can guess from the subject, I have noticed that enabling the
KMS feature of the i915 module with any kernel version after 2.6.32.7
causes memory corruption after one resumes from suspend-to-disk.

My hardware is a Toshiba Satellite A100, with an Intel graphics card.
I am using an up-to-date version of Debian Sid. Here are the lspci
entries for my graphics card:

=== 8< ===
00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller [8086:27a2] (rev 03) (prog-if 00 [VGA controller])
00:02.1 Display controller [0380]: Intel Corporation Mobile 945GM/GMS/GME, 943/940GML Express Integrated Graphics Controller [8086:27a6] (rev 03)
=== >8 ===

I have noticed that after upgrading from 2.6.32.7 to 2.6.32.9, I started
to get a lot of segfaults from different programs when I resume from
suspend-to-disk. After searching the Internet for this problem, I have
seen that some other people also had it, and that it wasn't a new problem
either:

http://bbs.archlinux.org/viewtopic.php?id=91375
https://bugzilla.redhat.com/show_bug.cgi?id=537494
http://bugzilla.kernel.org/show_bug.cgi?id=13811

Even though some people say that they have had this problem for a long time,
I have only noticed it after upgrading to 2.6.32.9.

After booting with "nomodeset" and confirming that the problem doesn't
happen with that kernel option, I have determined that the problem was
with i915.

Then I used the following command to bisect the changes that i915 has
seen between 2.6.32.7 and 2.6.32.9:

git bisect start v2.6.32.9 v2.6.32.7 -- ./drivers/gpu/drm/

With each iteration in the bisection, I have tried at least 3 cycles
of suspend-to-disk and resume operations. I saw that all of the tried
versions had memory corruption issues after resume from suspend-to-disk.

Then, git told me that the culprit is the first change to i915 after the
release 2.6.32.7. So 2.6.32.8 introduced the regression I am experiencing.
Here's the "git bisect log" output:

=== 8< ===
# bad: [7f5e918e62cbc9ac27c2f47d3c3dd4b86f67ff0e] Linux 2.6.32.9
# good: [b4bdd73ce865213a5653dc424873e8da37e858cc] Linux 2.6.32.7
git bisect start 'v2.6.32.9' 'v2.6.32.7' '--' './drivers/gpu/drm/'
# bad: [192ff23a2206eb5136c779bfed73171a4d214ad6] drm/i915: Add HP nx9020/SamsungSX20S to ACPI LID quirk list
git bisect bad 192ff23a2206eb5136c779bfed73171a4d214ad6
# bad: [6240058ce3725f5e708e1c17c3a676217e44ba9b] drm/i915: disable hotplug detect before Ironlake CRT detect
git bisect bad 6240058ce3725f5e708e1c17c3a676217e44ba9b
# bad: [61d4374b51386dd40c03fd15df5a7f97347de688] drm/i915: Reload hangcheck timer too for Ironlake
git bisect bad 61d4374b51386dd40c03fd15df5a7f97347de688
# bad: [d8e0902806c0bd2ccc4f6a267ff52565a3ec933b] drm/i915: Selectively enable self-reclaim
git bisect bad d8e0902806c0bd2ccc4f6a267ff52565a3ec933b

d8e0902806c0bd2ccc4f6a267ff52565a3ec933b is the first bad commit
commit d8e0902806c0bd2ccc4f6a267ff52565a3ec933b
Author: Chris Wilson <chris@xxxxxxxxxxxxxxxxxx>
Date: Wed Jan 27 13:36:32 2010 +0000

drm/i915: Selectively enable self-reclaim

commit 4bdadb9785696439c6e2b3efe34aa76df1149c83 upstream.

Having missed the ENOMEM return via i915_gem_fault(), there are probably
other paths that I also missed. By not enabling NORETRY by default these
paths can run the shrinker and take memory from the system (but not from
our own inactive lists because our shrinker can not run whilst we hold
the struct mutex) and this may allow the system to survive a little longer
whilst our drivers consume all available memory.

References:
OOM killer unexpectedly called with kernel 2.6.32
http://bugzilla.kernel.org/show_bug.cgi?id=14933

v2: Pass gfp into page mapping.
v3: Use new read_cache_page_gfp() instead of open-coding.

...
=== >8 ===

For the record, just to confirm that this commit is actually the culprit,
I took a vanilla 2.6.32.9 source tree and reverted only this commit. I am
happy to let you know that with this commit reverted, I can no longer
reproduce the memory corruption issue.

However, as I noted above, some people have had this problem for a longer
time. So I am not sure if the commit above causes the bug or if it makes
the bug easier to trigger.

Finally, I would like to note that this regression is going to be important,
because, as you know, Intel's X11 drivers are not going to support mode-setting
in user mode starting with version 2.10.0.

If there is any help I can provide in fixing this regression, please let me
know. I am willing to try patches.

Regards,

M. Vefa Bicakci

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/