Re: [PATCH] RAS: Add a tracepoint for reporting memory controllerevents

From: Mauro Carvalho Chehab
Date: Tue May 29 2012 - 10:02:25 EST


Em 29-05-2012 08:58, Borislav Petkov escreveu:
> On Thu, May 24, 2012 at 03:00:53PM -0300, Mauro Carvalho Chehab wrote:

<ironic comments skipped>

>> the address and address mask is needed, as most memory controllers can't point
>> to a single address, because the register that stores the address doesn't have
>> enough bits to store the full content of the instruction pointer register, or because
>> of some other internal device issues.
>>
>> So, two different "addresses" could atually point to the same group of transistors
>> inside a DIMM.
>>
>> Also, higher values of grains may affect the error statistics. For example, i3200_edac
>> driver has a grain that can be 64 MB, while other devices have a grain of 1.
>
> I think you mean
>
> #define I3200_TOM_SHIFT 26 /* 64MiB grain */

>
> which is the Top-Of-Memory shift value. How is that grain in the sense of error
> granularity I can't fathom.
>

It seems you were unable to read the comments at the function that fills dimm->grain:

/*
* The dram rank boundary (DRB) reg values are boundary addresses
* for each DRAM rank with a granularity of 64MB. DRB regs are
* cumulative; the last one will contain the total memory
* contained in all ranks.
*/
for (i = 0; i < mci->nr_csrows; i++) {
unsigned long nr_pages;
struct csrow_info *csrow = &mci->csrows[i];

nr_pages = drb_to_nr_pages(drbs, stacked,
i / I3200_RANKS_PER_CHANNEL,
i % I3200_RANKS_PER_CHANNEL);

if (nr_pages == 0)
continue;

for (j = 0; j < nr_channels; j++) {
struct dimm_info *dimm = csrow->channels[j].dimm;

dimm->nr_pages = nr_pages / nr_channels;
dimm->grain = nr_pages << PAGE_SHIFT;
...


Assuming that errors are given by a Gausian distribution, the PDF parameters (mean, standard
derivation) when grain is equal to 1 is completely different than when grain is 64 MB.

That means that any correlation function used by an stochastic process analysis
will need to take the grain into account, in order to detect if a series of errors
are due to a random noise, or if they're due to a physical problem at the device.

> Oh, and by the way, this define is unused and can be removed.

Feel free to submit a patch for it.

Regards,
Mauro
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/