Re: [GIT PULL] EDAC fixes for 3.8

From: Borislav Petkov
Date: Sat Mar 09 2013 - 10:47:03 EST


On Thu, Mar 07, 2013 at 11:02:13AM -0300, Mauro Carvalho Chehab wrote:
> Sure. See below:
>
> [ 19.062902] EDAC MC: Ver: 3.0.0
> [ 19.088757] EDAC DEBUG: edac_mc_sysfs_init: device mc created
> [ 19.284745] AMD64 EDAC driver v3.4.0
> [ 19.299082] EDAC amd64: DRAM ECC enabled.
> [ 19.315960] EDAC DEBUG: amd64_nb_mce_bank_enabled_on_node: core: 0, MCG_CTL: 0x3f, NB MSR is enabled

^^^^^^^
Whoops, where did core 1 go? Strange.

> [ 19.321115] EDAC DEBUG: amd64_nb_mce_bank_enabled_on_node: core: 2, MCG_CTL: 0x3f, NB MSR is enabled
> [ 19.321118] EDAC DEBUG: amd64_nb_mce_bank_enabled_on_node: core: 3, MCG_CTL: 0x3f, NB MSR is enabled
> [ 19.321120] EDAC DEBUG: amd64_nb_mce_bank_enabled_on_node: core: 4, MCG_CTL: 0x3f, NB MSR is enabled
> [ 19.321123] EDAC DEBUG: amd64_nb_mce_bank_enabled_on_node: core: 5, MCG_CTL: 0x3f, NB MSR is enabled
> [ 19.321125] EDAC DEBUG: amd64_nb_mce_bank_enabled_on_node: core: 6, MCG_CTL: 0x3f, NB MSR is enabled
> [ 19.321140] EDAC amd64: F10h detected (node 0).
> [ 19.327072] EDAC DEBUG: reserve_mc_sibling_devs: F1: 0000:00:18.1
> [ 19.327074] EDAC DEBUG: reserve_mc_sibling_devs: F2: 0000:00:18.2
> [ 19.327076] EDAC DEBUG: reserve_mc_sibling_devs: F3: 0000:00:18.3
> [ 19.327078] EDAC DEBUG: read_mc_regs: TOP_MEM: 0x00000000e0000000
> [ 19.327081] EDAC DEBUG: read_mc_regs: TOP_MEM2: 0x0000000420000000

Looks about right - 16G.

> [ 19.327087] EDAC DEBUG: read_dram_ctl_register: F2x110 (DCTSelLow): 0x000005e4, High range addrs at: 0x0
> [ 19.327089] EDAC DEBUG: read_dram_ctl_register: DCTs operate in unganged mode
> [ 19.327091] EDAC DEBUG: read_dram_ctl_register: Address range split per DCT: no
> [ 19.327093] EDAC DEBUG: read_dram_ctl_register: data interleave for ECC: enabled, DRAM cleared since last warm reset: yes
> [ 19.327095] EDAC DEBUG: read_dram_ctl_register: channel interleave: enabled, interleave bits selector: 0x3
> [ 19.327099] EDAC DEBUG: read_mc_regs: DRAM range[0], base: 0x0000000000000000; limit: 0x000000021fffffff
> [ 19.327101] EDAC DEBUG: read_mc_regs: IntlvEn=Disabled; Range access: RW IntlvSel=0 DstNode=0
> [ 19.327104] EDAC DEBUG: read_mc_regs: DRAM range[1], base: 0x0000000220000000; limit: 0x000000041fffffff
> [ 19.327107] EDAC DEBUG: read_mc_regs: IntlvEn=Disabled; Range access: RW IntlvSel=0 DstNode=1
> [ 19.327114] EDAC DEBUG: read_dct_base_mask: DCSB0[0]=0x00000000 reg: F2x40
> [ 19.327117] EDAC DEBUG: read_dct_base_mask: DCSB1[0]=0x00000000 reg: F2x140
> [ 19.327119] EDAC DEBUG: read_dct_base_mask: DCSB0[1]=0x00000000 reg: F2x44
> [ 19.327121] EDAC DEBUG: read_dct_base_mask: DCSB1[1]=0x00000000 reg: F2x144
> [ 19.327123] EDAC DEBUG: read_dct_base_mask: DCSB0[2]=0x00000001 reg: F2x48
> [ 19.327125] EDAC DEBUG: read_dct_base_mask: DCSB1[2]=0x00000001 reg: F2x148
> [ 19.327129] EDAC DEBUG: read_dct_base_mask: DCSB0[3]=0x00000101 reg: F2x4c
> [ 19.327131] EDAC DEBUG: read_dct_base_mask: DCSB1[3]=0x00000101 reg: F2x14c
> [ 19.327134] EDAC DEBUG: read_dct_base_mask: DCSB0[4]=0x00000000 reg: F2x50
> [ 19.327136] EDAC DEBUG: read_dct_base_mask: DCSB1[4]=0x00000000 reg: F2x150
> [ 19.327138] EDAC DEBUG: read_dct_base_mask: DCSB0[5]=0x00000000 reg: F2x54
> [ 19.327140] EDAC DEBUG: read_dct_base_mask: DCSB1[5]=0x00000000 reg: F2x154
> [ 19.327142] EDAC DEBUG: read_dct_base_mask: DCSB0[6]=0x00000201 reg: F2x58
> [ 19.327144] EDAC DEBUG: read_dct_base_mask: DCSB1[6]=0x00000201 reg: F2x158
> [ 19.327146] EDAC DEBUG: read_dct_base_mask: DCSB0[7]=0x00000301 reg: F2x5c
> [ 19.327148] EDAC DEBUG: read_dct_base_mask: DCSB1[7]=0x00000301 reg: F2x15c
> [ 19.327150] EDAC DEBUG: read_dct_base_mask: DCSM0[0]=0x00000000 reg: F2x60
> [ 19.327152] EDAC DEBUG: read_dct_base_mask: DCSM1[0]=0x00000000 reg: F2x160
> [ 19.327155] EDAC DEBUG: read_dct_base_mask: DCSM0[1]=0x00f83ce0 reg: F2x64
> [ 19.327157] EDAC DEBUG: read_dct_base_mask: DCSM1[1]=0x00f83ce0 reg: F2x164
> [ 19.327159] EDAC DEBUG: read_dct_base_mask: DCSM0[2]=0x00000000 reg: F2x68
> [ 19.327161] EDAC DEBUG: read_dct_base_mask: DCSM1[2]=0x00000000 reg: F2x168
> [ 19.327163] EDAC DEBUG: read_dct_base_mask: DCSM0[3]=0x00f83ce0 reg: F2x6c
> [ 19.327165] EDAC DEBUG: read_dct_base_mask: DCSM1[3]=0x00f83ce0 reg: F2x16c
> [ 19.327169] EDAC DEBUG: dump_misc_regs: F3xE8 (NB Cap): 0x0200df5f
> [ 19.327170] EDAC DEBUG: dump_misc_regs: NB two channel DRAM capable: yes
> [ 19.327172] EDAC DEBUG: dump_misc_regs: ECC capable: yes, ChipKill ECC capable: yes
> [ 19.327175] EDAC DEBUG: amd64_dump_dramcfg_low: F2x090 (DRAM Cfg Low): 0x00080100
> [ 19.327179] EDAC DEBUG: amd64_dump_dramcfg_low: DIMM type: buffered; all DIMMs support ECC: yes
> [ 19.327181] EDAC DEBUG: amd64_dump_dramcfg_low: PAR/ERR parity: enabled
> [ 19.327183] EDAC DEBUG: amd64_dump_dramcfg_low: DCT 128bit mode width: 64b
> [ 19.327185] EDAC DEBUG: amd64_dump_dramcfg_low: x4 logical DIMMs present: L0: no L1: no L2: no L3: no
> [ 19.327187] EDAC DEBUG: dump_misc_regs: F3xB0 (Online Spare): 0x00000000
> [ 19.327189] EDAC DEBUG: dump_misc_regs: F1xF0 (DRAM Hole Address): 0xe0002003, base: 0xe0000000, offset: 0x20000000
> [ 19.327190] EDAC DEBUG: dump_misc_regs: DramHoleValid: yes
> [ 19.327193] EDAC DEBUG: amd64_debug_display_dimm_sizes: F2x080 (DRAM Bank Address Mapping): 0x00005050
> [ 19.327195] EDAC MC: DCT0 chip selects:
> [ 19.327196] EDAC amd64: MC: 0: 0MB 1: 0MB
> [ 19.333141] EDAC amd64: MC: 2: 1024MB 3: 1024MB
> [ 19.339225] EDAC amd64: MC: 4: 0MB 5: 0MB
> [ 19.344247] EDAC amd64: MC: 6: 1024MB 7: 1024MB
> [ 19.348948] EDAC DEBUG: amd64_debug_display_dimm_sizes: F2x180 (DRAM Bank Address Mapping): 0x00005050
> [ 19.348949] EDAC MC: DCT1 chip selects:
> [ 19.348954] EDAC amd64: MC: 0: 0MB 1: 0MB
> [ 19.353656] EDAC amd64: MC: 2: 1024MB 3: 1024MB
> [ 19.358365] EDAC amd64: MC: 4: 0MB 5: 0MB
> [ 19.363086] EDAC amd64: MC: 6: 1024MB 7: 1024MB
> [ 19.367799] EDAC amd64: using x8 syndromes.
> [ 19.371996] EDAC DEBUG: amd64_dump_dramcfg_low: F2x190 (DRAM Cfg Low): 0x00080100
> [ 19.371998] EDAC DEBUG: amd64_dump_dramcfg_low: DIMM type: buffered; all DIMMs support ECC: yes
> [ 19.372003] EDAC DEBUG: amd64_dump_dramcfg_low: PAR/ERR parity: enabled
> [ 19.372005] EDAC DEBUG: amd64_dump_dramcfg_low: DCT 128bit mode width: 64b
> [ 19.372007] EDAC DEBUG: amd64_dump_dramcfg_low: x4 logical DIMMs present: L0: no L1: no L2: no L3: no
> [ 19.372009] EDAC DEBUG: f1x_early_channel_count: Data width is not 128 bits - need more decoding
> [ 19.372011] EDAC amd64: MCT channel count: 2
> [ 19.376292] EDAC DEBUG: edac_mc_alloc: allocating 1904 bytes for mci data (16 ranks, 16 csrows/channels)
> [ 19.376323] EDAC DEBUG: init_csrows: node 0, NBCFG=0x4af0005c[ChipKillEccCap: 1|DramEccEn: 1]
> [ 19.376325] EDAC DEBUG: init_csrows: MC node: 0, csrow: 2
> [ 19.376327] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 2, channel: 0, DBAM idx: 5
> [ 19.376329] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144
> [ 19.376331] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 2, channel: 1, DBAM idx: 5
> [ 19.376333] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144
> [ 19.376335] EDAC amd64: CS2: Registered DDR3 RAM
> [ 19.380967] EDAC DEBUG: init_csrows: Total csrow2 pages: 524288
> [ 19.380970] EDAC DEBUG: init_csrows: MC node: 0, csrow: 3
> [ 19.380971] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 3, channel: 0, DBAM idx: 5
> [ 19.380973] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144
> [ 19.380975] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 3, channel: 1, DBAM idx: 5
> [ 19.380977] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144
> [ 19.380978] EDAC amd64: CS3: Registered DDR3 RAM
> [ 19.385610] EDAC DEBUG: init_csrows: Total csrow3 pages: 524288
> [ 19.385612] EDAC DEBUG: init_csrows: MC node: 0, csrow: 6
> [ 19.385614] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 6, channel: 0, DBAM idx: 5
> [ 19.385615] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144
> [ 19.385617] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 6, channel: 1, DBAM idx: 5
> [ 19.385619] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144
> [ 19.385620] EDAC amd64: CS6: Registered DDR3 RAM
> [ 19.390240] EDAC DEBUG: init_csrows: Total csrow6 pages: 524288
> [ 19.390242] EDAC DEBUG: init_csrows: MC node: 0, csrow: 7
> [ 19.390244] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 7, channel: 0, DBAM idx: 5
> [ 19.390246] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144
> [ 19.390248] EDAC DEBUG: amd64_csrow_nr_pages: csrow: 7, channel: 1, DBAM idx: 5
> [ 19.390250] EDAC DEBUG: amd64_csrow_nr_pages: nr_pages/channel: 262144
> [ 19.390254] EDAC amd64: CS7: Registered DDR3 RAM
> [ 19.394875] EDAC DEBUG: init_csrows: Total csrow7 pages: 524288

[ â ]

> [ 19.395385] EDAC MC0: Giving out device to 'amd64_edac' 'F10h': DEV 0000:00:18.2
> [ 19.402852] EDAC amd64: DRAM ECC enabled.
> [ 19.406879] EDAC DEBUG: amd64_nb_mce_bank_enabled_on_node: core: 1, MCG_CTL: 0x3f, NB MSR is enabled

here's core 1, WTF? on the second node? Great.

> [ 19.406882] EDAC DEBUG: amd64_nb_mce_bank_enabled_on_node: core: 7, MCG_CTL: 0x3f, NB MSR is enabled
> [ 19.406884] EDAC DEBUG: amd64_nb_mce_bank_enabled_on_node: core: 8, MCG_CTL: 0x3f, NB MSR is enabled
> [ 19.406887] EDAC DEBUG: amd64_nb_mce_bank_enabled_on_node: core: 9, MCG_CTL: 0x3f, NB MSR is enabled
> [ 19.406889] EDAC DEBUG: amd64_nb_mce_bank_enabled_on_node: core: 10, MCG_CTL: 0x3f, NB MSR is enabled
> [ 19.406891] EDAC DEBUG: amd64_nb_mce_bank_enabled_on_node: core: 11, MCG_CTL: 0x3f, NB MSR is enabled

[ â ]

On Thu, Mar 07, 2013 at 09:57:03AM -0300, Mauro Carvalho Chehab wrote:
> This is what the csrows nodes show:
>
> /sys/devices/system/edac/mc/mc0/csrow2/size_mb:2048
> /sys/devices/system/edac/mc/mc0/csrow3/size_mb:2048
> /sys/devices/system/edac/mc/mc0/csrow6/size_mb:2048
> /sys/devices/system/edac/mc/mc0/csrow7/size_mb:2048
> /sys/devices/system/edac/mc/mc1/csrow2/size_mb:2048
> /sys/devices/system/edac/mc/mc1/csrow3/size_mb:2048
> /sys/devices/system/edac/mc/mc1/csrow6/size_mb:2048
> /sys/devices/system/edac/mc/mc1/csrow7/size_mb:2048

This is correct.

Each chip select has 1024M per DCT but since we have 2 DCTs per node,
that's 1024M * 2 = 2G per chip select of a MC.

> Total size is 16Gb, but the number of ranks are wrong.

Well, chip select != rank, remember?

> This is what's reported by the new API:
>
> /sys/devices/system/edac/mc/mc0/rank12/size:2048
> /sys/devices/system/edac/mc/mc0/rank13/size:2048
> /sys/devices/system/edac/mc/mc0/rank14/size:2048
> /sys/devices/system/edac/mc/mc0/rank15/size:2048
> /sys/devices/system/edac/mc/mc0/rank4/size:2048
> /sys/devices/system/edac/mc/mc0/rank5/size:2048
> /sys/devices/system/edac/mc/mc0/rank6/size:2048
> /sys/devices/system/edac/mc/mc0/rank7/size:2048
> /sys/devices/system/edac/mc/mc1/rank12/size:2048
> /sys/devices/system/edac/mc/mc1/rank13/size:2048
> /sys/devices/system/edac/mc/mc1/rank14/size:2048
> /sys/devices/system/edac/mc/mc1/rank15/size:2048
> /sys/devices/system/edac/mc/mc1/rank4/size:2048
> /sys/devices/system/edac/mc/mc1/rank5/size:2048
> /sys/devices/system/edac/mc/mc1/rank6/size:2048
> /sys/devices/system/edac/mc/mc1/rank7/size:2048
>
> Here, the number of ranks are ok, but the size is wrong.
>
> This is what the edac debug logs say:
>
> [ 18.829184] EDAC amd64: F10h detected (node 0).
> [ 18.829206] EDAC MC: DCT0 chip selects:
> [ 18.829207] EDAC amd64: MC: 0: 0MB 1: 0MB
> [ 18.829219] EDAC amd64: MC: 2: 1024MB 3: 1024MB
> [ 18.829220] EDAC amd64: MC: 4: 0MB 5: 0MB
> [ 18.829221] EDAC amd64: MC: 6: 1024MB 7: 1024MB
> [ 18.829222] EDAC MC: DCT1 chip selects:
> [ 18.829223] EDAC amd64: MC: 0: 0MB 1: 0MB
> [ 18.829223] EDAC amd64: MC: 2: 1024MB 3: 1024MB
> [ 18.829224] EDAC amd64: MC: 4: 0MB 5: 0MB
> [ 18.829225] EDAC amd64: MC: 6: 1024MB 7: 1024MB
>
> [ 18.923914] EDAC amd64: F10h detected (node 1).
> [ 18.956025] EDAC MC: DCT0 chip selects:
> [ 18.956028] EDAC amd64: MC: 0: 0MB 1: 0MB
> [ 18.962055] EDAC amd64: MC: 2: 1024MB 3: 1024MB
> [ 18.968167] EDAC amd64: MC: 4: 0MB 5: 0MB
> [ 18.974252] EDAC amd64: MC: 6: 1024MB 7: 1024MB
> [ 18.980333] EDAC MC: DCT1 chip selects:
> [ 18.980335] EDAC amd64: MC: 0: 0MB 1: 0MB
> [ 18.986415] EDAC amd64: MC: 2: 1024MB 3: 1024MB
> [ 18.991454] EDAC amd64: MC: 4: 0MB 5: 0MB
> [ 18.996155] EDAC amd64: MC: 6: 1024MB 7: 1024MB
> [ 19.000854] EDAC amd64: using x8 syndromes.
>
> Here, everything is fine.

So, actually to satisfy the new api, you'll probably need to stick down
this information above, i.e. the chip selects *per* DCT which equals
also the ranks.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/