Re: [RFC PATCH] memory driver: make phys_index/end_phys_index reflect the start/end section number

From: Nathan Fontenot
Date: Fri Apr 11 2014 - 14:55:15 EST


On 04/09/2014 11:17 PM, Li Zhong wrote:
> On Wed, 2014-04-09 at 12:39 -0500, Nathan Fontenot wrote:
>> On 04/08/2014 02:47 PM, Dave Hansen wrote:
>>>
>>> That document really needs to be updated to stop referring to sections
>>> (at least in the descriptions of the user interface). We can not change
>>> the units of phys_index/end_phys_index without also changing
>>> block_size_bytes.
>>>
>>
>> Here is a first pass at updating the documentation.
>>
>> I have tried to update the documentation to refer to memory blocks instead
>> of memory sections where appropriate and added a paragraph to explain
>> that memory blocks are mode of memory sections.
>>
>> Thoughts?
>
> If we all agree to hide the information about sections, then I think we
> also need to update the section id's used for phys_index/end_phys_index,
> something like following on top of yours?
>
> --
> diff --git a/Documentation/memory-hotplug.txt b/Documentation/memory-hotplug.txt
> index 92d15e2..9fbb025 100644
> --- a/Documentation/memory-hotplug.txt
> +++ b/Documentation/memory-hotplug.txt
> @@ -138,10 +138,7 @@ is described under /sys/devices/system/memory as
> /sys/devices/system/memory/memoryXXX
> (XXX is the memory block id.)
>
> -Now, XXX is defined as (start_address_of_section / section_size) of the first
> -section contained in the memory block. The files 'phys_index' and
> -'end_phys_index' under each directory report the beginning and end section id's
> -for the memory block covered by the sysfs directory. It is expected that all
> +For the memory block covered by the sysfs directory. It is expected that all
> memory sections in this range are present and no memory holes exist in the
> range. Currently there is no way to determine if there is a memory hole, but
> the existence of one should not affect the hotplug capabilities of the memory
> @@ -155,16 +152,14 @@ This device covers address range [0x100000000 ... 0x140000000)
> Under each memory block, you can see 4 or 5 files, the end_phys_index file
> being a recent addition and not present on older kernels.
>
> -/sys/devices/system/memory/memoryXXX/start_phys_index
> +/sys/devices/system/memory/memoryXXX/phys_index
> /sys/devices/system/memory/memoryXXX/end_phys_index
> /sys/devices/system/memory/memoryXXX/phys_device
> /sys/devices/system/memory/memoryXXX/state
> /sys/devices/system/memory/memoryXXX/removable
>
> -'phys_index' : read-only and contains section id of the first section
> - in the memory block, same as XXX.
> -'end_phys_index' : read-only and contains section id of the last section
> - in the memory block.
> +'phys_index' : read-only and contains memory block id, same as XXX.
> +'end_phys_index' : read-only and contains memory block id, same as XXX.
> 'state' : read-write
> at read: contains online/offline state of memory.
> at write: user can specify "online_kernel",
> --
>
> Not sure whether it is proper to remove end_phys_index, too?

If we are going to leave the code as it is today such that the start_phys_index
and end_phys_index files both contain the same value I don't see why we should
not do this.

Li Zhong, unless anyone has objections, can you submit a patch to update the
files in sysfs and the documentation?

-Nathan

>
> Thanks,
> Zhong
>
>
>
>
>>
>> -Nathan
>> ---
>> Documentation/memory-hotplug.txt | 113 ++++++++++++++++++++-------------------
>> 1 file changed, 59 insertions(+), 54 deletions(-)
>>
>> Index: linux/Documentation/memory-hotplug.txt
>> ===================================================================
>> --- linux.orig/Documentation/memory-hotplug.txt
>> +++ linux/Documentation/memory-hotplug.txt
>> @@ -88,16 +88,21 @@ phase by hand.
>>
>> 1.3. Unit of Memory online/offline operation
>> ------------
>> -Memory hotplug uses SPARSEMEM memory model. SPARSEMEM divides the whole memory
>> -into chunks of the same size. The chunk is called a "section". The size of
>> -a section is architecture dependent. For example, power uses 16MiB, ia64 uses
>> -1GiB. The unit of online/offline operation is "one section". (see Section 3.)
>> +Memory hotplug uses SPARSEMEM memory model which allows memory to be divided
>> +into chunks of the same size. These chunks are called "sections". The size of
>> +a memory section is architecture dependent. For example, power uses 16MiB, ia64
>> +uses 1GiB.
>> +
>> +Memory sections are combined into chunks referred to as "memory blocks". The
>> +size of a memory block is architecture dependent and represents the logical
>> +unit upon which memory online/offline operations are to be performed. The
>> +default size of a memory block is the same as memory section size unless an
>> +architecture specifies otherwise. (see Section 3.)
>>
>> -To determine the size of sections, please read this file:
>> +To determine the size (in bytes) of a memory block please read this file:
>>
>> /sys/devices/system/memory/block_size_bytes
>>
>> -This file shows the size of sections in byte.
>>
>> -----------------------
>> 2. Kernel Configuration
>> @@ -123,14 +128,15 @@ config options.
>> (CONFIG_ACPI_CONTAINER).
>> This option can be kernel module too.
>>
>> +
>> --------------------------------
>> -4 sysfs files for memory hotplug
>> +3 sysfs files for memory hotplug
>> --------------------------------
>> -All sections have their device information in sysfs. Each section is part of
>> -a memory block under /sys/devices/system/memory as
>> +All memory blocks have their device information in sysfs. Each memory block
>> +is described under /sys/devices/system/memory as
>>
>> /sys/devices/system/memory/memoryXXX
>> -(XXX is the section id.)
>> +(XXX is the memory block id.)
>>
>> Now, XXX is defined as (start_address_of_section / section_size) of the first
>> section contained in the memory block. The files 'phys_index' and
>> @@ -141,13 +147,13 @@ range. Currently there is no way to dete
>> the existence of one should not affect the hotplug capabilities of the memory
>> block.
>>
>> -For example, assume 1GiB section size. A device for a memory starting at
>> +For example, assume 1GiB memory block size. A device for a memory starting at
>> 0x100000000 is /sys/device/system/memory/memory4
>> (0x100000000 / 1Gib = 4)
>> This device covers address range [0x100000000 ... 0x140000000)
>>
>> -Under each section, you can see 4 or 5 files, the end_phys_index file being
>> -a recent addition and not present on older kernels.
>> +Under each memory block, you can see 4 or 5 files, the end_phys_index file
>> +being a recent addition and not present on older kernels.
>>
>> /sys/devices/system/memory/memoryXXX/start_phys_index
>> /sys/devices/system/memory/memoryXXX/end_phys_index
>> @@ -185,6 +191,7 @@ For example:
>> A backlink will also be created:
>> /sys/devices/system/memory/memory9/node0 -> ../../node/node0
>>
>> +
>> --------------------------------
>> 4. Physical memory hot-add phase
>> --------------------------------
>> @@ -227,11 +234,10 @@ You can tell the physical address of new
>>
>> % echo start_address_of_new_memory > /sys/devices/system/memory/probe
>>
>> -Then, [start_address_of_new_memory, start_address_of_new_memory + section_size)
>> -memory range is hot-added. In this case, hotplug script is not called (in
>> -current implementation). You'll have to online memory by yourself.
>> -Please see "How to online memory" in this text.
>> -
>> +Then, [start_address_of_new_memory, start_address_of_new_memory +
>> +memory_block_size] memory range is hot-added. In this case, hotplug script is
>> +not called (in current implementation). You'll have to online memory by
>> +yourself. Please see "How to online memory" in this text.
>>
>>
>> ------------------------------
>> @@ -240,36 +246,36 @@ Please see "How to online memory" in thi
>>
>> 5.1. State of memory
>> ------------
>> -To see (online/offline) state of memory section, read 'state' file.
>> +To see (online/offline) state of a memory block, read 'state' file.
>>
>> % cat /sys/device/system/memory/memoryXXX/state
>>
>>
>> -If the memory section is online, you'll read "online".
>> -If the memory section is offline, you'll read "offline".
>> +If the memory block is online, you'll read "online".
>> +If the memory block is offline, you'll read "offline".
>>
>>
>> 5.2. How to online memory
>> ------------
>> Even if the memory is hot-added, it is not at ready-to-use state.
>> -For using newly added memory, you have to "online" the memory section.
>> +For using newly added memory, you have to "online" the memory block.
>>
>> -For onlining, you have to write "online" to the section's state file as:
>> +For onlining, you have to write "online" to the memory block's state file as:
>>
>> % echo online > /sys/devices/system/memory/memoryXXX/state
>>
>> -This onlining will not change the ZONE type of the target memory section,
>> -If the memory section is in ZONE_NORMAL, you can change it to ZONE_MOVABLE:
>> +This onlining will not change the ZONE type of the target memory block,
>> +If the memory block is in ZONE_NORMAL, you can change it to ZONE_MOVABLE:
>>
>> % echo online_movable > /sys/devices/system/memory/memoryXXX/state
>> -(NOTE: current limit: this memory section must be adjacent to ZONE_MOVABLE)
>> +(NOTE: current limit: this memory block must be adjacent to ZONE_MOVABLE)
>>
>> -And if the memory section is in ZONE_MOVABLE, you can change it to ZONE_NORMAL:
>> +And if the memory block is in ZONE_MOVABLE, you can change it to ZONE_NORMAL:
>>
>> % echo online_kernel > /sys/devices/system/memory/memoryXXX/state
>> -(NOTE: current limit: this memory section must be adjacent to ZONE_NORMAL)
>> +(NOTE: current limit: this memory block must be adjacent to ZONE_NORMAL)
>>
>> -After this, section memoryXXX's state will be 'online' and the amount of
>> +After this, memory block XXX's state will be 'online' and the amount of
>> available memory will be increased.
>>
>> Currently, newly added memory is added as ZONE_NORMAL (for powerpc, ZONE_DMA).
>> @@ -284,22 +290,22 @@ This may be changed in future.
>> 6.1 Memory offline and ZONE_MOVABLE
>> ------------
>> Memory offlining is more complicated than memory online. Because memory offline
>> -has to make the whole memory section be unused, memory offline can fail if
>> -the section includes memory which cannot be freed.
>> +has to make the whole memory block be unused, memory offline can fail if
>> +the memort block includes memory which cannot be freed.
>>
>> In general, memory offline can use 2 techniques.
>>
>> -(1) reclaim and free all memory in the section.
>> -(2) migrate all pages in the section.
>> +(1) reclaim and free all memory in the memory block.
>> +(2) migrate all pages in the memory block.
>>
>> In the current implementation, Linux's memory offline uses method (2), freeing
>> -all pages in the section by page migration. But not all pages are
>> +all pages in the memory block by page migration. But not all pages are
>> migratable. Under current Linux, migratable pages are anonymous pages and
>> -page caches. For offlining a section by migration, the kernel has to guarantee
>> -that the section contains only migratable pages.
>> +page caches. For offlining a memory block by migration, the kernel has to
>> +guarantee that the memory block contains only migratable pages.
>>
>> -Now, a boot option for making a section which consists of migratable pages is
>> -supported. By specifying "kernelcore=" or "movablecore=" boot option, you can
>> +Now, a boot option for making a memory block which consists of migratable pages
>> +is supported. By specifying "kernelcore=" or "movablecore=" boot option, you can
>> create ZONE_MOVABLE...a zone which is just used for movable pages.
>> (See also Documentation/kernel-parameters.txt)
>>
>> @@ -315,28 +321,27 @@ creates ZONE_MOVABLE as following.
>> Size of memory for movable pages (for offline) is ZZZZ.
>>
>>
>> -Note) Unfortunately, there is no information to show which section belongs
>> +Note: Unfortunately, there is no information to show which memory block belongs
>> to ZONE_MOVABLE. This is TBD.
>>
>>
>> 6.2. How to offline memory
>> ------------
>> -You can offline a section by using the same sysfs interface that was used in
>> -memory onlining.
>> +You can offline a memory block by using the same sysfs interface that was used
>> +in memory onlining.
>>
>> % echo offline > /sys/devices/system/memory/memoryXXX/state
>>
>> -If offline succeeds, the state of the memory section is changed to be "offline".
>> +If offline succeeds, the state of the memory block is changed to be "offline".
>> If it fails, some error core (like -EBUSY) will be returned by the kernel.
>> -Even if a section does not belong to ZONE_MOVABLE, you can try to offline it.
>> -If it doesn't contain 'unmovable' memory, you'll get success.
>> +Even if a memory block does not belong to ZONE_MOVABLE, you can try to offline
>> +it. If it doesn't contain 'unmovable' memory, you'll get success.
>>
>> -A section under ZONE_MOVABLE is considered to be able to be offlined easily.
>> -But under some busy state, it may return -EBUSY. Even if a memory section
>> -cannot be offlined due to -EBUSY, you can retry offlining it and may be able to
>> -offline it (or not).
>> -(For example, a page is referred to by some kernel internal call and released
>> - soon.)
>> +A memory block under ZONE_MOVABLE is considered to be able to be offlined
>> +easily. But under some busy state, it may return -EBUSY. Even if a memory
>> +block cannot be offlined due to -EBUSY, you can retry offlining it and may be
>> +able to offline it (or not). (For example, a page is referred to by some kernel
>> +internal call and released soon.)
>>
>> Consideration:
>> Memory hotplug's design direction is to make the possibility of memory offlining
>> @@ -373,11 +378,11 @@ MEMORY_GOING_OFFLINE
>> Generated to begin the process of offlining memory. Allocations are no
>> longer possible from the memory but some of the memory to be offlined
>> is still in use. The callback can be used to free memory known to a
>> - subsystem from the indicated memory section.
>> + subsystem from the indicated memory block.
>>
>> MEMORY_CANCEL_OFFLINE
>> Generated if MEMORY_GOING_OFFLINE fails. Memory is available again from
>> - the section that we attempted to offline.
>> + the memory block that we attempted to offline.
>>
>> MEMORY_OFFLINE
>> Generated after offlining memory is complete.
>> @@ -413,8 +418,8 @@ node if necessary.
>> --------------
>> - allowing memory hot-add to ZONE_MOVABLE. maybe we need some switch like
>> sysctl or new control file.
>> - - showing memory section and physical device relationship.
>> - - showing memory section is under ZONE_MOVABLE or not
>> + - showing memory block and physical device relationship.
>> + - showing memory block is under ZONE_MOVABLE or not
>> - test and make it better memory offlining.
>> - support HugeTLB page migration and offlining.
>> - memmap removing at memory offline.
>
>

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/