Re: [PATCH v11 15/15] HMM: add documentation explaining HMM internals and how to use it.

From: Randy Dunlap
Date: Wed Oct 21 2015 - 23:24:07 EST


Hi,

Some corrections and a few questions...

On 10/21/15 14:00, Jérôme Glisse wrote:
> This add documentation on how HMM works and a more in depth view of how it
> should be use by device driver writers.
>
> Signed-off-by: Jérôme Glisse <jglisse@xxxxxxxxxx>
> ---
> Documentation/vm/hmm.txt | 219 +++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 219 insertions(+)
> create mode 100644 Documentation/vm/hmm.txt
>
> diff --git a/Documentation/vm/hmm.txt b/Documentation/vm/hmm.txt
> new file mode 100644
> index 0000000..febed50
> --- /dev/null
> +++ b/Documentation/vm/hmm.txt
> @@ -0,0 +1,219 @@
> +Heterogeneous Memory Management (HMM)
> +-------------------------------------
> +
> +The raison d'�tre of HMM is to provide a common API for device driver that

drivers

> +wants to mirror a process address space on there device and/or migrate system

want their

> +memory to device memory. Device driver can decide to only use one aspect of

drivers

> +HMM (mirroring or memory migration), for instance some device can directly
> +access process address space through hardware (for instance PCIe ATS/PASID),
> +but still want to benefit from memory migration capabilities that HMM offer.
> +
> +While HMM rely on existing kernel infrastructure (namely mmu_notifier) some

relies

> +of its features (memory migration, atomic access) require integration with
> +core mm kernel code. Having HMM as the common intermediary is more appealing

MM

> +than having each device driver hooking itself inside the common mm code.

MM

> +
> +Moreover HMM as a layer allows integration with DMA API or page reclaimation.

reclamation.

> +
> +
> +Mirroring address space on the device:
> +--------------------------------------
> +
> +Device that can't directly access transparently the process address space, need
> +to mirror the CPU page table into there own page table. HMM helps to keep the

their

> +device page table synchronize with the CPU page table. It is not expected that

synchronized

> +the device will fully mirror the CPU page table but only mirror region that are

regions

> +actively accessed by the device. For that reasons HMM only helps populating and

reason

> +synchronizing device page table for range that the device driver explicitly ask

ranges asks

or is only one range supported?


> +for.
> +
> +Mirroring address space inside the device page table is easy with HMM :

HMM:

> +
> + /* Create a mirror for the current process for your device. */
> + your_hmm_mirror->hmm_mirror.device = your_hmm_device;
> + hmm_mirror_register(&your_hmm_mirror->hmm_mirror);
> +
> + ...
> +
> + /* Mirror memory (in read mode) between addressA and addressB */
> + your_hmm_event->hmm_event.start = addressA;
> + your_hmm_event->hmm_event.end = addressB;

Multiple events (ranges) can be specified?
Is hmm_event.end (addressB) included or excluded from the range?

> + your_hmm_event->hmm_event.etype = HMM_DEVICE_RFAULT;
> + hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
> + /* HMM callback into your driver with the >update() callback. During the
> + * callback use the HMM page table to populate the device page table. You
> + * can only use the HMM page table to populate the device page table for
> + * the specified range during the >update() callback, at any other point in
> + * time the HMM page table content should be assume to be undefined.

assumed

> + */
> + your_hmm_device->update(mirror, event);
> +
> + ...
> +
> + /* Process is quiting or device done stop the mirroring and cleanup. */

quitting or device done; stop

> + hmm_mirror_unregister(&your_hmm_mirror->hmm_mirror);
> + /* Device driver can free your_hmm_mirror */
> +
> +
> +HMM mirror page table:
> +----------------------
> +
> +Each hmm_mirror object is associated with a mirror page table that HMM keeps
> +synchronize with the CPU page table by using the mmu_notifier API. HMM is using

synchronized

> +its own generic page table format because it needs to store DMA address, which

adresses,

> +are bigger than long on some architecture, and have more flags per entry than

architectures,

> +radix tree allows.
> +
> +The HMM page table mostly mirror x86 page table layout. A page holds a global

mirrors

> +directory and each entry points to a lower level directory. Unlike regular CPU
> +page table, directory level are more aggressively freed and remove from the HMM

tables, levels removed

> +mirror page table. This means device driver needs to use the HMM helpers and to

drivers need

> +follow directive on when and how to access the mirror page table. HMM use the

uses

> +per page spinlock of directory page to synchronize update of directory ie update

pages directory, i.e.,

> +can happen on different directory concurently.

concurrently.

> +
> +As a rules the mirror page table can only be accessed by device driver from one

rule by a device driver

> +of the HMM device callback. Any access from outside a callback is illegal and

callbacks.

> +gives undertimed result.

undetermined
or undefined

> +
> +Accessing the mirror page table from a device callback needs to use the HMM
> +page table helpers. A loop to access entry for a range of address looks like :

entries addresses looks like:

> +
> + /* Initialize a HMM page table iterator. */

an HMM

> + struct hmm_pt_iter iter;
> + hmm_pt_iter_init(&iter, &mirror->pt)
> +
> + /* Get pointer to HMM page table entry for a given address. */
> + dma_addr_t *hmm_pte;
> + hmm_pte = hmm_pt_iter_walk(&iter, &addr, &next);

what are 'addr' and 'next'? (types)

> +
> +If there is no valid entry directory for given range address then hmm_pte is
> +NULL. If there is a valid entry directory then you can access the hmm_pte and
> +the pointer will stay valid as long as you do not call hmm_pt_iter_walk() with
> +the same iter struct for a different address or call hmm_pt_iter_fini().
> +
> +While the HMM page table entry pointer stays valid you can only modify the
> +value it is pointing to by using one of HMM helpers (hmm_pte_*()) as other
> +threads might be updating the same entry concurrently. The device driver only
> +need to update an HMM page table entry to set the dirty bit, so driver should

needs drivers

> +only be using hmm_pte_set_dirty().
> +
> +Similarly to extract information the device driver should use one of the helper

helpers

> +like hmm_pte_dma_addr() or hmm_pte_pfn() (if HMM is not doing DMA mapping which
> +is a device driver at initialization parameter).
> +
> +
> +Migrating system memory to device memory:
> +-----------------------------------------
> +
> +Device like discret GPU often have there own local memory which offer bigger

Devices discrete GPUs their

> +bandwidth and smaller latency than access to system memory for the GPU. This
> +local memory is not necessarily accessible by the CPU. Device local memory will
> +remain revealent for the foreseeable future as bandwidth of GPU memory keep

relevant keeps

> +increasing faster than bandwidth of system memory and as latency of PCIe does
> +not decrease.
> +
> +Thus to maximize use of device like GPU, program need to use the device memory.

devices like GPUs, programs

> +Userspace API wants to make this as transparent as it can be, so that there is
> +no need for complex modification of applications.
> +
> +Transparent use of device memory for range of address of a process require core

requires

> +mm code modifications. Adding a new memory zone for devices memory did not make

MM device

> +sense given that such memory is often only accessible by the device only. This
> +is why we decided to use a special kind of swap, migrated memory is mark as a

swap; marked

> +special swap entry inside the CPU page table.
> +
> +While HMM handles the migration process, it does not decide what range or when
> +to migrate memory. The decision to perform such migration is under the control
> +of the device driver. Migration back to system memory happens either because
> +the CPU try to access the memory or because device driver decided to migrate

tries

> +the memory back.
> +
> +
> + /* Migrate system memory between addressA and addressB to device memory. */
> + your_hmm_event->hmm_event.start = addressA;
> + your_hmm_event->hmm_event.end = addressB;

is hmm_event.end (addressB) inclusive and exclusive?
i.e., is it end_of_copy + 1?
i.e., is the size of the copy addressB - addressA or
addressB - addressA + 1?
i.e., is addressB = addressA + size
or is addressB = addressA + size - 1

In my experience it is usually better to have a start_address and size
instead of start_address and end_address.

> + your_hmm_event->hmm_event.etype = HMM_COPY_TO_DEVICE;
> + hmm_mirror_fault(&your_hmm_mirror->hmm_mirror, &your_hmm_event->hmm_event);
> + /* HMM callback into your driver with the >copy_to_device() callback.
> + * Device driver must allocate device memory, DMA system memory to device
> + * memory, update the device page table to point to device memory and
> + * return. See hmm.h for details instructions and how failure are handled.

detailed failures

> + */
> + your_hmm_device->copy_to_device(mirror, event, dst, addressA, addressB);
> +
> +
> +Right now HMM only support migrating anonymous private memory. Migration of

supports

> +share memory and more generaly file mapped memory is on the road map.

shared generally

> +
> +
> +Locking consideration and overall design:
> +-----------------------------------------
> +
> +As a rule HMM will handle proper locking on the behalf of the device driver,
> +as such device driver does not need to take any mm lock before calling into

MM

> +the HMM code.
> +
> +HMM is also responsible of the hmm_device and hmm_mirror object lifetime. The

for

> +device driver can only free those after calling hmm_device_unregister() or
> +hmm_mirror_unregister() respectively.
> +
> +All the lock inside any of the HMM structure should never be use by the device

locks structures

> +driver. They are intended to be use only and only by HMM code. Below is short

used only by the HMM code.

> +description of the 3 main locks that exist for HMM internal use. Educational
> +purpose only.
> +
> +Each process mm has one and only one struct hmm associated with it. Each hmm

MM

> +struct can be use by several different mirror. There is one and only one mirror

mirrors.

> +per mm and device pair. So in essence the hmm struct is the core that dispatch

MM dispatches

> +everything to every single mirror, each of them corresponding to a specific
> +device. The list of mirror for an hmm struct is protected by a semaphore as it

mirrors
> +sees mostly read access.
> +
> +Each time a device fault a range of address it calls hmm_mirror_fault(), HMM

faults

> +keeps track, inside the hmm struct, of each range currently being faulted. It
> +does that so it can synchronize with any CPU page table update. If there is a
> +CPU page table update then a callback through mmu_notifier will happen and HMM
> +will try to interrupt the device page fault that conflict (ie address range

conflicts (i.e.,

> +overlap with the range being updated) and wait for them to back off. This
> +insure that at no point in time the device driver see transient page table

insures sees

> +information. The list of active fault is protected by a spinlock, query on

faults spinlock;

> +that list should be short and quick (we haven't gather enough statistic on

gathered statistics

> +that side yet to have a good idea of the average access pattern).
> +
> +Each device driver wanting to use HMM must register one and only one hmm_device
> +struct per physical device with HMM. The hmm_device struct have pointer to the

has

> +device driver call back and keeps track of active mirrors for a given device.

callback

> +The active mirrors list is protected by a spinlock.
> +
> +
> +Future work:
> +------------
> +
> +Improved atomic access by the device to system memory. Some platform bus (PCIe)

busses

> +offer limited number of atomic memory operations, some platform do not even

operations; platforms

> +have any kind of atomic memory operations by a device. In order to allow such
> +atomic operation we want to map page read only the CPU while the device perform

operations pages read-only in the CPU performs

> +its operation. For this we need a new case inside the CPU write fault code path
> +to synchronize with the device.
> +
> +We want to allow program to lock a range of memory inside device memory and

allow a program

> +forbid CPU access while the memory is lock inside the device. Any CPU access

locked

> +to locked range would result in SIGBUS. We think that madvise() would be the
> +right syscall into which we could plug that feature.
> +
> +In order to minimize kernel memory consumption and overhead of DMA mapping, we
> +want to introduce new DMA API that allows to manage mapping on IOMMU directory
> +page basis. This would allow to map/unmap/update DMA mapping in bulk and
> +minimize IOMMU update and flushing overhead. Moreover this would allow to
> +improve IOMMU bad access reporting for DMA address inside those directory.
> +
> +Because update to the device page table might require "heavy" synchronization
> +with the device, the mmu_notifier callback might have to sleep while HMM is
> +waiting for the device driver to report device page table update completion.
> +This is especialy bad if this happens during page reclaimation, this might

especially reclamation;

> +bring the system to pause. We want to mitigate this, either by maintaining a
> +new intermediate lru level in which we put pages actively mirrored by a device

LRU

> +or by some other mecanism. For time being we advice that device driver that

mechanism. advise

> +use HMM explicitly explain this corner case so that user are aware that this

users

> +can happens if there is memory pressure.

happen
>


--
~Randy
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/