[RFC PATCH 01/25] Documentation: kvx: Add basic documentation

From: Yann Sionneau
Date: Tue Jan 03 2023 - 12:03:00 EST


Add some documentation for kvx arch and its Linux port.

CC: Jonathan Corbet <corbet@xxxxxxx>
CC: linux-doc@xxxxxxxxxxxxxxx
CC: linux-kernel@xxxxxxxxxxxxxxx
Co-developed-by: Clement Leger <clement.leger@xxxxxxxxxxx>
Signed-off-by: Clement Leger <clement.leger@xxxxxxxxxxx>
Co-developed-by: Guillaume Thouvenin <gthouvenin@xxxxxxxxx>
Signed-off-by: Guillaume Thouvenin <gthouvenin@xxxxxxxxx>
Signed-off-by: Yann Sionneau <ysionneau@xxxxxxxxx>
---
Documentation/kvx/kvx-exceptions.txt | 246 ++++++++++++++++++++++++
Documentation/kvx/kvx-iommu.txt | 183 ++++++++++++++++++
Documentation/kvx/kvx-mmu.txt | 272 +++++++++++++++++++++++++++
Documentation/kvx/kvx-smp.txt | 36 ++++
Documentation/kvx/kvx.txt | 268 ++++++++++++++++++++++++++
5 files changed, 1005 insertions(+)
create mode 100644 Documentation/kvx/kvx-exceptions.txt
create mode 100644 Documentation/kvx/kvx-iommu.txt
create mode 100644 Documentation/kvx/kvx-mmu.txt
create mode 100644 Documentation/kvx/kvx-smp.txt
create mode 100644 Documentation/kvx/kvx.txt

diff --git a/Documentation/kvx/kvx-exceptions.txt b/Documentation/kvx/kvx-exceptions.txt
new file mode 100644
index 000000000000..11368287bd48
--- /dev/null
+++ b/Documentation/kvx/kvx-exceptions.txt
@@ -0,0 +1,246 @@
+Exceptions
+==========
+On kvx, handlers are set using $ev (exception vector) register which
+specifies a base address.
+An offset is added to $ev upon exception and the result is used as
+"Next $pc".
+The offset depends on which exception vector the cpu wants to jump to:
+* $ev + 0x00 for debug
+* $ev + 0x40 for trap
+* $ev + 0x80 for interrupt
+* $ev + 0xc0 for syscall
+
+Then, handlers are laid in the following order:
+
+ _____________
+ | |
+ | Syscall |
+ |_____________|
+ | |
+ | Interrupts |
+ |_____________|
+ | |
+ | Traps |
+ |_____________|
+ | | ^
+ | Debug | | Stride
+BASE -> |_____________| v
+
+
+Interrupts, and traps are serviced similarly, ie:
+- Jump to handler
+- Save all registers
+- Prepare the call (do_IRQ or trap_handler)
+- restore all registers
+- return from exception
+
+entry.S file is (as for other architectures) the entry point into the kernel.
+It contains all assembly routines related to interrupts/traps/syscall.
+
+Syscall handling
+================
+
+When executing a syscall, it must be done using "scall $r6"
+where $r6 contains the syscall number. Using this convention allow to
+modify and restart a syscall from the kernel.
+
+Syscalls are handled differently than interrupts/exceptions. From an ABI
+point of view, scalls are like function calls: any caller saved register
+can be clobbered by the syscall. However, syscall parameters are passed
+using registers r0 through r7. These registers must be preserved to avoid
+cloberring them before the actual syscall function.
+
+On syscall from userspace (scall instruction), the processor will put
+the syscall number in $es.sn and switch from user to kernel privilege
+mode. kvx_syscall_handler will be called in kernel mode.
+
+The following steps are then taken:
+
+- Switch to kernel stack
+- Extract syscall number
+- Check that the syscall number is not bogus
+ - If so, set syscall func to a not implemented one
+- Check if tracing is enabled
+ - If so, jump to trace_syscall_enter
+ - Save syscall arguments (r0 -> r7) on stack in pt_regs
+ - Call do_trace_syscall_enter function
+- Restore syscall arguments since they have been modified by C call
+- Call the syscall function
+- Save $r0 in pt_regs since it can be cloberred afterward
+- If tracing was enabled, call trace_syscall_exit
+- Call work_pending
+- Return to user !
+
+The trace call is handled out of the fast path. All slow path handling
+is done in another part of code to avoid messing with the cache.
+
+Signals
+=======
+
+Signals are handled when exiting kernel before returning to user.
+When handling a signal, the path is the following:
+
+1 - User application is executing normally
+ Then any exception happens (syscall, interrupt, trap)
+2 - The exception handling path is taken
+ and before returning to user, pending signals are checked
+3 - Signal are handled by do_signal
+ Registers are saved and a special part of the stack is modified
+ to create a trampoline to call rt_sigreturn
+ $spc is modified to jump to user signal handler
+ $ra is modified to jump to sigreturn trampoline directly after
+ returning from user signal handler.
+4 - User signal handler is called after rfe from exception
+ when returning, $ra is retored to $pc, resulting in a call
+ to the syscall trampoline.
+5 - syscall trampoline is executed, leading to rt_sigreturn syscall
+6 - rt_sigreturn syscall is executed
+ Previous registers are restored to allow returning to user correctly
+7 - User application is restored at the exact point it was interrupted
+ before.
+
+
+ +----------+
+ | 1 |
+ | User app | @func
+ | (user) |
+ +---+------+
+ |
+ | it/trap/scall
+ |
+ +---v-------+
+ | 2 |
+ | exception |
+ | handling |
+ | (kernel) |
+ +---+-------+
+ |
+ | Check if signal are pending, if so, handle signals
+ |
+ +---v--------+
+ | 3 |
+ | do_signal |
+ | handling |
+ | (kernel) |
+ +----+-------+
+ |
+ | Return to user signal handler
+ |
+ +----v------+
+ | 4 |
+ | signal |
+ | handler |
+ | (user) |
+ +----+------+
+ |
+ | Return to sigreturn trampoline
+ |
+ +----v-------+
+ | 5 |
+ | syscall |
+ |rt_sigreturn|
+ | (user) |
+ +----+-------+
+ |
+ | Syscall to rt_sigreturn
+ |
+ +----v-------+
+ | 6 |
+ | sigreturn |
+ | handler |
+ | (kernel) |
+ +----+-------+
+ |
+ | Modify context to return to original func
+ |
+ +----v-----+
+ | 7 |
+ | User app | @func
+ | (user) |
+ +----------+
+
+Registers handling
+==================
+
+MMU is disabled in all exceptions paths, during register save and restoration.
+This will prevent from triggering MMU fault (such as TLB miss) which could
+clobber the current register state. Such event can occurs when RWX mode is
+enabled and the memory accessed to save register can trigger a TLB miss.
+Aside from that which is common for all exceptions path, registers are saved
+differently regarding the type of exception.
+
+Interrupts and traps
+--------------------
+
+When interrupt and traps are triggered, we only save the caller-saved registers.
+Indeed, we rely on the fact that C code will save and restore callee-saved and
+hence, there is no need to save them. This path is the following:
+
+ +------------+ +-----------+ +---------------+
+IT | Save caller| C Call | Execute C | Ret | Restore caller| Ret from IT
++--->+ saved +--------->+ handler +------->+ saved +----->
+ | registers | +-----------+ | registers |
+ +------------+ +---------------+
+
+However, when returning to user, we check if there is work_pending. If a signal
+is pending and there is a signal handler to be called, then we need all
+registers to be saved on the stack in the pt_regs before executing the signal
+handler and restored after that. Since we only saved caller-saved registers, we
+need to also save callee-saved registers to restore them correctly when
+returning to user. This path is the following (a bit more complicated !):
+
+ +------------+
+ | Save caller| +-----------+ Ret +------------+
+ IT | saved | C Call | Execute C | to asm | Check work |
+ +--->+ registers +--------->+ handler +------->+ pending |
+ | to pt_regs | +-----------+ +--+---+-----+
+ +------------+ | |
+ Work pending | | No work pending
+ +--------------------------------------------+ |
+ | |
+ | +------------+
+ v |
+ +------+------+ v
+ | Save callee | +-------+-------+
+ | saved | | Restore caller| RFE from IT
+ | registers | | saved +------->
+ | to pt_regs | | registers |
+ +--+-------+--+ | from pt_regs |
+ | | +-------+-------+
+ | | +---------+ ^
+ | | | Execute | |
+ | +-------->+ needed +-----------+
+ | | work |
+ | +---------+
+ |Signal handler ?
+ v
++----+----------+ RFE to user +-------------+ +--------------+
+| Copy all | handler | Execute | ret | rt_sigreturn |
+| registers +------------>+ user signal +------>+ trampoline |
+| from pt_regs | | handler | | to kernel |
+| to user stack | +-------------+ +------+-------+
++---------------+ |
+ syscall rt_sigreturn |
+ +-------------------------------------------------+
+ |
+ v
++--------+-------+ +-------------+
+| Recopy all | | Restore all | RFE
+| registers from +--------------------->+ saved +------->
+| user stack | Return | registers |
+| to pt_regs | from sigreturn |from pt_regs |
++----------------+ (via ret_from_fork) +-------------+
+
+
+Syscalls
+--------
+As explained before, for syscalls, we can use whatever callee-saved registers
+we want since syscall are seen as a "classic" call from ABI pov.
+Only different path is the one for clone. For this path, since the child expects
+to find same callee-registers content than his parent, we must save them before
+executing the clone syscall and restore them after that for the child. This is
+done via a redefinition of __sys_clone in assembly which will be called in place
+of the standard sys_clone. This new call will save callee saved registers
+in pt_regs. Parent will return using the syscall standard path. Freshly spawned
+child however will be woken up via ret_from_fork which will restore all
+registers (even if caller saved are not needed).
diff --git a/Documentation/kvx/kvx-iommu.txt b/Documentation/kvx/kvx-iommu.txt
new file mode 100644
index 000000000000..96b74ce71acb
--- /dev/null
+++ b/Documentation/kvx/kvx-iommu.txt
@@ -0,0 +1,183 @@
+IOMMU
+=====
+
+General Overview
+----------------
+
+To exchange data between device and users through memory, the driver has to
+set up a buffer by doing some kernel allocation. The address of the buffer is
+virtual and the physical one is obtained through the MMU. When the device wants
+to access the same physical memory space it uses a bus address. This address is
+obtained by using the DMA mapping API. The Coolidge SoC includes several IOMMUs for clusters,
+PCIe peripherals, SoC peripherals, and more; that will translate this "bus address"
+into a physical one during DMA operations.
+
+The bus addresses are IOVA (I/O Virtual Address) or DMA addresses. This
+addresses can be obtained by calling the allocation functions of the DMA APIs.
+It can also be obtained through classical kernel allocation of physical
+contiguous memory and then calling mapping functions of the DMA API.
+
+In order to be able to use the kvx IOMMU we have implemented the IOMMU DMA
+interface in arch/kvx/mm/dma-mapping.c. DMA functions are registered by
+implementing arch_setup_dma_ops() and generic IOMMU functions. Generic IOMMU
+are calling our specific IOMMU functions that adding or removing mappings
+between DMA addresses and physical addresses in the IOMMU TLB.
+
+Specifics IOMMU functions are defined in the kvx IOMMU driver. A kvx IOMMU
+driver is managing two physical hardware IOMMU used for TX and RX. In the next
+section we described the HW IOMMUs.
+
+
+Cluster IOMMUs
+--------------
+
+IOMMUs on cluster are used for DMA and cryptographic accelerators.
+There are six IOMMUs connected to the:
+ - cluster DMA tx
+ - cluster DMA rx
+ - first non secure cryptographic accelerator
+ - second non secure cryptographic accelerator
+ - first secure cryptographic accelerator
+ - second secure cryptographic accelerator
+
+SoC peripherals IOMMUs
+----------------------
+
+Since SoC peripherals are connected to an AXI bus, two IOMMUs are used: one for
+each AXI channel (read and write). These two IOMMUs are shared between all master
+devices and DMA. These two IOMMUs will have the same entries but need to be configured
+independently.
+
+PCIe IOMMUs
+-----------
+
+There is a slave IOMMU (read and write from the MPPA to the PCIe endpoint)
+and a master IOMMU (read and write from a PCIe endpoint to system DDR).
+The PCIe root complex and the MSI/MSI-X controller have been designed to use
+the IOMMU feature when enabled. (For example for supporting endpoint that
+support only 32 bits addresses and allow them to access any memory in a
+64 bits address space). For security reason it is highly recommended to
+activate the IOMMU for PCIe.
+
+IOMMU implementation
+--------------------
+
+The kvx is providing several IOMMUs. Here is a simplified view of all IOMMUs
+and translations that occurs between memory and devices:
+
+ +---------------------------------------------------------------------+
+ | +------------+ +---------+ | CLUSTER X |
+ | | Cores 0-15 +---->+ Crypto | +-----------|
+ | +-----+------+ +----+----+ |
+ | | | |
+ | v v |
+ | +-------+ +------------------------------+ |
+ | | MMU | +----+ IOMMU x4 (secure + insecure) | |
+ | +---+---+ | +------------------------------+ |
+ | | | |
+ +--------------------+ |
+ | | | |
+ v v | |
+ +---+--------+-+ | |
+ | MEMORY | | +----------+ +--------+ +-------+ |
+ | +<-|-----+ IOMMU Rx |<----+ DMA Rx |<----+ | |
+ | | | +----------+ +--------+ | | |
+ | | | | NoC | |
+ | | | +----------+ +--------+ | | |
+ | +--|---->| IOMMU Tx +---->| DMA Tx +---->+ | |
+ | | | +----------+ +--------+ +-------+ |
+ | | +------------------------------------------------+
+ | |
+ | | +--------------+ +------+
+ | |<--->+ IOMMU Rx/Tx +<--->+ PCIe +
+ | | +--------------+ +------+
+ | |
+ | | +--------------+ +------------------------+
+ | |<--->+ IOMMU Rx/Tx +<--->+ master Soc Peripherals |
+ | | +--------------+ +------------------------+
+ +--------------+
+
+
+There is also an IOMMU dedicated to the crypto module but this module will not
+be accessed by the operating system.
+
+We will provide one driver to manage IOMMUs RX/TX. All of them will be
+described in the device tree to be able to get their particularities. See
+the example below that describes the relation between IOMMU, DMA and NoC in
+the cluster.
+
+IOMMU is related to a specific bus like PCIe we will be able to specify that
+all peripherals will go through this IOMMU.
+
+### IOMMU Page table
+
+We need to be able to know which IO virtual addresses (IOVA) are mapped in the
+TLB in order to be able to remove entries when a device finishes a transfer and
+release memory. This information could be extracted when needed by computing all
+sets used by the memory and then reads all sixteen ways and compare them to the
+IOVA but it won't be efficient. We also need to be able to translate an IOVA
+to a physical address as required by the iova_to_phys IOMMU ops that is used
+by DMA. Like previously it can be done by extracting the set from the address
+and comparing the IOVA to each sixteen entries of the given set.
+
+A solution is to keep a page table for the IOMMU. But this method is not
+efficient for reloading an entry of the TLB without the help of an hardware
+page table. So to prevent the need of a refill we will update the TLB when a
+device request access to memory and if there is no more slot available in the
+TLB we will just fail and the device will have to try again later. It is not
+efficient but at least we won't need to manage the refill of the TLB.
+
+This leads to an issue with the memory that can be used for transfer between
+device and memory (see Limitations below). As we only support 4Ko page size we
+can only map 8Mo. To be able to manage bigger transfer we can implement the
+huge page table in the Linux kernel and use a page table that match the size of
+huge page table for a given IOMMU (typically the PCIe IOMMU).
+
+As we won't refill the TLB we know that we won't have more than 128*16 entries.
+In this case we can simply keep a table with all possible entries.
+
+### Maintenance interface
+
+It is possible to have several "maintainers" for the same IOMMU. The driver is
+using two of them. One that writes the TLB and another interface reads TLB. For
+debug purpose it is possible to display the content of the tlb by using the
+following command in gdb:
+
+ gdb> p kvx_iommu_dump_tlb( <iommu addr>, 0)
+
+Since different management interface are used for read and write it is safe to
+execute the above command at any moment.
+
+### Interrupts
+
+IOMMU can have 3 kind of interrupts that corresponds to 3 different types of
+errors (no mapping. protection, parity). When the IOMMU is shared between
+clusters (SoC periph and PCIe) then fifteen IRQs are generated according to the
+configuration of an association table. The association table is indexed by the
+ASN number (9 bits) and the entry of the table is a subscription mask with one
+bit per destination. Currently this is not managed by the driver.
+
+The driver is only managing interrupts for the cluster. The mode used is the
+stall one. So when an interrupt occurs it is managed by the driver. All others
+interrupts that occurs are stored and the IOMMU is stalled. When driver cleans
+the first interrupt others will be managed one by one.
+
+### ASN (Address Space Number)
+
+This is also know as ASID in some other architecture. Each device will have a
+given ASN that will be given through the device tree. As address space is
+managed at the IOMMU domain level we will use one group and one domain per ID.
+ASN are coded on 9 bits.
+
+Device tree
+-----------
+
+Relationships between devices, DMAs and IOMMUs are described in the
+device tree (see Documentation/devicetree/bindings/iommu/kalray,kvx-iommu.txt
+for more details).
+
+Limitations
+-----------
+
+Only supporting 4 KB page size will limit the size of mapped memory to 8 MB
+because the IOMMU TLB can have at most 128*16 entries.
diff --git a/Documentation/kvx/kvx-mmu.txt b/Documentation/kvx/kvx-mmu.txt
new file mode 100644
index 000000000000..a3ebbef36981
--- /dev/null
+++ b/Documentation/kvx/kvx-mmu.txt
@@ -0,0 +1,272 @@
+MMU
+===
+
+Virtual addresses are on 41 bits for kvx when using 64-bit mode.
+To differentiate kernel from user space, we use the high order bit
+(bit 40). When bit 40 is set, then the higher remaining bits must also be set to
+1. The virtual address must be extended with 1 when the bit 40 is set,
+if not the address must be zero extended. Bit 40 is set for kernel space
+mappings and not set for user space mappings.
+
+Memory Map
+==========
+
+In Linux physical memories are arranged into banks according to the cost of an
+access in term of distance to a memory. As we are UMA architecture we only have
+one bank and thus one node.
+
+A node is divided into several kind of zone. For example if DMA can only access
+a specific area in the physical memory we will define a ZONE_DMA for this purpose.
+In our case we are considering that DMA can access all DDR so we don't have a specific
+zone for this. On 64 bit architecture all DDR can be mapped in virtual kernel space
+so there is no need for a ZONE_HIGHMEM. That means that in our case there is
+only one ZONE_NORMAL. This will be updated if DMA cannot access all memory.
+
+Currently, the memory mapping is the following for 4KB page:
+
++-----------------------+-----------------------+------+-------+--------------+
+| Start | End | Attr | Size | Name |
++-----------------------+-----------------------+------+-------+--------------+
+| 0000 0000 0000 0000 | 0000 003F FFFF FFFF | --- | 256GB | User |
+| 0000 0040 0000 0000 | 0000 007F FFFF FFFF | --- | 256GB | MMAP |
+| 0000 0080 0000 0000 | FFFF FF7F FFFF FFFF | --- | --- | Gap |
+| FFFF FF80 0000 0000 | FFFF FFFF FFFF FFFF | --- | 512GB | Kernel |
+| FFFF FF80 0000 0000 | FFFF FF8F FFFF FFFF | RWX | 64GB | Direct Map |
+| FFFF FF90 0000 0000 | FFFF FF90 3FFF FFFF | RWX | 1GB | Vmalloc |
+| FFFF FF90 4000 0000 | FFFF FFFF FFFF FFFF | RW | 447GB | Free area |
++-----------------------+-----------------------+------+-------+--------------+
+
+Enable the MMU
+==============
+
+All kernel functions and symbols are in virtual memory except for kvx_start()
+function which is loaded at 0x0 in physical memory.
+To be able to switch from physical addresses to virtual addresses we choose to
+setup the TLB at the very beginning of the boot process to be able to map both
+pieces of code. For this we added two entries in the LTLB. The first one,
+LTLB[0], contains the mapping between virtual memory and DDR. Its size is 512MB.
+The second entry, LTLB[1], contains a flat mapping of the first 2MB of the SMEM.
+Once those two entries are present we can enable the MMU. LTLB[1] will be
+removed during paging_init() because once we are really running in virtual space
+it will not be used anymore.
+In order to access more than 512MB DDR memory, the remaining memory (> 512MB) is
+refill using a comparison in kernel_perf_refill that does not walk the kernel
+page table, thus having a faster refill time for kernel. These entries are
+inserted into the LTLB for easier computation (4 LTLB entries). The drawback of
+this approach is that mapped entries are using RWX protection attributes,
+leading to no protection at all.
+
+Kernel strict RWX
+=================
+
+CONFIG_STRICT_KERNEL_RWX is enabled by default in default_defconfig.
+Once booted, if CONFIG_STRICT_KERNEL_RWX is enable, the kernel text and memory
+will be mapped in the init_mm page table. Once mapped, the refill routine for
+the kernel is patched to always do a page table walk, bypassing the faster
+comparison but enforcing page protection attributes when refilling.
+Finally, the LTLB[0] entry is replaced by a 4K one, mapping only exceptions with
+RX protection. It allows us to never trigger nomapping on nomapping refill
+routine which would (obviously) not work... Once this is done, we can flush the
+4 LTLB entries for kernel refill in order to be sure there is no stalled
+entries and that new entries inserted in JTLB will apply.
+
+By default, the following policy is applied on vmlinux sections:
+- init_data: RW
+- init_text: RX (or RWX if parameter rodata=off)
+- text: RX (or RWX if parameter rodata=off)
+- rodata: RW before init, RO after init
+- sdata: RW
+
+Kernel RWX mode can then be switched on/off using /sys/kvx/kernel_rwx file.
+
+Privilege Level
+================
+Since we are using privilege levels on kvx, we make use of the virtual
+spaces to be in the same space as the user. The kernel will have the
+$ps.mmup set in kernel (PL1) and unset for user (PL2).
+As said in kvx documentation, we have two cases when the kernel is
+booted:
+- Either we have been booted by someone (bootloader, hypervisor, etc)
+- Or we are alone (boot from flash)
+
+In both cases, we will use the virtual space 0. Indeed, if we are alone
+on the core, then it means nobody is using the MMU and we can take the
+first virtual space. If not alone, then when writing an entry to the tlb
+using writetlb instruction, the hypervisor will catch it and change the
+virtual space accordingly.
+
+Memblock
+========
+
+When the kernel starts there is no memory allocator available. One of the first
+step in the kernel is to detect the amount of DDR available by getting this
+information in the device tree and initialize the low-level "memblock" allocator.
+
+We start by reserving memory for the whole kernel. For instance with a device
+tree containing 512Mo of DDR you could see the following boot messages:
+
+setup_bootmem: Memory : 0x100000000 - 0x120000000
+setup_bootmem: Reserved: 0x10001f000 - 0x1002d1bc0
+
+During the paging init we need to set:
+ - min_low_pfn that is the lowest PFN available in the system
+ - max_low_pfn that indicates the end if NORMAL zone
+ - max_pfn that is the number of pages in the system
+
+This setting is used for dividing memory into pages and for configuring the
+zone. See the memory map section for more information about ZONE.
+
+Zones are configured in free_area_init_core(). During start_kernel() other
+allocations are done for command line, cpu areas, PID hash table, different
+caches for VFS. This allocator is used until mem_init() is called.
+
+mem_init() is provided by the architecture. For MPPA we just call
+free_all_bootmem() that will go through all pages that are not used by the
+low level allocator and mark them as not used. So physical pages that are
+reserved for the kernel are still used and remain in physical memory. All pages
+released will now be used by the buddy allocator.
+
+Peripherals
+===========
+
+Peripherals are mapped using standard ioremap infrastructure, therefore
+mapped addresses are located in the vmalloc space.
+
+LTLB Usage
+==========
+
+LTLB is used to add resident mapping which allows for faster MMU lookup.
+Currently, the LTLB is used to map some mandatory kernel pages and to allow fast
+accesses to l2 cache (mailbox and registers).
+When CONFIG_STRICT_KERNEL_RWX is disabled, 4 entries are reserved for kernel
+TLB refill using 512MB pages. When CONFIG_STRICT_KERNEL_RWX is enabled, these
+entries are unused since kernel is paginated using the same mecanism than for
+user (page walking and entries in JTLB)
+
+Page Table
+==========
+
+We only support three levels for the page table and 4KB for page size.
+
+3 levels page table
+-------------------
+
+...-----+--------+--------+--------+--------+--------+
+ 40|39 32|31 24|23 16|15 8|7 0|
+...-----++-------+--+-----+---+----+----+---+--------+
+ | | | |
+ | | | +---> [11:0] Offset (12 bits)
+ | | +-------------> [20:12] PTE offset (9 bits)
+ | +-----------------------> [29:21] PMD offset (9 bits)
+ +----------------------------------> [39:30] PGD offset (10 bits)
+Bits 40 to 64 are signed extended according to bit 39. If bit 39 is equal to 1
+we are in kernel space.
+
+As 10 bits are used for PGD we need to allocate 2 pages.
+
+PTE format
+==========
+
+About the format of the PTE entry, as we are not forced by hardware for choices,
+we choose to follow the format described in the RiscV implementation as a
+starting point.
+
+ +---------+--------+----+--------+---+---+---+---+---+---+------+---+---+
+ | 63..23 | 22..13 | 12 | 11..10 | 9 | 8 | 7 | 6 | 5 | 4 | 3..2 | 1 | 0 |
+ +---------+--------+----+--------+---+---+---+---+---+---+------+---+---+
+ PFN Unused S PageSZ H G X W R D CP A P
+ where:
+ P: Present
+ A: Accessed
+ CP: Cache policy
+ D: Dirty
+ R: Read
+ W: Write
+ X: Executable
+ G: Global
+ H: Huge page
+ PageSZ: Page size as set in TLB format (0:4Ko, 1:64Ko, 2:2Mo, 3:512Mo)
+ S: Soft/Special
+ PFN: Page frame number (depends on page size)
+
+Huge bit must be somewhere in the first 12 bits to be able to detect it
+when reading the PMD entry.
+
+PageSZ must be on bit 10 and 11 because it matches the TEL.PS bits. And
+by doing that it is easier in assembly to set the TEL.PS to PageSZ.
+
+Fast TLB refill
+===============
+
+kvx core does not feature a hardware page walker. This work must be done
+by the core in software. In order to optimize TLB refill, a special fast
+path is taken when entering in kernel space.
+In order to speed up the process, the following actions are taken:
+# Save some registers in a per process scratchpad
+# If the trap is a nomapping then try the fastpath
+# Save some more registers for this fastpath
+# Check if faulting address is a memory direct mapping one.
+ # If entry is a direct mapping one and RWX is not enabled, add an entry into LTLB
+ # If not, continue
+# Try to walk the page table
+ # If entry is not present, take the slowpath (do_page_fault)
+# Refill the tlb properly
+# Exit by restoring only a few registers
+
+ASN Handling
+============
+
+Disclaimer: Some part of this are taken from ARC architecture.
+
+kvx MMU provides 9-bit ASN (Address Space Number) in order to tag TLB entries.
+It allows for multiple process with the same virtual space to cohabit without
+the need to flush TLB everytime we context switch.
+kvx implementation to use them is based on other architectures (such as arc
+or xtensa) and uses a wrapping ASN counter containing both cycle/generation and
+asn.
+
++---------+--------+
+|63 10|9 0|
++---------+--------+
+ Cycle ASN
+
+This ASN counter is incremented monotonously to allocate new ASNs. When the
+counter reaches 511 (9 bit), TLB is completely flushed and a new cycle is
+started. A new allocation cycle, post rollover, could potentially reassign an
+ASN to a different task. Thus the rule is to reassign an ASN when the current
+context cycles does not match the allocation cycle.
+The 64 bit @cpu_asn_cache (and mm->asn) have 9 bits MMU ASN and rest 55 bits
+serve as cycle/generation indicator and natural 64 bit unsigned math
+automagically increments the generation when lower 9 bits rollover.
+When the counter completely wraps, we reset the counter to first cycle value
+(ie cycle = 1). This allows to distinguish context without any ASN and old cycle
+generated value with the same operation (XOR on cycle).
+
+Huge page
+=========
+
+Currently only 3 level page table has been implemented for 4Ko base page size.
+So the page shift is 12 bits, the pmd shift is 21 and the pgdir shift is 30
+bits. This choice implies that for 4Ko base page size if we use a PMD as a huge
+page the size will be 2Mo and if we use a PUD as a huge page it will be 1Go.
+
+To support other huge page sizes (64Ko and 512Mo) we need to use several
+contiguous entries in the page table. For huge page of 64Ko we will need to
+use 16 entries in the PTE and for a huge page of 512Mo it means that 256
+entries in PMD will be used.
+
+Debug
+=====
+
+In order to debug the page table and tlb entries, gdb scripts contains commands
+which allows to dump the page table:
+- lx-kvx-page-table-walk
+ - Display the current process page table by default
+- lx-kvx-tlb-decode
+ - Display the content of $tel and $teh into something readable
+
+Other commands available in kvx-gdb are the following:
+- mppa-dump-tlb
+ - Display the content of TLBs (JTLB and LTLB)
+- mppa-lookup-addr
+ - Find physical address matching a virtual one
diff --git a/Documentation/kvx/kvx-smp.txt b/Documentation/kvx/kvx-smp.txt
new file mode 100644
index 000000000000..1b69d77db8cd
--- /dev/null
+++ b/Documentation/kvx/kvx-smp.txt
@@ -0,0 +1,36 @@
+SMP
+===
+
+On kvx, 5 clusters are organized as groups of 16 processors + 1
+secure core (RM) for each cluster. These 17 processors are L1$ coherent
+for TCM (tightly Coupled Memory). A mixed hw/sw L2$ is present to have
+cache coherency on DDR as well as TCM.
+The RM manager is not meant to run Linux so, 16 processors are available
+for SMP.
+
+Booting
+=======
+
+When booting the kvx processor, only the RM is woken up. This RM will
+execute a portion of code located in a section named .rm_firmware.
+By default, a simple power off code is embedded in this section.
+To avoid embedding the firmware in kernel sources, the section is patched
+using external tools to add the L2$ firmware (and replace the default firmware).
+Before executing this firmware, the RM boots the PE0. PE0 will then enable L2
+coherency and request will be stalled until RM boots the L2$ firmware.
+
+Locking primitives
+==================
+
+spinlock/rwlock are using the kernel standard queued spinlock/rwlocks.
+These primitives are based on cmpxch and xchg. More particularly, it uses xchg16
+which is implemented as a read modify write with acswap on 32 bit word since
+kvx does not have cmpxchg for size < 32bits.
+
+IPI
+===
+
+An IPI controller allows to communicate between CPUs using a simple
+memory mapped register. This register can simply be written using a mask to
+trigger interrupts directly to the cores matching the mask.
+
diff --git a/Documentation/kvx/kvx.txt b/Documentation/kvx/kvx.txt
new file mode 100644
index 000000000000..8ce0703de681
--- /dev/null
+++ b/Documentation/kvx/kvx.txt
@@ -0,0 +1,268 @@
+kvx Core Implementation
+=======================
+
+This documents will try to explain any architecture choice for the kvx
+linux port.
+
+Regarding the peripheral, we MUST use device tree to describe ALL
+peripherals. The bindings should always start with "kalray,kvx" for all
+core related peripherals (watchdog, timer, etc)
+
+System Architecture
+===================
+
+On kvx, we have 4 levels of privilege level starting from 0 (most
+privileged one) to 3 (less privilege one). A system of owners allows
+to delegate ownership of resources by using specials system registers.
+
+The 2 main software stacks for Linux Kernel are the following:
+
++-------------+ +-------------+
+| PL0: Debug | | PL0: Debug |
++-------------+ +-------------+
+| PL1: Linux | | PL1: HyperV |
++-------------+ +-------------+
+| PL2: User | | PL2: Linux |
++-------------+ +-------------+
+| | | PL3: User |
++-------------+ +-------------+
+
+In both cases, the kvx support for privileges has been designed using
+only relative PL and thus should work on both configurations without
+any modifications.
+
+When booting, the CPU is executing in PL0 and owns all the privileges.
+This level is almost dedicated to the debug routines for the debugguer.
+It only needs to own few privileges (breakpoint 0 and watchpoint 0) to
+be able to debug a system executing in PL1 to PL3.
+Debug routines are not always there for instance when the kernel is
+executing alone (booted from flash).
+In order to ease the load of debug routines, software convention is to
+jump directly to PL1 and let PL0 for the debug.
+When the kernel boots, it checks if the current privilege level is 0
+($ps.pl is the only absolute value). If so, then it will delegate
+almost all resources to PL1 and use a RFE to lower its execution
+privilege level (see asm_delegate_pl in head.S).
+If the current PL is already different from 0, then it means somebody
+is above us and we need to request resource to inform it we need them. It will
+then either delegate them to us directly or virtualize the delegation.
+All privileges levels have their set of banked registers (ps, ea, sps,
+sr, etc) which contain privilege level specific values.
+$sr (system reserved) is banked and will hold the current task_struct.
+This register is reserved and should not be touched by any other code.
+For more information, refer to the kvx system level architecture manual.
+
+Boot
+====
+
+On kvx, the RM (Secure Core) of Cluster 0 will boot first. It will then be able
+to boot a firmware. This firmware is stored in the rm_firmware section.
+The first argument ($r0) of this firmware will be a pointer to a function with
+the following prototype: void firmware_init_done(uint64_t features). This
+function is responsible of describing the features supported by the firmware and
+will start the first PE after that.
+By default, the rm_firmware function act as the "default" firmware. This
+function does nothing except calling firmware_init_done and then goes to sleep.
+In order to add another firmware, the rm_firmware section is patched using
+objcopy. The content of this section is then replaced by the provided firmware.
+This firmware will do an init and then call firmware_init_done before running
+the main loop.
+When the PE boots, it will check for the firmware features to enable or disable
+specific core features (L2$ for instance).
+
+When entering the C (kvx_lowlevel_start) the kernel will look for a special
+magic in $r0 (0x494C314B). This magic tells the kernel if there is arguments
+passed by a bootloader.
+Currently, the following values are passed through registers:
+ - r1: pointer to command line setup by bootloader
+ - r2: device tree
+
+If this magic is not set, then, the command line will be the one
+provided in the device tree (see bootargs). The default device tree is
+not builtin but will be patched by the runner used (simulator or jtag) in the
+dtb section.
+
+A default stdout-path is desirable to allow early printk.
+
+Boot Memory Allocator
+=====================
+
+The boot memory allocator is used to allocate memory before paging is enabled.
+It is initialized with DDR and also with the shared memory. This first one is
+initialized during the setup_bootmem() and the second one when calling
+early_init_fdt_scan_reserved_mem().
+
+
+Virtual and physical memory
+===========================
+
+The mapping used and the memory management is described in
+Documentation/kvx/kvx-mmu.txt.
+Our Kernel is compiled using virtual addresses that starts at
+0xffffff0000000000. But when it is started the kernel uses physical addresses.
+Before calling the first function arch_low_level_start() we configure 2 entries
+of the LTLB.
+
+The first entry will map the first 1G of virtual address space to the first
+1G of DDR:
+ - TLB[0]: 0xffffff0000000000 -> 0x100000000 (size 512Mo)
+
+The second entry will be a flat mapping of the first 512 Ko of the SMEM. It
+is required to have this flat mapping because there is still code located at
+this address that needs to be executed:
+ - TLB[1]: 0x0 -> 0x0 (size 512Ko)
+
+Once virtual space reached the second entry is removed.
+
+To be able to set breakpoints when MMU is enabled we added a label called
+gdb_mmu_enabled. If you try to set a breakpoint on a function that is in
+virtual memory before the activation of the MMU this address as no signification
+for GDB. So, for example, if you want to break on the function start_kernel()
+you will need to run:
+
+ kvx-gdb -silent path_to/vmlinux \
+ -ex 'tbreak gdb_mmu_enabled' -ex 'run' \
+ -ex 'break start_kernel' \
+ -ex 'continue'
+
+We will also add an option to kvx-gdb to simplify this step.
+
+Timers
+======
+
+The free-runinng clock (clocksource) is based on the DSU. This clock is
+not interruptible and never stops even if core go into idle.
+
+Regarding the tick (clockevent), we use the timer 0 available on the core.
+This timer allows to set a periodic tick which will be used as the main
+tick for each core. Note that this clock is percpu.
+
+get_cycles implementation is based on performance counter. One of them
+is used to count cycles. Note that since this is used only when the core
+is running, there is no need to worry about core sleeping (which will
+stop the cycle counter)
+
+Context switching
+=================
+
+context switching is done in entry.S. When spawning a fresh thread,
+copy_thread is called. During this call, we setup callee saved register
+r20 and r21 to special values containing the function to call.
+
+The normal path for a kernel thread will be the following:
+
+ 1 - Enter copy_thread_tls and setup callee saved registers which will
+ be restored in __switch_to.
+ 2 - set r20 and r21 (in thread_struct) to function and argument and
+ ra to ret_from_kernel_thread.
+ These callee saved will be restored in switch_to.
+ 3 - Call _switch_to at some point.
+ 4 - Save all callee saved register since switch_to is seen as a
+ standard function call by the caller.
+ 5 - Change stack pointer to the new stack
+ 6 - At the end of switch to, set sr0 to the new task and use ret to
+ jump to ret_from_kernel_thread (address restored from ra).
+ 7 - In ret_from_kernel_thread, execute the function with arguments by
+ using r20, r21 and we are done
+
+For more explanation, you can refer to https://lwn.net/Articles/520227/
+
+User thread creation
+====================
+
+We are using almost the same path as copy_thread to create it.
+The detailed path is the following:
+
+ 1 - Call start_thread which will setup user pc and stack pointer in
+ task regs. We also set sps and clear privilege mode bit.
+ When returning from exception, it will "flip" to user mode.
+ 2 - Enter copy_thread_tls and setup callee saved registers which will
+ be restored in __switch_to. Also, set the "return" function to be
+ ret_from_fork which will be called at end of switch_to
+ 3 - set r20 (in thread_struct) with tracing information.
+ (simply by lazyness to avoid computing it in assembly...)
+ 4 - Call _switch_to at some point.
+ 5 - The current pc will then be restored to be ret_from fork.
+ 6 - Ret from fork calls schedule_tail and then check if tracing is
+ enabled. If so call syscall_trace_exit
+ 7 - finally, instead of returning to kernel, we restore all registers
+ that have been setup by start_thread by restoring regs stored on
+ stack
+
+L2$ handling
+============
+
+On kvx, the L2$ is handled by a firmware running on the RM. This firmware needs
+various information to be aware of its configuration and communicate with the
+kernel. In order to do that, when firmware is starting, the device tree is given
+as parameter along with the "registers" zone. This zone is simply a memory area
+where data are exchanged between kernel <-> L2$. When some commands are written
+to it, the kernel sends an interrupt using a mailbox.
+If the L2$ node is not present in the device tree, then, the RM will directly go
+into sleeping.
+
+Boot diagram:
+
+ RM PE 0
+ +
+ +---------+ |
+ | Boot | |
+ +----+----+ |
+ | |
+ v |
+ +-----+-----+ |
+ | Prepare | |
+ | L2 shared | |
+ | memory | |
+ |(registers)| |
+ +-----+-----+ |
+ | | +-----------+
+ +------------------->+ Boot |
+ | | +-----+-----+
+ v | |
+ +--------+---------+ | |
+ | L2 firmware | | |
+ | parameters: | | |
+ | r0 = registers | | |
+ | r1 = DTB | | |
+ +--------+---------+ | |
+ | | |
+ v | |
+ +-------+--------+ | +------+------+
+ | L2 firmware | | | Wait for L2 |
+ | execution | | | to be ready |
+ +-------+--------+ | +------+------+
+ | | |
+ +------v-------+ | v
+ | L2 requests | | +------+------+
++--->+ handling | | | Enable |
+| +-------+------+ | | L2 caching |
+| | | +------+------+
+| | | |
++------------+ + v
+
+
+Since this driver is started early (before SMP boot), A lot of drivers are not
+yet probed (mailboxes, iommu, etc) and thus can not be used.
+
+Building
+========
+
+In order to build the kernel, you will need a complete kvx toolchain.
+First, setup the config using the following command line
+
+$ make ARCH=kvx O=your_directory default_defconfig
+
+Adjust any configuration option you may need and then, build the kernel:
+
+$ make ARCH=kvx O=your_directory -j12
+
+You will finally have a vmlinux image ready to be run.
+
+$ kvx-mppa -- vmlinux
+
+Additionally, you may want to debug it. To do so, use kvx-gdb:
+
+$ kvx-gdb vmlinux
+
+
--
2.37.2