ANNOUNCE: Hugepage (HPD) block device driver.

From: Dr. Greg Wettstein
Date: Sun Jun 15 2014 - 06:30:31 EST


Good morning, I hope this note finds the weekend going well for
everyone.

Izzy, our Golden Retriever, and I headed out to our lake place last
weekend. I was fighting off a miserable early summer cold so I didn't
feel up to cycling and the fish weren't biting so Izzy suggested we
catch up on our backlogged projects by getting our HPD driver ready
for a release.

So while Izzy intensely monitored the family of geese that
periodically swim by the cabin I got the driver into reasonable shape
for people to start testing. It is now a week later and Izzy thought
that given the work schedule coming up I had better get the package
out on the FTP server.

So without further ado, on behalf of Izzy and Enjellic Systems
Development, I would like to announce the first testing release of the
HupgePage Block Device (HPD) driver. Source is available at the
following URL:

ftp://ftp.enjellic.com/pub/hpd/hpd_driver-1.0beta.tar.gz

The HPD driver implements a dynamically configurable RAM based block
device which uses the kernel hugepage infrastructure and magazines to
provide the memory based backing for the block devices. It borrows
heritage from the existing brd ramdisk code with the primary
differences being dynamic configurability and the backing methodology.

Izzy has watched the discussion of the relevancy of hugepages with
some interest. It is his contention that the HPD driver may offer one
of the most useful applications of this infrastructure. There are
obvious advantages in a ramdisk to handling the backing store in
larger size units and NUMA support falls out naturally since the
hugepage infrastructure is NUMA aware.

Block devices are created by writing the desired size of the block
device, in bytes, to the following pseudo-file:

/sys/fs/hpd/create

On a NUMA capable platform there will be additional pseudo-files of
the following form:

/sys/fs/hpd/create_nodeN

Which will constrain the ramdisk to use memory from only the specified
node. On NUMA platforms a request through the /sys/fs/hpd/create file
will interleave the ramdisk allocation across all memory capable
nodes.

Prior to creating an HPD device there must be an allocation of
hugepages made to the freepage pool. This can be done by writing the
number of pages desired to the following pseudo-file:

/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages

The following pseudo-files can be used to implement allocations which
are pinned to a specific memory capable NUMA node:

/sys/devices/system/node/nodeN/hugepages/hugepages-2048kB/nr_hugepages

A hugepage allocation can also be specified by specifying the
following argument on the kernel command-line:

nr_hugepages=N

Depending on the activity of your machine this may be needed since
memory fragmentation may limit the number of order 9 pages which are
available.

A ramdisk is deleted by writing the value 1 to the following
pseudo-file:

/sys/block/hpdN/device/delete

We have found the driver to be particularly useful in testing our SCST
implementation, extensions and infrastructure. It is capable of
sustaining line-rate 10+ GBPS throughput which allows target
infrastructure to be tested and verified with FIO running in verify
mode. The NULLIO target, while fast of course, does not allow
verification of I/O since there is no persistent backing.

Measured I/O latency on 4K block sizes is approximately five
micro-seconds. Based on that Izzy thought we should get this released
for our fellow brethren in the storage appliance industry. He
suggests that pretty impressive appliance benchmark numbers can be
obtained by using an HPD based cache device with bcache in writeback
mode..... :-)

The driver includes a small patch to the mm/hugetlb.c to add two
exported functions for allocation and release of generic hugepages.
This was needed since there was not a suitable API for
allocating/released extended size pages in a NUMA aware fashion.

>From an architectural perspective the HPD driver differs from the
current ramdisk driver by using a single extended size page to hold
the array of page pointers for the backing store rather then a radix
tree mapping sectors to pages. This limits the size of an individual
block device to one-half of a terrabyte.

A single major with 128 minors is supported. Izzy recommends using
RAID0 if single block device semantics are needed and you have a
machine with 64 terrabytes of RAM handy. Hopefully this won't be a
major limitation for other then the SGI boys.

The driver has been tested pretty extensively but public releases are
legendary for brown bag issues. Please let us know if you test this
and run into any issues.

Finally, thanks and kudos should go to Izzy for prompting all the work
on this. Anyone who has been snooted by a Golden Retriever will tell
you how difficult they are to resist once they put their mind to
something. Anyone who finds this driver useful should note that he
enjoys the large Milk Bone (tm) dog biscuits.... :-)

Best wishes for a productive week.

Dr Greg and Izzy.

As always,
Dr. G.W. Wettstein, Ph.D. Enjellic Systems Development, LLC.
4206 N. 19th Ave. Specializing in information infra-structure
Fargo, ND 58102 development.
PH: 701-281-1686
FAX: 701-281-3949 EMAIL: greg@xxxxxxxxxxxx
------------------------------------------------------------------------------
"One problem with monolithic business structures is losing sight
of the fundamental importance of mathematics. Consider committees;
commonly forgotten is the relationship that given a projection of N
individuals to complete an assignment the most effective number of
people to assign to the committee is given by f(N) = N - (N-1)."
-- Dr. G.W. Wettstein
Guerrilla Tactics for Corporate Survival
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/