Re: [PATCH RFC v2 net-next 05/16] bpf: introduce syscall(BPF, ...) and BPF maps

From: Kees Cook
Date: Wed Jul 23 2014 - 14:02:23 EST


On Thu, Jul 17, 2014 at 9:19 PM, Alexei Starovoitov <ast@xxxxxxxxxxxx> wrote:
> BPF syscall is a demux for different BPF releated commands.
>
> 'maps' is a generic storage of different types for sharing data between kernel
> and userspace.
>
> The maps can be created from user space via BPF syscall:
> - create a map with given type and attributes
> fd = bpf_map_create(map_type, struct nlattr *attr, int len)
> returns fd or negative error
>
> - close(fd) deletes the map
>
> Next patch allows userspace programs to populate/read maps that eBPF programs
> are concurrently updating.
>
> maps can have different types: hash, bloom filter, radix-tree, etc.
>
> The map is defined by:
> . type
> . max number of elements
> . key size in bytes
> . value size in bytes
>
> Next patches allow eBPF programs to access maps via API:
> void * bpf_map_lookup_elem(u32 fd, void *key);
> int bpf_map_update_elem(u32 fd, void *key, void *value);
> int bpf_map_delete_elem(u32 fd, void *key);
>
> This patch establishes core infrastructure for BPF maps.
> Next patches implement lookup/update and hashtable type.
> More map types can be added in the future.
>
> syscall is using type-length-value style of passing arguments to be backwards
> compatible with future extensions to map attributes. Different map types may
> use different attributes as well.
> The concept of type-lenght-value is borrowed from netlink, but netlink itself
> is not applicable here, since BPF programs and maps can be used in NET-less
> configurations.
>
> Signed-off-by: Alexei Starovoitov <ast@xxxxxxxxxxxx>
> ---
> Documentation/networking/filter.txt | 69 +++++++++++
> include/linux/bpf.h | 43 +++++++
> include/uapi/linux/bpf.h | 24 ++++
> kernel/bpf/Makefile | 2 +-
> kernel/bpf/syscall.c | 225 +++++++++++++++++++++++++++++++++++
> 5 files changed, 362 insertions(+), 1 deletion(-)
> create mode 100644 include/linux/bpf.h
> create mode 100644 kernel/bpf/syscall.c
>
> diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt
> index ee78eba78a9d..e14e486f69cd 100644
> --- a/Documentation/networking/filter.txt
> +++ b/Documentation/networking/filter.txt
> @@ -995,6 +995,75 @@ BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg
> Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and
> 2 byte atomic increments are not supported.
>
> +eBPF maps
> +---------
> +'maps' is a generic storage of different types for sharing data between kernel
> +and userspace.
> +
> +The maps are accessed from user space via BPF syscall, which has commands:
> +- create a map with given id, type and attributes
> + map_id = bpf_map_create(int map_id, map_type, struct nlattr *attr, int len)
> + returns positive map id or negative error

Looks like these docs need updating for the fd-based approach instead
of the map_id approach?

> +
> +- delete map with given map id
> + err = bpf_map_delete(int map_id)
> + returns zero or negative error
> +
> +- lookup key in a given map referenced by map_id
> + err = bpf_map_lookup_elem(int map_id, void *key, void *value)
> + returns zero and stores found elem into value or negative error
> +
> +- create or update key/value pair in a given map
> + err = bpf_map_update_elem(int map_id, void *key, void *value)
> + returns zero or negative error
> +
> +- find and delete element by key in a given map
> + err = bpf_map_delete_elem(int map_id, void *key)
> +
> +userspace programs uses this API to create/populate/read maps that eBPF programs
> +are concurrently updating.
> +
> +maps can have different types: hash, bloom filter, radix-tree, etc.
> +
> +The map is defined by:
> + . id
> + . type
> + . max number of elements
> + . key size in bytes
> + . value size in bytes
> +
> +The maps are accesible from eBPF program with API:
> + void * bpf_map_lookup_elem(u32 map_id, void *key);
> + int bpf_map_update_elem(u32 map_id, void *key, void *value);
> + int bpf_map_delete_elem(u32 map_id, void *key);
> +
> +If eBPF verifier is configured to recognize extra calls in the program
> +bpf_map_lookup_elem() and bpf_map_update_elem() then access to maps looks like:
> + ...
> + ptr_to_value = map_lookup_elem(const_int_map_id, key)
> + access memory [ptr_to_value, ptr_to_value + value_size_in_bytes]
> + ...
> + prepare key2 and value2 on stack of key_size and value_size
> + err = map_update_elem(const_int_map_id2, key2, value2)
> + ...
> +
> +eBPF program cannot create or delete maps
> +(such calls will be unknown to verifier)
> +
> +During program loading the refcnt of used maps is incremented, so they don't get
> +deleted while program is running
> +
> +bpf_map_update_elem() can fail if maximum number of elements reached.
> +if key2 already exists, bpf_map_update_elem() replaces it with value2 atomically
> +
> +bpf_map_lookup_elem() can return null or ptr_to_value
> +ptr_to_value is read/write from the program point of view.
> +
> +The verifier will check that the program accesses map elements within specified
> +size. It will not let programs pass junk values as 'key' and 'value' to
> +bpf_map_*_elem() functions, so these functions (implemented in C inside kernel)
> +can safely access the pointers in all cases.
> +
> Testing
> -------
>
> diff --git a/include/linux/bpf.h b/include/linux/bpf.h
> new file mode 100644
> index 000000000000..57af236a0eb4
> --- /dev/null
> +++ b/include/linux/bpf.h
> @@ -0,0 +1,43 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +#ifndef _LINUX_BPF_H
> +#define _LINUX_BPF_H 1
> +
> +#include <uapi/linux/bpf.h>
> +#include <linux/workqueue.h>
> +
> +struct bpf_map;
> +struct nlattr;
> +
> +/* map is generic key/value storage optionally accesible by eBPF programs */
> +struct bpf_map_ops {
> + /* funcs callable from userspace (via syscall) */
> + struct bpf_map *(*map_alloc)(struct nlattr *attrs[BPF_MAP_ATTR_MAX + 1]);
> + void (*map_free)(struct bpf_map *);
> +};
> +
> +struct bpf_map {
> + atomic_t refcnt;
> + int map_id;
> + enum bpf_map_type map_type;
> + u32 key_size;
> + u32 value_size;
> + u32 max_entries;
> + struct bpf_map_ops *ops;
> + struct work_struct work;
> +};
> +
> +struct bpf_map_type_list {
> + struct list_head list_node;
> + struct bpf_map_ops *ops;
> + enum bpf_map_type type;
> +};
> +
> +void bpf_register_map_type(struct bpf_map_type_list *tl);
> +struct bpf_map *bpf_map_get(u32 map_id);
> +
> +#endif /* _LINUX_BPF_H */
> diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
> index 3ff5bf5045a7..dcc7eb97a64a 100644
> --- a/include/uapi/linux/bpf.h
> +++ b/include/uapi/linux/bpf.h
> @@ -300,4 +300,28 @@ struct bpf_insn {
> __s32 imm; /* signed immediate constant */
> };
>
> +/* BPF syscall commands */
> +enum bpf_cmd {
> + /* create a map with given type and attributes
> + * fd = bpf_map_create(bpf_map_type, struct nlattr *attr, int len)
> + * returns fd or negative error
> + * map is deleted when fd is closed
> + */
> + BPF_MAP_CREATE,
> +};
> +
> +enum bpf_map_attributes {
> + BPF_MAP_UNSPEC,
> + BPF_MAP_KEY_SIZE, /* size of key in bytes */
> + BPF_MAP_VALUE_SIZE, /* size of value in bytes */
> + BPF_MAP_MAX_ENTRIES, /* maximum number of entries in a map */
> + __BPF_MAP_ATTR_MAX,
> +};
> +#define BPF_MAP_ATTR_MAX (__BPF_MAP_ATTR_MAX - 1)
> +#define BPF_MAP_MAX_ATTR_SIZE 65535
> +
> +enum bpf_map_type {
> + BPF_MAP_TYPE_UNSPEC,
> +};
> +
> #endif /* _UAPI__LINUX_BPF_H__ */
> diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
> index 6a71145e2769..e9f7334ed07a 100644
> --- a/kernel/bpf/Makefile
> +++ b/kernel/bpf/Makefile
> @@ -1 +1 @@
> -obj-y := core.o
> +obj-y := core.o syscall.o
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> new file mode 100644
> index 000000000000..c4a330642653
> --- /dev/null
> +++ b/kernel/bpf/syscall.c
> @@ -0,0 +1,225 @@
> +/* Copyright (c) 2011-2014 PLUMgrid, http://plumgrid.com
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
> + * General Public License for more details.
> + */
> +#include <linux/bpf.h>
> +#include <linux/syscalls.h>
> +#include <net/netlink.h>
> +#include <linux/anon_inodes.h>
> +
> +/* mutex to protect insertion/deletion of map_id in IDR */
> +static DEFINE_MUTEX(bpf_map_lock);
> +static DEFINE_IDR(bpf_map_id_idr);
> +
> +/* maximum number of outstanding maps */
> +#define MAX_BPF_MAP_CNT 1024
> +static u32 bpf_map_cnt;
> +
> +static LIST_HEAD(bpf_map_types);
> +
> +static struct bpf_map *find_and_alloc_map(enum bpf_map_type type,
> + struct nlattr *tb[BPF_MAP_ATTR_MAX + 1])
> +{
> + struct bpf_map_type_list *tl;
> + struct bpf_map *map;
> +
> + list_for_each_entry(tl, &bpf_map_types, list_node) {
> + if (tl->type == type) {
> + map = tl->ops->map_alloc(tb);
> + if (IS_ERR(map))
> + return map;
> + map->ops = tl->ops;
> + map->map_type = type;
> + return map;
> + }
> + }
> + return ERR_PTR(-EINVAL);
> +}
> +
> +/* boot time registration of different map implementations */
> +void bpf_register_map_type(struct bpf_map_type_list *tl)
> +{
> + list_add(&tl->list_node, &bpf_map_types);
> +}
> +
> +/* called from workqueue */
> +static void bpf_map_free_deferred(struct work_struct *work)
> +{
> + struct bpf_map *map = container_of(work, struct bpf_map, work);
> +
> + /* grab the mutex and free the map */
> + mutex_lock(&bpf_map_lock);
> +
> + bpf_map_cnt--;
> + idr_remove(&bpf_map_id_idr, map->map_id);
> +
> + mutex_unlock(&bpf_map_lock);
> +
> + /* implementation dependent freeing */
> + map->ops->map_free(map);
> +}
> +
> +/* decrement map refcnt and schedule it for freeing via workqueue
> + * (unrelying map implementation ops->map_free() might sleep)
> + */
> +static void __bpf_map_put(struct bpf_map *map)
> +{
> + if (atomic_dec_and_test(&map->refcnt)) {
> + INIT_WORK(&map->work, bpf_map_free_deferred);
> + schedule_work(&map->work);
> + }
> +}
> +
> +/* find map by id and decrement its refcnt
> + *
> + * can be called without any locks held
> + *
> + * returns true if map was found
> + */
> +static bool bpf_map_put(u32 map_id)
> +{
> + struct bpf_map *map;
> +
> + rcu_read_lock();
> + map = idr_find(&bpf_map_id_idr, map_id);
> +
> + if (!map) {
> + rcu_read_unlock();
> + return false;
> + }
> +
> + __bpf_map_put(map);
> + rcu_read_unlock();
> +
> + return true;
> +}
> +
> +/* called with bpf_map_lock held */
> +struct bpf_map *bpf_map_get(u32 map_id)
> +{
> + BUG_ON(!mutex_is_locked(&bpf_map_lock));
> +
> + return idr_find(&bpf_map_id_idr, map_id);
> +}
> +
> +static int bpf_map_release(struct inode *inode, struct file *filp)
> +{
> + struct bpf_map *map = filp->private_data;
> +
> + __bpf_map_put(map);
> + return 0;
> +}
> +
> +static const struct file_operations bpf_map_fops = {
> + .release = bpf_map_release,
> +};
> +
> +static const struct nla_policy map_policy[BPF_MAP_ATTR_MAX + 1] = {
> + [BPF_MAP_KEY_SIZE] = { .type = NLA_U32 },
> + [BPF_MAP_VALUE_SIZE] = { .type = NLA_U32 },
> + [BPF_MAP_MAX_ENTRIES] = { .type = NLA_U32 },
> +};
> +
> +/* called via syscall */
> +static int map_create(enum bpf_map_type type, struct nlattr __user *uattr, int len)
> +{
> + struct nlattr *tb[BPF_MAP_ATTR_MAX + 1];
> + struct bpf_map *map;
> + struct nlattr *attr;
> + int err;
> +
> + if (len <= 0 || len > BPF_MAP_MAX_ATTR_SIZE)
> + return -EINVAL;
> +
> + attr = kmalloc(len, GFP_USER);
> + if (!attr)
> + return -ENOMEM;
> +
> + /* copy map attributes from user space */
> + err = -EFAULT;
> + if (copy_from_user(attr, uattr, len) != 0)
> + goto free_attr;
> +
> + /* perform basic validation */
> + err = nla_parse(tb, BPF_MAP_ATTR_MAX, attr, len, map_policy);
> + if (err < 0)
> + goto free_attr;
> +
> + /* find map type and init map: hashtable vs rbtree vs bloom vs ... */
> + map = find_and_alloc_map(type, tb);
> + if (IS_ERR(map)) {
> + err = PTR_ERR(map);
> + goto free_attr;
> + }
> +
> + atomic_set(&map->refcnt, 1);
> +
> + mutex_lock(&bpf_map_lock);
> +
> + if (bpf_map_cnt >= MAX_BPF_MAP_CNT) {
> + mutex_unlock(&bpf_map_lock);
> + err = -ENOSPC;
> + goto free_map;
> + }
> +
> + /* allocate map id */
> + err = idr_alloc(&bpf_map_id_idr, map, 1 /* min map_id */, 0, GFP_USER);
> +
> + if (err > 0)
> + bpf_map_cnt++;
> +
> + map->map_id = err;
> +
> + mutex_unlock(&bpf_map_lock);
> +
> + if (err < 0)
> + /* failed to allocate map id */
> + goto free_map;
> +
> + err = anon_inode_getfd("bpf-map", &bpf_map_fops, map, O_RDWR | O_CLOEXEC);
> +
> + if (err < 0)
> + /* failed to allocate fd */
> + goto free_map_id;
> +
> + /* user supplied array of map attributes is no longer needed */
> + kfree(attr);
> +
> + return err;
> +
> +free_map_id:
> + /* grab the mutex and free the map */
> + mutex_lock(&bpf_map_lock);
> +
> + bpf_map_cnt--;
> + idr_remove(&bpf_map_id_idr, map->map_id);
> +
> + mutex_unlock(&bpf_map_lock);
> +free_map:
> + map->ops->map_free(map);
> +free_attr:
> + kfree(attr);
> + return err;
> +}
> +
> +SYSCALL_DEFINE5(bpf, int, cmd, unsigned long, arg2, unsigned long, arg3,
> + unsigned long, arg4, unsigned long, arg5)
> +{
> + if (!capable(CAP_SYS_ADMIN))
> + return -EPERM;

It might be valuable to have a comment here describing why this is
currently limited to CAP_SYS_ADMIN.

> +
> + switch (cmd) {
> + case BPF_MAP_CREATE:
> + return map_create((enum bpf_map_type) arg2,
> + (struct nlattr __user *) arg3, (int) arg4);

I'd recommend requiring arg5 == 0 here, just for future flexibility.

-Kees

> + default:
> + return -EINVAL;
> + }
> +}
> --
> 1.7.9.5
>



--
Kees Cook
Chrome OS Security
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/