# **Blackwire**<sup>TM</sup> WireGuard in HDL 100+ Gbit/s

on an FPGA NIC

WireGuard

Leon Woestenberg

# FAST, MODERN, SECURE VPN TUNNEL

- state-of-the-art clean-slate protocol for building network overlays with encryption and authentication, IP/IP, many use cases, trusted crypto.
- a *WireGuard peer* is a host in the VPN, identified by its public key. each peer can potentially reach every remote peer directly: **meshing**
- a *WireGuard* <u>endpoint</u> is the public (or outer) UDP address. may change when endpoint **roams**; the tunnel stays up.

1:00

each peer has a set of <u>allowed IP</u> address prefixes within tunnel.
multicasting is possible





# Why WireGuard on an FPGA?

- Speed growth curve of Ethernet is steeper than that of CPUs. Offloading 'instructure tasks' such as VPN/encryption and authentication to smartNIC FPGAs becomes cost effective.
- Deterministic guaranteed behaviour vs. best-effort on CPU.

#### Example use case:



 9th February 2023 – The Video Services Forum (VSF), has further enhanced the Reliable Internet Streaming Transport (RIST) protocol with the use of WireGuard VPN in RIST devices.

### Map WireGuard into our FPGA design space



### Map WireGuard into our FPGA design space



### Map WireGuard; Talk Numbers, challenges

100 Gbit/s, 512 bits wide AXI Streaming 250 MHz worst case one packet header in each clock cycle

roughly one handshake per minute per active peer connection

1024 peers  $\rightarrow$  58 ms budget per handshake take conservative design budget: 25 ms?

100 Gbit/s, 512 bits wide AXI Streaming 250 MHz worst case one packet header in each clock cycle

### Initial design choices for implementation



### Design choices: (De)coupling data/control



### Design choices: (De)coupling data/control



### Design choices: (De)coupling data/control





### **Example: Allowed IP address prefix lookup**



### Allowed IP lookup: Longest prefix match using binary tree



| •           | P3 1010+ 14 | Match Addr w/ Prefix Set (Tree) |       |  |  |  |  |  |  |  |  |
|-------------|-------------|---------------------------------|-------|--|--|--|--|--|--|--|--|
| $\bigcirc$  | P4 10101 H  | Lookup Stage 0                  | RAM   |  |  |  |  |  |  |  |  |
| Pipeline:   | B           | Lookup Stage 1                  | RAM   |  |  |  |  |  |  |  |  |
| one IP      | a c         | Lookup Stage 2                  | RAM   |  |  |  |  |  |  |  |  |
| Y           |             | Lookup Stage 3                  | RAM   |  |  |  |  |  |  |  |  |
| address bit |             | •••                             | •••   |  |  |  |  |  |  |  |  |
| per clock   | PY)         | Lookup Stage 31                 | RAM   |  |  |  |  |  |  |  |  |
|             |             | Result: longest prefix          | match |  |  |  |  |  |  |  |  |

- Multi stage pipeline, one stage for each address bit.
- 32 bits/stages for IPv4, 128 for IPv6, 129 for both.
- Each stage has a memory (LUT, distributed or BRAM).
- Use optimally balanced tree (reduce worst case RAM use!).



Vivado: report\_design\_analysis -logic\_level\_distribution

#### **Optimized for f.max:** stage sel stage d2 stage stage\_o stage\_d stage sel d2 stage\_i stage\_sel → == ID bitpos d bitpos d2 bitpos i bitpos o INC bitmask d bit set? 1 >> right\_child\_sel ip\_addr\_o ip\_addr\_i ip\_addr\_d ip addr d2 prefix match d2 prefix\_xor XOR update i ==0 & xor masked( d2) valid match d2 result d2 result i result result\_o location location d location d2 Λ prefix mem location\_i stage\_sel child stage d2 child stage m mem prefix len mem s1 >>><mark>prefix\_m</mark>ask location location\_o child\_location\_m child location child location d2 INC l clk l clk

| P3 1010+ 143    | Match Addr w/ Prefix Set for RX and TX |   |             |    |  |  |  |
|-----------------|----------------------------------------|---|-------------|----|--|--|--|
| D   PY 10101 HY | RX                                     | ( | TDP RAM #0  | ТΧ |  |  |  |
| B               | RX                                     | ( | TDP RAM #1  | ТΧ |  |  |  |
| CE DE           | RX                                     | ( | TDP RAM #2  | ТΧ |  |  |  |
| TRY'E YD        | RX                                     | ( | TDP RAM #3  | ТΧ |  |  |  |
| CO PI           |                                        |   | • • •       |    |  |  |  |
| (P3) (P)        | RX                                     | ( | TDP RAM #31 | ТΧ |  |  |  |

Result: longest prefix matches

- → Two pipeline stages per bit
- → Balanced combinatorial logic (≤ 4 levels) between registers
- → 400 MHz on Ultrascale+
- → True dual port RAM (RX reads, TX reads, RISC-V writes).
- → 800 million Allowed IP address prefix lookups per second.

### Blackwire builds upon top-notch OSH:

- → Blackwire uses SpinalHDL (Thanks Charles Papon!)
  - write cycle efficient RTL in less code
  - zero cost (no overhead) abstractions
  - the Spinal (building blocks) library is a piece of art
- → Blackwire **uses Corundum** (Thanks Alex Forencich!)
  - SGDMA NIC design for PCIe and Ethernet FPGA boards
  - comes with Linux Kernel device driver

and tools... Verilator, GHDL, CocoTB, GTKWave, SymbiYosys, ... thanks to everyone committing to those projects. Thanks for explaining **formal verification** (Thanks Matt Venn!)

### Blackwire Project Status (9/2023)

→ Open sourced HDL on GitHub, some WIP to follow https://github.com/brightai-nl/BlackwireOverview

FAQ: Actual code repositories listed in README!

- → Are We WireGuard Yet (AWWY)?
  - ♦ 75% done;
  - ♦ 25% to do, see README on GitHub for TODOs.

## Blackwire IP Core



https://github.com/brightai-nl/BlackwireOverview

### Blackwire: 'wg0' but implemented on FPGA



### Blackwire: integrated in network infrastructure



## Blackwire WireGuard

### Thanks! *Questions?*

- https://github.com/brightai-nl/BlackwireOverview
- → Q&A e-mail: Leon Woestenberg <leon@brightai.nl>

# **Complementary Slides**

### **Blackwire FPGA Resources ~ (for 100 Gbit)**

Alveo U50 example with Corundum + WireGuard (BRAM, no URAM)

RX path is 128 Gbit/s, TX path is 64 Gbit/s in this design

| Name ^1  | CLB LUTs<br>(871680) | CLB Registers<br>(1743360) |        |       |       |        |        | LUT as Memory<br>(403200) | Block RAM<br>Tile (1344) |       |        |
|----------|----------------------|----------------------------|--------|-------|-------|--------|--------|---------------------------|--------------------------|-------|--------|
| V N fpga | 31.39%               | 32.04%                     | 18.67% | 0.36% | 0.11% | 58.96% | 26.90% | 9.71%                     | 25.78%                   | 6.09% | 21.91% |

#### Resources (roadmap: move more registers into BRAM/URAM)

| Name                                   | 1 CLB LUTs<br>(871680) |        |       |     |   |       | LUT as Logic<br>(871680) | LUT as Memory<br>(403200) | Block RAM<br>Tile (1344) |    |      |
|----------------------------------------|------------------------|--------|-------|-----|---|-------|--------------------------|---------------------------|--------------------------|----|------|
| I app.app_block_inst (mqnic_app_block) | 211501                 | 462593 | 19298 | 649 | 8 | 50192 | 182601                   | 28900                     | 153                      | 21 | 1300 |

add the following numbers for 100 Gbit/s full duplex:

subtract the following numbers for ~60 Gbit/s full duplex

| > 🔳 packetTx_tx (BlackwireTransmit) | 67906 | 156822 | 6397 | 185 | 4 | 17931 | 59587 | 8319 | 20.5 | 0 | 432 |  |
|-------------------------------------|-------|--------|------|-----|---|-------|-------|------|------|---|-----|--|
|-------------------------------------|-------|--------|------|-----|---|-------|-------|------|------|---|-----|--|



Olof Kindgren (He/Him) (He/Him) • 1st Award-winning Engineer and Actor at Qamcom

1mo •••

This is super interesting. I have been looking at doing exactly the same thing. Is this a proprietary or open source implementation? Would love to read some more about the work

Like · 🖒 6 | Reply · 3 Replies



Olof Kindgren (He/Him) (He/Him) • 1st Award-winning Engineer and Actor at Qamcom

This is super interesting. I have been looking at doing exactly the same thing. Is this a proprietary or open source implementation? Would love to read some more about the work

Like · 🖰 6 | Reply · 3 Replies



Olof Kindgren (He/Him) (He/Him) • 1st 1mo ••• Award-winning Engineer and Actor at Qamcom

Ah! It looks like it is written in SpinalHDL too :D I'm sure **Charles Papon** must be happy to see that :)



1mo •••