Accelerate network packets transmission with eBPF

11 min readMar 23, 2023

In the previous article Tracing Network Packets in Kubernetes, the entire content was used to describe the trajectory of network packets in the Kubernetes network. At the end of the article, we proposed a hypothesis: if two sockets in the same kernel space can transmit data directly, can we eliminate the latency caused by the processing of the kernel network protocol stack?

Whether it is network communication between two different containers in the same pod or between two pods on the same node, it actually occurs in the same kernel space, and the two sockets that are endpoints are also located in the same memory. At the beginning of the previous article, we also summarized that the transmission trajectory of data packets is actually the addressing process of sockets. We can further expand the problem: if the communication between two sockets on the same node can quickly locate the remote socket — find its address in memory, we can skip the step of the kernel protocol stack and accelerate the transmission of data packets.

The two sockets that are endpoints are the client socket and the server socket that establish a connection, and they can be associated with each other through IP addresses and ports. The local address and port of the client socket are the remote address and port of the server socket, while the remote address and port of the client socket are the local address and port of the server socket.

After the client and server complete the connection establishment, if the combination of local address + port and remote address + port can be used to locate the socket, we only need to switch the local and remote addresses + ports to locate the remote socket, and then write the data directly to the remote socket (actually write it to the socket’s receive queue RXQ, which is not expanded here), which can bypass the kernel network stack (including netfilter/iptables) or even the processing of the NIC.

How to implement this? As the title suggests, we use the eBPF technology.

What is eBPF?

The Linux kernel has always been an ideal place to implement monitoring/observability, network, and security features. However, in many cases, this is not easy, as these tasks require modifying the kernel source code or loading kernel modules, resulting in adding new abstractions on top of existing ones. eBPF is a revolutionary technology that allows running sandbox programs in the kernel without the need for modifying the kernel source code or loading kernel modules.

By making the Linux kernel programmable, it becomes possible to build more intelligent and feature-rich infrastructure software based on existing (rather than adding new) abstractions, without increasing system complexity, or sacrificing performance and security.

Use cases

The following is an excerpt from the eBPF.io website.

In networking, using eBPF can speed up packet processing without leaving the kernel space. Additional protocol parsers can be added, and any forwarding logic can be easily written to meet evolving needs.
In observability, using eBPF enables custom metric collection and kernel aggregation, as well as generating visibility events and data structures from a multitude of sources without exporting samples.
In tracing and profiling, attaching eBPF programs to tracepoints and kernel and user-space probe points provides powerful inspection capabilities and unique insights to address system performance issues.
In security, combining the ability to see and understand all system calls with a data packet and socket-level view of all networking, creates a security system that can run in more contexts and have better control levels.

Event-driven

eBPF programs are event-driven, meaning they are executed when the kernel or application triggers a hook point. Predefined hook types include system calls, function entry/exit, kernel tracepoints, network events, and more.

The Linux kernel provides a set of BPF hooks on system calls and the network stack, which can trigger the execution of BPF programs. Below are some commonly used hooks:

XDP: This is the earliest hook point that can trigger a BPF program in the network driver when receiving network packets. Since it has not yet entered the kernel’s network protocol stack and has not performed high-cost operations, such as allocating sk_buff for network packets, it is ideal for running filtering programs that delete malicious or accidental traffic and other common DDOS protection mechanisms.
Traffic Control Ingress/Egress: A BPF program attached to the traffic control (tc) ingress hook, which can be attached to a network interface. This hook executes before L3 in the network stack and can access most of the metadata of network packets. It can handle operations within the same node, such as applying L3/L4 endpoint policies and forwarding traffic to endpoints. CNI typically uses a virtual ethernet interface veth to connect containers to the host network namespace. By attaching a tc ingress hook to the host-side veth, you can monitor all traffic leaving the container (or attach to the eth0 interface in the container). It can also be used to handle operations across nodes. By attaching another BPF program to the tc egress hook, Cilium can monitor all traffic entering and leaving nodes and enforce policies.

The above two hooks belong to the network event type, and below is another hook related to sockets’ system calls.

Socket operations: A socket operation hook attached to a specific cgroup and run on socket operations. For example, attach the BPF socket operation program to cgroup/sock_ops and use it to monitor the status changes of sockets (obtained from bpf_sock_ops), especially the ESTABLISHED state. When the socket state becomes ESTABLISHED and the peer of the TCP socket is also on the current node (or it may be a local proxy), the information is stored. Or attach the program to cgroup/connect4 operation, and it will be executed when initializing the connection with an ipv4 address, modifying the address and port.
Socket send: This hook runs on every send operation performed by a socket. At this time, the hook can check the message, discard it, send the message to the kernel’s network protocol stack, or redirect the message to another socket. Here, we can use it to quickly address sockets.

Map

An important aspect of eBPF programs is the ability to share collected information and store states. To do this, eBPF programs can use the concept of eBPF Maps to store and retrieve data. eBPF Maps can be accessed from eBPF programs and also from user-space applications through system calls.

There are several types of Maps, such as Hash Maps, Arrays, LRU (Least Recently Used) Hash Maps, Circular Buffers, Stack Trace, etc.

For example, the program attached to a socket in the previous section, which is executed every time a message is sent, is actually attached to the socket Hash Map, with the socket being the value in the key-value pair.

Helper Functions

eBPF programs cannot call arbitrary kernel functions. Doing so would bind the eBPF program to a specific kernel version and complicate its compatibility. Instead, eBPF programs can make function calls to helper functions, which are well-known and stable APIs provided by the kernel.

These helper functions provide different functionalities, such as generating random numbers, getting the current time and date, accessing eBPF Maps, getting process/cgroup context, manipulating network packets, and forwarding logic.

Implementation

After explaining the content of eBPF, we should have a general idea of the implementation. Here we need two eBPF programs to respectively maintain the socket map and forward messages to the peer socket. Thanks to Idan Zach’s sample code ebpf-sockops, I made some minor modifications to improve readability.

The original code used 16777343 to represent the address 127.0.0.1 and 4135 to represent the port 10000, which are the values converted from network byte order.

Maintaining the socket map: sockops

The program attached to sock_ops: monitor the socket state, and when the state is BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB or BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB, use the helper function bpf_sock_hash_update[^1] to save the socket as a value to the socket map, where the key consists of the local address + port and remote address + port.

In actual processing, it is necessary to check the current socket information and save the socket with the destination address and port or the local address and port of 127.0.0.1 and 10000 to the socket map.

__section("sockops")
int bpf_sockmap(struct bpf_sock_ops *skops)
{
    __u32 family, op;

    family = skops->family;
    op = skops->op;    //printk("<<< op %d, port = %d --> %d\n", op, skops->local_port, skops->remote_port);
    switch (op) {
        case BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB:
        case BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB:
        if (family == AF_INET6)
                        bpf_sock_ops_ipv6(skops);
                else if (family == AF_INET)
                        bpf_sock_ops_ipv4(skops);
                break;
        default:
                break;
        }
    return 0;
}// 127.0.0.1
static const __u32 lo_ip = 127 + (1 << 24);static inline void bpf_sock_ops_ipv4(struct bpf_sock_ops *skops)
{
    struct sock_key key = {};
    sk_extract4_key(skops, &key);
    if (key.dip4 == loopback_ip || key.sip4 == loopback_ip ) {
        if (key.dport == bpf_htons(SERVER_PORT) || key.sport == bpf_htons(SERVER_PORT)) {
            int ret = sock_hash_update(skops, &sock_ops_map, &key, BPF_NOEXIST);
            printk("<<< ipv4 op = %d, port %d --> %d\n", skops->op, key.sport, key.dport);
            if (ret != 0)
                printk("*** FAILED %d ***\n", ret);
        }
    }
}

Message Passing: sk_msg

The program attached to the socket map triggers at every message send and uses the remote address+port and local address+port of the current socket as the key to locate the remote socket from the map. If the location is successful, it means the client and server are on the same node, and the data can be directly written to the remote socket using the bpf_msg_redirect_hash[^2] helper function.

Instead of using bpf_msg_redirect_hash directly, a custom function called msg_redirect_hash is used to access it. This is because accessing the former directly would cause the validation to fail.

Similar to sockops, message redirection is also targeted at messages with either the destination address and port or the local address and port being "127.0.0.1" and "10000".

__section("sk_msg")
int bpf_redir(struct sk_msg_md *msg)
{
    __u64 flags = BPF_F_INGRESS;
    struct sock_key key = {};

    sk_msg_extract4_key(msg, &key);
    // See whether the source or destination IP is local host
    if (key.dip4 == loopback_ip || key.sip4 == loopback_ip ) {
        // See whether the source or destination port is 10000
        if (key.dport == bpf_htons(SERVER_PORT) || key.sport == bpf_htons(SERVER_PORT)) {
            //int len1 = (__u64)msg->data_end - (__u64)msg->data;
                    //printk("<<< redir_proxy port %d --> %d (%d)\n", key.sport, key.dport, len1);
            msg_redirect_hash(msg, &sock_ops_map, &key, flags);
        }
    }    return SK_PASS;
}

Testing

Environment

Ubuntu 20.04
Kernel 5.15.0–1034

Install dependencies.

sudo apt update && sudo apt install make clang llvm gcc-multilib linux-tools-$(uname -r) linux-cloud-tools-$(uname -r) linux-tools-generic

Clone code to local.

git clone https://github.com/addozhang/ebpf-sockops
cd ebpf-sockops

Compile the BPF program and load to kernel.

sudo ./load.sh

Install the iperf3.

sudo apt install iperf3

Start the iperf3 server.

iperf3 -s -p 10000

Run the iperf3 client to send requests.

iperf3 -c 127.0.0.1 -t 10 -l 64k -p 10000

Trigger the trace.sh script to monitor the logs and you will get 4 logs, because it established two connections.

./trace.sh

iperf3-7744    [001] d...1   838.985683: bpf_trace_printk: <<< ipv4 op = 4, port 45189 --> 4135
iperf3-7744    [001] d.s11   838.985733: bpf_trace_printk: <<< ipv4 op = 5, port 4135 --> 45189
iperf3-7744    [001] d...1   838.986033: bpf_trace_printk: <<< ipv4 op = 4, port 45701 --> 4135
iperf3-7744    [001] d.s11   838.986078: bpf_trace_printk: <<< ipv4 op = 5, port 4135 --> 45701

How to determine if the kernel network stack has been bypassed? You can use tcpdump to capture packets and check. From the captured packets, it can be seen that only the traffic of the handshake and termination is present, and the subsequent message sending traffic completely bypasses the kernel network stack.

sudo tcpdump -i lo port 10000 -vvv
tcpdump: listening on lo, link-type EN10MB (Ethernet), capture size 262144 bytes
13:23:31.761317 IP (tos 0x0, ttl 64, id 50214, offset 0, flags [DF], proto TCP (6), length 60)
    localhost.34224 > localhost.webmin: Flags [S], cksum 0xfe30 (incorrect -> 0x5ca1), seq 2753408235, win 65495, options [mss 65495,sackOK,TS val 166914980 ecr 0,nop,wscale 7], length 0
13:23:31.761333 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    localhost.webmin > localhost.34224: Flags [S.], cksum 0xfe30 (incorrect -> 0x169a), seq 3960628312, ack 2753408236, win 65483, options [mss 65495,sackOK,TS val 166914980 ecr 166914980,nop,wscale 7], length 0
13:23:31.761385 IP (tos 0x0, ttl 64, id 50215, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.34224 > localhost.webmin: Flags [.], cksum 0xfe28 (incorrect -> 0x3d56), seq 1, ack 1, win 512, options [nop,nop,TS val 166914980 ecr 166914980], length 0
13:23:31.761678 IP (tos 0x0, ttl 64, id 59057, offset 0, flags [DF], proto TCP (6), length 60)
    localhost.34226 > localhost.webmin: Flags [S], cksum 0xfe30 (incorrect -> 0x4eb8), seq 3068504073, win 65495, options [mss 65495,sackOK,TS val 166914981 ecr 0,nop,wscale 7], length 0
13:23:31.761689 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
    localhost.webmin > localhost.34226: Flags [S.], cksum 0xfe30 (incorrect -> 0x195d), seq 874449823, ack 3068504074, win 65483, options [mss 65495,sackOK,TS val 166914981 ecr 166914981,nop,wscale 7], length 0
13:23:31.761734 IP (tos 0x0, ttl 64, id 59058, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.34226 > localhost.webmin: Flags [.], cksum 0xfe28 (incorrect -> 0x4019), seq 1, ack 1, win 512, options [nop,nop,TS val 166914981 ecr 166914981], length 0
13:23:41.762819 IP (tos 0x0, ttl 64, id 43056, offset 0, flags [DF], proto TCP (6), length 52)                                    localhost.webmin > localhost.34226: Flags [F.], cksum 0xfe28 (incorrect -> 0x1907), seq 1, ack 1, win 512, options [nop,nop,TS val 166924982 ecr 166914981], length 0
13:23:41.763334 IP (tos 0x0, ttl 64, id 59059, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.34226 > localhost.webmin: Flags [F.], cksum 0xfe28 (incorrect -> 0xf1f4), seq 1, ack 2, win 512, options [nop,nop,TS val 166924982 ecr 166924982], length 0
13:23:41.763348 IP (tos 0x0, ttl 64, id 43057, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.webmin > localhost.34226: Flags [.], cksum 0xfe28 (incorrect -> 0xf1f4), seq 2, ack 2, win 512, options [nop,nop,TS val 166924982 ecr 166924982], length 0
13:23:41.763588 IP (tos 0x0, ttl 64, id 50216, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.34224 > localhost.webmin: Flags [F.], cksum 0xfe28 (incorrect -> 0x1643), seq 1, ack 1, win 512, options [nop,nop,TS val 166924982 ecr 166914980], length 0
13:23:41.763940 IP (tos 0x0, ttl 64, id 14090, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.webmin > localhost.34224: Flags [F.], cksum 0xfe28 (incorrect -> 0xef2e), seq 1, ack 2, win 512, options [nop,nop,TS val 166924983 ecr 166924982], length 0
13:23:41.763952 IP (tos 0x0, ttl 64, id 50217, offset 0, flags [DF], proto TCP (6), length 52)
    localhost.34224 > localhost.webmin: Flags [.], cksum 0xfe28 (incorrect -> 0xef2d), seq 2, ack 2, win 512, options [nop,nop,TS val 166924983 ecr 166924983], length 0

Summary

With the introduction of eBPF, we have shortened the datapath for same-node communication packets and bypassed the kernel network stack to directly connect the two endpoints’ sockets.

This design is suitable for communication between two applications in the same pod and communication between two pods on the same node.