Exploring eBPF Implementation through Linux Source Code
Last year, I delved into eBPF and shared several eBPF-related learning notes focusing on its applications. To prepare for my upcoming article, I’ve decided to start with the Linux source code this time, aiming for a deeper understanding of how eBPF works. Thus, this piece is another learning note. If you’re intrigued by the workings of eBPF, feel free to join me on this journey. Any feedback on the article is highly appreciated.
I won’t be going into an extensive introduction to eBPF here. For that, you can refer to my other article, Accelerate network packets transmission with eBPF, and Tracing packets datapath in Kubernetes network to get a basic understanding of eBPF and its applications in network acceleration.
Moving forward, we will use the program bpf_sockops from eBPF sockops as an example, in conjunction with the Linux v6.8 source code to explore the workings of eBPF.
BPF Program Operations
In the load.sh script, the loading and attaching operations of the program are completed. The following commands use bpftool to perform the loading and attaching of the BPF program, respectively.
# Load
sudo bpftool prog load bpf_sockops.o "/sys/fs/bpf/bpf_sockop"
# Attach
sudo bpftool cgroup attach "/sys/fs/cgroup/unified/" sock_ops pinned "/sys/fs/bpf/bpf_sockop"
Here, bpftool is a command-line tool that wraps the kernel function bpf()
, used for managing and manipulating BPF programs and Maps.
Loading
sudo bpftool prog load bpf_sockops.o "/sys/fs/bpf/bpf_sockop"
The command bpftool prog load
loads bpf_sockops.o
into the path /sys/fs/bpf/bpf_sockop
.
The loading of the BPF program by bpftool is accomplished by calling bpf()
with the command BPF_PROG_LOAD
and passing in the loading options bpf_prog_load_opts
:
syscall(__NR_bpf, BPF_PROG_LOAD, &attr, sizeof(attr))
- syscall bpf() is the bpf system function.
- __sys_bpf executes the bpf command BPF_PROG_LOAD.
- bpf_prog_load allocates memory for the program, initializes it, checks certifications, runs the verifier, creates file descriptors (fd), etc.
Once the program is successfully loaded, it can then be attached.
Attaching
sudo bpftool cgroup attach "/sys/fs/cgroup/unified/" sock_ops pinned "/sys/fs/bpf/bpf_sockop"
The command bpftool cgroup attach
attachs the loaded (and pinned to the filesystem) program /sys/fs/bpf/bpf_sockop
to the cgroup /sys/fs/cgroup/unified/
, with the attachment type sock_ops
. This sock_ops
is defined by the libbpf
library used by bpftool and also serves as an ELF section name. It corresponds to the BPF program type BPF_PROG_TYPE_SOCK_OPS
, and the attachment type is BPF_CGROUP_SOCK_OPS
.
In eBPF programming, ELF (Executable and Linkable Format) files are used to store compiled eBPF programs and related data. An ELF file consists of multiple sections, each containing different types of information, such as program code, symbol tables, debug information, etc.
The sock_ops
type in libbpf => BPF program type BPF_PROG_TYPE_SOCK_OPS
=> attach type BPF_CGROUP_SOCK_OPS
, corresponds to the section (__section
) named sockops
in the program bpf_sockops.c
.
About the sock_ops
attach point:
sock_ops
typically refers to a series of functions and operations in the Linux kernel that handle socket operations.
sock_ops
can include a range of operations, such as creating sockets, binding sockets to specific addresses and ports, listening for connection requests from other sockets, accepting connection requests, sending and receiving data, and closing sockets, among others. These operations are usually provided through a set of predefined APIs, such as the POSIX socket API, which defines a series of functions likesocket()
,bind()
,listen()
,accept()
,send()
,recv()
, andclose()
, for application programs to call.
This time, bpftool performs the BPF_PROG_ATTACH
operation via the bpf()
system call, passing in the attachment options bpf_prog_attach_opts
to complete the process.
syscall(__NR_bpf, BPF_PROG_ATTACH, &attr, sizeof(attr))
- syscall bpf() is the bpf system function.
- bpf_prog_attach
- cgroup_bpf_prog_attach
- cgroup_bpf_prog_attach
- __cgroup_bpf_attach checks if a program of the same attach type exists on the cgroup, replacing it if so.
- bpf_prog_put — checks for the existence of a program of the same attach type on the cgroup and replaces it if present.
- static_branch_inc — if not, increments the count for that attach type in the
cgroup_bpf_enabled_key
counter.
cgroup_bpf_enabled_key
is a counter for specific types of cgroup BPF programs.!!! This counter is utilized at runtime.
With this, we have successfully attached the program to the cgroup’s sock_ops
.
Socket Operations (sock_ops)
Socket operations are numerous, and here we take the server-side accept
operation during the connection establishment process as an example.
Starting with the system call accept
:
- accept
- __sys_accept4_file
- do_accept Here,
ops->accept()
corresponds to proto_ops inet_stream_ops, which are operations related to stateful sockets (e.g., TCP) - inet_stream_ops.accept
- inet_accept
sk1->sk_prot->accept()
wheresk_prot
provides the specific operations for the TCP protocolproto tcp_prot
- tcp_prot.accept
- [inet_csk_accept] begins handling the three-way handshake and invokes the TCP protocol implementation. inet_init registers
IPPROTO_TCP
, the implementation of the TCP protocol, with net_protocol tcp_protocol, whosehandler
istcp_v4_rcv
. - tcp_v4_rcv At this point, the first phase of the handshake begins, with the socket still in
TCP_LISTEN
state. - tcp_v4_do_rcv handles the state transitions for each phase of the handshake until the connection is established.
- tcp_rcv_state_process Let’s focus on the final handshake phase, where the server receives the client’s ACK, completing the connection establishment.
- tcp_init_transfer sets the socket state to
BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
, initiating data transfer. - bpf_skops_established
- BPF_CGROUP_RUN_PROG_SOCK_OPS executes the BPF program attached with the type
BPF_CGROUP_SOCK_OPS
.
BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
is the operator for socket.accept()
to accept a connection request and complete the connection establishment. It's one of many sock_ops
operators. These operators can be viewed as events(Event-driven part), with the execution of programs being event-driven. For example:
- The operator for completing the three-way handshake from the client side is
BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB
; - The operator for a socket entering the listen state is
BPF_SOCK_OPS_TCP_LISTEN_CB
; - The operator for data acknowledgment is
BPF_SOCK_OPS_DATA_ACK_CB
; - The operator for TCP state changes is
BPF_SOCK_OPS_STATE_CB
.
The execution of BPF programs follows accordingly, without further elaboration here. For those interested, more analysis is available here(Implementation part).