Tracing packets datapath in Kubernetes network

7 min readMar 20, 2023

For me, network and operating system kernels are both unfamiliar and intriguing. I hope to uncover the truth behind them by peeling away the layers of mist.

In my previous article, I delved into the Kubernetes network model, and this time I want to go deeper: to understand the transmission of packets in Kubernetes, to prepare for learning Kubernetes’ eBPF network acceleration, and to deepen my understanding of networks and operating system kernels. There may be some omissions in the article, and I welcome everyone’s feedback.

Before we begin, I can summarize my learning outcome in one sentence: the flow of packets is actually an addressing process of a socket file descriptor (socket fd). It is not simply pointing to the memory address of the socket fd, but also includes its network address.

In Unix and Unix-like systems, everything is a file, and sockets can also be operated through file descriptors.

Basic Knowledge

Packet

Since we are going to discuss the flow of packets, let’s first look at what a packet is.

A network packet (also known as a network datagram or network frame) is a unit of data transmitted over a computer network. Let’s take the most common TCP packet as an example, which contains the following parts:

Ethernet header: link-layer information, mainly including the destination MAC address and source MAC address, as well as the format of the message, which here is an IP packet.
IP header: network layer information, mainly including the length, source IP address, and destination IP address, as well as the format of the message, which must be a TCP packet here.
TCP header: transport layer information, including the source port and destination port.
Data: generally data from the 7th layer, such as HTTP, etc.

Checksum and FCS, which are not introduced here, are usually used to check whether the packet has been tampered with or errors have occurred during transmission.

The process of an application sending data to the network using a socket can be simply understood as the process of encapsulating data with header information: TCP packet, IP packet, Ethernet packet; conversely, the process of receiving Ethernet packets from the network to data that can be processed by the application is the process of decapsulation. The process of encapsulation and decapsulation is performed by the kernel network protocol stack.

Next, we will explain the processing of sockets and the kernel network protocol stack.

Socket

Socket is a programming interface used in computer networks, located between the user space (the space where user applications run) and the kernel network protocol stack (the component in the kernel that packages and unpacks data).

As a programming interface, sockets provide the following operations (only some are listed):

socket
connect
bind
listen
accept
Data transmission
send
sendto
sendmsg
recv
recvfrom
recvmsg
getsockname
getpeername
getsockopt, setsockopt to get or set socket layer or protocol layer options
close

The following figure shows the effect of each operation:

Before explaining the kernel network protocol stack, let’s first talk about the data structure of packets in memory: sk_buff.

sk_buff

sk_buff is a data structure used in the Linux kernel to manage network packets. It contains various information and attributes of received and transmitted network packets, such as the protocol, data length, source and destination addresses, etc. sk_buff is a data structure that can be used to pass data between the network layer and the data link layer and can be used for all types of network protocol stacks, such as TCP/IP, UDP, ICMP, etc.

sk_buff is widely used in various layers of the network protocol stack in the Linux kernel, such as the data link layer, network layer, transport layer, etc. The sk_buff data structure has many fields, with four important fields that are all pointer types. The use of sk_buff at different layers is accomplished by modifying these pointers: adding headers (packetization) and removing headers (depacketization).

This process operates on pointers, and data is zero-copied, which can greatly improve efficiency.

Kernel Network Protocol Stack

Here’s a diagram and an explanation of the process of sending or receiving data when an application is involved in the process:

Packing

The application uses the sendmsg operation of a socket to send data (netfilter, traffic control, and queue discipline are not discussed in detail here):

Allocate an sk_buff.
The network protocol stack processing begins here.
Set transport layer information (source and destination port numbers in this case).
Find the route based on the destination IP.
Set network layer information (source and destination IP addresses, etc.).
Call netfilter (LOCAL_OUT).
Set the interface and protocol.
Call netfilter (POST_ROUTING).
If the packet is too long, it’s fragmented.
L2 addressing, i.e., find the MAC address of the device that can own the destination IP address.
Set data link layer information.
The kernel network protocol stack operation is now complete.
Call tc (traffic control) egress (redirect the packet if necessary).
Enter the queue discipline (qdisc).
Write to the NIC (network interface controller).
Send to the network.

Unpacking

NIC receives data packets from the network (without going into details about direct memory access, netfilter, and traffic control):

Write the data packet to DMA (Direct Memory Access) (does not rely on the CPU, written directly into memory by the NIC).
Allocate sk_buff and fill in metadata, such as the protocol being Ethernet type and the receiving network interface.
Save the link layer information in the mac_header field of sk_buff and "remove" the link layer information from the data packet (move the pointer).
Network protocol stack processing begins
Save the network layer information in the network_header field.
Call tc ingress.
“Remove” network layer information.
Save the transport layer information in the transport_header field.
Call netfilter (PRE_ROUTING).
Look up routing.
Merge multiple packets.
Call netfilter (LOCAL_IN).
“Remove” transport layer information.
Look up the socket listening on the destination port or send a reset.
Write the data to the socket’s receive queue.
Send a signal to notify that data has been written to the queue.
The operation of the kernel network protocol stack is now complete.
Dequeue sk_buff from the socket receive queue.
Write the data to the application’s buffer.
Release sk_buff.

Kubernetes Network Model

The other basic knowledge is the Kubernetes network model, which can refer to the previously discussed article A Deep Dive into the Kubernetes Network Model and Communication.

Packet Flow in Kubernetes

We continue to discuss the three communication scenarios mentioned in the previous article. Pod-to-pod communication uses the pod IP address. If we want to discuss accessing through Service, the discussion about netfilter will increase.

Container-to-Container Communication within a Pod

Communication between two containers within a pod usually uses the loopback address 127.0.0.1, which is determined in the routing process of packet #4 to be transmitted using the loopback NIC lo.

Pod-to-Pod Communication on the Same Node

The request sent by curl is determined to use the eth0 interface in the routing #4 process. Then, it reaches the root network namespace of the node through the tunnel veth1 connected to eth0.

veth1 is connected to other pods through the bridge cni0 and the virtual Ethernet interface vethX. In the L2 addressing of packet #10, the ARP request is sent through the bridge to all connected interfaces to check if they have the destination IP address in the original request (which is 10.42.1.9 here).

After obtaining the MAC address of veth0, the packet's link layer information is set in packet #11. After the packet is sent, it enters the eth0 interface of pod httpbin through the veth0 tunnel and then begins the unpacking process.

There is nothing special about the unpacking process, and the socket used by httpbin is determined.

Communication between pods on different nodes

Here things are slightly different. When an ARP request is sent through cni0 and no response is received, the host's routing table in the root namespace is used to determine the destination host's IP address. Then an ARP request is sent through the host's eth0 and a response from the destination host is received. The MAC address is written into packet #11.

Once the packet arrives at the destination host, the unpacking process begins and eventually enters the destination pod.

At the cluster level, there is a routing table that stores the Pod IP subnet of each node (when a node joins the cluster, it is assigned a Pod subnet, such as 10.42.0.0/16 in k3s, and each node gets a subnet like 10.42.0.0/24, 10.42.1.0/24, 10.42.2.0/24, and so on). The request is sent to the node based on the Pod IP subnet of the node to which the requested IP belongs.

Summary

In all three scenarios, the number of times the packet is processed by the kernel network protocol stack is twice (including netfilter processing), even if it is within the same pod or node. Both of these situations actually occur within the same kernel space.

If two sockets within the same kernel space can transfer data directly, can we eliminate the latency caused by the kernel network protocol stack processing?

Continued in the next section.