githubEdit

TCP

TCP protocol

Introduction

OSI Seven-Layer Network Model

MTU (Maximum Transmission Unit): maximum packet size for a network device or interface MSS (Maximum Segment Size): maximum TCP segment size

  • Physical Layer PDU name: Bit

  • Data Link Layer PDU name: Frame Protocols: Ethernet, Wi-Fi (IEEE 802.11) Ethernet MTU = 46~1518 Bytes

  • Network Layer PDU name: Packet Protocols: IP, ICMP, BGP IP MTU = 1518 - 14(Frame Header) - 4(CRC) = 1500 Bytes

  • Transport Layer PDU name: Segment OR Datagram Protocols: TCP, UDP MSS = 1500(Ethernet MTU) - 20(IP Header) - 20(TCP Header) = 1460 Bytes

  • Session Layer PDU name: DataStream

  • Presentation Layer PDU name: Message Protocols: SSL/TLS

  • Application Layer PDU name: Message Protocols: HTTP, SMTP, SSH, Telnet

TCP Header Format

Pasted image 20230905091738 A TCP connection is defined by a five-tuple identifying the same connection (src_ip, src_port, dst_ip, dst_port, protocol)

  • Sequence Number is the packet sequence number, used to solve the network packet reordering problem.

  • Acknowledgement Number is the ACK -- used to confirm receipt, solving the problem of packet loss.

  • Window, also called Advertised-Window, is the well-known Sliding Window, used for flow control.

  • TCP Flag is the packet type, primarily used to control the TCP state machine.

TCP State Machine

Pasted image 20230905101408
Pasted image 20230905101422
  • For the 3-way handshake to establish a connection, the main purpose is to initialize the Sequence Number's initial value. Both communicating parties must notify each other of their initialized Sequence Number (abbreviated as ISN: Initial Sequence Number) -- hence the name SYN, which stands for Synchronize Sequence Numbers. These are the x and y in the diagram above. This number will be used as the sequence number for subsequent data communication, ensuring that data received at the application layer will not be disordered due to network transmission issues (TCP uses this sequence number to reassemble data).

  • For the 4-way teardown, if you look carefully it is actually 2 rounds, because TCP is full-duplex, so both the sender and receiver need Fin and Ack. However, one side is passive, making it appear as the so-called 4-way teardown. If both sides disconnect simultaneously, they enter the CLOSING state, then reach the TIME_WAIT state. The diagram below shows simultaneous disconnection by both sides (you can also follow along with the TCP state machine)

Pasted image 20230905101514 Important notes:

  • SYN_RECV state: When the server cannot receive the ACK for the connection establishment, it resends the SYN+ACK packet. In Linux, the default is 5 retries, starting from 1s and doubling each time, totaling 1s + 2s + 4s + 8s + 16s + 32s = 2^6 - 1 = 63s. TCP only disconnects after the 63s timeout. Optimization parameters: 1) tcp_synack_retries to reduce the retry count. 2) tcp_max_syn_backlog and net.core.somaxconn to increase the SYN half-connection queue. 3) tcp_abort_on_overflow to reject connections and drop ACKs when the full connection queue is full; tcp_syncookies hashes the five-tuple into a cookie and returns it, the client carries it back to establish the connection (not recommended to enable)

  • ISN initialization: The ISN is tied to a pseudo-clock that increments the ISN by one every 4 microseconds until it exceeds 2^32, then wraps around to 0. One ISN cycle is approximately 4.55 hours. Assuming a TCP segment's lifetime on the network does not exceed the Maximum Segment Lifetime (MSL), as long as the MSL value is less than 4.55 hours, the ISN will not be reused

  • MSL and TIME_WAIT: The timeout from TIME_WAIT state to CLOSED state is set to 2*MSL (RFC793 defines MSL as 2 minutes; Linux sets it to 30s via the kernel parameter net.ipv4.tcp_fin_timeout). Reasons: 1) TIME_WAIT ensures enough time for the peer to receive the ACK. If the passive closing side does not receive the ACK, it will trigger the passive side to resend Fin -- one round trip is exactly 2 MSLs. 2) It provides enough time to prevent this connection from being confused with subsequent connections (if the connection is reused, delayed packets could get mixed with the new connection)

  • Too many TIME_WAITs: As a client under high-concurrency short connections, there may be too many TIME_WAIT states. Optimization parameters: 1) tcp_tw_reuse to reuse connections, requires tcp_timestamps=1 to be enabled simultaneously (not highly recommended). 2) tcp_tw_recycle assumes the peer has tcp_timestamps enabled and compares timestamps to reuse connections; deprecated in newer versions. 3) tcp_max_tw_buckets controls the number of TIME_WAIT states, default value 180000. When exceeded, the system destroys them and prints a warning

The TIME_WAIT state only exists on the side that actively disconnects. For HTTP servers, it is recommended to enable keepalive (browsers will reuse a single TCP connection to handle multiple HTTP requests; enabled by default in HTTP/1.1 and above), letting the client actively disconnect

Sequence Number in Data Transmission

wireshark filter expression: ip.addr == 172.22.3.29 && tcp.port == 9000 Pasted image 20230906170656

Pasted image 20230906171805 The SeqNum increment is related to the number of bytes transmitted.

Note: Wireshark uses Relative SeqNum for friendlier display. You can uncheck it in the protocol preferences from the right-click menu to see the "Absolute SeqNum".

TCP Retransmission Mechanism

Note: The ACK from the receiver to the sender only acknowledges the last contiguous packet

  1. Timeout retransmission mechanism: For 5 data segments (1-5), when segment 3 is not received:

  • Only retransmit the timed-out packet, i.e., segment 3 (saves bandwidth, slow)

  • Retransmit all packets after the timeout, i.e., segments 3, 4, 5 (slightly better, wastes bandwidth)

  1. Fast Retransmit mechanism The Fast Retransmit algorithm is data-driven rather than time-driven for retransmission. It only ACKs the last packet that may have been lost. The first segment arrives, so ACK 2 is sent back. Segment 2 is not received for some reason. Segment 3 arrives, so ACK 2 is still sent. Segments 4 and 5 arrive, but ACK 2 is still sent because segment 2 has not been received. The sender receives three ACK=2 confirmations and knows that segment 2 has not arrived, so it immediately retransmits segment 2. Then, the receiver gets segment 2. Since segments 3, 4, 5 have already been received, it ACKs 6 Pasted image 20230906172736 Question: Does retransmission retransmit only the ACK-lost packet or all previous packets?

  2. Selective Acknowledgment (SACK): Requires adding a SACK option in the TCP header. The ACK is still the Fast Retransmit ACK, while SACK reports the received data fragments Pasted image 20230906172757 The sender can use the returned SACK to know which data has arrived and which has not, thus optimizing the Fast Retransmit algorithm. Of course, this protocol requires support on both sides. Linux kernel parameter net.ipv4.tcp_sack=1 enables this feature Note: Receiver reneging issue -- the receiver has the right to discard the sender's SACK data. The receiver may need memory for more important things, so the sender cannot fully rely on SACK. It still needs ACK and must maintain the timeout. If subsequent ACKs do not increase, the SACK data still needs to be retransmitted.

  3. Duplicate SACK (D-SACK): Addresses the problem of receiving duplicate data, primarily using SACK to tell the sender which data was received in duplicate

  • ACK packet loss: If the first SACK segment's range is covered by the ACK, it is a D-SACK. As shown in the diagram, two ACK packets (3500, 4000) were lost in the request. The third packet returns ACK=4000 SACK=3000-3500, making this SACK a D-SACK packet, indicating the data was not lost but the ACK packets were. Pasted image 20230907091907

  • Network delay: If the first SACK segment's range is covered by the second SACK segment, it is a D-SACK. As shown in the diagram, the network packet (1000-1499) was delayed by the network, causing the sender not to receive the ACK. The three subsequent packets that arrived triggered the "Fast Retransmit algorithm", so retransmission occurred. But when the retransmission happened, the delayed packet also arrived, so a SACK=1000-1500 was sent back. Since the ACK had already reached 3000, this SACK is a D-SACK -- indicating that a duplicate packet was received.

In this case, the sender knows that the retransmission triggered by the "Fast Retransmit algorithm" was not because the sent packet was lost, nor because the response ACK packet was lost, but because of network delay.

Linux kernel parameter net.ipv4.tcp_dsack=1 enables this feature Pasted image 20230907091442 Benefits of using D-SACK:

  1. Lets the sender know whether the sent packet was lost or the returning ACK packet was lost.

  2. Whether the timeout was set too small, causing retransmission.

  3. Whether packets sent earlier arrived later on the network (also called reordering)

  4. Whether the network duplicated the data packet

TCP RTT Algorithm

RTT (Round Trip Time): The time from when a packet is sent to when the ACK returns. If the sender sends at time t0 and receives the ACK at time t1, the RTT sample = t1 - t0

RTO (Retransmission TimeOut): TCP's timeout setting to make retransmission efficient

Algorithms: Classic algorithm (weighted moving average), Karn/Partridge algorithm, Jacobson/Karels algorithm

TCP Sliding Window

TCP header field Window (Advertised-Window): The receiver tells the sender how much buffer space it has available to receive data

Pasted image 20230907171639
  • On the receiver side, LastByteRead points to the position read in the TCP buffer, NextByteExpected points to the last position of contiguous received packets, and LastByteRcved points to the last position of received packets. We can see there are some data gaps in between where data has not yet arrived.

  • On the sender side, LastByteAcked points to the position acknowledged by the receiver (indicating successful send confirmation), LastByteSent indicates data that has been sent but not yet successfully acknowledged, and LastByteWritten points to where the upper-layer application is currently writing. Therefore:

  • The receiver reports its AdvertisedWindow = MaxRcvBuffer - LastByteRcvd - 1 in the ACK sent back to the sender;

  • The sender controls the size of data sent based on this window to ensure the receiver can handle it

Sender sliding window example: Before sliding Pasted image 20230908140914 After sliding Pasted image 20230908141002

Pasted image 20230908141907

Zero window

After the window becomes 0, the sender sends ZWP (Zero Window Probe) packets to the receiver, asking the receiver to ACK its window size. This is typically set to 3 attempts, each about 30-60 seconds apart (different implementations may vary). If the window is still 0 after 3 attempts, some TCP implementations will send RST to disconnect.

Note: Wherever there is waiting, DDoS attacks are possible. Zero Window is no exception. Some attackers establish an HTTP connection, send a GET request, then set the Window to 0. The server can only wait and perform ZWP. Attackers can then send a large number of such concurrent requests to exhaust server resources.

In Wireshark, you can use tcp.analysis.zero_window to filter packets, then use "Follow TCP Stream" from the right-click menu to see the ZeroWindowProbe and ZeroWindowProbeAck packets

Silly Window Syndrome

When the receiver is too busy to consume data from the Receive Window, the sender's window becomes smaller and smaller. Eventually, if the receiver frees up a few bytes and tells the sender there are now a few bytes of window available, the sender will eagerly send those few bytes. With MSS=1460, sending such small data with IP and TCP headers wastes bandwidth. Solution: Avoid responding to small window sizes; only respond when the window size is large enough. This can be implemented on both the receiver and sender sides.

  • On the receiver side, if received data causes the window size to fall below a certain value, it can directly ACK(0) back to the sender, closing the window and preventing the sender from sending more data. When the receiver has processed enough data so that the window size is greater than or equal to MSS, or when half the receiver buffer is empty, it can reopen the window to let the sender send data.

  • When caused by the sender side, the well-known Nagle's algorithm is used. This algorithm also uses delayed processing and has two main conditions: 1) Wait until Window Size >= MSS or Data Size >= MSS, 2) Receive the ACK for previously sent data, before sending new data; otherwise, data is accumulated.

TCP Congestion Handling

  1. Slow Start

  2. At the start of a newly established connection, initialize cwnd = 1, indicating one MSS-sized data segment can be transmitted.

  3. For each ACK received, cwnd++; linear increase

  4. For each RTT passed, cwnd = cwnd*2; exponential increase

  5. ssthresh (slow start threshold). When cwnd >= ssthresh, the "congestion avoidance algorithm" begins

  6. Congestion Avoidance Generally, ssthresh is set to 65535 bytes. When cwnd reaches this value, the algorithm is as follows:

  7. When an ACK is received, cwnd = cwnd + 1/cwnd

  8. For each RTT passed, cwnd = cwnd + 1

  9. Congestion Event (Fast Retransmit)

  10. Wait for RTO timeout, then retransmit the data packet. TCP considers this situation very bad and reacts strongly.

  • sshthresh = cwnd /2

  • cwnd reset to 1

  • Enter slow start algorithm

  1. Fast Retransmit algorithm, which initiates retransmission upon receiving 3 duplicate ACKs without waiting for RTO timeout.

  • TCP Tahoe's implementation is the same as RTO timeout.

  • TCP Reno's implementation is:

    • cwnd = cwnd / 2

    • sshthresh = cwnd

    • Enter Fast Recovery algorithm

  1. Fast Recovery

  2. cwnd = sshthresh + 3 * MSS (3 means confirmation that 3 data packets have been received)

  3. Retransmit the data packet specified by the Duplicated ACKs

  4. If more duplicated ACKs are received, cwnd = cwnd + 1

  5. If a new ACK is received, cwnd = sshthresh, then enter the congestion avoidance algorithm.

Algorithm diagram Pasted image 20230908161112

TCP Full Connection and Half-Connection Queues

Half-Connection Queue Overflow & SYN Flood

Test using the TCP server side

Full Connection Queue Overflow

Reference:

Last updated