TCP
TCP protocol
Introduction
OSI Seven-Layer Network Model
MTU (Maximum Transmission Unit): maximum packet size for a network device or interface MSS (Maximum Segment Size): maximum TCP segment size
Physical Layer PDU name: Bit
Data Link Layer PDU name: Frame Protocols: Ethernet, Wi-Fi (IEEE 802.11) Ethernet MTU = 46~1518 Bytes
Network Layer PDU name: Packet Protocols: IP, ICMP, BGP IP MTU = 1518 - 14(Frame Header) - 4(CRC) = 1500 Bytes
Transport Layer PDU name: Segment OR Datagram Protocols: TCP, UDP MSS = 1500(Ethernet MTU) - 20(IP Header) - 20(TCP Header) = 1460 Bytes
Session Layer PDU name: DataStream
Presentation Layer PDU name: Message Protocols: SSL/TLS
Application Layer PDU name: Message Protocols: HTTP, SMTP, SSH, Telnet
TCP Header Format
A TCP connection is defined by a five-tuple identifying the same connection (src_ip, src_port, dst_ip, dst_port, protocol)
Sequence Number is the packet sequence number, used to solve the network packet reordering problem.
Acknowledgement Number is the ACK -- used to confirm receipt, solving the problem of packet loss.
Window, also called Advertised-Window, is the well-known Sliding Window, used for flow control.
TCP Flag is the packet type, primarily used to control the TCP state machine.
TCP State Machine


For the 3-way handshake to establish a connection, the main purpose is to initialize the Sequence Number's initial value. Both communicating parties must notify each other of their initialized Sequence Number (abbreviated as ISN: Initial Sequence Number) -- hence the name SYN, which stands for Synchronize Sequence Numbers. These are the x and y in the diagram above. This number will be used as the sequence number for subsequent data communication, ensuring that data received at the application layer will not be disordered due to network transmission issues (TCP uses this sequence number to reassemble data).
For the 4-way teardown, if you look carefully it is actually 2 rounds, because TCP is full-duplex, so both the sender and receiver need Fin and Ack. However, one side is passive, making it appear as the so-called 4-way teardown. If both sides disconnect simultaneously, they enter the CLOSING state, then reach the TIME_WAIT state. The diagram below shows simultaneous disconnection by both sides (you can also follow along with the TCP state machine)
Important notes:
SYN_RECV state: When the server cannot receive the ACK for the connection establishment, it resends the SYN+ACK packet. In Linux, the default is 5 retries, starting from 1s and doubling each time, totaling 1s + 2s + 4s + 8s + 16s + 32s = 2^6 - 1 = 63s. TCP only disconnects after the 63s timeout. Optimization parameters: 1) tcp_synack_retries to reduce the retry count. 2) tcp_max_syn_backlog and net.core.somaxconn to increase the SYN half-connection queue. 3) tcp_abort_on_overflow to reject connections and drop ACKs when the full connection queue is full; tcp_syncookies hashes the five-tuple into a cookie and returns it, the client carries it back to establish the connection (not recommended to enable)
ISN initialization: The ISN is tied to a pseudo-clock that increments the ISN by one every 4 microseconds until it exceeds 2^32, then wraps around to 0. One ISN cycle is approximately 4.55 hours. Assuming a TCP segment's lifetime on the network does not exceed the Maximum Segment Lifetime (MSL), as long as the MSL value is less than 4.55 hours, the ISN will not be reused
MSL and TIME_WAIT: The timeout from TIME_WAIT state to CLOSED state is set to 2*MSL (RFC793 defines MSL as 2 minutes; Linux sets it to 30s via the kernel parameter net.ipv4.tcp_fin_timeout). Reasons: 1) TIME_WAIT ensures enough time for the peer to receive the ACK. If the passive closing side does not receive the ACK, it will trigger the passive side to resend Fin -- one round trip is exactly 2 MSLs. 2) It provides enough time to prevent this connection from being confused with subsequent connections (if the connection is reused, delayed packets could get mixed with the new connection)
Too many TIME_WAITs: As a client under high-concurrency short connections, there may be too many TIME_WAIT states. Optimization parameters: 1) tcp_tw_reuse to reuse connections, requires tcp_timestamps=1 to be enabled simultaneously (not highly recommended). 2) tcp_tw_recycle assumes the peer has tcp_timestamps enabled and compares timestamps to reuse connections; deprecated in newer versions. 3) tcp_max_tw_buckets controls the number of TIME_WAIT states, default value 180000. When exceeded, the system destroys them and prints a warning
The TIME_WAIT state only exists on the side that actively disconnects. For HTTP servers, it is recommended to enable keepalive (browsers will reuse a single TCP connection to handle multiple HTTP requests; enabled by default in HTTP/1.1 and above), letting the client actively disconnect
Sequence Number in Data Transmission
wireshark filter expression: ip.addr == 172.22.3.29 && tcp.port == 9000 
The SeqNum increment is related to the number of bytes transmitted.
Note: Wireshark uses Relative SeqNum for friendlier display. You can uncheck it in the protocol preferences from the right-click menu to see the "Absolute SeqNum".
TCP Retransmission Mechanism
Note: The ACK from the receiver to the sender only acknowledges the last contiguous packet
Timeout retransmission mechanism: For 5 data segments (1-5), when segment 3 is not received:
Only retransmit the timed-out packet, i.e., segment 3 (saves bandwidth, slow)
Retransmit all packets after the timeout, i.e., segments 3, 4, 5 (slightly better, wastes bandwidth)
Fast Retransmit mechanism The Fast Retransmit algorithm is data-driven rather than time-driven for retransmission. It only ACKs the last packet that may have been lost. The first segment arrives, so ACK 2 is sent back. Segment 2 is not received for some reason. Segment 3 arrives, so ACK 2 is still sent. Segments 4 and 5 arrive, but ACK 2 is still sent because segment 2 has not been received. The sender receives three ACK=2 confirmations and knows that segment 2 has not arrived, so it immediately retransmits segment 2. Then, the receiver gets segment 2. Since segments 3, 4, 5 have already been received, it ACKs 6
Question: Does retransmission retransmit only the ACK-lost packet or all previous packets?Selective Acknowledgment (SACK): Requires adding a SACK option in the TCP header. The ACK is still the Fast Retransmit ACK, while SACK reports the received data fragments
The sender can use the returned SACK to know which data has arrived and which has not, thus optimizing the Fast Retransmit algorithm. Of course, this protocol requires support on both sides. Linux kernel parameter net.ipv4.tcp_sack=1 enables this feature Note: Receiver reneging issue -- the receiver has the right to discard the sender's SACK data. The receiver may need memory for more important things, so the sender cannot fully rely on SACK. It still needs ACK and must maintain the timeout. If subsequent ACKs do not increase, the SACK data still needs to be retransmitted.Duplicate SACK (D-SACK): Addresses the problem of receiving duplicate data, primarily using SACK to tell the sender which data was received in duplicate
ACK packet loss: If the first SACK segment's range is covered by the ACK, it is a D-SACK. As shown in the diagram, two ACK packets (3500, 4000) were lost in the request. The third packet returns ACK=4000 SACK=3000-3500, making this SACK a D-SACK packet, indicating the data was not lost but the ACK packets were.

Network delay: If the first SACK segment's range is covered by the second SACK segment, it is a D-SACK. As shown in the diagram, the network packet (1000-1499) was delayed by the network, causing the sender not to receive the ACK. The three subsequent packets that arrived triggered the "Fast Retransmit algorithm", so retransmission occurred. But when the retransmission happened, the delayed packet also arrived, so a SACK=1000-1500 was sent back. Since the ACK had already reached 3000, this SACK is a D-SACK -- indicating that a duplicate packet was received.
In this case, the sender knows that the retransmission triggered by the "Fast Retransmit algorithm" was not because the sent packet was lost, nor because the response ACK packet was lost, but because of network delay.
Linux kernel parameter net.ipv4.tcp_dsack=1 enables this feature
Benefits of using D-SACK:
Lets the sender know whether the sent packet was lost or the returning ACK packet was lost.
Whether the timeout was set too small, causing retransmission.
Whether packets sent earlier arrived later on the network (also called reordering)
Whether the network duplicated the data packet
TCP RTT Algorithm
RTT (Round Trip Time): The time from when a packet is sent to when the ACK returns. If the sender sends at time t0 and receives the ACK at time t1, the RTT sample = t1 - t0
RTO (Retransmission TimeOut): TCP's timeout setting to make retransmission efficient
Algorithms: Classic algorithm (weighted moving average), Karn/Partridge algorithm, Jacobson/Karels algorithm
TCP Sliding Window
TCP header field Window (Advertised-Window): The receiver tells the sender how much buffer space it has available to receive data

On the receiver side, LastByteRead points to the position read in the TCP buffer, NextByteExpected points to the last position of contiguous received packets, and LastByteRcved points to the last position of received packets. We can see there are some data gaps in between where data has not yet arrived.
On the sender side, LastByteAcked points to the position acknowledged by the receiver (indicating successful send confirmation), LastByteSent indicates data that has been sent but not yet successfully acknowledged, and LastByteWritten points to where the upper-layer application is currently writing. Therefore:
The receiver reports its AdvertisedWindow = MaxRcvBuffer - LastByteRcvd - 1 in the ACK sent back to the sender;
The sender controls the size of data sent based on this window to ensure the receiver can handle it
Sender sliding window example: Before sliding
After sliding 

Zero window
After the window becomes 0, the sender sends ZWP (Zero Window Probe) packets to the receiver, asking the receiver to ACK its window size. This is typically set to 3 attempts, each about 30-60 seconds apart (different implementations may vary). If the window is still 0 after 3 attempts, some TCP implementations will send RST to disconnect.
Note: Wherever there is waiting, DDoS attacks are possible. Zero Window is no exception. Some attackers establish an HTTP connection, send a GET request, then set the Window to 0. The server can only wait and perform ZWP. Attackers can then send a large number of such concurrent requests to exhaust server resources.
In Wireshark, you can use tcp.analysis.zero_window to filter packets, then use "Follow TCP Stream" from the right-click menu to see the ZeroWindowProbe and ZeroWindowProbeAck packets
Silly Window Syndrome
When the receiver is too busy to consume data from the Receive Window, the sender's window becomes smaller and smaller. Eventually, if the receiver frees up a few bytes and tells the sender there are now a few bytes of window available, the sender will eagerly send those few bytes. With MSS=1460, sending such small data with IP and TCP headers wastes bandwidth. Solution: Avoid responding to small window sizes; only respond when the window size is large enough. This can be implemented on both the receiver and sender sides.
On the receiver side, if received data causes the window size to fall below a certain value, it can directly ACK(0) back to the sender, closing the window and preventing the sender from sending more data. When the receiver has processed enough data so that the window size is greater than or equal to MSS, or when half the receiver buffer is empty, it can reopen the window to let the sender send data.
When caused by the sender side, the well-known Nagle's algorithm is used. This algorithm also uses delayed processing and has two main conditions: 1) Wait until Window Size >= MSS or Data Size >= MSS, 2) Receive the ACK for previously sent data, before sending new data; otherwise, data is accumulated.
TCP Congestion Handling
Slow Start
At the start of a newly established connection, initialize cwnd = 1, indicating one MSS-sized data segment can be transmitted.
For each ACK received, cwnd++; linear increase
For each RTT passed, cwnd = cwnd*2; exponential increase
ssthresh (slow start threshold). When cwnd >= ssthresh, the "congestion avoidance algorithm" begins
Congestion Avoidance Generally, ssthresh is set to 65535 bytes. When cwnd reaches this value, the algorithm is as follows:
When an ACK is received, cwnd = cwnd + 1/cwnd
For each RTT passed, cwnd = cwnd + 1
Congestion Event (Fast Retransmit)
Wait for RTO timeout, then retransmit the data packet. TCP considers this situation very bad and reacts strongly.
sshthresh = cwnd /2
cwnd reset to 1
Enter slow start algorithm
Fast Retransmit algorithm, which initiates retransmission upon receiving 3 duplicate ACKs without waiting for RTO timeout.
TCP Tahoe's implementation is the same as RTO timeout.
TCP Reno's implementation is:
cwnd = cwnd / 2
sshthresh = cwnd
Enter Fast Recovery algorithm
Fast Recovery
cwnd = sshthresh + 3 * MSS (3 means confirmation that 3 data packets have been received)
Retransmit the data packet specified by the Duplicated ACKs
If more duplicated ACKs are received, cwnd = cwnd + 1
If a new ACK is received, cwnd = sshthresh, then enter the congestion avoidance algorithm.
Algorithm diagram 
TCP Full Connection and Half-Connection Queues
Half-Connection Queue Overflow & SYN Flood
Test using the TCP server side
Full Connection Queue Overflow
Reference:
Last updated