tcp

TriggerTek Logo
abcdefghijklmnopqrstuvwxyz_
TCP(7)			  Linux Programmer’s Manual		       TCP(7)



NAME
       tcp - TCP protocol

SYNOPSIS
       #include <sys/socket.h>
       #include <netinet/in.h>
       #include <netinet/tcp.h>
       tcp_socket = socket(PF_INET, SOCK_STREAM, 0);

DESCRIPTION
       This  is	 an  implementation  of	 the  TCP protocol defined in RFC793,
       RFC1122 and RFC2001 with the NewReno and SACK extensions.  It provides
       a  reliable, stream oriented, full duplex connection between two sock-
       ets on top of ip(7), for both v4 and v6 versions.  TCP guarantees that
       the  data arrives in order and retransmits lost packets.	 It generates
       and checks a per packet checksum to catch  transmission	errors.	  TCP
       does not preserve record boundaries.

       A  fresh	 TCP  socket  has no remote or local address and is not fully
       specified.  To create an outgoing TCP  connection  use  connect(2)  to
       establish a connection to another TCP socket.  To receive new incoming
       connections bind(2) the socket first to a local address and  port  and
       then  call  listen(2)  to  put the socket into listening state.	After
       that a new socket for each incoming connection can be  accepted	using
       accept(2).   A  socket  which  has  had accept or connect successfully
       called on it is fully specified and may transmit data.  Data cannot be
       transmitted on listening or not yet connected sockets.

       Linux supports RFC1323 TCP high performance extensions.	These include
       Protection Against Wrapped Sequence  Numbers  (PAWS),  Window  Scaling
       and  Timestamps.	  Window  scaling allows the use of large (> 64K) TCP
       windows in order to support links with high latency or bandwidth.   To
       make use of them, the send and receive buffer sizes must be increased.
       They  can   be	set   globally	 with	the   net.ipv4.tcp_wmem	  and
       net.ipv4.tcp_rmem  sysctl variables, or on individual sockets by using
       the SO_SNDBUF and SO_RCVBUF  socket  options  with  the	setsockopt(2)
       call.

       The  maximum  sizes  for socket buffers declared via the SO_SNDBUF and
       SO_RCVBUF mechanisms are limited by the global  net.core.rmem_max  and
       net.core.wmem_max sysctls.  Note that TCP actually allocates twice the
       size of the buffer requested in the setsockopt(2) call, and so a	 suc-
       ceeding	getsockopt(2) call will not return the same size of buffer as
       requested in the setsockopt(2) call.  TCP uses this for administrative
       purposes	 and  internal	kernel	structures,  and the sysctl variables
       reflect the larger sizes compared to the actual TCP windows.  On indi-
       vidual  connections,  the  socket buffer size must be set prior to the
       listen() or connect() calls in order  to	 have  it  take	 effect.  See
       socket(7) for more information.

       TCP  supports urgent data.  Urgent data is used to signal the receiver
       that some important message is part of the data	stream	and  that  it
       should  be processed as soon as possible.  To send urgent data specify
       the MSG_OOB option to send(2).  When urgent data is received, the ker-
       nel  sends  a  SIGURG  signal to the reading process or the process or
       process group that has been set for the socket using the SIOCSPGRP  or
       FIOSETOWN  ioctls.  When	 the  SO_OOBINLINE  socket option is enabled,
       urgent data is put into the normal data stream (and can be tested  for
       by  the	SIOCATMARK ioctl), otherwise it can be only received when the
       MSG_OOB flag is set for sendmsg(2).

       Linux 2.4 introduced a number of changes for improved  throughput  and
       scaling,	 as  well  as enhanced functionality.  Some of these features
       include support for zerocopy sendfile(2), Explicit Congestion  Notifi-
       cation, new management of TIME_WAIT sockets, keep-alive socket options
       and support for Duplicate SACK extensions.

ADDRESS FORMATS
       TCP is built on top of IP (see ip(7)).  The address formats defined by
       ip(7)  apply  to TCP.  TCP only supports point-to-point communication;
       broadcasting and multicasting are not supported.

SYSCTLS
       These variables can be accessed by the /proc/sys/net/ipv4/*  files  or
       with the sysctl(2) interface.  In addition, most IP sysctls also apply
       to TCP; see ip(7).

       tcp_abort_on_overflow
	      Enable resetting connections if the listening  service  is  too
	      slow  and unable to keep up and accept them.  It is not enabled
	      by default.  It means that if overflow occurred due to a burst,
	      the  connection will recover.  Enable this option _only_ if you
	      are really sure that the listening daemon cannot	be  tuned  to
	      accept  connections  faster.  Enabling this option can harm the
	      clients of your server.

       tcp_adv_win_scale
	      Count  buffering	overhead  as  bytes/2^tcp_adv_win_scale	  (if
	      tcp_adv_win_scale	 >  0) or bytes-bytes/2^(-tcp_adv_win_scale),
	      if it is <= 0. The default is 2.

	      The socket receive buffer space is shared between the  applica-
	      tion  and	 kernel.  TCP maintains part of the buffer as the TCP
	      window, this is the size of the receive  window  advertised  to
	      the  other end.  The rest of the space is used as the "applica-
	      tion" buffer, used to isolate the network from  scheduling  and
	      application  latencies.  The tcp_adv_win_scale default value of
	      2 implies that the space used for the application buffer is one
	      fourth that of the total.

       tcp_app_win
	      This  variable  defines  how  many  bytes of the TCP window are
	      reserved for buffering overhead.

	      A maximum of (window/2^tcp_app_win, mss) bytes  in  the  window
	      are  reserved for the application buffer.	 A value of 0 implies
	      that no amount is reserved.  The default value is 31.

       tcp_dsack
	      Enable RFC2883 TCP Duplicate SACK support.  It  is  enabled  by
	      default.

       tcp_ecn
	      Enable  RFC2884  Explicit	 Congestion  Notification.  It is not
	      enabled by default.  When enabled, connectivity to some  desti-
	      nations  could  be  affected  due to older, misbehaving routers
	      along the path causing connections to be dropped.

       tcp_fack
	      Enable TCP Forward Acknowledgement support.  It is  enabled  by
	      default.

       tcp_fin_timeout
	      How  many	 seconds  to  wait  for a final FIN packet before the
	      socket is forcibly closed.  This is strictly a violation of the
	      TCP  specification,  but	required to prevent denial-of-service
	      (DoS) attacks.  The default value in 2.4 kernels	is  60,	 down
	      from 180 in 2.2.

       tcp_keepalive_intvl
	      The  number  of  seconds	between	 TCP  keep-alive probes.  The
	      default value is 75 seconds.

       tcp_keepalive_probes
	      The maximum number of TCP keep-alive probes to send before giv-
	      ing  up  and  killing the connection if no response is obtained
	      from the other end.  The default value is 9.

       tcp_keepalive_time
	      The number of seconds a connection needs to be idle before  TCP
	      begins  sending  out  keep-alive	probes.	 Keep-alives are only
	      sent when the  SO_KEEPALIVE  socket  option  is  enabled.	  The
	      default value is 7200 seconds (2 hours).	An idle connection is
	      terminated after approximately  an  additional  11  minutes  (9
	      probes  an  interval  of	75  seconds apart) when keep-alive is
	      enabled.

	      Note that underlying connection tracking mechanisms and  appli-
	      cation timeouts may be much shorter.

       tcp_max_orphans
	      The  maximum  number of orphaned (not attached to any user file
	      handle) TCP sockets allowed in the system.  When this number is
	      exceeded,	 the  orphaned	connection  is reset and a warning is
	      printed.	This limit exists only to prevent simple DoS attacks.
	      Lowering	this  limit  is	 not  recommended. Network conditions
	      might require you to increase the number	of  orphans  allowed,
	      but  note	 that  each  orphan can eat up to ~64K of unswappable
	      memory.  The default initial value is set equal to  the  kernel
	      parameter	 NR_FILE.  This initial default is adjusted depending
	      on the memory in the system.

       tcp_max_syn_backlog
	      The maximum number of queued  connection	requests  which	 have
	      still  not  received  an	acknowledgement	 from  the connecting
	      client.  If this number is  exceeded,  the  kernel  will	begin
	      dropping	requests.   The	 default value of 256 is increased to
	      1024 when the memory present  in	the  system  is	 adequate  or
	      greater  (>=  128Mb), and reduced to 128 for those systems with
	      very low memory (<= 32Mb).  It  is  recommended  that  if	 this
	      needs   to   be	increased   above   1024,  TCP_SYNQ_HSIZE  in
	      include/net/tcp.h	      be       modified	       to	 keep
	      TCP_SYNQ_HSIZE*16<=tcp_max_syn_backlog,	and   the  kernel  be
	      recompiled.

       tcp_max_tw_buckets
	      The maximum number of sockets in TIME_WAIT state allowed in the
	      system.	This limit exists only to prevent simple DoS attacks.
	      The default value of NR_FILE*2 is	 adjusted  depending  on  the
	      memory  in  the system.  If this number is exceeded, the socket
	      is closed and a warning is printed.

       tcp_mem
	      This is a vector of 3 integers: [low, pressure,  high].	These
	      bounds are used by TCP to track its memory usage.	 The defaults
	      are calculated at boot time from the amount of  available	 mem-
	      ory.

	      low  - TCP doesn’t regulate its memory allocation when the num-
	      ber of pages it has allocated globally is below this number.

	      pressure - when the amount of memory allocated by	 TCP  exceeds
	      this  number  of	pages,	TCP moderates its memory consumption.
	      This memory pressure state is exited once the number  of	pages
	      allocated falls below the low mark.

	      high  -  the  maximum  number of pages, globally, that TCP will
	      allocate.	 This value overrides any other limits imposed by the
	      kernel.

       tcp_orphan_retries
	      The maximum number of attempts made to probe the other end of a
	      connection which has been closed by our end.  The default value
	      is 8.

       tcp_reordering
	      The  maximum  a  packet can be reordered in a TCP packet stream
	      without TCP assuming packet loss and  going  into	 slow  start.
	      The  default  is 3.  It is not advisable to change this number.
	      This is a packet reordering detection metric designed to	mini-
	      mize  unnecessary back off and retransmits provoked by reorder-
	      ing of packets on a connection.

       tcp_retrans_collapse
	      Try to send full-sized  packets  during  retransmit.   This  is
	      enabled by default.

       tcp_retries1
	      The  number of times TCP will attempt to retransmit a packet on
	      an established connection normally, without the extra effort of
	      getting  the network layers involved.  Once we exceed this num-
	      ber of retransmits, we first have the network layer update  the
	      route  if	 possible before each new retransmit.  The default is
	      the RFC specified minimum of 3.

       tcp_retries2
	      The maximum number of times a TCP packet	is  retransmitted  in
	      established  state  before giving up.  The default value is 15,
	      which corresponds to a duration of approximately between 13  to
	      30  minutes,  depending  on  the	retransmission	timeout.  The
	      RFC1122 specified minimum limit of  100  seconds	is  typically
	      deemed too short.

       tcp_rfc1337
	      Enable  TCP  behaviour  conformant  with RFC 1337.  This is not
	      enabled by default.  When not enabled, if a RST is received  in
	      TIME_WAIT	 state, we close the socket immediately without wait-
	      ing for the end of the TIME_WAIT period.

       tcp_rmem
	      This is a vector of 3 integers:  [min,  default,	max].	These
	      parameters  are  used  by TCP to regulate receive buffer sizes.
	      TCP dynamically adjusts the size of the receive buffer from the
	      defaults	listed below, in the range of these sysctl variables,
	      depending on memory available in the system.

	      min - minimum size of the	 receive  buffer  used	by  each  TCP
	      socket.	The  default value is 4K, and is lowered to PAGE_SIZE
	      bytes in low memory systems.  This value is used to ensure that
	      in memory pressure mode, allocations below this size will still
	      succeed.	This is not used to bound the  size  of	 the  receive
	      buffer declared using SO_RCVBUF on a socket.

	      default  -  the  default	size  of the receive buffer for a TCP
	      socket.  This value overwrites the initial default buffer	 size
	      from  the	 generic global net.core.rmem_default defined for all
	      protocols.  The default value is 87380 bytes, and is lowered to
	      43689  in	 low  memory systems.  If larger receive buffer sizes
	      are desired, this value should  be  increased  (to  affect  all
	      sockets).	     To	    employ    large    TCP    windows,	  the
	      net.ipv4.tcp_window_scaling must be enabled (default).

	      max - the maximum size of the receive buffer used by  each  TCP
	      socket.	  This	 value	 does	not   override	 the   global
	      net.core.rmem_max.  This is not used to limit the size  of  the
	      receive  buffer  declared	 using	SO_RCVBUF  on  a socket.  The
	      default value of 87380*2 bytes is lowered to 87380 in low	 mem-
	      ory systems.

       tcp_sack
	      Enable  RFC2018  TCP Selective Acknowledgements.	It is enabled
	      by default.

       tcp_stdurg
	      Enable the strict RFC793	interpretation	of  the	 TCP  urgent-
	      pointer field.  The default is to use the BSD-compatible inter-
	      pretation of the urgent-pointer, pointing	 to  the  first	 byte
	      after the urgent data.  The RFC793 interpretation is to have it
	      point to the last byte of urgent data.   Enabling	 this  option
	      may lead to interoperatibility problems.

       tcp_synack_retries
	      The maximum number of times a SYN/ACK segment for a passive TCP
	      connection will be retransmitted.	 This number  should  not  be
	      higher than 255. The default value is 5.

       tcp_syncookies
	      Enable  TCP  syncookies.	The kernel must be compiled with CON-
	      FIG_SYN_COOKIES.	Send out  syncookies  when  the	 syn  backlog
	      queue  of	 a socket overflows.  The syncookies feature attempts
	      to protect a socket from a SYN flood attack.   This  should  be
	      used  as	a last resort, if at all.  This is a violation of the
	      TCP protocol, and conflicts with other areas of TCP such as TCP
	      extensions.   It can cause problems for clients and relays.  It
	      is not recommended as a tuning  mechanism	 for  heavily  loaded
	      servers  to  help	 with overloaded or misconfigured conditions.
	      For   recommended	  alternatives	  see	 tcp_max_syn_backlog,
	      tcp_synack_retries, tcp_abort_on_overflow.

       tcp_syn_retries
	      The maximum number of times initial SYNs for an active TCP con-
	      nection attempt will be retransmitted.  This value  should  not
	      be  higher than 255.  The default value is 5, which corresponds
	      to approximately 180 seconds.

       tcp_timestamps
	      Enable RFC1323 TCP timestamps.  This is enabled by default.

       tcp_tw_recycle
	      Enable fast recycling of TIME-WAIT sockets.  It is not  enabled
	      by default.  Enabling this option is not recommended since this
	      causes problems when working with NAT (Network Address Transla-
	      tion).

       tcp_window_scaling
	      Enable  RFC1323  TCP window scaling.  It is enabled by default.
	      This feature allows the use of a large window (> 64K) on a  TCP
	      connection,  should the other end support it.  Normally, the 16
	      bit window length field in the TCP  header  limits  the  window
	      size  to	less  than 64K bytes.  If larger windows are desired,
	      applications can increase the size of their socket buffers  and
	      the  window  scaling  option  will  be  employed.	  If tcp_win-
	      dow_scaling is disabled, TCP will not negotiate the use of win-
	      dow scaling with the other end during connection setup.

       tcp_wmem
	      This  is	a  vector  of 3 integers: [min, default, max].	These
	      parameters are used by TCP to regulate send buffer sizes.	  TCP
	      dynamically  adjusts  the	 size  of  the	send  buffer from the
	      default values listed below, in the range of these sysctl vari-
	      ables, depending on memory available.

	      min  - minimum size of the send buffer used by each TCP socket.
	      The default value is 4K bytes.  This value is  used  to  ensure
	      that  in memory pressure mode, allocations below this size will
	      still succeed.  This is not used to bound the size of the	 send
	      buffer declared using SO_SNDBUF on a socket.

	      default - the default size of the send buffer for a TCP socket.
	      This value overwrites the initial default buffer size from  the
	      generic global net.core.wmem_default defined for all protocols.
	      The default value is 16K bytes.  If larger  send	buffer	sizes
	      are  desired,  this  value  should  be increased (to affect all
	      sockets).	 To employ large TCP  windows,	the  sysctl  variable
	      net.ipv4.tcp_window_scaling must be enabled (default).

	      max  -  the  maximum  size  of the send buffer used by each TCP
	      socket.	 This	value	does   not   override	the    global
	      net.core.wmem_max.   This	 is not used to limit the size of the
	      send buffer declared using SO_SNDBUF on a socket.	 The  default
	      value  is	 128K  bytes.	It is lowered to 64K depending on the
	      memory available in the system.

SOCKET OPTIONS
       To set or get a TCP socket option, call getsockopt(2) to read or	 set-
       sockopt(2)  to  write the option with the option level argument set to
       SOL_TCP.	 In addition, most SOL_IP socket options  are  valid  on  TCP
       sockets. For more information see ip(7).

       TCP_CORK
	      If  set,	don’t  send  out  partial frames.  All queued partial
	      frames are sent when the option is cleared again.	 This is use-
	      ful  for	prepending headers before calling sendfile(2), or for
	      throughput optimization.	This option cannot be  combined	 with
	      TCP_NODELAY.   This  option should not be used in code intended
	      to be portable.

       TCP_DEFER_ACCEPT
	      Allows a listener to be awakened only when data arrives on  the
	      socket.	Takes  an integer value (seconds), this can bound the
	      maximum number of attempts TCP will make to complete  the	 con-
	      nection.	This option should not be used in code intended to be
	      portable.

       TCP_INFO
	      Used to collect information  about  this	socket.	  The  kernel
	      returns	a   struct   tcp_info	as   defined   in   the	 file
	      /usr/include/linux/tcp.h.	 This option should not	 be  used  in
	      code intended to be portable.

       TCP_KEEPCNT
	      The  maximum  number of keepalive probes TCP should send before
	      dropping the connection.	This option should  not	 be  used  in
	      code intended to be portable.

       TCP_KEEPIDLE
	      The  time	 (in  seconds)	the  connection	 needs to remain idle
	      before TCP starts	 sending  keepalive  probes,  if  the  socket
	      option  SO_KEEPALIVE  has been set on this socket.  This option
	      should not be used in code intended to be portable.

       TCP_KEEPINTVL
	      The time (in  seconds)  between  individual  keepalive  probes.
	      This option should not be used in code intended to be portable.

       TCP_LINGER2
	      The lifetime of orphaned FIN_WAIT2 state sockets.	 This  option
	      can  be used to override the system wide sysctl tcp_fin_timeout
	      on this socket.  This is not to be confused with the  socket(7)
	      level option SO_LINGER.  This option should not be used in code
	      intended to be portable.

       TCP_MAXSEG
	      The maximum segment size for outgoing  TCP  packets.   If	 this
	      option  is set before connection establishment, it also changes
	      the MSS value announced to the other end in the initial packet.
	      Values  greater  than  the  (eventual)  interface	 MTU  have no
	      effect.  TCP will also impose its minimum	 and  maximum  bounds
	      over the value provided.

       TCP_NODELAY
	      If  set, disable the Nagle algorithm.  This means that segments
	      are always sent as soon as possible, even if there  is  only  a
	      small  amount  of	 data.	 When not set, data is buffered until
	      there is a sufficient amount to send out, thereby avoiding  the
	      frequent	sending	 of small packets, which results in poor uti-
	      lization of the network.	This option cannot  be	used  at  the
	      same time as the option TCP_CORK.

       TCP_QUICKACK
	      Enable  quickack	mode  if  set  or  disable  quickack  mode if
	      cleared.	In quickack mode, acks are sent	 immediately,  rather
	      than  delayed  if needed in accordance to normal TCP operation.
	      This flag is not permanent, it only enables a switch to or from
	      quickack	mode.	Subsequent operation of the TCP protocol will
	      once again enter/leave quickack mode depending on internal pro-
	      tocol  processing	 and  factors  such  as	 delayed ack timeouts
	      occurring and data transfer.  This option should not be used in
	      code intended to be portable.

       TCP_SYNCNT
	      Set  the	number of SYN retransmits that TCP should send before
	      aborting the attempt to connect.	It cannot exceed  255.	 This
	      option should not be used in code intended to be portable.

       TCP_WINDOW_CLAMP
	      Bound  the  size	of  the advertised window to this value.  The
	      kernel imposes  a	 minimum  size	of  SOCK_MIN_RCVBUF/2.	 This
	      option should not be used in code intended to be portable.

IOCTLS
       These ioctls can be accessed using ioctl(2).  The correct syntax is:

	      int value;
	      error = ioctl(tcp_socket, ioctl_type, &value);

       SIOCINQ
	      Returns the amount of queued unread data in the receive buffer.
	      Argument is a pointer to an integer.  The socket must not be in
	      LISTEN state, otherwise an error (EINVAL) is returned.

       SIOCATMARK
	      Returns true when the all urgent data has been already received
	      by the user program.  This is used together with	SO_OOBINLINE.
	      Argument is an pointer to an integer for the test result.

       SIOCOUTQ
	      Returns  the  amount of unsent data in the socket send queue in
	      the passed integer value pointer.	 The socket must  not  be  in
	      LISTEN state, otherwise an error (EINVAL) is returned.

ERROR HANDLING
       When  a	network	 error occurs, TCP tries to resend the packet.	If it
       doesn’t succeed after some time, either ETIMEDOUT or the last received
       error on this connection is reported.

       Some  applications  require a quicker error notification.  This can be
       enabled with the SOL_IP level IP_RECVERR	 socket	 option.   When	 this
       option  is  enabled, all incoming errors are immediately passed to the
       user program.  Use this option with care - it makes TCP less  tolerant
       to routing changes and other normal network conditions.

NOTES
       When  an	 error	occurs doing a connection setup occurring in a socket
       write SIGPIPE is only raised when the SO_KEEPALIVE  socket  option  is
       set.

       TCP  has	 no  real out-of-band data; it has urgent data. In Linux this
       means if the other end sends newer out-of-band data the	older  urgent
       data  is	 inserted as normal data into the stream (even when SO_OOBIN-
       LINE is not set). This differs from BSD based stacks.

       Linux uses the BSD compatible interpretation  of	 the  urgent  pointer
       field by default.  This violates RFC1122, but is required for interop-
       erability with other stacks.  It can  be	 changed  by  the  tcp_stdurg
       sysctl.

ERRORS
       EPIPE  The  other end closed the socket unexpectedly or a read is exe-
	      cuted on a shut down socket.

       ETIMEDOUT
	      The other end didn’t acknowledge retransmitted data after	 some
	      time.

       EAFNOTSUPPORT
	      Passed socket address type in sin_family was not AF_INET.

       Any  errors  defined for ip(7) or the generic socket layer may also be
       returned for TCP.

BUGS
       Not all errors are documented.
       IPv6 is not described.

VERSIONS
       Support	for  Explicit  Congestion  Notification,  zerocopy  sendfile,
       reordering support and some SACK extensions (DSACK) were introduced in
       2.4.  Support for forward acknowledgement (FACK), TIME_WAIT recycling,
       per connection keepalive socket options and sysctls were introduced in
       2.3.

       The default values and descriptions for	the  sysctl  variables	given
       above are applicable for the 2.4 kernel.

AUTHORS
       This  man  page	was originally written by Andi Kleen.  It was updated
       for 2.4 by Nivedita Singhvi with input from Alexey  Kuznetsov’s	Docu-
       mentation/networking/ip-sysctls.txt document.

SEE ALSO
       socket(7),  socket(2),  ip(7),  bind(2),	 listen(2),  accept(2),	 con-
       nect(2),	 sendmsg(2),  recvmsg(2),  sendfile(2),	 sysctl(2),  getsock-
       opt(2).

       RFC793 for the TCP specification.
       RFC1122	for the TCP requirements and a description of the Nagle algo-
       rithm.
       RFC1323 for TCP timestamp and window scaling options.
       RFC1644 for a description of TIME_WAIT assassination hazards.
       RFC2481 for a description of Explicit Congestion Notification.
       RFC2581 for TCP congestion control algorithms.
       RFC2018 and RFC2883 for SACK and extensions to SACK.




Linux Man Page			  2003-08-21			       TCP(7)