opensm

TriggerTek Logo
abcdefghijklmnopqrstuvwxyz_
OPENSM(8)		      OpenIB Management			    OPENSM(8)



NAME
       opensm - InfiniBand subnet manager and administration (SM/SA)


SYNOPSIS
       opensm  [-c(ache-options)]  [-g(uid)[=]<GUID  in	 hex>] [-l(mc) <LMC>]
       [-p(riority)  <PRIORITY>]  [-smkey  <SM_Key>]  [-r(eassign_lids)]  [-R
       <engine name> | --routing_engine <engine name>] [-z | --connect_roots]
       [-M <file name> | --lid_matrix_file <file name>]	 [-U  <file  name>  |
       --ucast_file  <file  name>]  [-S	 |  --sadb_file	 <file	name>]	[-a |
       --root_guid_file <path to file>] [-u | --cn_guid_file <path to  file>]
       [-o(nce)]  [-s(weep) <interval>] [-t(imeout) <milliseconds>] [-maxsmps
       <number>] [-console [off | local | socket | loopback]]  [-console-port
       <port>]	  [-i(gnore-guids)    <equalize-ignore-guids-file>]   [-f   |
       --log_file]  [-L	 |  --log_limit	 <size	in  MB>]  [-e(rase_log_file)]
       [-P(config)]   [-Q   |	--qos]	 [-N   |   --no_part_enforce]  [-y  |
       --stay_on_fatal]	 [-B  |	 --daemon]  [-I	 |  --inactive]	  [--perfmgr]
       [--perfmgr_sweep_time_s	 <seconds>]   [--prefix_routes_file   <path>]
       [-v(erbose)] [-V] [-D <flags>] [-d(ebug) <number>] [-h(elp)] [-?]


DESCRIPTION
       opensm is an InfiniBand compliant Subnet Manager	 and  Administration,
       and runs on top of OpenIB.

       opensm  provides an implementation of an InfiniBand Subnet Manager and
       Administration. Such a software entity is required to run for in order
       to  initialize  the InfiniBand hardware (at least one per each Infini-
       Band subnet).

       opensm also now contains an experimental version of a performance man-
       ager as well.

       opensm  defaults	 were designed to meet the common case usage on clus-
       ters with up to a few hundred  nodes.  Thus,  in	 this  default	mode,
       opensm  will scan the IB fabric, initialize it, and sweep occasionally
       for changes.

       opensm attaches to a specific IB port on the local machine and config-
       ures  only the fabric connected to it. (If the local machine has other
       IB ports, opensm will ignore the	 fabrics  connected  to	 those	other
       ports).	If  no	port  is  specified,  it will select the first "best"
       available port.

       opensm can present the available ports and prompt for a port number to
       attach to.

       By  default,  the  run  is  logged to two files: /var/log/messages and
       /var/log/opensm.log.  The first file will register only general	major
       events,	whereas	 the  second will include details of reported errors.
       All errors reported in this second file should be treated  as  indica-
       tors  of	 IB  fabric  health issues.  (Note that when a fatal and non-
       recoverable error occurs, opensm will exit.)  Both  log	files  should
       include the message "SUBNET UP" if opensm was able to setup the subnet
       correctly.


OPTIONS
       -c, --cache-options
	      Write out a list of all tunable  OpenSM  parameters,  including
	      their  current values from the command line as well as defaults
	      for   others,   into   the    file    OSM_CACHE_DIR/opensm.opts
	      (OSM_CACHE_DIR defaults to /var/cache/opensm if the correspond-
	      ing environment variable is not set). The options file is	 then
	      used  for	 subsequent  OpenSM  invocations but any command line
	      options take precedence.

       -g, --guid
	      This option specifies the local  port  GUID  value  with	which
	      OpenSM  should  bind.  OpenSM may be bound to 1 port at a time.
	      If GUID given is 0, OpenSM displays a  list  of  possible	 port
	      GUIDs  and  waits	 for user input.  Without -g, OpenSM tries to
	      use the default port.

       -l, --lmc
	      This option specifies the subnet’s LMC value.   The  number  of
	      LIDs  assigned to each port is 2^LMC.  The LMC value must be in
	      the range 0-7.  LMC values > 0  allow  multiple  paths  between
	      ports.  LMC values > 0 should only be used if the subnet topol-
	      ogy actually provides multiple paths between ports, i.e. multi-
	      ple   interconnects   between  switches.	 Without  -l,  OpenSM
	      defaults to LMC = 0, which allows	 one  path  between  any  two
	      ports.

       -p, --priority
	      This  option specifies the SM´s PRIORITY.	 This will effect the
	      handover cases, where master is chosen by	 priority  and	GUID.
	      Range  goes  from	 0 (default and lowest priority) to 15 (high-
	      est).

       -smkey This option specifies the SM´s SM_Key  (64  bits).   This	 will
	      effect SM authentication.

       -r, --reassign_lids
	      This  option  causes  OpenSM to reassign LIDs to all end nodes.
	      Specifying -r on a running subnet may disrupt  subnet  traffic.
	      Without  -r,  OpenSM  attempts to preserve existing LID assign-
	      ments resolving multiple use of same LID.

       -R, --routing_engine
	      This option chooses routing engine instead of Min Hop algorithm
	      (default).  Supported engines: minhop, updn, file, ftree, lash,
	      dor

       -z, --connect_roots
	      This option enforces a routing engine (currently up/down	only)
	      to  make	connectivity between root switches and in this way to
	      be fully IBA complaint. In many cases this can  violate  "pure"
	      deadlock free algorithm, so use it carefully.

       -M, --lid_matrix_file
	      This option specifies the name of the lid matrix dump file from
	      where switch lid matrices (min hops tables will be loaded.

       -U, --ucast_file
	      This option specifies the name of the unicast  dump  file	 from
	      where switch forwarding tables will be loaded.

       -S, --sadb_file
	      This  option  specifies  the  name  of the SA DB dump file from
	      where SA database will be loaded.

       -a, --root_guid_file
	      Set the root nodes for the Up/Down or  Fat-Tree  routing	algo-
	      rithm  to the guids provided in the given file (one to a line).

       -u, --cn_guid_file
	      Set the compute nodes for the Fat-Tree routing algorithm to the
	      guids provided in the given file (one to a line).

       -o, --once
	      This  option  causes  OpenSM to configure the subnet once, then
	      exit.  Ports remain in the ACTIVE state.

       -s, --sweep
	      This option specifies the	 number	 of  seconds  between  subnet
	      sweeps.  Specifying -s 0 disables sweeping.  Without -s, OpenSM
	      defaults to a sweep interval of 10 seconds.

       -t, --timeout
	      This option specifies the time in milliseconds used for  trans-
	      action  timeouts.	  Specifying -t 0 disables timeouts.  Without
	      -t, OpenSM defaults to a timeout value of 200 milliseconds.

       -maxsmps
	      This option specifies the number of VL15 SMP  MADs  allowed  on
	      the  wire at any one time.  Specifying -maxsmps 0 allows unlim-
	      ited outstanding SMPs.  Without -maxsmps, OpenSM defaults to  a
	      maximum of 4 outstanding SMPs.

       -console [off | local | socket | loopback]
	      This  option  brings up the OpenSM console (default off).	 Note
	      that the socket and loopback options will only be available  if
	      OpenSM was built with --enable-console-socket.

       -console-port <port>
	      Specify  an  alternate  telnet  port  for	 the  socket  console
	      (default 10000).	Note that this option only appears if  OpenSM
	      was built with --enable-console-socket.

       -i, -ignore-guids <equalize-ignore-guids-file>
	      This  option  provides  the  means to define a set of ports (by
	      guid) that will be ignored by the link load equalization	algo-
	      rithm.

       -x, --honor_guid2lid
	      This  option  forces OpenSM to honor the guid2lid file, when it
	      comes  out  of  Standby  state,  if  such	 file  exists	under
	      OSM_CACHE_DIR, and is valid.  By default, this is FALSE.

       -f, --log_file
	      This  option defines the log to be the given file.  By default,
	      the log goes to /var/log/opensm.log.  For	 the  log  to  go  to
	      standard output use -f stdout.

       -L, --log_limit <size in MB>
	      This option defines maximal log file size in MB. When specified
	      the log file will be truncated upon reaching this limit.

       -e, --erase_log_file
	      This option will cause deletion of the log file (if  it  previ-
	      ously exists). By default, the log file is accumulative.

       -P, --Pconfig
	      This  option defines the optional partition configuration file.
	      The default name is ´/etc/opensm/opensm-partitions.conf´.

       --prefix_routes_file=path
	      Prefix routes control  how  the  SA  responds  to	 path  record
	      queries  for  off-subnet	DGIDs.	By default, the SA fails such
	      queries. The PREFIX ROUTES section below describes  the  format
	      of    the	   configuration   file.    The	  default   path   is
	      /etc/ofa/opensm-prefix-routes.conf.

       -Q, --qos
	      This option enables QoS setup. It is disabled by default.

       -N, --no_part_enforce
	      This option disables partition enforcement on  switch  external
	      ports.

       -y, --stay_on_fatal
	      This  option  will cause SM not to exit on fatal initialization
	      issues: if SM discovers duplicated guids or  a  12x  link	 with
	      lane  reversal  badly configured.	 By default, the SM will exit
	      on these errors.

       -B, --daemon
	      Run in daemon mode - OpenSM will run in the background.

       -I, --inactive
	      Start SM in inactive rather than init SM	state.	 This  option
	      can  be  used  in	 conjunction  with the perfmgr so as to run a
	      standalone performance manager without SM/SA.  However, this is
	      NOT currently implemented in the performance manager.

       -perfmgr
	      Enable  the perfmgr.  Only takes effect if --enable-perfmgr was
	      specified at configure time.

       -perfmgr_sweep_time_s <seconds>
	      Specify the sweep time for the performance manager  in  seconds
	      (default	is  180	 seconds).   Only  takes  effect if --enable-
	      perfmgr was specified at configure time.

       -v, --verbose
	      This option increases the log verbosity level.  The  -v  option
	      may  be  specified  multiple times to further increase the ver-
	      bosity level.  See the -D option for more information about log
	      verbosity.

       -V     This  option  sets  the  maximum verbosity level and forces log
	      flushing.	 The -V option is equivalent to ´-D 0xFF -d 2´.	  See
	      the -D option for more information about log verbosity.

       -D     This  option  sets the log verbosity level.  A flags field must
	      follow  the  -D  option.	 A  bit	 set/clear   in	  the	flags
	      enables/disables a specific log level as follows:

	       BIT    LOG LEVEL ENABLED
	       ----   -----------------
	       0x01 - ERROR (error messages)
	       0x02 - INFO (basic messages, low volume)
	       0x04 - VERBOSE (interesting stuff, moderate volume)
	       0x08 - DEBUG (diagnostic, high volume)
	       0x10 - FUNCS (function entry/exit, very high volume)
	       0x20 - FRAMES (dumps all SMP and GMP frames)
	       0x40 - ROUTING (dump FDB routing information)
	       0x80 - currently unused.

	      Without  -D, OpenSM defaults to ERROR + INFO (0x3).  Specifying
	      -D 0 disables all messages.  Specifying  -D  0xFF	 enables  all
	      messages	(see -V).  High verbosity levels may require increas-
	      ing the transaction timeout with the -t option.

       -d, --debug
	      This option specifies a debug option.  These  options  are  not
	      normally	needed.	  The  number  following -d selects the debug
	      option to enable as follows:

	       OPT   Description
	       ---    -----------------
	       -d0  - Ignore other SM nodes
	       -d1  - Force single threaded dispatching
	       -d2  - Force log flushing after each log message
	       -d3  - Disable multicast support

       -h, --help
	      Display this usage info then exit.

       -?     Display this usage info then exit.


ENVIRONMENT VARIABLES
       The following environment variables control opensm behavior:

       OSM_TMP_DIR - controls the directory in which the temporary files gen-
       erated  by  opensm  are	created.  These files are: opensm-subnet.lst,
       opensm.fdbs,  and  opensm.mcfdbs.  By  default,	this   directory   is
       /var/log.

       OSM_CACHE_DIR  - opensm stores certain data to the disk such that sub-
       sequent	runs  are  consistent.	The   default	directory   used   is
       /var/cache/opensm.  The following files are included in it:

	guid2lid - stores the LID range assigned to each GUID

	opensm.opts - an optional file that holds a complete set of opensm
		      configuration options


NOTES
       When opensm receives a HUP signal, it starts a new heavy sweep as if a
       trap was received or a topology change was found.

       Also, SIGUSR1 can be used to trigger a reopen  of  /var/log/opensm.log
       for logrotate purposes.


PARTITION CONFIGURATION
       The   default   name   of  OpenSM  partitions  configuration  file  is
       ´/etc/ofa/opensm-partitions.conf´. The default may be changed by using
       --Pconfig (-P) option with OpenSM.

       The  default  partition will be created by OpenSM unconditionally even
       when  partition	configuration  file  does  not	exist  or  cannot  be
       accessed.

       The  default partition has P_Key value 0x7fff. OpenSM´s port will have
       full membership in default partition. All other end  ports  will	 have
       partial membership.

       File Format

       Comments:

       Line  content  followed	after ´#´ character is comment and ignored by
       parser.

       General file format:

       <Partition Definition>:<PortGUIDs list> ;

       Partition Definition:

       [PartitionName][=PKey][,flag[=value]][,defmember=full|limited]

	PartitionName - string, will be used with logging. When omitted
			empty string will be used.
	PKey	      - P_Key value for this partition. Only low 15 bits will
			be used. When omitted will be autogenerated.
	flag	       - used to indicate IPoIB capability of this partition.
	defmember=full|limited - specifies default membership for port guid
			list. Default is limited.

       Currently recognized flags are:

	ipoib	    - indicates that this partition may be used for IPoIB, as
		      result IPoIB capable MC group will be created.
	rate=<val>  - specifies rate for this IPoIB MC group
		      (default is 3 (10GBps))
	mtu=<val>   - specifies MTU for this IPoIB MC group
		      (default is 4 (2048))
	sl=<val>    - specifies SL for this IPoIB MC group
		      (default is 0)
	scope=<val> - specifies scope for this IPoIB MC group
		      (default is 2 (link local)).  Multiple scope settings
		      are permitted for a partition.

       Note  that  values  for	rate,  mtu,  and scope should be specified as
       defined in the IBTA specification (for example, mtu=4 for 2048).

       PortGUIDs list:

	PortGUID	 - GUID of partition member EndPort. Hexadecimal
			   numbers should start from 0x, decimal numbers
			   are accepted too.
	full or limited	 - indicates full or limited membership for this
			   port.  When omitted (or unrecognized) limited
			   membership is assumed.

       There are two useful keywords for PortGUID definition:

	- ’ALL’ means all end ports in this subnet.
	- ’SELF’ means subnet manager’s port.

       Empty list means no ports in this partition.

       Notes:

       White space is permitted between delimiters (’=’, ’,’,’:’,’;’).

       The line can be wrapped after ’:’ followed after Partition  Definition
       and between.

       PartitionName does not need to be unique, PKey does need to be unique.
       If PKey is repeated then those partition configurations will be merged
       and first PartitionName will be used (see also next note).

       It is possible to split partition configuration in more than one defi-
       nition, but then PKey should be explicitly specified  (otherwise	 dif-
       ferent PKey values will be generated for those definitions).

       Examples:

	Default=0x7fff : ALL, SELF=full ;

	NewPartition , ipoib : 0x123456=full, 0x3456789034=limi, 0x2134af2306
       ;

	YetAnotherOne = 0x300 : SELF=full ;
	YetAnotherOne = 0x300 : ALL=limited ;

	ShareIO = 0x80 , defmember=full : 0x123451, 0x123452;
	# 0x123453, 0x123454 will be limited
	ShareIO = 0x80 : 0x123453, 0x123454, 0x123455=full;
	# 0x123456, 0x123457 will be limited
	ShareIO	 =   0x80   :	defmember=limited   :	0x123456,   0x123457,
       0x123458=full;
	ShareIO = 0x80 , defmember=full : 0x123459, 0x12345a;
	ShareIO	  =  0x80  ,  defmember=full  :	 0x12345b,  0x12345c=limited,
       0x12345d;


       Note:

       The following rule is equivalent to how OpenSM used to  run  prior  to
       the partition manager:

	Default=0x7fff,ipoib:ALL=full;


QOS CONFIGURATION
       There  are  a  set  of QoS related low-level configuration parameters.
       All these parameter names are prefixed by "qos_"	 string.  Here	is  a
       full list of these parameters:

	qos_max_vls    - The maximum number of VLs that will be on the subnet
	qos_high_limit - The limit of High Priority component of VL
			 Arbitration table (IBA 7.6.9)
	qos_vlarb_low  - Low priority VL Arbitration table (IBA 7.6.9)
			 template
	qos_vlarb_high - High priority VL Arbitration table (IBA 7.6.9)
			 template
			 Both VL arbitration templates are pairs of
			 VL and weight
	qos_sl2vl      - SL2VL Mapping table (IBA 7.6.6) template. It is
			 a list of VLs corresponding to SLs 0-15 (Note
			 that VL15 used here means drop this SL)

       Typical default values (hard-coded in OpenSM initialization) are:

	qos_max_vls=15
	qos_high_limit=0
	qos_vlarb_low=0:0,1:4,2:4,3:4,4:4,5:4,6:4,7:4,8:4,9:4,10:4,11:4,12:4,13:4,14:4
	qos_vlarb_high=0:4,1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0,9:0,10:0,11:0,12:0,13:0,14:0
	qos_sl2vl=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7

       The syntax is compatible with rest of OpenSM configuration options and
       values may be stored in OpenSM config file (cached options file).

       In  addition  to	 the  above, we may define separate QoS configuration
       parameters sets for various target types.  As  targets,	we  currently
       support	CAs,  routers,	switch	external ports, and switch’s enhanced
       port 0. The names of  such  specialized	parameters  are	 prefixed  by
       "qos_<type>_"  string.  Here is a full list of the currently supported
       sets:

	qos_ca_	 - QoS configuration parameters set for CAs.
	qos_rtr_ - parameters set for routers.
	qos_sw0_ - parameters set for switches’ port 0.
	qos_swe_ - parameters set for switches’ external ports.

       Examples:
	qos_sw0_max_vls=2
	qos_ca_sl2vl=0,1,2,3,5,5,5,12,12,0,
	qos_swe_high_limit=0


PREFIX ROUTES
       Prefix routes control how the SA responds to path record	 queries  for
       off-subnet  DGIDs.   By default, the SA fails such queries.  Note that
       IBA does not specify how the SA should obtain off-subnet	 path  record
       information.   The  prefix routes configuration is meant as a stop-gap
       until the specification is completed.

       Each line in the configuration file is a 64-bit prefix followed	by  a
       64-bit  GUID, separated by white space.	The GUID specifies the router
       port on the local subnet that will handle the prefix.  Blank lines are
       ignored, as is anything between a # character and the end of the line.
       The prefix and GUID are both in	hex,  the  leading  0x	is  optional.
       Either,	or both, can be wild-carded by specifying an asterisk instead
       of an explicit prefix or GUID.

       When responding to a path record query for an off-subnet DGID,  opensm
       searches for the first prefix match in the configuration file.  There-
       fore, the order of the lines in the configuration file is important: a
       wild-carded  prefix at the beginning of the configuration file renders
       all subsequent lines useless.  If there is no match, then opensm fails
       the  query.  It is legal to repeat prefixes in the configuration file,
       opensm will return the path to the first available matching router.  A
       configuration  file  with a single line where both prefix and GUID are
       wild-carded means that a path record query specifying  any  off-subnet
       DGID should return a path to the first available router.	 This config-
       uration yields the  same	 behaviour  formerly  achieved	by  compiling
       opensm with -DROUTER_EXP.


ROUTING
       OpenSM now offers five routing engines:

       1.   Min	 Hop Algorithm - based on the minimum hops to each node where
       the path length is optimized.

       2.  UPDN Unicast routing algorithm - also based on the minimum hops to
       each  node,  but	 it  is	 constrained to ranking rules. This algorithm
       should be chosen if the subnet is not a pure Fat	 Tree,	and  deadlock
       may occur due to a loop in the subnet.

       3.   Fat	 Tree  Unicast	routing	 algorithm - this algorithm optimizes
       routing for congestion-free "shift" communication pattern.  It  should
       be chosen if a subnet is a symmetrical Fat Trees of various types, not
       just K-ary-N-Trees: non-constant K, not fully staffed, any CBB  ratio.
       Similar to UPDN, Fat Tree routing is constrained to ranking rules.

       4.  LASH	 unicast  routing  algorithm - uses Infiniband virtual layers
       (SL) to provide deadlock-free shortest-path routing  while  also	 dis-
       tributing  the  paths between layers. LASH is an alternative deadlock-
       free topology-agnostic routing algorithm to the non-minimal UPDN algo-
       rithm avoiding the use of a potentially congested root node.

       5. DOR Unicast routing algorithm - based on the Min Hop algorithm, but
       avoids port equalization except for redundant links between  the	 same
       two  switches.  This provides deadlock free routes for hypercubes when
       the fabric is cabled as a hypercube and for meshes when	cabled	as  a
       mesh (see details below).

       OpenSM also supports a file method which can load routes from a table.
       See ´Modular Routing Engine´ for more information on this.

       The basic routing algorithm is comprised of two stages:

       1. MinHop matrix calculation
	  How many hops are required to get from each port to each LID ?
	  The algorithm to fill these tables is different if you run standard
       (min hop) or Up/Down.
	  For standard routing, a "relaxation" algorithm is used to propagate
       min hop from every destination LID through neighbor switches
	  For Up/Down routing, a BFS from  every  target  is  used.  The  BFS
       tracks  link  direction (up or down) and avoid steps that will perform
       up after a down step was used.

       2. Once MinHop matrices exist, each switch is  visited  and  for	 each
       target LID a decision is made as to what port should be used to get to
       that LID.
	  This step is common to standard and Up/Down routing. Each port  has
       a counter counting the number of target LIDs going through it.
	  When	there  are  multiple  alternative ports with same MinHop to a
       LID, the one with less previously assigned ports is selected.
	  If LMC > 0, more checks  are	added:	Within	each  group  of	 LIDs
       assigned to same target port,
	  a. use only ports which have same MinHop
	  b. first prefer the ones that go to different systemImageGuid (then
       the previous LID of the same LMC group)
	  c. if none - prefer those which go through another NodeGuid
	  d. fall back to the number of paths  method  (if  all	 go  to	 same
       node).

       Effect of Topology Changes

       OpenSM  will  preserve  existing routing in any case where there is no
       change in the fabric switches unless the -r  (--reassign_lids)  option
       is specified.

       -r
       --reassign_lids
		 This option causes OpenSM to reassign LIDs to all
		 end nodes. Specifying -r on a running subnet
		 may disrupt subnet traffic.
		 Without -r, OpenSM attempts to preserve existing
		 LID assignments resolving multiple use of same LID.

       If  a link is added or removed, OpenSM does not recalculate the routes
       that do not have to change. A route has to change if the	 port  is  no
       longer UP or no longer the MinHop. When routing changes are performed,
       the same algorithm for balancing the routes is invoked.

       In the case of using the file based routing, any topology changes  are
       currently  ignored  The ’file’ routing engine just loads the LFTs from
       the file specified, with no reaction to real topology. Obviously, this
       will not be able to recheck LIDs (by GUID) for disconnected nodes, and
       LFTs for non-existent switches  will  be	 skipped.  Multicast  is  not
       affected by ’file’ routing engine (this uses min hop tables).


       Min Hop Algorithm

       The  Min Hop algorithm is invoked when neither UPDN or the file method
       are specified.

       The Min Hop algorithm is divided into two stages: computation of	 min-
       hop  tables  on every switch and LFT output port assignment. Link sub-
       scription is also equalized with the ability to override based on port
       GUID. The latter is supplied by:

       -i <equalize-ignore-guids-file>
       -ignore-guids <equalize-ignore-guids-file>
		 This option provides the means to define a set of ports
		 (by guid) that will be ignored by the link load
		 equalization algorithm. Note that only endports (CA,
		 switch port 0, and router ports) and not switch external
		 ports are supported.

       LMC awareness routes based on (remote) system or switch basis.


       Purpose of UPDN Algorithm

       The  UPDN algorithm is designed to prevent deadlocks from occurring in
       loops of the subnet. A loop-deadlock is a situation in which it is  no
       longer  possible	 to send data between any two hosts connected through
       the loop. As such, the UPDN routing algorithm should be	used  if  the
       subnet  is  not a pure Fat Tree, and one of its loops may experience a
       deadlock (due, for example, to high pressure).

       The UPDN algorithm is based on the following main stages:

       1.  Auto-detect root nodes - based on  the  CA  hop  length  from  any
       switch in the subnet, a statistical histogram is built for each switch
       (hop num vs number of occurrences). If the histogram reflects  a	 spe-
       cific  column  (higher  than  others)  for  a certain node, then it is
       marked as a root node. Since the algorithm is statistical, it may  not
       find  any  root	nodes. The list of the root nodes found by this auto-
       detect stage is used by the ranking process stage.

	   Note 1: The user can override the node list manually.
	   Note 2: If this stage cannot find any root nodes, and the user did
		   not specify a guid list file, OpenSM defaults back to the
		   Min Hop routing algorithm.

       2.   Ranking  process  -	 All root switch nodes (found in stage 1) are
       assigned a rank of 0. Using the BFS algorithm, the rest of the  switch
       nodes in the subnet are ranked incrementally. This ranking aids in the
       process of enforcing rules that ensure loop-free paths.

       3.  Min Hop Table setting - after ranking is done, a BFS algorithm  is
       run  from  each (CA or switch) node in the subnet. During the BFS pro-
       cess, the FDB table of each switch node traversed by BFS	 is  updated,
       in reference to the starting node, based on the ranking rules and guid
       values.

       At the end of the process, the updated  FDB  tables  ensure  loop-free
       paths through the subnet.

       Note: Up/Down routing does not allow LID routing communication between
       switches that are located inside spine "switch systems".	  The  reason
       is  that	 there	is no way to allow a LID route between them that does
       not break the Up/Down rule.  One ramification of this is that you can-
       not run SM on switches other than the leaf switches of the fabric.


       UPDN Algorithm Usage

       Activation through OpenSM

       Use  ’-R updn’ option (instead of old ’-u’) to activate the UPDN algo-
       rithm.  Use ’-a <root_guid_file>’ for adding an UPDN  guid  file	 that
       contains	 the root nodes for ranking.  If the ‘-a’ option is not used,
       OpenSM uses its auto-detect root nodes algorithm.

       Notes on the guid list file:

       1.   A valid guid file specifies one guid in each line. Lines with  an
       invalid format will be discarded.
       2.    The  user	should	specify the root switch guids. However, it is
       also possible to specify CA guids; OpenSM will use  the	guid  of  the
       switch  (if  it	exists)	 that connects the CA to the subnet as a root
       node.


       Fat-tree Routing Algorithm

       The fat-tree algorithm optimizes	 routing  for  "shift"	communication
       pattern.	  It  should be chosen if a subnet is a symmetrical or almost
       symmetrical fat-tree of various types.  It supports not just  K-ary-N-
       Trees, by handling for non-constant K, cases where not all leafs (CAs)
       are present, any CBB  ratio.   As  in  UPDN,  fat-tree  also  prevents
       credit-loop-deadlocks.

       If  the	root  guid  file  is not provided (’-a’ or ’--root_guid_file’
       options), the topology has to be pure fat-tree that complies with  the
       following rules:
	 - Tree rank should be between two and eight (inclusively)
	 - Switches of the same rank should have the same number
	   of UP-going port groups*, unless they are root switches,
	   in which case the shouldn’t have UP-going ports at all.
	 - Switches of the same rank should have the same number
	   of DOWN-going port groups, unless they are leaf switches.
	 - Switches of the same rank should have the same number
	   of ports in each UP-going port group.
	 - Switches of the same rank should have the same number
	   of ports in each DOWN-going port group.
	 - All the CAs have to be at the same tree level (rank).

       If  the	root  guid  file is provided, the topology doesn’t have to be
       pure fat-tree, and it should only comply with the following rules:
	 - Tree rank should be between two and eight (inclusively)
	 - All the Compute Nodes** have to be at the same tree level  (rank).
	   Note that non-compute node CAs are allowed here to be at different
	   tree ranks.

       * ports that are connected to the same remote switch are referenced as
       ´port group´.

       **   list  of  compute  nodes  (CNs)  can  be  specified	 by  ´-u´  or
       ´--cn_guid_file´ OpenSM options.

       Topologies that do not comply cause a fallback  to  min	hop  routing.
       Note  that this can also occur on link failures which cause the topol-
       ogy to no longer be "pure" fat-tree.

       Note that although fat-tree algorithm supports trees with  non-integer
       CBB  ratio,  the routing will not be as balanced as in case of integer
       CBB ratio.  In addition to this, although the  algorithm	 allows	 leaf
       switches to have any number of CAs, the closer the tree is to be fully
       populated, the more effective the "shift" communication	pattern	 will
       be.   In	 general,  even	 if the root list is provided, the closer the
       topology to a pure and symmetrical  fat-tree,  the  more	 optimal  the
       routing will be.

       The  algorithm also dumps compute node ordering file (opensm-ftree-ca-
       order.dump) in the same directory where the OpenSM log  resides.	 This
       ordering	 file  provides the CN order that may be used to create effi-
       cient communication pattern, that will match the routing tables.

       Activation through OpenSM

       Use ’-R ftree’ option to activate the  fat-tree	algorithm.   Use  ’-a
       <root_guid_file>’  to  provide  root  nodes  for	 ranking. If the ‘-a’
       option is not used, routing algorithm will detect roots automatically.
       Use  ’-u	 <root_cn_file>’ to provide the list of compute nodes. If the
       ‘-u’ option is not used, all the CAs are considered as compute  nodes.

       Note:  LMC > 0 is not supported by fat-tree routing. If this is speci-
       fied, the default routing algorithm is invoked instead.


       LASH Routing Algorithm

       LASH is an acronym for LAyered SHortest Path Routing. It is  a  deter-
       ministic	 shortest path routing algorithm that enables topology agnos-
       tic deadlock-free routing within communication networks.

       When computing the routing function, LASH analyzes the network  topol-
       ogy for the shortest-path routes between all pairs of sources / desti-
       nations and groups these paths into virtual layers in such a way as to
       avoid deadlock.

       Note  LASH analyzes routes and ensures deadlock freedom between switch
       pairs. The link from HCA between and switch does not need virtual lay-
       ers as deadlock will not arise between switch and HCA.

       In more detail, the algorithm works as follows:

       1)  LASH	 determines  the  shortest-path between all pairs of source /
       destination switches. Note, LASH ensures the same SL is used  for  all
       SRC/DST - DST/SRC pairs and there is no guarantee that the return path
       for a given DST/SRC will be the reverse of the route SRC/DST.

       2) LASH then begins an SL assignment process where a route is assigned
       to  a layer (SL) if the addition of that route does not cause deadlock
       within that layer. This is achieved by  maintaining  and	 analysing  a
       channel	dependency  graph for each layer. Once the potential addition
       of a path could lead to deadlock, LASH opens a new layer and continues
       the process.

       3)  Once	 this  stage has been completed, it is highly likely that the
       first layers processed will contain more paths than the	latter	ones.
       To  better  balance the use of layers, LASH moves paths from one layer
       to another so that the number of paths in each layer averages out.

       Note, the implementation of LASH in opensm attempts to use as few lay-
       ers  as	possible.  This	 number can be less than the number of actual
       layers available.

       In general LASH is a very flexible algorithm.  It  can,	for  example,
       reduce  to Dimension Order Routing in certain topologies, it is topol-
       ogy agnostic and fares well in the face of faults.

       It has been shown that for both regular and irregular topologies, LASH
       outperforms  Up/Down. The reason for this is that LASH distributes the
       traffic more evenly through a network, avoiding the bottleneck  issues
       related to a root node and always routes shortest-path.

       The algorithm was developed by Simula Research Laboratory.


       Use ’-R lash -Q ’ option to activate the LASH algorithm.

       Note: QoS support has to be turned on in order that SL/VL mappings are
       used.

       Note: LMC > 0 is not supported by the LASH routing. If this is  speci-
       fied, the default routing algorithm is invoked instead.


       DOR Routing Algorithm

       The  Dimension  Order  Routing algorithm is based on the Min Hop algo-
       rithm and so uses shortest paths.  Instead of  spreading	 traffic  out
       across  different  paths	 with  the same shortest distance, it chooses
       among the available shortest paths based on an ordering of dimensions.
       Each  port must be consistently cabled to represent a hypercube dimen-
       sion or a mesh dimension.  Paths are grown from a destination back  to
       a  source using the lowest dimension (port) of available paths at each
       step.  This provides the ordering necessary to avoid  deadlock.	 When
       there  are  multiple links between any two switches, they still repre-
       sent only one dimension and traffic is  balanced	 across	 them  unless
       port  equalization is turned off.  In the case of hypercubes, the same
       port must be used throughout the fabric	to  represent  the  hypercube
       dimension and match on both ends of the cable.  In the case of meshes,
       the dimension should consistently use the same pair of ports, one port
       on one end of the cable, and the other port on the other end, continu-
       ing along the mesh dimension.

       Use ’-R dor’ option to activate the DOR algorithm.


       Routing References

       To learn more about deadlock-free routing, see the  article  "Deadlock
       Free  Message  Routing  in Multiprocessor Interconnection Networks" by
       William J Dally and Charles L Seitz (1985).

       To learn more about the up/down algorithm, see the article  "Effective
       Strategy to Compute Forwarding Tables for InfiniBand Networks" by Jose
       Carlos Sancho, Antonio Robles,  and  Jose  Duato	 at  the  Universidad
       Politecnica de Valencia.

       To  learn  more about LASH and the flexibility behind it, the require-
       ment for layers, performance comparisons to other algorithms, see  the
       following articles:

       "Layered	 Routing  in  Irregular Networks", Lysne et al, IEEE Transac-
       tions on Parallel and  Distributed  Systems,  VOL.16,  No12,  December
       2005.

       "Routing	 for  the ASI Fabric Manager", Solheim et al. IEEE Communica-
       tions Magazine, Vol.44, No.7, July 2006.

       "Layered Shortest Path (LASH) Routing in Irregular  System  Area	 Net-
       works",	Skeie et al. IEEE Computer Society Communication Architecture
       for Clusters 2002.


       Modular Routine Engine

       Modular routing engine structure allows for the ease of "plugging" new
       routing modules.

       Currently,  only	 unicast  callbacks  are  supported. Multicast can be
       added later.

       One existing routing module is up-down "updn", which may be  activated
       with ’-R updn’ option (instead of old ’-u’).

       General usage is: $ opensm -R ’module-name’

       There  is  also	a  trivial  routing  module which is able to load LFT
       tables from a dump file.

       Main features:

	- this will load switch LFTs and/or LID matrices (min hops tables)
	- this will load switch LFTs according to the path entries introduced
	  in the dump file
	-  no  additional  checks  will	 be  performed (such as "is port con-
       nected",
	  etc.)
	- in case when fabric LIDs were changed this will try to reconstruct
	  LFTs correctly if endport GUIDs are represented in the dump file
	  (in order to disable this, GUIDs may be removed from the dump file
	   or zeroed)

       The dump file format is compatible with output of ’ibroute’  util  and
       for whole fabric can be generated with dump_lfts.sh script.

       To activate file based routing module, use:

	 opensm -R file -U /path/to/dump_file

       If  the	dump_file  is  not  found or is in error, the default routing
       algorithm is utilized.

       The ability to dump switch lid matrices (aka min hops tables) to	 file
       and later to load these is also supported.

       The  usage  is  similar to unicast forwarding tables loading from dump
       file (introduced by ’file’ routing engine), but new  lid	 matrix	 file
       name  should be specified by -M or --lid_matrix_file option. For exam-
       ple:

	 opensm -R file -M ./opensm-lid-matrix.dump

       The dump file is named ´opensm-lid-matrix.dump´ and will be  generated
       in   standard   opensm  dump  directory	(/var/log  by  default)	 when
       OSM_LOG_ROUTING logging flag is set.

       When routing engine ’file’ is activated, but dump file is  not  speci-
       fied  or not cannot be open default lid matrix algorithm will be used.

       There is also a switch forwarding tables dumper which generates a file
       compatible  with	 dump_lfts.sh  output. This file can be used as input
       for forwarding tables loading by ’file’ routing engine.	Both  or  one
       of options -U and -M can be specified together with ´-R file´.


FILES
       /etc/opensm/prefix-routes.conf
	      default prefix routes file.


AUTHORS
       Hal Rosenstock
	      <hal@xsigo.com>

       Sasha Khapyorsky
	      <sashak@voltaire.com>

       Eitan Zahavi
	      <eitan@mellanox.co.il>

       Yevgeny Kliteynik
	      <kliteyn@mellanox.co.il>

       Thomas Sodring
	      <tsodring@simula.no>



OpenIB				 Aug 16, 2007			    OPENSM(8)