irdma RDMA FreeBSD* driver for Intel(R) Ethernet Controller E810
================================================================
September 20, 2022

Contents
========

- Overview
- Prerequisites
- Building and Installation
- Testing
- Configuration
- DCB Configuration in FreeBSD
- DSCP Configuration
- Interoperability
- Known Issues
- Support

================================================================================


Overview
--------

- Support for iWARP and RoCEv2 protocols.
    - See Testing chapter for configuration details.
- Support for IB_PD_UNSAFE_GLOBAL_RKEY
    - Intel recommends this feature be used only in a closed network in which
      all applications running on all systems on the network are trusted.
      If the global_rkey is compromised, it may result in unfettered access
      to any RDMA aware user space application on the network.

Prerequisites
-------------

- FreeBSD version 12.2, 13.0 or later.
- Kernel configuration:
    Please add the following kernel configuration options:
        include GENERIC
        options OFED
        options OFED_DEBUG_INIT
        options COMPAT_LINUXKPI
        options SDP
        options IPOIB_CM

        nodevice ice
- For the irdma driver to work, an if_ice module with RDMA interface
  is required. The interface is available in if_ice version 0.28.2 or later.
  The RDMA interface may be turned on or off by using tunable of if_ice module:
    hw.ice.irdma
  It may be modified by putting:
    hw.ice.irdma=1
  to /boot/loader.conf file. Reboot is needed for the change to take effect.
  The RDMA interface is turned on by default (the value is 1).

Building and Installation
-------------------------

1. Untar ice-<version>.tar.gz and irdma-<version>.tar.gz:
    tar -xf ice-<version>.tar.gz
    tar -xf irdma-<version>.tar.gz
2. Install the if_ice driver:
    cd ice-<version>/ directory
    make
    make install
3. Install the irdma driver:
    cd irdma-<version>/src/
    make clean
    make ICE_DIR=$PATH_TO_ICE/ice-<version>/
    make install

Testing
-------
1. To load the irdma driver, run:
     kldload irdma
   If if_ice is not already loaded, the system will load it on its own.
   Please check whether the value of
     sysctl hw.ice.irdma
   is 1, if the irdma driver is not loading. To change the value put:
     hw.ice.irdma=1
   to /boot/loader.conf and reboot.
2. To check that the driver was loaded, run:
     sysctl -a | grep infiniband
   Typically, if everything goes well, around 190 entries per PF will appear.
3. Each interface of the card may work in either iWARP or RoCEv2 mode.
   To enable RoCEv2 compatibility, add:
     dev.irdma<interface_number>.roce_enable=1
   where <interface_number> is a desired ice interface number on which
   RoCEv2 protocol needs to be enabled, into:
     /boot/loader.conf

   for instance:
     dev.irdma0.roce_enable=0
     dev.irdma1.roce_enable=1
   will keep iWARP mode on ice0 and enable RoCEv2 mode on interface ice1.
   The RoCEv2 mode is the default.

   To check irdma roce_enable status run:
     sysctl dev.irdma<interface_number>.roce_enable
   for instance:
     sysctl dev.irdma2.roce_enable
   with returned value of '0' indicate the iWARP mode, and the value of '1'
   indicate the RoCEv2 mode.

   Note: An interface configured in one mode will not be able to connect
   to a node configured in another mode.
   Note: RoCEv2 requires a proper configuration of DCB in order to ensure
   lossless Ethernet. No properly configured DCB may lead to significant
   performance loss or connectivity issues. See DCB configuration section
   for an example of how to configure DCB in FreeBSD system.
4. Enable flow control in the ice driver:
     sysctl dev.ice.<interface_num>.fc=3
   Enable flow control on the switch your system is connected to. See your
   switch documentation for details.
   Note: FC setting and PFC are mutually exclusive, if both are set only
   one of them will actually work.
5. The source code for krping software is provided with the kernel in
   /usr/src/sys/contrib/rdma/krping/. To compile the software, change
   directory to /usr/src/sys/modules/rdma/krping/ and invoke the following:
     make clean
     make
     make install
     kldload krping
6. Start a krping server on one machine:
    echo size=64,count=1,port=6601,addr=100.0.0.189,server > /dev/krping
7. Connect a client from another machine:
    echo size=64,count=1,port=6601,addr=100.0.0.189,client > /dev/krping

==============================================================================


Configuration
-------------

The following sysctl options are available:
- dev.irdma<interface_number>.debug
    defines level of debug messages.
    Typical value: 1 for errors only, 0x7fffffff for full debug.
- dev.irdma<interface_number>.roce_enable
    enables RoCEv2 protocol usage on <interface_numer> interface.
    By default RoCEv2 protocol is used.
- dev.irdma<interface_number>.dcqcn_enable
    enables the DCQCN algorithm for RoCEv2.
    Note: "roce_enable" must also be set for this sysctl to take effect.
    Note: The change may be set at any time, but it will be applied only to
          newly created QPs.
- dev.irdma<interface_number>.dcqcn_cc_cfg_valid
    indicates that all DCQCN parameters are valid and should be updated
    in registers or QP context.
    Note: "roce_enable" must also be set for this tunable to take effect.
- dev.irdma<interface_number>.dcqcn_min_dec_factor
    The minimum factor by which the current transmit rate can be
    changed when processing a CNP. Value is given as a percentage
    (1-100).
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_min_rate_MBps
    The minimum value, in Mbits per second, for rate to limit.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_F
    The number of times to stay in each stage of bandwidth recovery.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_T
    The number of microseconds that should elapse before increasing the
    CWND in DCQCN mode.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_B
    The number of bytes to transmit before updating CWND in DCQCN mode.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_rai_factor
    The number of MSS to add to the congestion window in additive
    increase mode.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_hai_factor
    The number of MSS to add to the congestion window in hyperactive
    increase mode.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set for this
          tunable to take effect.
- dev.irdma<interface_number>.dcqcn_rreduce_mperiod
    The minimum time between 2 consecutive rate reductions for a single
    flow. Rate reduction will occur only if a CNP is received during
    the relevant time interval.
    Note: "roce_enable" and "dcqcn_min_dec_factor" must also be set for this
          tunable to take effect.


DCB Configuration in FreeBSD
----------------------------
    In order for RoCEv2 traffic to work without any problems DCB should be
    configured on ice driver.
    DCB (Data Center Bridging) allows for a host to accept configuration from
    its link partner (willing mode), or for the host to set its own
    configuration (non-willing).

    DCB on E810 devices is intended to be used with a switch or non-willing
    partner with DCBX/LLDP that will handle DCB configuration.

    Note: E810 family devices do not support both FW DCB and non-willing mode
          (e.g. the firmware will not try to configure the partner).

    Note: FreeBSD 13.0 or earlier does not have SW DCB support which means it
          cannot currently support both SW DCB and willing modes because there
          is no software to accept the configuration and handle negotiation
          and adapter configuration.

    Currently, with FW LLDP agent (DCBX) enabled the driver supports "willing"
    mode or "non-willing" mode otherwise. The DCB configuration may be limited
    in the latter case.

    Note: The kernel needs to have https://reviews.freebsd.org/D31485
          (iflib: Allow drivers to determine which queue to TX on) applied in
          order to support the DCB.

    Configuration of FW LLDP Agent:
        sysctl dev.ice.<iface_num>.fw_lldp_agent=1

        1 enables FW-LLDP, 0 disables FW-LLDP.
        For the ice driver to be able to send lldp packets, you need to disable
        the lldp filter:
            kenv hw.ice.debug.enable_tx_lldp_filter=0

    View/Edit DCB ETS Settings:
        sysctl dev.ice.<iface_num>.ets_min_rate

            In "willing" mode (fw_lldp_agent=1), displays the current ETS
            bandwidth table. In "non-willing" mode, displays and allows setting
            the table.
            The sysctl accepts an input that consists of a comma-separated list
            of numbers [0-100], that must all add up to 100. These correspond
            to the minimum bandwidth allocations allowed for each traffic
            class.

            For instance:
                sysctl dev.ice.<iface_num>.ets_min_rate=30,10,10,10,10,10,10,10
            This configures every traffic class but TC 0 to a minimum of 10%
            bandwidth; TC 0 instead has 30% minimum bandwidth

            Note: When setting ets_min_rate, only non-0 values are allowed for
            TCs that are in use in up2tc_map. Therefore, the up2tc_map setting
            shall be done before setting ets_min_rate.

    User Priority to Traffic Class Mapping:
        sysctl dev.ice.<iface_num>.up2tc_map

            In "willing" mode (fw_lldp_agent=1), displays the current ETS
            priority assignment table. In "non-willing" mode, displays and
            allows setting the table.
            Input must be in this format: 0,1,2,3,4,5,6,7
            Where the first number is the TC for UP0, second number is the TC
            for UP1, etc.

    Priority Flow Control Configuration:
        sysctl.dev.ice.<iface_num>.pfc

            In "willing" mode (fw_lldp_agent=1), displays the current Priority
            Flow Control configuration. In "non-willing" mode, displays and
            allows setting the configuration.

            Input/Output is in this format: 0xff
            Where each bit of the hexadecimal number indicates enablement of
            corresponding Traffic Class.
            For instance:
                sysctl.dev.ice.<iface_num>.pfc=0x81
            indicates the PFC is enabled on TC0 and TC7.
            Settings for disabled TCs with this sysctl are ignored.

            This sysctl shall only write the new configuration when the adapter
            is in non-willing mode

    Debug sysctls
        sysctl dev.ice.<iface_num>.debug.local_dcbx_cfg
        sysctl dev.ice.<iface_num>.debug.remote_dcbx_cfg
        sysctl.dev.ice.<iface_num>.debug.pf_vsi_cfg

        Print out more information when the ICE_DBG_DCB flag is set
        (debug_mask=0x400) in the ice driver.

DSCP Configuration
------------------
    In order to enable DSCP (Differentiated Services Code Point) feature some
    additional configuration is needed.

    Enable DSCP mode in the ice driver:
      sysctl dev.ice.<iface_num>.pfc_mode=1

    Set traffic classes for DSCP:
      sysctl dev.ice.<iface_num>.dscp2tc_map.0-7
    for instance, to assign four DSCPs to four traffic classes:
      sysctl dev.ice.0.dscp2tc_map.0-7=0,1,2,3,0,0,0,0

    Assign minimum guaranteed bandwidth to each TC using:
      sysctl dev.ice.<iface_num>.ets_min_rate
    for instance, on ice0 assign 70% to TC0, 10% for TC1, TC2 and TC3, 0% for
    others:
      sysctl dev.ice.0.ets_min_rate=70,10,10,10,0,0,0,0


Interoperability
----------------
- irdma and libirdma versions
  Starting from irdma 1.1.3 there is a feature to check what version of irdma
  and libirdma is actually in use. Prior to that version the irdma notes the
  current version of kernel driver in dmesg, but it is done only on load and
  for cases when irdma is loaded for very long time the information can be
  lost.
  To check the current irdma version it is enough to use:
    sysctl dev.irdma<interface_number>.drv_ver
  To check current libirdma version, one may use shared function
  libirdma_query_device(). For example, a simple program used to determine the
  version can look like this:

    #include <stdio.h>
    #include <libirdma/irdma_uquery.h>

    int main()
    {
            struct libirdma_device attrs;

            libirdma_query_device(NULL, &attrs);
            printf("libirdma version in use: %s %zu %d\n",
    	           attrs.lib_ver, sizeof(attrs.lib_ver),
    	           attrs.query_ver);

            return 0;
    }

  And it can be compiled using:
    cc -I/usr/src/contrib/ofed/ -lirdma \
       libirdma_ver_test.c -o libirdma_ver_test
  Please note, in case of any problems with user space tools, it is always
  worth checking whether the versions of kernel driver and libirdma match.

Known Issues
------------
- The krping is unable to bind to an address belonging to vLAN interface.
  This appears to be a problem in rdma_copy_addr of ib_addr.c
- During extensive traffic in RoCEv2 mode with multiple QPs CQP Error 8029
  may occur. This can result in kernel panic.
- During extensive traffic in RoCEv2 mode using send operations multiple
  AE 0x50a errors may occur, even when PFC is enabled and correctly configured.
- During traffic in RoCEv2 mode, using large number of QPs (>64) a PE
  Critical Error may occur. In such circumstances the card may become
  inoperational, and reboot is required to restore RDMA capability.

Support
-------
For general information, go to the Intel support website at:
www.intel.com/support/

or the Intel Wired Networking project hosted by Sourceforge at:
http://sourceforge.net/projects/e1000

If an issue is identified with the released source code on a supported
kernel with a supported adapter, email the specific information related to the
issue to e1000-rdma@lists.sourceforge.net



================================================================================


License
-------

This software is available to you under a choice of one of two
licenses. You may choose to be licensed under the terms of the GNU
General Public License (GPL) Version 2, available from the file
COPYING in the main directory of this source tree, or the
OpenFabrics.org BSD license below:

  Redistribution and use in source and binary forms, with or
  without modification, are permitted provided that the following
  conditions are met:

  - Redistributions of source code must retain the above
    copyright notice, this list of conditions and the following
    disclaimer.

  - Redistributions in binary form must reproduce the above
    copyright notice, this list of conditions and the following
    disclaimer in the documentation and/or other materials
    provided with the distribution.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.


================================================================================


Trademarks
----------

(c) Intel Corporation. Intel, the Intel logo, and other Intel marks are
trademarks of Intel Corporation or its subsidiaries. 
Other names and brands may be claimed as the property of others. 


