mlx rdma网卡指标参数简介


综述

mlx5 driver在linux sysfs下有一系列的mlx网卡参数和计数器分布在/sys/class/infiniband/mlx5_x/ports/1/counters/sys/class/infiniband/mlx5_x/ports/1/hw_counters目录下,这些参数统计了某种类型的事件发生的次数,如某种错误数,收包数等等。理解这些参数,可以帮助我们更好的理解mlx网卡的运行状态,通过监控,可以更快的定位rdma报错的根因

hw_counter

  • rnr_nak_retry_err:本机作为发送方,收到对端发来的RNR NAK包的数量。如果接收方qp的srq没有空闲了,这个计数会涨
  • out_of_buffer:本机作为接收方,收包的时候发现没有buffer了,如果自己qp的srq满了,这个计数会涨
  • out_of_sequence:收包乱序
  • local_ack_timeout_err:发送的rdma请求超时计数
  • packet_seq_err:本机收到NAK包计数
  • req_cqe_error:本机CQE报错计数
  • duplicate_request:本机收到重复包
  • np_ecn_marked_roce_packets:本机收到的ecn

counter

  • port_rcv_data: Total number of data octets, divided by 4 (lanes), received on all VLs. This is 64 bit counter.
  • port_rcv_packets: Total number of packets (this may include packets containing Errors. This is 64 bit counter.
  • port_xmit_data: Total number of data octets, divided by 4 (lanes), transmitted on all VLs. This is 64 bit counter.
  • port_xmit_packets: Total number of packets transmitted on all VLs from this port. This may include packets with errors.
  • unicast_rcv_packets: Total number of unicast packets, including unicast packets containing errors.
  • unicast_xmit_packets: Total number of unicast packets transmitted on all VLs from the port. This may include unicast packets with errors.

参考链接

  1. Understanding mlx5 Linux Counters and Status Parameters
  2. Understanding mlx5 ethtool Counters
  3. Nak Errors
Logo

更多推荐