Featured Post

YouTube and link library for S2D.dk

2021/04/30

Azure Stack HCI Troubleshooting "UseRdmaForStorage"

Azure Stack HCI Troubleshooting the Cluster Object "UseRdmaForStorage"


*** Disclaimer ***
s2d.dk is not responsible for any errors, or for the results obtained from the use of this information on s2d.dk. All information in this site is provided as "draft notes" and "as is", with no guarantee of completeness, accuracy, timeliness or of the results obtained from the use of this information. Always test in a lab setup, before use any of the information in production environment.
For any reference links to other websites we encourages you to read the privacy statements of the third-party websites.
The names of actual companies and products mentioned herein may be the trademarks of their respective owners.
***

Last update: 2021.04.30

Get-HealthFault list a event with the Reason:
The cluster detected network connectivity issues that prevent Storage Spaces Direct from working properly.
To ensure consistent performance and data safety, Storage Spaces Direct has stopped using remote direct memory access (RDMA) even if RDMA-capable hardware is present and enabled.
Storage Spaces Direct will continue to flow but diminished performance using TCP/IP.

Azure Stack HCI (20H2)

Azure Stack HCI (20H2), have a new toggle switch called "UseRDMAForStorage" which gets flipped to 0 (Off) when Network issues are detected. The issue detection looks for SMB spontaneous disconnects and, if they occur often without an obvious explanation (e.g., the node restarting) then the Cluster stops to relying on RDMA/RoCE as a precaution. If you are confident that the network issue is fixed, you can flip the setting back to 1 (On).

***

When the Cluster disable RDMA and change to TCP...

Microsoft-Windows-FailoverClustering/Operational Event 5163:

  • Cluster service disabled RDMA on the SMB instance for SBL IO on this node. All IO for this instance will now go over TCP connections only.
  • Cluster service disabled RDMA on the SMB instance for CSV IO on this node. All IO for this instance will now go over TCP connections only.

When you enable RDMA again...

Microsoft-Windows-FailoverClustering/Operational Event 5164:

  • Cluster service enabled RDMA on the SMB instance for SBL IO on this node.
  • Cluster service enabled RDMA on the SMB instance for CSV IO on this node

***

Events that you will see in the minutes before the disabling of RDMA...

Microsoft-Windows-SMBClient/Connectivity Event 30804:

A network connection was disconnected.
Instance name: \Device\SmbVsa
Server name: x.x.x.x
Server address: x.x.x.x:445
Connection type: Rdma
InterfaceId: 
Guidance:
This indicates that the client's connection to the server was disconnected.
Frequent, unexpected disconnects when using an RDMA over Converged Ethernet (RoCE) adapter may indicate a network misconfiguration. RoCE requires Priority Flow Control (PFC) to be configured for every host, switch and router on the RoCE network. Failure to properly configure PFC will cause packet loss, frequent disconnects and poor performance.

***

Health Service:

If you get the Event RDMA is off and the Azure Stack HCI (20H2) used TCP/IP from Get-HealthFault

Review the Cluster Objects:
  • Get-Cluster | fl *
  • Cluster Object: UseRdmaForStorage (1=On) or (0=Off)
If the value is 0 review and validate the RDMA/RoCE and DCB settings with tools like:
  • "netstat -xan" shows the RDMA SMB connections
  • "Validate-DCB" (More information)
  • "Perfmon /sys" add counter for Network and RDMA
Change the Cluster Object:
  • (Get-Cluster).UseRdmaForStorage=1
Links: