Skip to main content
  • Place orders quickly and easily
  • View orders and track your shipping status
  • Enjoy members-only rewards and discounts
  • Create and access a list of your products
  • Manage your Dell EMC sites, products, and product-level contacts using Company Administration.

PowerScale: InfiniBand Queue Pairs may be Stuck in a Degraded State After a Primary Path is Brought Down

Summary: This article describes an issue where a node's InfiniBand queue pairs are stuck in a MIGRATED state after the primary path goes down and up (flapped), affecting OneFS versions 9.1 and 9.2.1. ...

This article may have been automatically translated. If you have any feedback regarding its quality, please let us know using the form at the bottom of this page.

Article Content


Symptoms

InfiniBand queue pairs are backend sockets with two paths for redundancy. If both paths are up (int-a and int-b), SDP resolves the path and marks the queue pair as "ARMED."  An armed queue pair indicates that a node has a high availability (redundant) path to another node. However, if a nodes path is not available (or down), the queue pair may be marked as "MIGRATED."  A migrated queue pair indicates that a node does not have a high availability (redundant) path to another node.  

We have found that if the primary path goes down, the node switches to the alternate path as expected with no impact. The nodes queue pairs are marked as migrated. When the path is brought back up, the node is unable to resolve the path again, and the queue pair is left in a migrated state.  

If the new path is brought down, the node briefly splits. However, both paths become resolvable and the queue pair is marked as armed (highly available or redundant).

The following messages are seen in messages file of any node experiencing this issue:
/boot/kernel.amd64/kernel: [sdp_apm.c:178](pid 0="kernel")(tid=XXXXXX) SDP: alternate path record lookup failed: -22

The following must be identified to determine if a node is having an issue:

  •  Messages file must contain current log lines.
/boot/kernel.amd64/kernel: [sdp_apm.c:178](pid 0="kernel")(tid=XXXXXX) SDP: alternate path record lookup failed: -22
  • Isilon Health Check Framework identifies node "infiniband_socket_state" as not being ARMED.
infiniband_socket_state - WARNING

* At least one socket is not armed. This node is not properly connected to the following nodes: ['X, Y, Z']
  • There are no backend interfaces that are inactive.
# isi_for_array -s ifconfig ib0 | grep inactive; isi_for_array -s ifconfig ib1 | grep inactive
  • There are no backend interfaces that are OACTIVE.
# isi_for_array -s ifconfig ib0 | grep OACTIVE; isi_for_array -s ifconfig ib1 | grep OACTIVE
isilon-1:  -> :isilon-2       |A| _OK_        |B| _OK_        |F| _OK_
isilon-1:  -> :isilon-3       |A| _OK_        |B| _OK_        |F| _OK_
isilon-1:  -> :isilon-4       |A| _OK_        |B| _OK_        |F| _OK_
isilon-1:  -> :isilon-5       |A| _OK_        |B| _OK_        |F| _OK_
isilon-1:  -> :isilon-6       |A| _OK_        |B| _OK_        |F| _OK_
isilon-1:  -> :isilon-7       |A| _OK_        |B| _OK_        |F| _OK_
isilon-1:  -> :isilon-8       |A| _OK_        |B| _OK_        |F| _OK_
isilon-1:  -> :isilon-9       |A| _OK_        |B| _OK_        |F| _OK_
isilon-1:  -> :isilon-10      |A| _OK_        |B| _OK_        |F| _OK_
If you believe you are running into this issue, open a support case for an Isilon Engineer to investigate and confirm.

Cause

After the link has flapped, Sockets Direct Protocol (SDP) contains old information that is creating an invalid argument while attempting to resolve the path. 

Resolution

There is no fix for version 9.1.0. 
The issue is resolved in version 9.2.1.16.
The issue is resolved in versions 9.3 and 9.4.

The following can be performed to work around the issue. WARNING: All options to work around the issue require a node down event and should be performed during a maintenance window.

  • Identify the node that had its link flap. 
  • Reboot the node and it clears all paths and bring them to an ARMED state.  
OR
  • Identify the node that had its link flap.
  • Bring down the current path which forces the node to split. For example, if ib0 (int-a) flapped, it is likely the node is using path ib1 (int-b). Bring down ib1 and bring it back up using "ifconfig."  The node resolves ib0 and ib1 paths and brought back into an ARMED state.

Article Properties


Affected Product

Isilon Switch QDR

Last Published Date

05 Jan 2023

Version

8

Article Type

Solution