Instructions
Issue Summary
End users may experience severe impact on some, or all, hosts connected to VPLEX from issues such as slow drains, array target controller faults, CRC errors, switch ASIC faults, switch reboots, etc. The VPLEX back-end utilizes a round robin policy that may cause issues on one fabric to impact all host paths on that fabric (or may affect paths on the other fabric as well).
For switch and array teams
If an end user is reporting wide spread impact as a result of a single component failure, slow drain, etc. check with the end user to see if VPLEX is in the environment. If VPLEX is in the environment, and the extent of the problem is known, request that the end user block the affected path(s) on the switch. If VPLEX is in the environment and the affected paths are not known, engage Dell EMC Customer Support, explain the issue, and mention this article.
For the VPLEX Team
If there is an SR where the end user is reporting ongoing impact and it is suspected the cause is due to poorly performing back end paths, identify the poorly performing paths and block them in VPLEX. If the affected paths are not evident, engage a coach for assistance. Switch and array collaborations can be done once the impact has ended.
Background
VPLEX to Array I/O Flow
VPLEX operates much like a clustered host environment. Each director, which receives I/O from the host, is responsible for completing that I/O. Each director has multiple paths across both fabrics to each LUN. Each VPLEX director is responsible for balancing the I/O across all the available active paths.
VPLEX Fault Detection and Mitigation
The primary method VPLEX uses for detecting and mitigating path faults is to monitor the ratio of timeouts on each path. If 90 percent of the I/O times out in two consecutive 15 second periods, VPLEX will banish the affected path and no longer use it. VPLEX will then periodically probe the banished path and un-banish it if I/O is seen again successfully on this path.
How Problems Can Arise
Due to the high threshold for path banishing, frequent probing, and the low threshold for unbanishing the path, unhealthy paths may continue to be used by VPLEX. The result is that VPLEX may send a significant amount of I/O through poorly performing paths or paths which have experienced soft faults. This I/O either times out or takes an excessive amount of time to complete. The result is significantly elevated response times across all host paths. This may result in performance data unavailability for any or all hosts connected to the VPLEX.
Recommendation
Upgrade to VPLEX GeoSynchrony target code 6.2 P3 or later for improved relief/handling of the above conditions. Refer to release notes for more details about back-end path management functionality.