Client computers perform slowly. Certain jobs, particularly those that run on the cluster, either fail or take longer than expected to complete.
Performance issues are typically caused by network traffic, network configuration issues, client or cluster processing load, or a combination thereof. This article describes several effective ways to troubleshoot performance issues.
Troubleshooting with InsightIQ
Table of Contents:
- Using Isilon InsightIQ
- Troubleshooting without InsightIQ
- Network throughput
- Distribution of client connections
- SmartConnect
- Cluster throughput
- Cluster processing
- Queued operations
- CPU
Using Isilon InsightIQ is the best way to monitor performance and to troubleshoot performance issues.
The Isilon InsightIQ virtual appliance enables you to monitor and analyze Isilon cluster activity through flexible, customizable chart views in the InsightIQ web-based application. These charts provide detailed information about cluster hardware, software, and file system and protocol operations. InsightIQ transforms data into visual information that emphasizes any performance outliers, enabling quick and easy diagnosis of bottlenecks or optimize workflows.
For details on using InsightIQ, see the InsightIQ User Guide.
Troubleshooting without InsightIQ
If you are not using InsightIQ, you can run a variety of commands to investigate performance issues. Troubleshoot performance issues first by examining network and cluster throughput, then by examining cluster processing, and finally by examining individual node CPU rates.
Network throughput
Use a network testing tool such as Iperf to determine the throughput capabilities of the cluster and client computers on your network.
Using Iperf, run the following commands on the cluster and client. These commands define a window size that is large enough to reveal if the network link is a potential cause of latency issues.
Cluster
iperf -s -w 262144
Client
iperf -c <cluster IP> -w 262144
Distribution of client connections
Check how many NFS and SMB clients are connected to the cluster to make sure they are not favoring one node.
- Open an SSH connection on any node in the cluster and log on using the "root" account.
- To check NFS clients, run the following command:
isi statistics query - nodes=all --stats=node.clientstats.connected.nfs,node.clientstats.active.nfs
The output displays the number of clients connected per node and how many of those clients are active on each node.
- To check SMB clients, run the following command:
isi statistics query - nodes=all --stats=node.clientstats.connected.smb,
node.clientstats.active.smb1,node.clientstats.active.smb2
The output displays the number of clients connected per node and how many of those clients are active on each node.
SmartConnect
Check to ensure that the node that SmartConnect is running on is not burdened with network traffic.
- Open an SSH connection on any node in the cluster and log on using the "root" account.
- Run the following command:
isi_for_array -sq 'ifconfig|grep em -A3'
The output displays a list of all the IP addresses that are bound to the external interface.
- Note any nodes that have one more IP address than the rest.
- Check the status of the nodes that you noted in step 3 by running the following command:
isi status
Check the throughput column of the output to determine the load of the nodes noted in step 3.
Cluster throughput
Assess cluster throughput by conducting write and read tests that measure how much time it takes to read from and write to a file. Conduct at least one write test and one read test, as follows.
Write test
- Open an SSH connection on any node in the cluster and log on using the "root" account.
- Change to the /ifs directory:
cd /ifs
- From the command line interface (CLI) on the cluster or from a UNIX or Linux client computer, use the dd command to write a new file to the cluster. Run the following command:
dd if=/dev/zero of=1GBfile bs=1024k count=1024
This command creates a sample 1GB file and reports how much time it took to write it to disk.
- From the output of this command, extrapolate how many MB per second can be written to disk in single-stream workflows.
- If you have a Mac client and want to conduct further analysis:
- Start Activity Monitor.
- Run the following command, where pathToFile is the file path of the targeted file:
cat /dev/zero > /pathToFile
This command helps measure the throughput of write operations on the Isilon cluster. (Although it is possible to run the dd command from a Mac client, results can be inconsistent.)
- Monitor the results of the command in the Activity Monitor's Network tab.
Read test
When measuring the throughput of read operations, be sure not to conduct read tests on the file that you created during the write test. Because that file has been cached, the results of your read tests would be inaccurate. Instead, test a read operation of a file that has not been cached. Find a file on the cluster that is larger than 1GB, and reference that file in the read test.
- Open an SSH connection on any node in the cluster and log on using the "root" account.
- From the CLI on the cluster or from a UNIX or Linux client computer, use the dd command to read a file on the cluster. Run the following command where pathToLargeFile is the file path of the targeted file:
dd if=/pathToLargeFile of=/dev/null bs=1024k
This command reads the targeted file and reports how much time it took to read it.
- If you have a Mac client and want to conduct further analysis:
- Start Activity Monitor.
- Run the following command where pathToFile is the file path of the targeted file:
time cp /pathToLargeFile > /dev/null
This command helps measure the throughput of read operations on the Isilon cluster. (Although it is possible to run the dd command from a Mac client, results can be inconsistent.)
- Monitor the results of the command in the Activity Monitor's Network tab.
Cluster processing
Restripe jobs
Before examining input/output (I/O) operations (IOPS) of the cluster:
- Determine which jobs are running on the cluster. If restripe jobs such as AutoBalance, Collect, or MultiScan are running, consider why those jobs are running and if they should continue to run.
- Consider the type of data being consumed. If client computers are working with large video files or virtual machines (VMs), the restripe job requires a higher amount of disk IOPS than normal.
- Consider temporarily pausing a restripe job. Doing so can significantly improve performance and might be a viable short-term solution to a performance issue.
Disk I/O
Examining disk I/O can help determine if certain disks are being overused.
By cluster
- Open an SSH connection on any node in the cluster and log on using the "root" account.
- Ascertain disk I/O by running the following command:
isi statistics pstat
- From the output of this command, divide the disk IOPS by the total number of disks in the cluster. For example, for an 8-node cluster using Isilon IQ 12000x nodes, which hosts 12 drives per node, you would divide the disk IOPS by 96.
For X-Series nodes and NL-Series nodes, you should expect to see disk IOPS of 70 or less for 100% random workflows, or disk IOPS of 140 or less for 100% sequential workflows. Because NL-Series nodes have less RAM and lower CPU speeds than X-Series nodes, X-Series nodes can handle higher disk IOPS.
By node and by disk
- Open an SSH connection on any node in the cluster and log on using the "root" account.
- Ascertain disk IOPS by node, which can help discover disks that are overused, by running the following command:
isi statistics query --nodes=all --stats=node.disk.xfers.rate.sum --top
- To determine how to query for statistics on a per disk basis, use the following command:
isi statistics describe --stats=all | grep disk
Queued operations
Another way to determine if disks are being overused is to determine how many operations are queued for each disk in the cluster. For a single stream SMB-based workflow, a queue of 4 can indicate an issue, while for high concurrency NFS namespace operations, the queue can be much greater.
- Open an SSH connection on any node in the cluster and log on using the "root" account.
- Determine how many operations are queued for each disk in the cluster by running the following command:
isi_for_array -s sysctl hw.iosched | grep total_inqueue
- Determine the latency caused by the queued operations:
sysctl -aN hw.iosched|grep bios_inqueue|xargs sysctl -D
CPU
CPU issues are frequently traced to the operations that clients are performing on the cluster. Using the isi statistics command, you can determine the operations that are being performed on the cluster, cataloged by either network protocol or client computer.
- Open an SSH connection on any node in the cluster and log on using the "root" account.
- Determine which operations are being performed across the network and assess which of those operations are taking the most time by running the following command:
isi statistics protocol --orderby=TimeAvg --top
This command output gives detailed statistics for all network protocols, organized by how long it takes the cluster to respond to clients. Although the results of this command might not identify which operation is the slowest, it can point you in the right direction.
- To obtain more information about CPU processing, such as which nodes' CPUs are the most heavily used, run the following command:
isi statistics system --top
- To obtain the four processes on each node that are consuming the most CPU resources, run the following command:
isi_for_array -sq 'top -d1|grep PID -A4'