Article Number: 000018989
Introduction
This article provides the procedure for properly shutting down your Dell Isilon cluster and includes information about the risks associated with an improper cluster shutdown.
Nodes that are improperly shut down in the cluster should not be without system power for longer than the life of the NVRAM battery, which is approximately 3 to 5 days, depending on the type of node. If data is still stored in a node journal and a node is without system power for longer than the NVRAM battery life, data is lost and the cluster must be rebuilt.
Contact Dell Isilon Technical Support for assistance if you have questions about the procedures or information in this article.
Procedure
The cluster shutdown procedure requires root credentials and serial console access to nodes in the cluster. The procedure is divided into five phases.
Read the entire procedure before beginning the shutdown process. This ensures that you understand the context and order for completing each step.
Phase 1: Perform preventative maintenance.
These steps are performed approximately 4-8 weeks before the scheduled shutdown. The purpose of this phase is to identify unknown or latent hardware or firmware issues that can impede the shutdown procedure.
If circumstances require an immediate cluster-wide shutdown, you can shut down all nodes simultaneously using the OneFS command-line interface or the OneFS Web administration interface.
Dell strongly recommends following all the steps in Phase 3 to preserve the integrity of data if there is an emergency shutdown procedure.
1. Upload logs for historical reference if needed.
# isi_gather_info
2. Perform or Request an Isilon Health Check.
This evaluates the health of the cluster to ensure that it is in a good supportable, operational status.
It can be performed by the customer using Isilon: How to Run the Isilon On-Cluster Analysis Tool or Dell Technologies PowerScale HealthCheck - PowerScale Info Hub
By the Remote Reactive (Customer Support) team. This is available to all customers with an active maintenance agreement for clusters on supported code versions. If you meet these requirements, open a Service Request (SR) on the Dell Online Support site requesting an "Isilon Health Check." And provide full logs for the Health Check by running this command
# isi_gather_info
*The Health Check is not intended to fix cluster issues, or assess the cluster's configuration, performance, or workflow.
3. Perform a "cold reboot" of each node by performing the following steps. A maintenance window should be scheduled for this activity:
Shut down each node in your cluster one at a time.
To shut down each node:
A. Open an SSH connection to any node. Shut down each node by running the following command:
isi config shutdown <node_lnn>
B. Verify that each node has powered off by confirming that the green power indicator LED on the back of the node is no longer illuminated.
C. Press the power button to power the node back on.
D. Verify that the node has rejoined the cluster and is healthy by running the isi status -q command and looking for -OK- in the Health DASR column of the output.
E. If a node encounters issues indicated in the Health DASR column, or fails to rejoin the cluster, resolve these issues before shutting down the next node.
An example of an issue is selected. Node 1 has rejoined the cluster successfully, but the Health DASR column indicates that it needs attention.
mycluster-1# isi status -q Cluster Name: mycluster Cluster Health: [ ATTN] Cluster Storage: HDD SSD Size: 11G (23G Raw) 0 (0 Raw) VHS Size: 11G Used: 7.9G (69%) 0 (n/a) Avail: 3.5G (31%) 0 (n/a) Health Throughput (bps) HDD Storage SSD Storage ID |IP Address |DASR | In Out Total| Used / Size |Used / Size -------------------+-----+-----+-----+-----+-----------------+----------------- 1|10.1.16.141 |-A-- | 0| 150K| 150K| 2.0G/ 2.8G( 69%)| (No SSDs) 2|10.1.16.142 |-OK- | 98K| 13K| 112K| 2.0G/ 2.8G( 69%)| (No SSDs) 3|10.1.16.143 |-OK- | 0| 44K| 44K| 2.0G/ 2.8G( 69%)| (No SSDs) 4|10.1.16.144 |-OK- | 0| 512| 512| 2.0G/ 2.8G( 69%)| (No SSDs) -------------------+-----+-----+-----+-----+-----------------+----------------- Cluster Totals: | 98K| 208K| 306K| 7.9G/ 11G( 69%)| (No SSDs) Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only
3. Resolve any hardware issues uncovered by the reboot before proceeding to the next phase.
isi config reboot <node_lnn>
However, Dell strongly recommends using the cold-reboot approach to more effectively identify latent hardware issues.
4. Schedule a maintenance window for a total cluster shutdown.
Phase 2: Shut down each node in the cluster.
These steps are to be performed on the day that you shut down your Isilon cluster. During a cluster-wide shutdown, some factors may impact or delay the shutdown process. For example, outstanding data writes to a node might affect the shutdown. The purpose of steps 1-2 is to ensure that all clients are disconnected from the cluster and data is properly saved from node journals to the file system prior to running the shutdown command. If you have iSCSI clients, ensure you shut down clients before the iSCSI service is disabled.
Step 3 describes how to shut down each node in your cluster sequentially using a serial console. This method is recommended because it enables you to verify that each node is properly shut down before proceeding to the next node, and make adjustments or fix issues as needed to ensure a proper cluster shutdown. However, this method may be time-consuming because it requires connecting a serial console to each node to run the shutdown command. The section, "Shut down all nodes in your cluster simultaneously," describes how to use the OneFS command-line interface or the OneFS web administration interface to shut down your cluster. This method is less time-consuming than step 3, but makes it more challenging to identify nodes which encounter issues during the shutdown process.
1. Isilon recommends isolating the cluster from clients to ensure that write-heavy clients do not impede the shutdown procedure. You can do this by disabling the client-facing services running on your cluster. Perform the following procedure to disable client-facing services:
A. Identify the client-facing services or protocols that are running on your cluster by running the following commands for each client-facing service:
B. Document the services that are "enabled" on your cluster based on the output for each command. Selected in the example below, the SMB service is enabled whereas the NFS service is disabled:
mycluster-4# isi services smb Service 'smb' is enabled. mycluster-4# isi services nfs Service 'nfs' is disabled. mycluster-4#
C. Disable client-facing services. After this step, all clients immediately lose connection to the cluster. To disable a service, run the following command that is related to the service that you have enabled:
If you have iSCSI clients, ensure that that iSCSI clients have unmounted their LUNs prior to performing step 2. Run the isi iscsi list command to confirm that all iSCSI clients are disconnected from the cluster.
2. Move data writes stored in node journals to the file system by running the isi_for_array isi_flush command. Output similar to the following appears on each node:
mycluster-4# isi_for_array isi_flush mycluster-1: Flushing cache... mycluster-1: Cache flushing complete.
mycluster-4# isi_for_array isi_flush mycluster-1: Flushing cache... vinvalbuf: flush failed, 1 clean and 0 dirty bufs remaining mycluster-2: Flushing cache... fsync: giving up on dirty
Run the isi_for_array isi_flush command again. If any node fails to flush, contact Dell Isilon Technical Support. All nodes must successfully flush before proceeding to the next step.
3. Shut down each node in the cluster sequentially and monitor the output. This approach is recommended because it enables you to identify and resolve any issues before shutting down the next node in the cluster. Shut down each node by performing the following steps:
A. Attach a serial console to each node.
B. Run the following command:
isi config shutdown
When the node is successfully shut down, output similar to the following appears:
Powering the system off using ACPI
C. Watch the console and look for hardware-related failure events. Successful node journal saves are Selected in the following output variations:
2014-03-22T00:35:19Z <1.5> mycluster-3(id11) isi_save_journal[44868]: Attempting to save journal to default location 2014-03-22T00:35:19Z <1.5> mycluster-3(id11) isi_save_journal[44868]: Saving journal to /var/journal/journal.gz 2014-03-22T00:35:19Z <1.5> mycluster-3(id11) isi_save_journal[44868]: All data saved successfully 2014-03-22T00:37:29Z <1.5> mycluster-3(id11) isi_save_journal[45074]: Attempting to save journal to default location 2014-03-22T00:37:29Z <1.5> mycluster-3(id11) isi_save_journal[45074]: A valid backup journal already exists. Not saving. An example of a node journal save failure is highlighted in the output below: 2014-03-21T23:39:09Z <1.4> mycluster-3(id11) /sbin/shutdown: ERROR: Validation failed for backup journal. Shutdown aborted 2014-03-21T23:39:09Z <1.4> mycluster-3(id11) /sbin/shutdown: Failed command output:
If you receive an error that the node journal did not save, you can manually save the journal by performing the steps in Phase 3.
Shut down all nodes in the cluster simultaneously
If there is an emergency, you can shut down all nodes in the cluster simultaneously. However, this method is not recommended because it does not enable you to monitor the status and output of each node in case an issue occurs. If you choose to follow these steps, Dell strongly recommends following all the steps in Phase 3 to verify that all nodes have properly shutdown after performing the procedures below.
To shut down all nodes in your cluster, use the OneFS command-line interface or the OneFS web administration interface.
From the OneFS command-line interface
# isi config shutdown all
From the OneFS web administration interface.
In OneFS 8.0 and later:
Click Cluster Management > Hardware Configuration > Shutdown & Reboot Controls
Click Shut down, and then click Submit.
Click Yes to confirm. A page appears stating that the cluster is now shutting down.
Phase 3: Verify that that nodes have successfully shut down.
Confirm that the nodes have properly shut down by looking at the power indicator light-emitting diode (LED) on the back of the node. All power indicator LEDs should appear dark, or OFF. This indicates that the node has successfully shutdown.
** Contact Dell Isilon Technical Support if you have any doubts about the success of the shutdown operation, such as if the node does not shut down or the journal is not saved**.
If the power indicator light on the back of the node is still illuminated, the node has not shut down. If the node has not shut down, or if you receive console output indicating that the node journal did not save properly (from Phase 2, step 3c), you must manually save the journal to ensure that that data is committed to disk before shutting down the node.
To manually save the journal and shut down the node, perform the following steps:
Attach a serial console to the node. Determine if the node is responsive to the command-line interface.
A. If the node is responsive to the command-line interface, reboot the node by running the following command:
# isi config reboot
B. If the node is not responsive to the command-line interface, manually reboot the node by pressing and holding the power button on the back of the node. This causes the node to power off. Wait 30 seconds and then press the power button once to boot the node backup again. Go to the next step.
2. After rebooting the node, log back in and use the following steps to save the journal:
A. Attempt to gracefully shut down the node again by running the following command:
# isi config shutdown
B. If the output still indicates that the journal did not save, manually save the journal by running the following command:
# isi_save_journal
C. If the journal still does not save, unmount the file system, /ifs and then force save the journal by running the following commands:
# isi_kill_busy && umount /ifs
Then force save the journal by running.
D. Verify that the journal is saved by running the isi_checkjournal command.
# isi_checkjournal
E. Do not go to the next step until output indicates that the journal is successfully saved.
Contact Dell Isilon Technical Support if needed.
Phase 4: Disconnect the power source.
After your cluster has successfully shut down and the nodes are powered off, only then can the power source be disconnected from the cluster.
NVRAM batteries
When a client writes a file to a node, the writes are first stored in nonvolatile RAM (NVRAM) hosted on the node's journal card. Sometime later, OneFS commits those writes to disk. To protect the data stored in NVRAM if an unscheduled power outage, each node is equipped with NVRAM batteries (two for redundancy). A node that is powered off but remains connected to a power source continues to refresh its NVRAM batteries. When the power source is disconnected from the node, the NVRAM batteries start to drain. Battery life in the current generation of nodes (X200, S200, X400, and NL400) is approximately five days. In the previous generation of nodes, NVRAM battery life is approximately three days.
Dell Technologies recommends properly shutting down nodes to avoid relying on NVRAM batteries for a substantial length of time during a power outage.
If the NVRAM batteries on a node drain completely, the node boots to read-only mode and stays in read-only mode for approximately 30 minutes until the NVRAM batteries fully charge. When the batteries are recharged, the node automatically returns to normal read/write mode.
Phase 5: Power on each node in the cluster
These steps are to be performed when you are ready to restart your Isilon cluster.
Restore the power source to each node.
Press the power button on the front panel or the back of each node to boot them.
Cluster Name: mycluster Cluster Health: [ OK ] Cluster Storage: HDD SSD Size: 11G (23G Raw) 0 (0 Raw) VHS Size: 11G Used: 7.9G (69%) 0 (n/a) Avail: 3.5G (31%) 0 (n/a) Health Throughput (bps) HDD Storage SSD Storage ID |IP Address |DASR | In Out Total| Used / Size |Used / Size -------------------+-----+-----+-----+-----+-----------------+----------------- 1|10.1.16.141 |-OK- | 0| 150K| 150K| 2.0G/ 2.8G( 69%)| (No SSDs) 2|10.1.16.142 |-OK- | 98K| 13K| 112K| 2.0G/ 2.8G( 69%)| (No SSDs) 3|10.1.16.143 |-OK- | 0| 44K| 44K| 2.0G/ 2.8G( 69%)| (No SSDs) 4|10.1.16.144 |-OK- | 0| 512| 512| 2.0G/ 2.8G( 69%)| (No SSDs) -------------------+-----+-----+-----+-----+-----------------+----------------- Cluster Totals: | 98K| 208K| 306K| 7.9G/ 11G( 69%)| (No SSDs) Health Fields: D = Down, A = Attention, S = Smartfailed, R = Read-Only
Phase 6: POST CHECK - Run a Health Check on the cluster
Upload a full log gather
# isi_gather_info --esrs
Perform or Request an Isilon Health Check by the Remote Reactive (Customer Support) Team.
Steps to run health checks.
Isilon: How to Run the Isilon On-Cluster Analysis Tool or Dell Technologies PowerScale HealthCheck - PowerScale Info Hub
Request a health check using Remote Reactive Support Team
This is available to all customers with an active maintenance agreement for clusters on supported code versions.
If you meet these requirements, open a Service Request (SR) on the Dell Online Support site requesting an "Isilon Health Check."
*The Health Check is not intended to fix cluster issues, or assess the cluster's configuration, performance, or workflow.
Isilon
Isilon
10 Apr 2024
11
How To