Failover Planning Guide and checklist

Introduction to this Guide

Overview

The Eyeglass PowerScale edition greatly simplifies DR with DFS. The solution allows DFS to maintain targets (UNC paths) that point at both source and destination clusters.

The failover and failback operations are initiated from Eyeglass and move configuration data to the writeable copy of the UNC target. Grouping of shares by SyncIQ policy allows Eyeglass to automatically protect shares added to the PowerScale. Quotas are also detected and protected automatically.

The following checklist will assist you in plan and test your configuration for Failover in the event DR (Disaster Recovery) is needed.

Chapter 2 - Checklist to plan for Failover

Steps Before failover Day	Task	Description	Completed
0	Document DR Runbook plan	Organize steps, contacts, order of steps, contacts per step required on execution of failover day
0A	Submit support logs for failover readiness audit (7 days before planned event) (see image for case option to request assessment)	Failover Release Notes
0B	Take failover training labs to practice execution	https://www.supernaeyeglass.com/booking
1A	Review DR Design Best Practices Review Failover Release notes Warning: Mandatory Step for all customers DR Assistance requires acceptance before continuing	Eyeglass and PowerScale Failover Best Practices Failover Release Notes
1B	Upgrade Eyeglass to latest version (Eyeglass releases includes failover rules engine updates that add rules found from other customer failovers that continuously improve or avoid known failover issues) Failover Release Notes	Eyeglass PowerScale Edition Upgrade Guide
1C	Test DR procedures	Setup Runbook robot feature for continuous DR testing Test failover with Superna Eyeglass Test it again, again and again Failback Review results, logs to ensure steps that Superna Eyeglass executes are understood Consulting documentation on failover mode you planned to implement Execute test plan before failover day to validate procedures
1D	Benchmark Failover (access Zone)	Copy data into a test policy or the runbook robot access zone (note Robot can only use 1 policy for testing, to complete multi policy testing a test access zone would need to be created and configured for access zone failover) Execute test failover and use failover log to find the make writable step time delta to the start of the log. This is the point at which failover is completed, and failback steps now begin to execute but clients are able to write data to target at this point. Repeat above with 2 policies and a known quantity of data so that both policies sync data and failover. Record the make writable time difference log step to the beginning of the failover log time stamp Repeat one more time with 3 policies same amount of data in each directory Now average the 3 test run times to the make writeable step and use this value that is unique to your environment (clusters, WAN, nodes in replication, etc..) to use to calculate estimated failover times if you have more than 3 policies. Note the test access zone should have all configuration completed (hints, spn, shares and exports and quotas) to ensure that the time estimates are as close to production configuration when estimating failover times. Note: If change rate is expected to be zero before planned failover then skip step to create changed data before failover. Note: The reason to create as many shares under each policy as in production is to get the time for the rename step to complete for each share, this step is parallel operation but should be benchmarked on your clusters Note: failover logs include steps post failover to prepare for failback and complete audit of the clusters. The failover job time DOES NOT REPRESENT THE TIME IT TAKES TO FAILOVER. YOU MUST CALCULATE THE MAKE WRITABLE STEP IN THE LOGS
1E	Benchmark Failover (DFS Mode)	Use the Access Zone with DFS mode policy or create a test DFS mode policy Copy test data into path Create one more shares into the path of test policy (if you have more than one share under a policy in production than create as many shares as you have in production policy configuration) Create more than one policy as per above step example 3 to get a good time average Create changed data if you plan to failover with un-synced data (optional step) Run DFS mode failover on 1 policy, then 2 then 3. Record the make writeable step time difference to the start of the failover log. Calculate the average time per policy (based on your production configuration) Use this number to estimate the time to complete your production failover times Note: The reason to create as many shares under each policy as in production is to get the time for the rename step to complete for each share, this step is parallel operation but should be benchmarked on your clusters Note: failover logs include steps post failover to prepare for failback and complete audit of the clusters. The failover job time DOES NOT REPRESENT THE TIME IT TAKES TO FAILOVER. YOU MUST CALCULATE THE MAKE WRITABLE STEP IN THE LOGS
2	Contact list for failover day	AD administrator DNS administrator Cluster storage Administrator workstation, server administrators Application team for dependant applications Change Management case entered for outage window
3	Reduce failover and failback time - Run manual domain mark jobs on all syncIQ policy paths (this will speed up failover because domain mark can take a long time to complete and elongates the failover time)	All policies run this procedure on all policies. Domain mark
4	Count shares,exports, NFS alias, quotas on source and target with OneFS UI	Validates approximate config count is synced correctly (also verify Superna Eyeglass DR Dashboard) (there should be no quotas synced on target - only shares, exports and NFS alias)
5	Verify dual delegation in DNS before failover	This verifies that DNS is pre-configured for failover for all Smartconnect Zones that will be failed over (Access Zone failover fails over all Smartconnect Zones on all IP pools in the Access Zone)
6	DFS failover preparation	using dfsutil verify clients that will be failing over show two active paths to storage and that correct path is active Verify all DFS mounts have both referrals configured dfsutil tool downloaded by OS type check path resolution
7	Communicate to application teams and business units that use the cluster the failover outage impact	Scheduled maintenance window with application and business units Ensure to explain that data loss will occur if data is written passed the maintenance window start time
8	Set all policies schedule to every 15 minutes or less 1 day prior to the failover to ensure data is staying in sync. This also ensures the failover speed will be optimized	This step is critical step to changne to avoid long running policies or long running jobs that will extend your failover and maintenance window. Specifically ensure run on change is never left enabled since policies that are running cannot be controled for failover.
Steps on the failover Day	Task	Description	Completed
0	SMB and NFS IO paused or stopped before failover start to avoid data loss	For SMB protocol the 2.0 or later feature can be used to block IO to shares with DR assistant. This inserts a deny read permission dynamically before failover starts and removes after failover completes. NFS requires the protocol to be disable to guarantee no IO. Exports should be unmounted before disabling the the protocol on the cluster.
1	Force run syncIQ policies 1 hour before planned failover	Run each syncIQ policy before so that the failover policy run will less data to sync
2	Execute failover	How to Execute A Failover with DR Assistant
3	Monitor failover	How to Monitor the Eyeglass Assisted Failover
4	If Required Data recovery guide	Failover Recovery Procedures
5	Ensure Active Directory admin is available	ADSIedit recovery steps are required and needs Active Directory Administrator access to cluster machine accounts

After Failover	Test Data Access	Use post failover steps guided steps How to Validate and troubleshoot A Successful Failover WHEN Data is NOT Accessible on the Target Cluster

Planning Check List Excel Download

Superna Eyeglass Failover Planning Checklist