Planning Guides
Failover Planning Guide and Checklist
Home


Failover Planning Guide and checklist


Introduction to this Guide

Overview

The Eyeglass PowerScale edition greatly simplifies DR with DFS.  The solution allows DFS to maintain targets (UNC paths) that point at both source and destination clusters.

The failover and failback operations are initiated from Eyeglass and move configuration data to the writeable copy of the UNC target. Grouping of shares by SyncIQ policy allows Eyeglass to automatically protect shares added to the PowerScale.  Quotas are also detected and protected automatically.

The following checklist will assist you in plan and test your configuration for Failover in the event DR (Disaster Recovery) is needed. 


Chapter 2 - Checklist to plan for Failover


Steps Before failover Day

Task

Description

Completed

0

Document DR Runbook plan

  • Organize steps, contacts, order of steps, contacts per step required on execution of failover day


0A

Submit support logs for failover readiness audit (7 days before planned event) (see image for case option to request assessment) 

Failover Release Notes


0B

Take failover training labs to practice execution 


1A

Review DR Design Best Practices

Review Failover Release notes

Warning: Mandatory Step for all customers DR Assistance requires acceptance before continuing


1B

Upgrade Eyeglass to latest version (Eyeglass releases includes failover rules engine updates that add rules found from other customer failovers that continuously improve or avoid known failover issues) Failover Release Notes


1C

Test DR procedures

  • Setup Runbook robot feature for continuous DR testing

  • Test failover with Superna Eyeglass

  • Test it again, again and again

  • Failback

  • Review results, logs to ensure steps that Superna Eyeglass executes are understood

  • Consulting documentation on failover mode you planned to implement

  • Execute test plan before failover day to validate procedures



1D

Benchmark Failover (access Zone)

  • Copy data into a test policy or the runbook robot access zone (note Robot can only use 1 policy for testing, to complete multi policy testing a test access zone would need to be created and configured for access zone failover)

  • Execute test failover  and use failover log to find the make writable step time delta to the start of the log.  This is the point at which failover is completed, and failback steps now begin to execute but clients are able to write data to target at this point.

  • Repeat above with 2 policies and a known quantity of data so that both policies sync data and failover.  Record the make writable time difference log step to the beginning of the failover log time stamp

  • Repeat one more time with 3 policies same amount of data in each directory

  • Now average the 3 test run times to the make writeable step and use this value that is unique to your environment (clusters, WAN, nodes in replication, etc..) to use to calculate estimated failover times if you have more than 3 policies.  

  • Note the test access zone should have all configuration completed (hints, spn, shares and exports and quotas) to ensure that the time estimates are as close to production configuration when estimating failover times.

  • Note: If change rate is expected to be zero before planned failover then skip step to create changed data before failover.

  • Note: The reason to create as many shares under each policy as in production is to get the time for the rename step to complete for each share, this step is parallel operation but should be benchmarked on your clusters

  • Note: failover logs include steps post failover to prepare for failback and complete audit of the clusters. The failover job time DOES NOT REPRESENT THE TIME IT TAKES TO FAILOVER. YOU MUST CALCULATE THE MAKE WRITABLE STEP IN THE LOGS



1E

Benchmark Failover (DFS Mode)

  • Use the Access Zone with DFS mode policy or create a test DFS mode policy

  • Copy test data into path

  • Create one more shares into the path of test policy (if you have more than one share under a policy in production than create as many shares as you have in production policy configuration)

  • Create more than one policy as per above step example 3 to get a good time average

  • Create changed data if you plan to failover with un-synced data (optional step)

  • Run DFS mode failover on 1 policy, then 2 then 3.  Record the make writeable step time difference to the start of the failover log.

  • Calculate the average time per policy (based on your production configuration)

  • Use this number to estimate the time to complete your production failover times

  • Note: The reason to create as many shares under each policy as in production is to get the time for the rename step to complete for each share, this step is parallel operation but should be benchmarked on your clusters

  • Note: failover logs include steps post failover to prepare for failback and complete audit of the clusters. The failover job time DOES NOT REPRESENT THE TIME IT TAKES TO FAILOVER. YOU MUST CALCULATE THE MAKE WRITABLE STEP IN THE LOGS


2

Contact list for failover day

  • AD administrator

  • DNS administrator

  • Cluster storage Administrator

  • workstation, server administrators

  • Application team for dependant applications

  • Change Management case entered for outage window


3

Reduce failover and failback time - Run manual domain mark jobs on all syncIQ policy paths (this will speed up failover because domain mark can take a long time to complete and elongates the failover time)

All policies run this procedure on all policies. Domain mark


4

Count shares,exports, NFS alias, quotas on source and target with OneFS UI

Validates approximate config count is synced correctly (also verify Superna Eyeglass DR Dashboard)

(there should be no quotas synced on target - only shares, exports and NFS alias)


5

Verify dual delegation in DNS before failover

This verifies that DNS is pre-configured for failover for all Smartconnect Zones that will be failed over (Access Zone failover fails over all Smartconnect Zones on all IP pools in the Access Zone)


6

DFS failover preparation

  1. using dfsutil verify clients that will be failing over show two active paths to storage and that correct path is active

  2. Verify all DFS mounts have both referrals configured

dfsutil tool downloaded by OS type

check path resolution




7

Communicate to application teams and business units  that use the cluster the failover outage impact

  1. Scheduled maintenance window with application and business units

  2. Ensure to explain that data loss will occur if data is written passed the maintenance window start time



8Set all policies schedule to every 15 minutes or less 1 day prior to the failover to ensure data is staying in sync. This also ensures the failover speed will be optimized
This step is critical step to changne to avoid long running policies or long running jobs that will extend your failover and maintenance window. Specifically ensure run on change is never left enabled since policies that are running cannot be controled for failover.

Steps on the  failover Day

Task

Description

Completed

0

SMB and NFS IO paused or stopped before failover start to avoid data loss

For SMB protocol the 2.0 or later feature can be used to block IO to shares with DR assistant.   This inserts a deny read permission dynamically before failover starts and removes after failover completes.

NFS requires the protocol to be disable to guarantee no IO. Exports should be unmounted before disabling the the protocol on the cluster.  


1

Force run syncIQ policies  1 hour before planned failover

Run each syncIQ policy before so that the failover policy run will less data to sync


2

Execute failover

How to Execute A Failover with DR Assistant


3

Monitor failover

How to Monitor the Eyeglass Assisted Failover



4

If Required Data recovery guide

Failover Recovery Procedures


5

Ensure Active Directory admin is available

ADSIedit recovery steps are required and needs Active Directory Administrator access to cluster machine accounts






After Failover

Test Data Access

Use post failover steps guided steps

How to Validate and troubleshoot A Successful Failover WHEN Data is NOT Accessible on the Target Cluster



Planning Check List Excel Download

  1. Superna Eyeglass Failover Planning Checklist
© Superna Inc