DR Design Guides

How to Execute A Failover with DR Assistant

Home

How to Execute A Failover with DR Assistant

Follow these steps to execute a failover.  Note: The planning guide is expected to be the referenced document for all planned failovers.  Support expects this document has been used for planning.

How to know when Uncontrolled failover should be used?

This option in Eyeglass DR assistant should be used while understanding the data protection implications.



READ THIS FIRST: Using this option means you are failing away from the data and losing  ALL  changes at the moment the failover is started in Eyeglass.  

NOTE: Uncontrolled faiover should only be used when the Eyeglass VM DOES NOT have reachability to both Clusters that replicate. Even if data access is an issue to Isilon BUT Eyeglass reachablity is green on the liveops icon. DO NOT USE UNCONTROLLED FAILOVER, USE CONTROLLED FAILOVER.


NOTE: All steps to recover from this failover mode, WILL require manual steps to recover DR sync status and failback from the DR cluster back to the Production cluster.

  • Recovery from uncontrolled failover is customer responsibility and is NOT covered by Superna support contract.
  • This will require involvement with all vendors related to the equipment in customer data center and receiving the green light from all vendors that the data center is ready to resume operations.  This will include Isilon and all dependent components such as AD, DNS, other application using Isilon services, physical infraststructure (power, networking WAN links).


DO NOT BRING THE CLUSTER ONLINE WITHOUT PLANNING.  RESYNC PREP DOES NOT RUN, WHICH MEANS BOTH CLUSTERS WILL BE WRITEABLE.  YOU SHOULD DISCONNECT THE SOURCE CLUSTER AND PLAN A CONTROLLED RECOVERY FROM AN UNCONTROLLED FAILOVER.

Reasons you may choose to execute an uncontrolled failover include the following:

  1. WAN link is cut to the data center with a very long repair time to restore service
  2. Loss of power for extended periods of time to the production data center
  3. Damaged cluster or serious cluster issue (upgrade)
  4. Equipment failover blocking access to the cluster, or application server failures with long recovery times
  5. Networking failure that prevents users from accessing storage and Isilon management network has ALSO Failed

Eyeglass Pre-Failover Check Important - Read me

IMPORTANT:

Making any changes to the SyncIQ Policies or related Eyeglass Configuration Replication Jobs during failover may result in unexpected results.

IMPORTANT:

Eyeglass Assisted Failover has a 45 minute timeout on each failover step.  Any step which is not completed within this timeout period will cause the failover to fail.  This can occur if SyncIQ policies are already running when failover job is started or SyncIQ steps take longer than expected to complete.  This timeout can be changed but does not accelerate failover if lowered.  

IMPORTANT:

Deleting configuration data (shares, exports, quotas) or modifying Share name or NFS Alias name or NFS Export path on the target cluster before failing over without running Eyeglass Configuration Replication will incorrectly result in the object being deleted on the source cluster after failover.  You must run Eyeglass configuration replication before the failover OR select the Config Sync checkbox on failover to prevent this from happening.

How to failover Data With DR Assistant

This covers Access Zone/ IP Pool mode, DFS policy mode or SyncIQ mode.

To failover Data DR Assistant:

  1. Consult the Failover Design Guide for monitoring failover progress.
  2. Then review steps below for post failover
  3. Verify manually that there are no open files on the Source Cluster.  There should be no client access to the Failover Source cluster during failover as this data will not be replicated.
  4. Open DR Assistant Icon

  1. Select Failover Type design for your environment.
  2. Select Source Cluster that has the writeable data to failover
  3. Leave all default check boxes for a planned controlled failover.
  1. Controlled failover
    1. Check if the source cluster is healthy and reachable. LiveOPS Dashboard Icon
    2. Uncheck if the source cluster is not healthy or reachable This option is a REAL DR event.  NOTE: Do not use this option unless lab testing OR you are prepared for manual steps to recover from the resulting end state.  In this case,  source cluster API calls are skipped and cached knowledge of shares, quotas are used to failover (Real DR Event).
  1. IMPORTANT:
    1. Eyeglass Configuration Replication Jobs will be in USERDISABLED state on source and target cluster after an uncontrolled failover.
    2. Eyeglass requires that directories being failed over exist on the target cluster which means SyncIQ policies have run at least once prior to failover.
  1. Data Sync
    1. Check to run a final SyncIQ data sync Job as part of the failover
    2. Uncheck to skip the SyncIQ data sync step
  2. Config Sync
    1. Check to run a final Eyeglass Configuration Replication Job as part of the failover
    2. Uncheck to skip the Eyeglass Configuration Replication step
  3. SMB Data Integrity Failover
    1. Check to execute SMB Data Integrity Failover step. It disconnects any active SMB sessions prior to failover and ensures that no new sessions can be established on the failover source.
    2. NOTE: if shares use root with full control, you are no longer using Active Directory user, this is a Linux user on Isilon only and not an Active Directory user.  Any share with run as root by passes all security and auditing and cannot be locked out from a share.   Any shares with this permission will not lockout any users.
    3. Uncheck to skip SMB Data Integrity Failover step.
  4. Quota Sync
    1. This option allows skipping of quota failover and will leave the quotas on the source cluster. This would be selected if 1000’s of quotas exist which affects failover performance of SyncIQ operations.  It will also remove the risk of a quota scan job impacting SyncIQ operations on quotas that are flagged with needs a scan on the destination cluster.
    2. Checked means quotas will failover (create on target, delete on the source).
    3. Unchecked means quotas will not be failed over but will remain on the source cluster.
  5. Block Failover on Warning
    1. Will block failover from starting if validation on warnings are detected.   All Warnings in DR Dashboard will block a failover and must be reviewed before unchecking this option to continue.
    2. Preceeding a failover with warnings, proceed at your own risk to data.  Recommendation: Open a case and get input from support.
  6. Quota Domain conflict with SyncIQ Validation:
    1. Allows override of default validation that will detect target cluster quotas with quota scan pending flag set,.  This flag blocks running policies, resync prep and make writeable steps from completing on policies that have newly created quotas and no quota domain created.
    2. See image below on policy quota domain validation check.
  7. This validation will block a failover attempt when checked and a warning validation is detected on the zone, policy or ip pool (enabled).
  8. To continue anyway uncheck this option and restart the failover  Once unchecked you are taking the risk of SyncIQ policies failing either Make Writeable step or Re-sync prep.
  9. Solution:  Run quota scan job from cluster jobs menu and allow quota scan to complete the quota domain creation on all quotas with the flag set.   Then start the failover again.
  10. NOTE: If multiple policies are defined some policies may fail make writeable step OR resync prep step.  In this release Eyeglass will continue to the next policy if a step fails.
  11. SyncIQ Resync Prep
    1. Check to execute the SyncIQ Resync Prep failover step (leave this default advanced setting)
    2. Uncheck to skip  SyncIQ Resync Prep failover step.  This is not recommended as it will leave the system in state where you will not be able to use Eyeglass to failback. This is used ONLY when customers want to failover in one direction and then recreate a new policy or they know how to manually recover and create mirror policy.
    3. Disable SyncIQ Jobs on Failover Target (advanced setting leave defaults )
      1. Disable on failover is optional if you don’t want to configure failback and execute sync job in the return direction.  This is used when you want to verify systems before replicating data back to the source. Warning: Using this option WILL require manual steps to failback
  12. MUST READ: Uncheck Controlled Failover ONLY if this is a REAL DR event (NOTE: If this is unchecked Eyeglass assumes the source cluster is destroyed, NO steps that provide failback are executed.  Customer is responsible for recovery from uncontrolled failover - it is not covered by Superna support. NO automated recovery is possible from using this option.  It is expected customers make decisions to protect data at all times and only use this option if data is deemed not usable for business reasons.   
      • Recovery from uncontrolled failover is customer responsibility and is NOT covered by Superna support contract.
      • This will require involvement with all vendors related to the equipment in customer data center and receiving the green light from all vendors that the data center is ready to resume operations.  This will include Isilon and all dependent components such as AD, DNS, other application using Isilon services, physical infraststructure (power, networking WAN links).
  13. All recovery is manual if this option is used. Ensure the cluster you fail away from is no longer accessible to users and take steps to ensure it cannot be accessed.
  14. (Screenshot below) Review best practices document, it is expected all release best practices are read and understood before proceeding. This document also covers prep steps for failover example domain mark.  This document is a must read for any failover.
  15. Select the policy or policies or Access Zone for the failover type selected.
  16. Check readiness again before continuing to ensure you understand the warnings and if they will affect your failover.  In general warnings do not block failover.  Errors block failover.
  17. Review Failover release notes that cover special scenario’s that must be assessed if they affect your planned failover. This document requires acknowledgment before continuing. Failing to read this document can result in data loss.
  1. Per SyncIQ Policy or DFS failover validation Screen.
  2. This screen lists all policies selected for the failover.  NOTE: The previous policy selection screen only allows policies that are valid choices. The Selected policies will be summarized. (Screenshot below)
  1. Access Zone validation Screen (below). NOTE: This screen will show all policies within the Access Zone that are eligible for failover. If any policy is USER DISABLED or policy disabled it will be shown as “will NOT be failed over”. NOTE: Do not failover with a disabled policy unless you know the data protected by this policy does not need to be failed over. 
  1. Reference the screenshot below to see a successful validation screen.
  2. Review final confirmation screen. 
  3. Review link to recovery guide.  
  4. Final acceptance and point of no return.
  5. NOTE: The failover job can be canceled once started but recovery steps will be manual.
  6. Start the failover with Run button.
  7. Monitor with logs clickable link.  
  8. Click Watch to follow the failover real-time or click fetch to update log window with current progress.
  9. NOTE: Failover jobs can be canceled cancel job link.
  10. ONLY USE IF DIRECTED BY SUPPORT
  11. WARNING: IF YOU CANCEL A FAILOVER, MANUAL RECOVERY OF NETWORKING POLICY STATE, SHARES, SPN, SMARTCONNECT IS REQUIRED.    SUPPORT IS UNABLE TO ASSIST WITH RECOVERY FROM INTENTIONALLY CANCELING A FAILOVER.
  12. Failover status completes with success or failure.
  13. If status is failure download support log from DR Assistant History tab  and open a support case.
  14. IMPORTANT:  Always test data access for any failover success or failure.   Detailed steps are posted How to Validate and troubleshoot A Successful Failover WHEN Data is NOT Accessible on the Target Cluster
Copyright Superna LLC