About stretched clusters in Intelligent Infrastructure

On this page:


Overview

The Intelligent Infrastructure (II) supports stretched clusters, an active Data Center deployment model in which a logical cluster contains two or more host servers in different locations. Stretched clusters use a failover process that significantly improves disaster preparedness. If one server fails, the virtual machines it hosts will automatically restart on another server either in the same location or at a secondary location, reducing downtime. The components required to enable stretched clusters include synchronously replicated storage, common network infrastructure, and sufficient compute resources.

Requirements

In order to take advantage of the II stretched cluster Data Center failover capabilities, virtual machines (VMs) and services must meet certain requirements.

Infrastructure requirements

The Data Center infrastructure must meet the following high-level requirements:

  • Network infrastructure supports common (same) IP space between campus Data Centers
  • Common storage environment between campus Data Centers
  • Common compute infrastructure between campus Data Centers
  • Logical configuration established to connect network, storage, and compute resources into a stretched cluster

Service requirements

The service environment's VMs must:

  • Be hosted within the Intelligent Infrastructure (II) service
  • Use replicated storage
  • Be configured to use Virtual Private LAN Service (VPLS) VLANs
  • Be defined to stretched clusters

Availability

II stretched clusters are live for VMs reporting protected status in the VM information report. The II system algorithms manage the online location of VMs, homing VMs during normal operations to their configured campus Data Center.

A small percentage of VMs are not configured to use stretched clusters. These include VMs not configured to use VPLS-enabled networks and VMs whose owners requested hosting on a single site only. Also, not all workloads require infrastructure failover; examples include services that have a geographically redundant infrastructure (DNS, DHCP, containers), and system architectures that are geographically dependent on a single site.

Determine whether your VMs are protected

To determine if your VMs are protected by II stretched clusters, generate a VM information report. The "Stretched_Cluster_Protected" column for each VM will denote if it is protected or not.

Failover events

A failover event occurs when one Data Center goes down, causing VMs to run in the remaining IU Data Center. Failover events are either planned or unplanned.

Planned failovers

An example of a planned failover event is a scheduled upgrade or a maintenance event at the Data Center level. There may be predictive failure resulting from a planned failover.

Communication

The Storage and Virtualization (SAV) team uses the following channels to communicate planned failover events:

Unplanned failovers

An example of an unplanned failover event is a complete loss of Data Center power to racks for storage and compute resources, or a widespread compute resource failure in one Data Center with power loss to an individual rack.

Unplanned failures cause affected VMs to reboot. Depending on boot dependencies, virtual server system administrators may need to restart application tiers. For example, if an application server starts before the database server, a service may need to be restarted to establish database connectivity.

Communication

Once the Storage and Virtualization (SAV) team has determined the scope of the unplanned failover, the team will use the following channels to communicate the event:

  • Email messages sent directly to owners of affected VMs ("OwnerEmail" addresses)
  • VMADMIN-L mailing list (if the failover is environment-wide)

Backups and data recovery

In the event of an unplanned widespread failure in the Data Center, II stretched clusters improve service availability and ensure that VMs recover quickly in a crash-consistent state. Stretched clusters do not replace out-of-band backup solutions such as IU's Data Protection Service (DPS).

Data Protection Services (DPS) have options for recovering a VM's files or operating system from a historical state. The DPS AllDisks backup option lets you recover a deleted file or corrupted operating system on a single VM.

Synchronous storage replication

If synchronous storage replication is suspended, VMs will continue to operate normally, but Data Center failover will not be available.

When synchronous storage replication resumes, VMs will continue to operate normally, replication will synchronize incremental changes that occurred, and stretched cluster capabilities will be restored.

Example failover scenarios

Planned failover due to scheduled upgrade or maintenance event at a Data Center

  • The Storage and Virtualization (SAV) team will migrate VMs from the Data Center undergoing maintenance to the remote IU Data Center.
  • If possible, storage synchronization will stay enabled throughout the event.
  • The SAV team will communicate details of the event through the mailing list.
  • VMs that are not protected by II stretched clusters will remain local and at risk during the maintenance event.

Complete loss of Data Center power to racks for storage and compute resources

  • For VMs protected by II stretch clusters, storage and compute resources will automatically failover to the opposing campus.
  • Every VM on the impacted campus will experience a high availability (HA) event and restart on the remaining campus Data Center.
  • VMs will continue running in the available Data Center until power is restored to the original campus, with storage and compute resources supporting the failback operation.
  • Once resources are available and storage is synchronized, VMs will automatically migrate back to the home Data Center.
  • The Storage and Virtualization (SAV) team will communicate details of the event through the mailing list.
  • VMs that are not protected by II stretched clusters will remain offline during the event. Owners may need to do full disaster recovery for these VMs.

Widespread compute resource failure in one Data Center power lost to an individual rack

  • Impacted VMs will experience a high availability (HA) event.
  • VMs will restart at the local Data Center as long as there are sufficient compute resources.
  • If compute resources are exhausted on the local Data Center remaining on the II stretched cluster, protected VMs will power on in the remote IU Data Center.
  • When power is restored to the rack and servers are validated as operational, VMs will automatically migrate back to the home Data Center.
  • The Storage and Virtualization (SAV) team will communicate details of the event through the mailing list.
  • VMs that are not protected by II stretched clusters will remain offline during the event. VM owners may need to do full disaster recovery for these VMs.

Network connection severed between campus Data Centers

  • VMs will continue to operate normally.
  • Storage synchronization will automatically be suspended.
  • Write access to volumes will be available in the Data Center where the last write occurred.
    • If last write occurred in Bloomington:
      • Write access to that volume will only be available in Bloomington until the network is back online and the volume is resynchronized.
      • Until the network is restored, access to that volume will only be available in Bloomington.
      • Once the network is available, compute resources will have access to storage through a single Data Center (local access in Bloomington, remote access from Indianapolis).
      • After volumes are resynchronized, compute resources will have local access to storage (local access in Bloomington, local access from Indianapolis).
    • If last write occurred in Indianapolis:
      • Write access to that volume will only be available in Indianapolis until the network is back online and the volume is resynchronized.
      • Until the network is restored, access to that volume will only be available in Indianapolis.
      • Once the network is available, compute resources will have access to storage through a single Data Center (local access in Indianapolis, remote access from Bloomington).
      • After volumes are resynchronized, storage resources will have local access to storage (local access in Indianapolis, local access from Bloomington).
  • The Storage and Virtualization (SAV) team will communicate details of the event through the mailing list.

Get help

After a failover, validate that all services you support are responding as expected. If there are any issues, contact the Storage and Virtualization team at sav-request@iu.edu.

This is document bafp in the Knowledge Base.
Last modified on 2023-11-20 16:24:21.