Design Considerations for High Availability and Disaster Recovery of vRealize Automation-Part1

vRealize Automation 7.3 has made serval significant improvements over its predecessor the most important of which are in the high availability and automatic failover/failback arena and are welcomed equally both by VMware customers as well consultants like me who always endeavour to design scalable & high available solutions for their customers.


In this blog series, I will discuss what design options and choices we have, to build a robust and high available vRealize Automation based cloud infrastructure with in a single site or across multiple data centers in a metro area with less than 5 ms of latency to provide both scalability as well as some disaster recovery capabilities.

But before I delve deeper lets first quickly have a look at the keys components & roles of a vRealize Automation based cloud infrastructure which needs to be protected and therefore should be designed and deployed in a high available manner.


  • vRealize Automation Appliance
  • Single Sign On /Authentication Service
  • vRealize Orchestrator
  • Infrastructure Website
  • Infrastructure Manager Service
  • Distributed Execution Manager – Orchestrator
  • Distributed Execution Manager – Worker
  • Proxy Agents
  • Microsoft SQL Server

Now depending on your availability and scalability requirements these roles can be combined or distributed across multiple machines.

For example, if I wish to simply depict a standalone but distributed medium scale design based on VMware reference architecture without any high availability then it will have following key building blocks.


This is a good scalable design as every key role is distributed across different machines, but still it does not provide any resiliency against component level failure as every role is deployed with a single instance. Whereas customers who are looking at vRealize Automation as their primary self-service portal for their cloud automation and orchestration requirement would also need high availability for their production deployments.

Therefore, the next step is to further deploy multiple instances of these roles to provide high availability either with in a single site to ensure resiliency against component level failure or across 2 different sites for site level failure or Disaster Recovery of vRealize Automation self-service portal.


You can now even deploy vRealize Automation 7.3 across multiple sites which can act as Active-Active sites during normal operations and can also provide Disaster Recovery only for the vRA in case of a site level failure without using Site Recovery Manager or any 3rd party replication solution with minimum RTO and RPO. Although this is largely dependent on your existing vSphere design and will not allow you to do any provisioning on the primary site or perform any Day-2 operations on the virtual machine which are already deployed in the primary site while primary site is down until it is used in conjunction with some other replication solution. I will discuss these scenarios in details later in this series.


Also a design like this would also involve many more building blocks like load balancers, SQL MS clustering or Alwayson availability group for IaaS database and careful attention to each vRA components as not every role of vRA act as active-active and some components may also need manual failover. For example, vRA Postgres database on the appliance supports automatic failover only if three nodes are deployed and synchronous replication is configured between two of them.


Let’s go through these intricacies and run through some of the design scenarios in the next part of this series.