The Important Rs for your Solr DR Plan – RTO and RPO

Dipsy Kapoor

Enterprise search, Apache Solr, search relevance and ranking, performance tuning, cloud-native architectures, reliability engineering, platform scalability, and engineering leadership.

Disaster Recovery

Blog Disaster Recovery

August 19, 2019

The Important Rs for your Solr DR Plan – RTO and RPO

Dipsy Kapoor

Enterprise search, Apache Solr, search relevance and ranking, performance tuning, cloud-native architectures, reliability engineering, platform scalability, and engineering leadership.

It’s common for platform and DevOps teams to have their Apache Solr search clusters and apps that depend on them in the cloud, often through Solr-as-a-Service deployments. In the case of a natural or man-made disaster, IT teams need to recover their search capabilities quickly to maintain business continuity. This article introduces the fundamentals of Solr disaster recovery and what it takes to build a defined recovery model for a modern search infrastructure.

There are two key metrics that guide what type of disaster recovery plan is best suited for your Solr Search needs: the Recovery Point Objective (RPO) and the Recovery Time Objective (RTO). Although they sound very similar, they represent two different aspects of a disaster recovery strategy. Together, these metrics determine the backup strategy that your enterprise would need, and how often data would need to be synchronized between geographically distributed Solr clusters.

In managed environments such as SearchStax Managed Search, these recovery objectives help determine how backups, replication, and failover are configured to restore search infrastructure quickly when disruptions occur.

What is RTO for Solr Search Deployments?

Recovery Time Objective, or RTO, is the amount of time that your Solr service can be unavailable in case of an emergency. This could be caused by a corrupt Solr Index or Solr replicas going into recovery mode. A cloud provider region could go down because of a natural disaster or a cyber attack. In any of these situations, your business might have to face some downtime. RTO defines the maximum acceptable time for your application to recover when such an event occurs.

What is RPO in Solr Disaster Recovery?

Recovery Point Objective, or RPO, defines the amount of data that your business can tolerate losing in case of an emergency. For example, if your business can tolerate losing four hours of history in order to get back on line quickly, then your RPO would be four hours.

Why Disaster Recovery Is Harder for Search Systems

Traditional applications can often restore data quickly from transactional databases. Search systems operate differently.

Solr deployments must recover:

indexed document data
schema configuration
cluster state and replicas
query routing and load balancing

Large search indexes may require hours or days to rebuild if the original index is lost. That is why DR strategies for search focus on protecting both cluster availability and index durability. Modern Solr deployments increasingly adopt a defined recovery model that pairs infrastructure architecture with explicit RTO and RPO targets so recovery procedures are predictable and testable during real incidents.

RTO and RPO in a Solr Disaster Recovery Strategy?

RPO determines how frequently you need to back up your data and/or synchronize it across your infrastructure. Assuming your RPO is two hours, you would need to make sure you have a backup within the last two hours available for your business. RTO would determine, given the backup, how quickly you can recreate your infrastructure and recover the data from your backup.

For some businesses, downtime is not critical. The RTO and RPO can be expressed in hours, and a traditional backup and restore capability is all you need. These can be well-served by a “cold” DR plan, where one starts a new Solr cluster after a disaster and restores the data and configurations from a backup.

Other businesses have aggressive RPO / RTO requirements. These businesses need a duplicate infrastructure that mirrors the primary datacenter in real-time. If the primary goes down for any reason, the secondary assumes the load while repairs are made. This is a “hot” DR plan.

In general, the more aggressive your RPO and RTO goals, the more cost you will incur to implement them.

Designing Solr disaster recovery starts with defining RTO and RPO. The next step is choosing the right architecture. Cold, warm, and hot disaster recovery models each offer different tradeoffs between cost, recovery speed, and operational complexity.

Dipsy Kapoor

Enterprise search, Apache Solr, search relevance and ranking, performance tuning, cloud-native architectures, reliability engineering, platform scalability, and engineering leadership.

Dipsy Kapoor is VP of Engineering at SearchStax, leading teams that build and operate cloud-native search solutions. With a background in search, scalable systems and product engineering, she cares deeply about reliability, relevance, and shipping high-impact features that help customers succeed.