Building a world-wide, real-time search service

At Ozmo, we produce thousands of device and app tutorials and heavily utilize Amazon Web Services (AWS) managed Elasticsearch to enable customers to easily search for answers to their support inquiries. This started with a single Elasticsearch cluster in us-east-1 and eventually expanded to us-west-2, using a simple nightly sync job with us-east-1 as the master cluster location, illustrated below.

Elasticsearch Pipeline Architecture (Old) — Old search architecture

As Ozmo has grown, however, our customers need real-time updates wherever they are, in addition to not being susceptible to a single point of failure. Our search service should be able to easily expand to additional regions wherever needed. This post describes the many issues we faced in migrating to a highly available, real-time architecture and how we dealt with them.

Goals of the upgrade

First, we established our goals, which spanned four primary areas: make the service highly available, allow easy recovery, provide real-time updates and of course not impact the end-user at all during the process. Implementing real-time updates across all regions proved to be the largest task because this requires reliably replicating all data to all regions.

High availability

No single point of failure. If one region has issues or goes down it should not affect any other regions.
Fast fail-over, no matter where a request originates from. Requests normally routed to the us-east-1 region should automatically go to us-west-2 (and vice-versa) without any human intervention.

Disaster recovery

Elasticsearch should not be the main storage location for documents. Currently, quite a bit of processing must be done when retrieving metadata from S3, creating a properly structured indexable document, to finally indexing it into Elasticsearch. Ozmo wants something in the middle to hold the processed documents before indexing it into Elasticsearch.

Real-time updates

The above diagram shows data is only replicated to us-west-1 nightly. Instead, all regions should receive updates simultaneously.

User experience

No end-user impact. The changeover should be seamless and the end user should not notice anything different.

Options we considered

We considered Elastic-hosted Elasticsearch Cross-Cluster replication, but didn't pursue this option because it was just recently released when this project began. We had already been heavily utilizing AWS-managed Elasticsearch so instead, a replication middle layer was chosen to distribute documents globally prior to indexing.

Commonly mentioned replication layers include S3 region mirroring, DynamoDB, and Amazon Kinesis^[1]. DynamoDB stood out to us in particular because it acts as a database and has great support for triggering Lambdas with a 24-hour retry period.

What we did and why

To summarize our approach, data originates in S3, is sent to DynamoDB and replicated globally, then is sent to Elasticsearch. The new event-driven architecture, illustrated below, utilizes Lambdas to process and transport data between services.

Elasticsearch Pipeline Architecture (New) — New search architecture

We utilize Infrastructure-as-Code practices as much as possible to make testing and deployment easier, and CloudFormation works very well to manage all the various components (Lambdas, IAM policies, VPC subnets, etc). This project was a great way to try different strategies to see what worked best since building Lambdas with CloudFormation is still relatively new to us. Conveniently in this new pipeline, Elasticsearch updates are only generated internally, so even if there is an issue there is some flexibility to fix it before the end user is affected.

Issues building the new pipeline

Going into this project, some challenges were definitely anticipated around designing the service to be highly available and real time. This included learning about and working with AWS CloudFormation, Lambdas and other parts of the AWS ecosystem we hadn’t used before. Many smaller issues during document processing, however, came as a surprise during testing. These issues were harder to catch since they only came up in the final stage where new search results were compared to old results.

CloudFormation: We utilize CloudFormation to help set up and provision many infrastructure components, but most of these cases use a pre-built template. Combining CloudFormation with Lambdas meant learning about the details of how CloudFormation works and how to use Lambdas alongside it - both of which were somewhat new to us.
Lambdas: We are utilizing several Lambdas in production, but this project was a great opportunity to create a better workflow applicable to other teams for developing and testing Lambdas. The whole Lambda ecosystem is quite frustrating at first between deployment, versions, aliases, testing and tying it all together with CloudFormation. Many of these issues were solved by a simple Python wrapper around AWS tools along with an in-house build script to help local development.
Document variation: The final Elasticsearch documents had some variations that were not completely obvious until we started testing. The final month or so was mostly comparing Elasticsearch results, finding a small edge case, taking that edge case into account and repeating. We intentionally didn't simply create and restore an Elasticsearch snapshot to ensure the new pipeline was backwards-compatible.
S3 events: S3 only allows a single destination for a particular action/prefix/suffix pattern. Thankfully this just required a simple SNS trigger to fan out events to multiple destinations.

Conclusion

Most of the slowdowns encountered were due to not completely knowing how a service works (Lambdas, CloudFormation) or not being familiar with small internal metadata details. Eventually these details were worked though and the upgrade was very successful. The new architecture can easily be expanded to even more regions without much effort. Utilizing the new event driven architecture, documents are now processed almost instantly and available to customers no matter where they are in the world.

Published: May 27, 2020July 23, 2021

Goals of the upgrade

High availability

Disaster recovery

Real-time updates

User experience

Options we considered

What we did and why

Issues building the new pipeline

Conclusion

Share

Related Articles

Rogers features Ozmo’s interactive tutorials in new commercial

Understanding average handle time (and five ways to improve it)