At Ozmo, we produce thousands of device and app tutorials and heavily utilize Amazon Web Services (AWS) managed Elasticsearch to enable customers to easily search for answers to their support inquiries. This started with a single Elasticsearch cluster in us-east-1 and eventually expanded to us-west-2, using a simple nightly sync job with us-east-1 as the master cluster location, illustrated below.
As Ozmo has grown, however, our customers need real-time updates wherever they are, in addition to not being susceptible to a single point of failure. Our search service should be able to easily expand to additional regions wherever needed. This post describes the many issues we faced in migrating to a highly available, real-time architecture and how we dealt with them.
First, we established our goals, which spanned four primary areas: make the service highly available, allow easy recovery, provide real-time updates and of course not impact the end-user at all during the process. Implementing real-time updates across all regions proved to be the largest task because this requires reliably replicating all data to all regions.
We considered Elastic-hosted Elasticsearch Cross-Cluster replication, but didn’t pursue this option because it was just recently released when this project began. We had already been heavily utilizing AWS-managed Elasticsearch so instead, a replication middle layer was chosen to distribute documents globally prior to indexing.
Commonly mentioned replication layers include S3 region mirroring, DynamoDB, and Amazon Kinesis1. DynamoDB stood out to us in particular because it acts as a database and has great support for triggering Lambdas with a 24-hour retry period.
To summarize our approach, data originates in S3, is sent to DynamoDB and replicated globally, then is sent to Elasticsearch. The new event-driven architecture, illustrated below, utilizes Lambdas to process and transport data between services.
We utilize Infrastructure-as-Code practices as much as possible to make testing and deployment easier, and CloudFormation works very well to manage all the various components (Lambdas, IAM policies, VPC subnets, etc). This project was a great way to try different strategies to see what worked best since building Lambdas with CloudFormation is still relatively new to us. Conveniently in this new pipeline, Elasticsearch updates are only generated internally, so even if there is an issue there is some flexibility to fix it before the end user is affected.
Going into this project, some challenges were definitely anticipated around designing the service to be highly available and real time. This included learning about and working with AWS CloudFormation, Lambdas and other parts of the AWS ecosystem we hadn’t used before. Many smaller issues during document processing, however, came as a surprise during testing. These issues were harder to catch since they only came up in the final stage where new search results were compared to old results.
Most of the slowdowns encountered were due to not completely knowing how a service works (Lambdas, CloudFormation) or not being familiar with small internal metadata details. Eventually these details were worked though and the upgrade was very successful. The new architecture can easily be expanded to even more regions without much effort. Utilizing the new event driven architecture, documents are now processed almost instantly and available to customers no matter where they are in the world.