Serving Insights from Big Data: Approach using Microservices on AWS

Preliminary design for insights sharing based on Micro Services and Spark-based ML solution
  1. In case all you are using your database for is to answer customer’s queries on varied quality and stage of data, you are adding to your infrastructure cost and maintenance. Running a non-cloud-native database on dedicated EC2 instances and adding replication infrastructure and mechanisms to make your system resilient to unknown crashes, is a considerable amount of addition to your costs and processes. This becomes more painful when your data might grow exponentially in times to come
  2. With growing data and customer base, increasing throughput becomes a nightmare that begins with optimizing complex queries, increasing instance size, scaling APIs
  • Durability
  • Security
  • Resilience
  • Increased request rate performance (serving 5k GET/HEAD request per second per prefix in a S3 bucket)
Microservice using S3 select services via S3 SDK to GET data, by passing SQL like query syntax
Serverless query service AP based insights rendering using AWS Athena
  1. In case all you are using your database for is to answer customer’s queries on varied quality and stage of data, you are adding to your infrastructure cost and maintenance. Running a non-cloud-native database on dedicated EC2 instances and adding replication infrastructure and mechanisms to make your system resilient to unknown crashes, is a considerable amount of addition to your costs and processes. This becomes more painful when your data might grow exponentially in times to come
  2. With growing data and customer base, increasing throughput becomes a nightmare that begins with optimizing complex queries, increasing instance size, scaling APIs
  • Durability
  • Security
  • Resilience
  • Increased request rate performance (serving 5k GET/HEAD request per second per prefix in a S3 bucket)
  • A GLUE-based crawler refreshes Athena’s catalog of bucket prefixes from where data will be queried. So as soon as data is refreshed in the bucket’s prefix, the corresponding Athena view is also updated by the crawler
  • Microservices, use AWS Athena SDK’s client to connect to Athena services to run SQL statements aggregating and fetching data to be served to calling entity
  • Auto scalability to handle a load of your growing data and access demands
  • High thorough put and low latency in accessing data
source: https://eng.uber.com/introducing-athenadriver/
  • An internal caching strategy can be implemented on the microservice layer, to either cache query results or queryID per query so that any further request made against the query can be served either directly from the cache, or by querying S3 rather than Athena, which would save cost and time.
  • An additional cost to storage would be added due to the query service bucket’s data, but with a proper archival or cleaning process, the size of this bucket could be controlled substantially.
A minimalistic flow chart of the cache based approach (Thanks to starUML’s tool)

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store