Serving Insights from Big Data: Approach using Microservices on AWS

Tushar Bisht
10 min readMar 27, 2021

At ground 0 an ML solution is an orchestration of many submodules beginning from data gathering, cleansing, processing, training, scoring, result generation or application, and insights generation.

While the ML module would be considered the brains in this ecosystem, insights generated from the data right from the sow of seeds until the results generated by such systems is very important. One would like to see a user interface or dashboards with aggregable data, packed with smart trends that form the foundation of your ML solution, along with the impact analysis and correctness of the model’s inference.

A common approach, when one is not going the native cloud solutions (like AWS Quick Sights or Google’s Data Studio), is to build a microservice-based framework acting as a gateway between your users and the smart insights that become one of your application’s primary USPs.

So our initial design would be something like this (This blog assumes you are building such a system on AWS):

Preliminary design for insights sharing based on Micro Services and Spark-based ML solution

While the above design would be appropriate for many OLAP systems, who would even integrate with the OLTP systems. For a pure analytics system, some of the infrastructural components could be considered as an overhead to performance as well as cost. So from the above solution

  1. We are looking at an additional SPARK pipeline or job that would be fetching HDFS format based data, running some transformation and grouping logic, to store objects in the resident SQL or no-SQL database. While the runtime of this job might look trivial at the start, but as the data grows you are looking at added infrastructure cost due to the size or run time of the cluster while adding to your application’s SLA and cost
  2. In case all you are using your database for is to answer customer’s queries on varied quality and stage of data, you are adding to your infrastructure cost and maintenance. Running a non-cloud-native database on dedicated EC2 instances and adding replication infrastructure and mechanisms to make your system resilient to unknown crashes, is a considerable amount of addition to your costs and processes. This becomes more painful when your data might grow exponentially in times to come
  3. With growing data and customer base, increasing throughput becomes a nightmare that begins with optimizing complex queries, increasing instance size, scaling APIs

S3 is the logical datastore option when you build an OLAP system on AWS. You would use it to store raw input formats as your primary cluster input, stage-wise processed and transformed data in parquet or other HDFS formats, and eventually the output. What one underestimates is S3 in conjunction with other services provided by AWS, potentially serving as your primary and only data source, at least for your OLAP and volume intrinsic data insights needs.

Using S3 as your core data source come with out of the box features like:

  • Scalability
  • Durability
  • Security
  • Resilience
  • Increased request rate performance (serving 5k GET/HEAD request per second per prefix in a S3 bucket)

So how would you query S3 containing raw formats in CSV, JSON, or HDFS based parquet formats like parquet to serve insights?

Option 1 : S3 Select
Amazon released S3 select APIs, which allows you to run SQL-like statements, and query data directly from your S3 prefixes.

Microservice using S3 select services via S3 SDK to GET data, by passing SQL like query syntax

The above design shows how you would connect your microservices using the AWS S3 SDK, to leverage S3 select, to directly query your S3 bucket, by executing SQL like statements, on file formats like JSON, CSV, parquet, etc.

While this design has a slight edge on our previous design, where we are cutting costs on a spark process and a dedicated database instance, it does have a very basic flaw.

S3 Select API is built to simply fetch data (supporting Where). In case you want to perform some kind of aggregation, you would have to pull the filtered dataset to your microservices layer, perform the required aggregation, and send the response to the requesting entity. So if you want to see an aggregated sum of the revenue of categories in the past 24 months, you would need to pull in the sales record amounting to insane data volume and a nightmare for throughput and API scalability.

The optimal way: The optimal solution that we would apply to this problem is introducing a Query service, which would allow us to run SQL query statements with aggregation ops on underlying S3 data. In the AWS ecosystem, you leverage AWS Athena, which is a serverless, query service, that allows you to run and fetch data with lower latency and high throughput and serve it to your customers.

Serverless query service AP based insights rendering using AWS Athena

At ground 0 an ML solution is an orchestration of many submodules beginning from data gathering, cleansing, processing, training, scoring, result generation or application, and insights generation.

While the ML module would be considered the brains in this ecosystem, insights generated from the data right from the sow of seeds until the results generated by such systems is very important. One would like to see a user interface or dashboards with aggregable data, packed with smart trends that form the foundation of your ML solution, along with the impact analysis and correctness of the model’s inference.

A common approach, when one is not going the native cloud solutions (like AWS Quick Sights or Google’s Data Studio), is to build a microservice-based framework acting as a gateway between your users and the smart insights that become one of your application’s primary USPs.

So our initial design would be something like this (This blog assumes you are building such a system on AWS):

While the above design would be appropriate for many OLAP systems, who would even integrate with the OLTP systems. For a pure analytics system, some of the infrastructural components could be considered as an overhead to performance as well as cost. So from the above solution

  1. We are looking at an additional SPARK pipeline or job that would be fetching HDFS format-based data, running some transformation, and grouping logic, to store objects in the resident SQL or no-SQL database. While the runtime of this job might look trivial at the start, but as the data grows you are looking at added infrastructure cost due to the size or run time of the cluster while adding to your application’s SLA and cost
  2. In case all you are using your database for is to answer customer’s queries on varied quality and stage of data, you are adding to your infrastructure cost and maintenance. Running a non-cloud-native database on dedicated EC2 instances and adding replication infrastructure and mechanisms to make your system resilient to unknown crashes, is a considerable amount of addition to your costs and processes. This becomes more painful when your data might grow exponentially in times to come
  3. With growing data and customer base, increasing throughput becomes a nightmare that begins with optimizing complex queries, increasing instance size, scaling APIs

S3 is the logical datastore option when you build an OLAP system on AWS. You would use it to store raw input formats as your primary cluster input, stage-wise processed and transformed data in parquet or other HDFS formats, and eventually the output. What one underestimates is S3 in conjunction with other services provided by AWS, potentially serving as your primary and only data source, at least for your OLAP and volume intrinsic data insights needs.

Using S3 as your core data source comes with out of the box features like:

  • Scalability
  • Durability
  • Security
  • Resilience
  • Increased request rate performance (serving 5k GET/HEAD request per second per prefix in a S3 bucket)

So how would you query S3 containing raw formats in CSV, JSON, or HDFS based parquet formats like parquet to serve insights?

Option 1 : S3 Select
Amazon released S3 select APIs, which allows you to run SQL-like statements, and query data directly from your S3 prefixes.

The above design shows how you would connect your microservices using the AWS S3 SDK, to leverage S3 select, to directly query your S3 bucket, by executing SQL like statements, on file formats like JSON, CSV, parquet, etc.

While this design has a slight edge on our previous design, where we are cutting costs on a spark process and a dedicated database instance, it does have a very basic flaw.

S3 Select API is built to simply fetch data (supporting Where). In case you want to perform some kind of aggregation, you would have to pull the filtered dataset to your microservices layer, perform the required aggregation, and send the response to the requesting entity. So if you want to see an aggregated sum of the revenue of categories in the past 24 months, you would need to pull in the sales record amounting to insane data volume and a nightmare for throughput and API scalability.

The optimal way: The optimal solution that we would apply to this problem is introducing Query service, which would allow us to run SQL query statements with aggregation ops on underlying S3 data. In the AWS ecosystem, you leverage AWS Athena, which is a serverless, query service, that allows you to run and fetch data with lower latency and high throughput and serve it to your customers.

The above insights generation and rendering architecture operate in the below sequence:

  • Spark-based jobs create a flattened parquet format of data sources that can be queried for generating insights responses
  • A GLUE-based crawler refreshes Athena’s catalog of bucket prefixes from where data will be queried. So as soon as data is refreshed in the bucket’s prefix, the corresponding Athena view is also updated by the crawler
  • Microservices, use AWS Athena SDK’s client to connect to Athena services to run SQL statements aggregating and fetching data to be served to calling entity

The above approach gives you the following advantages:

  • S3 as your sole data storage reducing the cost of maintaining a stand-alone database layer
  • Auto scalability to handle a load of your growing data and access demands
  • High thorough put and low latency in accessing data

But do you still have to run the same query over and over again every time a similar request is received?

The answer is no. However lucrative and easy the above design looks and sounds, you are always paying something to the cloud host for the services you use. The simpler the service, the lower the cost per-use. Athena API works in the following manner

source: https://eng.uber.com/introducing-athenadriver/

Athena service, while querying data from the source S3 bucket, saves the result set in a user-specified query service bucket. The StartQueryExecution() returns a queryID, which serves as a key to monitor the query execution and fetch results from the query service bucket.

So based on the above sequence diagram two things come to light

  • Successful query results are stored in Athena’s query service bucket. Any eventual call with this query ID would not require a hit to Athena, but a direct fetch from the S3 bucket avoiding any data scan.
  • An internal caching strategy can be implemented on the microservice layer, to either cache query results or queryID per query so that any further request made against the query can be served either directly from the cache, or by querying S3 rather than Athena, which would save cost and time.
  • An additional cost to storage would be added due to the query service bucket’s data, but with a proper archival or cleaning process, the size of this bucket could be controlled substantially.
A minimalistic flow chart of the cache based approach (Thanks to starUML’s tool)

Before we end this article, a quick note on the cost incurred with Athena. Athena charges data based on the volume of data scanned

So let’s go with a very scary query scan, in case you are looking at scanning 3TB of data with every query, and have around 300 such requests run by your system, and you have no caching or partitioning in place, you are looking at a cost of 136,875.00 USD a month, and this being a very blunt and dumb system. The same figure, with a stand-alone database, and replication mechanism, and hosting cost of holding 3TB of data is staggering !! With proper caching at the microservice level and partitioning of the S3 data, you bring the cost down significantly.

Before I end this entry, this blog does not endorse using AWS as your infrastructure solution, the above approach of serving insights from big data using microservices can be implemented in any cloud service which provides similia services. We were on AWS when we were finding a solution to our insight’s rendering and throughput problem. GCP has BigTable and BigQuery on which the above system can be designed probably in a more cost-effective manner.

In case you are looking for some sample code that can help you test out the Athena service use using AWS SDK, refer https://docs.aws.amazon.com/athena/latest/ug/code-samples.html.

--

--

Tushar Bisht

In the path of the binary enlightenment, amateur photographer and trying my hand in writing now !! a