Sunday, September 1, 2019

Querying Athena tables without the limits of Athena

When AWS introduced Athena, we started moving more and more of our business use to it. Athena meant that we no longer had to raise an EMR to query our parquet files and can easily play with with writing queries anytime and even save them. We even introduced CloudWatch scheduled jobs to run queries and email out reports.

However, when we started building a pipeline for processing data we quickly hit a ceiling of maximum concurrent queries allowed by Athena which is defined on an account basis! You can see the limits here. Notice how low they are! We soon realized that Athena may be problematic for automated pipelined with concurrent processes.

The good news is that once you have defined tables in Athena, these tables are automatically in the Glue Catalog of your AWS environment. This meant that we have Hive Tables that are globally accessible from any EMR we raise. This is much more efficient than having spark read a path from s3 where our files are stored, since in the case where you have a large number of files, it actually needs to scan the header information of each file before you can get to work. Instead, we can create the EMR with one extra parameter and then create our spark session with hive enabled. Once we do this we can access all our athena tables directly from spark code.

The first necessary change is to add to the emr creation script the following:

--configurations '[{"Classification":"hive-site","Properties": 
\
{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}, \
{"Classification":"spark-hive-site", \
"Properties":{"hive.metastore.client.factory.class":"com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory"}}]'

The next step is the change your code when you create your spark session

spark = SparkSession.builder.appName("SimpleApp").enableHiveSupport().getOrCreate()
After this you can select which athena database you want to use:

   spark.sql("use dev")
   spark.sql("show tables").show()
Then you can easily query you tables:
   spark.sql("SELECT * FROM myTable LIMIT 10

Monday, March 4, 2019

Using CloudWatch Logs Insights to monitor your API Gateway

Recently, AWS released a new feature called CloudWatch Logs Insights. This feature allows us to easily write queries on CloudWatch Logs and create dashboards out of them. We leveraged this features to enable us to easily monitor our API Gateway access logs, to see a breakdown of all 4xx and 5xx statuses returned by our APIs. In this post, I will outline the steps necessary to make this happen.

The first thing to do is to enable logging in your API Gateway. If you are using AWS SAM Cloud Formation you will not be able to automate this step at this time. Instead, using the Console, go to your API Gateway pages and select stages on the left. Then select the stage of your deployment you wish to have logs for. Then select the Logs/Tracing table and enable logs in JSON format by clicking on JSON.

You will also need to specify the ARN of your log group. If you don't know it then you can easily find it by selecting logs in CloudWatch. Then edit the columns you see to include the ARN.

Once you have specified this, you will begin to have access logs sent to the ARN you have set up.
Since you have chosen JSON format CloudWatch Insights will automatically be able to identify the fields and help you to write queries. From the Cloudwatch Logs console you can click on explore and then you can start to play. Here are few queries we used to get you started, which help us break down our 5xx errors:

filter status >= 500 and status <= 599
| stats count(*) as count by resourcePath as url, status, bin(5m)
This simple line allows will only look for requests whose status is 5xx and then display the url, status, time in 5 minute intervals and the number of such requests that appeared during that interval.

Once you have this you can click on the button "Add to Dashboard" so that you can have a dashboard to track this data.

And if you just want the dashboard without playing here is a CloudFormation template you can use: