logo

Written by Remastr, experienced developers creating Django & React applications.

menu

AWS Athena Analytics Tool

Daniel Lacko 2021-07-12 10 minutes

Whenever you are generating a lot of data, a natural question comes up. How to process the data? Do you need to set up a whole infrastructure to create a new full-fledged database? Perhaps, you do not need to actively work with the data, you just need to occasionally access them and get the necessary information. AWS Athena provides a solution for the latter case.


AWS Athena - Analytics tool

As the headline suggests, AWS Athena is a part of AWS analytic tool-set. This tool can help you create ad hoc databases or tables and run queries on them. All it needs is data in form of CSV, JSON, or some columnar formats such as Apache Parquet or ORC.

Athena Use-Case

Query services like Amazon Athena make it easy to run interactive queries against data directly in Amazon S3 without worrying about formatting data or managing infrastructure. For example, Athena is great if you just need to run a quick query on logs to help you debug an application. You just define a table for your data and start querying using standard SQL.

It is not advisable to use Athena for anything related to production, business, or enterprise level. For these cases, there are dedicated services such as Amazon Redshift or EMR.

Getting data to AWS Athena

Adding data source

In short, AWS Glue can extract, transform and load data. It can be used to create a crawler that can infer database and table schema from your data in Amazon S3. Inferred schema is then stored in AWS Glue Data catalog. Athena can access this catalog and load databases and tables and run queries on them. AWS Glue is already integrated into AWS Athena. Tutorial on how to add a data source to Athena can be found here.

AWS service integration with Athena

Athena can fetch data from other AWS services thanks to the integration. For example, if you have running AWS CloudTrail, you can create a table from CloudTrail -> Event History, with just one click. Afterward, you can run queries on logs from CloudTrail in the Athena query editor. Some other useful Athena integrations are:

  • AWS CloudFormation
  • Amazon CloudFront
  • Identity and Access Management (IAM)
  • Amazon S3 Inventory

You can read the full list of integrations here.

Using Athena query editor

Athena Query Editor
The interface is quite self-explanatory. On the left side, you can select Data source and Database and see what tables are present. On the right side, there is a text editor for SQL queries and buttons to run them or save them. After you run a query, query results will show up under the text editor. Some tips:

  • Save your query using "Save as". After that, any change will be saved automatically.
  • In the History tab (at the top), you can download query results in CSV file format.
  • Once you load data to Athena, it will be always updated. Let's say you are logging some data to S3 bucket and you load that data into Athena. Those are logs we are talking about, they keep coming all the time, right? Common sense is that Athena would show the newest data from logs from the date & time you loaded the data in. This common sense is false in this case. The data is always up to date.

Example query: CloudTrail user activity

CloudTrail is an AWS service responsible for logging user activity. Logs are stored in S3 bucket. You can view user activity logs via CloudTrail -> Event History. It has one downside though - Event History can view only the last 90 days of logs. One way of viewing all logs is using AWS Athena:

  1. Go to CloudTrail -> Event History.
  2. Click on "Create Athena table" to create a DB table out of the logs.
  3. Select a bucket where CloudTrail logs are stored.
  4. Click on "Create table".
  5. Go to AWS Athena.
  6. Select the table with CloudTrail logs.

Provided that you arrived at point 6 above, we can start writing a query. Let's see what services have been user intern_user:

SELECT DISTINCT eventsource FROM cloudtrail_logs_aws_dl_tutorial_bucket WHERE useridentity.username LIKE '%intern_user%'

We can now run the query and see results under the query editor:
Example Query

Athena pricing

Payment is per query. More precisely you are charged based on the amount of data scanned by each query. Pricing may differ per region. Examples:

  • N. Virginia = $5.00 per TB of data scanned
  • Cape Town = $6.00 per TB of data scanned

Savings

It's all about the size of data Athena has to read, therefore if you manage to reduce the size of your file, you can make some savings. If you compress files using GZIP, you might see up to 3:1 compression gains.

It can be also beneficial to load the data in more efficient formats. Columnar formats like Apache Parquet allow more efficient querying. Using such formats allows Athena to read-only columns specified by a query, therefore reducing the amount of data read.

2021 Created by Remastr. All rights reserved.