Placeholder Picture
gqc

Athena In A Nutshell

Background:
We talked previously about BI (business intelligence ) and data warehousing. One thing we haven't touched on is the AWS EMR (Elastic Map Reduce) service. This service has what you would define as Hadoop clusters. A Hadoop cluster is a special type of computational cluster designed specifically for storing and analyzing huge amounts of unstructured data in a distributed computing environment. You can also run other popular distributed frameworks such as Apache Spark, HBase, Presto, and Flink in EMR, and interact with data in other AWS data stores such as Amazon S3 and Amazon DynamoDB.

It takes some effort to manage EMR or hadoop clusters. What if we could store our data directly on S3 (an AWS objected oriented data store) and anlyse it via some means. 

This is exactly what Athena does, it is a data analysis service managed by AWS to analyse data stored on S3. Please note that this is not a database query service, it is only able to query data directly stored on S3.

global

Data Queries

Athena is an interactive data query service that utilizes standard SQL as its query langauge. Athena utilizes a serverless architecture and it is fully managed by AWS

quality

Performance & Availability 

Athena utilizes a warm pool of compute capacity resouces that takes zero spin up time, hence queries can be run immediately through this service. Parallelised queries increase performance.  Compressed data reduces scanning costs. DDL Queries and failed queries are not charged for.

All Athena resource pools are distributed across availability zones makin this service highly available. The source data on S3 being highly available as well. Athena only reads the data on S3 hence data on S3 is never modified by Athena.

connect

Data

Athena data queries can be run on tables created within this service, the Data Definition Language languages is based on Hive.

All tables in Athena are defined as external tables, hence delteting tables do not delete the data stored on S3

The following types of data can be queried in RAW format:

Text, CSV, JSON, weblogs, AWS service logs, TSV files and compressed files.

For best perfomance data can be converted to columnar foramts such as ORC or Parquet.
https://orc.apache.org/
https://parquet.apache.org/

Avro is currently supported in US East (Ohio) region.

https://avro.apache.org/


An ETL  is not required for the above types of data.

ETL:
Extract, transform, load (ETL) is the general procedure of copying data from one or more sources into a destination system which represents the data differently from the source(s).

Athena streams data direcly from S3 for querying.

The query execution engine is Presto which utilizes ANSI SQL.
https://prestosql.io/
Complex joins, nested queries & window functions are suported. Complex data types such as arrays and structs are also supported. Partioning of data by any key is supported.

Output:
Results are streamed to the console or S3. Results CVS can be downloaded to the decktop. Athena also has a JDBC driver, this enables external SQL clients to be connected to Athena for querying. The JDBC driver also supports programatic access via AWS IAM access keys.

Athena can be intergrated with Quicksite which is an AWS visualisation tool, where data is represented visually through dashboards.

The Athena catalogue consists of tables, table definitions, column names and partitions are defined via meta data. When a query is run the data is streamed to these tables from S3 for manupilation. The meta data is Hive meta store compatible.

Pyspark code can be utilized to convert data into Parquet or ORC  via EMR.

Code available on github:

https://github.com/aws-samples/aws-big-data-blog/tree/master/aws-blog-spark-parquet-conversion

Contact Us