Athena is an AWS service that can scan data stored in S3 and return results based on queries, using standard SQL statements that you would use for a traditional database. It’s particularly useful for extracting data out of CSV and large text files.
How Much Does Athena Cost?
Athena only has one charge: GB of data processed. The going rate is $5 per TB, or about $0.005 per GB. That’s already quite cheap unless you’re processing gigantic files (which Athena certainly can), but there are a few ways to reduce this cost even more.
First, Athena supports
gzip compression, which you should take advantage of. Text files compress surprisingly well, and it’s not uncommon to see 3:1 or even 4:1 compression gains. If Athena has to scan the entire file, you’ll get a 70-80% cost savings if it’s gzipped.
Athena also supports a columnar format with Apache Parquet. If you use Parquet over CSV, Athena can limit the search to specific columns, saving you a lot of of money in the process.
Really, Athena is fairly cheap, but the cost is there to entice people into using Parquet and gzipping. If you want to learn how to automatically convert your CSV files to the Parquet format, you can do so with AWS Glue.
How to Get Started
Head over to the Athena Console, and choose “Get Started.” You’ll be brought to the query editor, where there is already a sample database and table created for you. You can play around with this if you’d like, but once you’re done, go ahead and create a new database:
CREATE DATABASE PRIMARY
Select the database in the sidebar once it’s created. Click “Create Table,” and select “from S3 Bucket Data”:
Upload your data to S3, and select “Copy Path” to get a link to it. You’ll want to create a new folder to store the file in, even if you only have one file, since Athena expects it to be under at least one folder.
Back in the Athena console, enter in the path to the folder containing your data (not a direct file path), and give your table a name. Make sure it’s importing into the same database you created before.
Next, select your data type. Athena supports a number of options, such as standard web server logs, CSV and TSV, JSON, and custom delimiters.
Next, you’ll tell Athena how your data is structured, adding names for each column. You can add columns manually, or in bulk (the easier option).
To add columns in bulk, you’ll need to generate key-value pairs with the name of the column and the type of data, like so:
If your CSV file has column headers, you can pipe the file to
head to grab the header, and to
sed to add a value. The following script sets each column to the “String” type:
cat search.csv | head -n1 | sed 's/\([^,]*\)/\1 string/g'
You can change it to the correct type in the Athena console, but it needs to be formatted like this for Athena to accept it at all.
Once you’re done configuring columns, create the table, and you’ll be brought back to the query editor and shown the query used to create the table. You’ll see the newly created table in the sidebar as well.
If everything worked correctly, you should now be able to access your table data from Athena.
Athena uses standard SQL syntax, so you can query this virtual database and table exactly like you would query any other database:
SELECT * FROM mtg WHERE cmc > 4
You can add multiple tables that pull data from different S3 folders, and query them in a similar fashion.