Columnar formats, such as Apache Parquet, offer great compression savings and are much easier to scan, process, and analyze than other formats such as CSV. In this article, we show you how to convert your CSV data to Parquet using AWS Glue.
What Is A Columnar Format?
CSV files, log files, and any other character-delimited file all effectively store data in columns. Each row of data have a certain number of columns all separated by the delimiter, such as commas or spaces. But under the hood, these formats are still just lines of strings. There’s no easy way to scan just a single column of a CSV file.
This can be a problem with services like AWS Athena, which are able to run SQL queries on data stored in CSV and other delimited files. Even if you’re only querying a single column, Athena has to scan the entire file’s contents. Athena’s only charge is the GB of the data processed, so running up the bill by processing unnecessary data isn’t the best idea.
The solution is a true columnar format. Columnar formats store data in columns, much like a traditional relational database. The columns are stored together, and the data is much more homogenous, which makes them easier to compress. They’re not exactly human readable, but they’re understood by the application processing them just fine. In fact, because there’s less data to scan, they’re much easier to process.
Because Athena only has to scan one column to do a selection by column, it drastically cuts down on costs, especially for larger datasets. If you have 10 columns in each file and only scan one, that’s a 90% cost savings just from switching to Parquet.
Convert Automatically Using AWS Glue
AWS Glue is a tool from Amazon that converts datasets between formats. It’s primarily used as part of a pipeline to process data stored in delimited and other formats, and injects them into databases for use in Athena. While it can be set up to be automatic, you can also run it manually as well, and with a bit of tweaking it can be used to convert CSV files to the Parquet format.
Head over to the AWS Glue Console and select “Get Started”. From the sidebar, click on “Add Crawler” and create a new crawler. The crawler is configured to scan for data from S3 Buckets, and import the data into a database for use in the conversion.
Give your crawler a name, and choose to import data from a data store. Select S3 (though DynamoDB is another option), and enter the path to a folder containing your files. If you just have one file you want to convert, put it in its own folder.
Next, you’re asked to create an IAM role for your crawler to operate as. Create the role, then choose it from the list. You may have to hit the refresh button next to it for it to appear.
Choose a database for the crawler to output to; if you’ve used Athena before, you can use your custom database, but if not the default one should work fine.
If you wanted to automate the process, you can give your crawler a schedule so that it runs on a regular basis. If not, y choose manual mode and execute it yourself from the console.
Once it’s created, go ahead and run the crawler to import the data into the database you chose. If everything worked, you should see your file imported with the proper schema. The data types for each column are assigned automatically based on the source input.
Once your data is in the AWS system, you can convert it. From the Glue Console, switch over to the “Jobs” tab, and create a new job. Give it a name, add your IAM role, and select “A Proposed Script Generated By AWS Glue” as what the job runs.
Select your table on the next screen, then choose “Change Schema” to specify that this job runs a conversion.
Next, you have to select “Create Tables In Your Data Target”, specify Parquet as the format, and enter a new target path. Make sure this is an empty location without any other files.
Next, you can edit the schema of your file. This defaults to a one-to-one mapping of CSV columns to Parquet columns, which is likely what you want, but you can modify it if you need to.
Create the job, and you’ll be brought to a page that enables you to edit the Python script it runs. The default script should work fine, so hit “Save” and exit back to the jobs tab.
In our testing, the script always failed unless the IAM role was given specific permission to write to the location we specified the output to go to. You may have to manually edit the permissions from the IAM Management Console if you run into the same issue.
Otherwise, click “Run” and your script should start up. It may take a minute or two to process, but you should see the status in the info panel. When it’s done, you’ll see a new file created in S3.
This job can be configured to run off of triggers set by the crawler that imports the data, so the whole process can be automated from start to finish. If you’re importing server logs to S3 this way, this can be an easy method to convert them to a more usable format.