CSV file in S3 from a Trino Query
You want to generate a single CSV in a S3 bucket from a Trino SQL query result ?
Good idea ! We show here 2 options to do this with 2 simple examples of Python AWS Lambda functions.
Tests and examples are done with Starburst Galaxy, the Trino SaaS version, but it’s the same for a hosted Trino or Starburst Enterprise cluster.
We use 2 different options to generate a CSV file:
- With a CTAS (Create table) query.
- With a Python Pandas DataFrame object.
Code and examples are hosted in a GitHub repository.
Function with a CTAS query
The lambda_CTAS.py function uses a CTAS query to generate a single CSV file in a S3 bucket.
The CTAS query is like:
CREATE TABLE s3_catalog.tmp.your_file WITH (csv_separator = ',',external_location='s3://your_bucket/tmp/your_file', format='csv') as SELECT ....
Where tmp is an existing Schema in your Trino or Galaxy S3 Catalog (Glue or Hive), here named s3_catalog.
The extra steps into the function after the CTAS query run are to:
- Add .csv suffix to the file name
- Add columns name as header (from Columns name passed as function parameters)
- Copy the file in your specific bucket folder
- Remove the temporary table created (from the CTAS)
Important, you need also to set some properties to the session in order to generate only 1 CSV file not compressed.
session_properties:
- scale_writers : true
- writer_min_size : 1TB
- task_writer_count : 1
- s3_catalog.compression_codec : NONE
Function with a Pandas DataFrame
For the lambda_pandas.py function, we just load the SQL query output in a Pandas DataFrame object, and then we generate a CSV file in a bucket folder from the DataFrame.
No specific session properties need to be set but the query result must fit in-memory.
In the GitHub repository associated to this post, you can also find:
- Layer package in order to use Trino Python library and Python pandas objects in these Lambda functions (about Lambda layer).
- Python code to test the 2 functions.
- Flask Web application (in the s3-file-application folder) using the 2 Lambda functions.
Have fun!