CSV file in S3 from a Trino Query

Victor Coustenoble
2 min readApr 26, 2022

You want to generate a single CSV in a S3 bucket from a Trino SQL query result ?

Good idea ! We show here 2 options to do this with 2 simple examples of Python AWS Lambda functions.

Tests and examples are done with Starburst Galaxy, the Trino SaaS version, but it’s the same for a hosted Trino or Starburst Enterprise cluster.

We use 2 different options to generate a CSV file:

  • With a CTAS (Create table) query.
  • With a Python Pandas DataFrame object.
Architecture of the interaction between Starburst Galaxy and the AWS environment (Glue, S3, Lambda).

Code and examples are hosted in a GitHub repository.

Function with a CTAS query

The lambda_CTAS.py function uses a CTAS query to generate a single CSV file in a S3 bucket.

The CTAS query is like:

CREATE TABLE s3_catalog.tmp.your_file WITH (csv_separator = ',',external_location='s3://your_bucket/tmp/your_file', format='csv') as SELECT ....

Where tmp is an existing Schema in your Trino or Galaxy S3 Catalog (Glue or Hive), here named s3_catalog.

The extra steps into the function after the CTAS query run are to:

  • Add .csv suffix to the file name
  • Add columns name as header (from Columns name passed as function parameters)
  • Copy the file in your specific bucket folder
  • Remove the temporary table created (from the CTAS)

Important, you need also to set some properties to the session in order to generate only 1 CSV file not compressed.

session_properties:

  • scale_writers : true
  • writer_min_size : 1TB
  • task_writer_count : 1
  • s3_catalog.compression_codec : NONE

Function with a Pandas DataFrame

For the lambda_pandas.py function, we just load the SQL query output in a Pandas DataFrame object, and then we generate a CSV file in a bucket folder from the DataFrame.

No specific session properties need to be set but the query result must fit in-memory.

In the GitHub repository associated to this post, you can also find:

Have fun!

--

--