Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Yen Trinh
Yen Trinh

Posted on

Glue – Athena custom output fixed number Of Files

Situation:

When I only use partition clause, there are so many files in S3 bucket which is <1MB, this affect to the query speed and I want to make those become a bigger file.

Solution:

Solution 1: Use Athena "bucketing" method to custom the number of output file.

You can see this AWS blog for more information:
How can I set the number or size of files when I run a CTAS query in Athena?

However, there is one drawback if you use bucketing: Bucketed table do not support INSERT INTO query. Here comes the solution 2.

Solution 2: Use Glue repartition

The context is the same but now I want to use INSERT INTO query.

You can refer to this AWS blog for the procedure:
Build a Data Lake Foundation with AWS Glue and Amazon S3

Note that in the step numbered "13. View the job", we add the following code into the job:

datasource_df = dropnullfields3.repartition(<number of output file you want here>)
Enter fullscreen modeExit fullscreen mode

right after the line:

dropnullfields3 = DropNullFields.apply(frame = resolvechoice2, transformation_ctx = "dropnullfields3")
Enter fullscreen modeExit fullscreen mode

and edit the code:

datasink4 = glueContext.write_dynamic_frame.from_options(frame = dropnullfields3, connection_type = "s3", connection_options = {"path": "<your_s3_path>"}, format = "parquet", transformation_ctx = "datasink4")
Enter fullscreen modeExit fullscreen mode

to:

datasink4 = glueContext.write_dynamic_frame.from_options(frame = datasource_df, connection_type = "s3", connection_options = {"path": "<your_s3_path>"}, format = "parquet", transformation_ctx = "datasink4")
Enter fullscreen modeExit fullscreen mode

If you want to know more about Glue repartition:

Try querying with Athena
Create table:

CREATE EXTERNAL TABLE IF NOT EXISTS demo_query (  dispatching_base_num string,  pickup_date string,  locationid bigint)STORED AS PARQUETLOCATION 's3://athena-examples/parquet/'tblproperties ("parquet.compress"="SNAPPY");
Enter fullscreen modeExit fullscreen mode

Try to insert:

insert into demo_query ("dispatching_base_num", "pickup_date", "locationid") values ('aa23dtgt', '2020-12-03', 1234);
Enter fullscreen modeExit fullscreen mode

The insert query now should work. Success!

Top comments(1)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss
CollapseExpand
 
sasuke002a profile image
Sasuke Shelby
  • Joined

Industrial underfillepoxy adhesive glue
, circuit board-level adhesives, and adhesives for electronic goods have all been created by DeepMaterial.

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Coding is life, curious at AWS
  • Location
    Hanoi
  • Joined

More fromYen Trinh

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp