0

Problem

Where do you find print statements from your Glue ETL jobs? You guys, this is killing me. Why is this not the easiest thing to find?


Situation

I am trying to look at properties of my tables and do some general debugging in the console for an AWS Glue ETL job. Throughout I log some things and print some things. The built in functions to print dynamic frame schema though returnNone, so I can't easily embed them into a log string. Here is a the gist of my job:

import some_stuff...# Create and join tablescustomer_churn = glueContext.create_dynamic_frame.from_catalog(database=db_name, table_name=tbl_customer_churn)customer_churn = cust_joined.join(paths1=["customer id"], paths2=['id'], frame2=other_table)logger.info(f"Customer_churn_joined:\n")customer_churn.printSchema()# ---- Write out the combined file ---- s_customer_churn = customer_churn.toDF().select("customer id")logger.info(f"Customer_churn_just_cust_id:\n")s_customer_churn.printSchema()s_customer_churn.write.option("header","true").format("csv").mode('Overwrite').save(output_dir) logger.info("output_dir:" + output_dir)

Other relevant info:

  • Glue version: 5
  • Type: Spark
  • Language: Python3
  • Job observability metrics: True
  • Continuous Logging: True

What I've Tried

I looked in the contiuous logging tab and I get the logging statements, but no print statements come through. I saw theOutput logs going to Cloudwatch (screenshot, bottom-right), so I clicked that link, but none of the logs had my print statements. Why is this not the easiest thing to see?output_logs link

askedMar 5 at 16:26
d-gg's user avatar

1 Answer1

4

All logs

  • logger.info()

Look in the one without the suffix. There is a lot of items logged here so you will have to search.enter image description here

Output logs

  • print()

  • dataframe.printSchema(), dataframe.show()

Look in the one without the suffixenter image description here

The others with the suffix are from the individual workers depending on the number of workers defined for the job.

Example

import sysfrom awsglue.transforms import *from awsglue.utils import getResolvedOptionsfrom pyspark.context import SparkContextfrom awsglue.context import GlueContextfrom awsglue.job import Jobfrom pyspark.sql import Row## @params: [JOB_NAME]args = getResolvedOptions(sys.argv, ['JOB_NAME'])sc = SparkContext()glueContext = GlueContext(sc)spark = glueContext.spark_sessionjob = Job(glueContext)job.init(args['JOB_NAME'], args)logger = glueContext.get_logger()logger.info('Hello from logger.info will be in All Logs')print('print will show up in output log')testDf = spark.createDataFrame([Row(test_data='dataframe printSchema() and show() will be in the output log')])testDf.printSchema()testDf.show()job.commit()

OpenAll logs and you can search using the search box

enter image description here

Or do a find in the brower:enter image description here

Just understand you might have to scroll to the top and clickThere are older events to load. Load more.before you would find itenter image description here

Open `Output logs' and you can do the same type of searchenter image description here

answeredMar 7 at 16:18
Tim Mylott's user avatar
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Sign up orlog in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.