AnAWS Professional Service open source initiative |aws-proserve-opensource@amazon.com
Quick Start¶
>>>pipinstallawswrangler
>>># Optional modules are installed with:>>>pipinstall'awswrangler[redshift]'
importawswrangleraswrimportpandasaspdfromdatetimeimportdatetimedf=pd.DataFrame({"id":[1,2],"value":["foo","boo"]})# Storing data on Data Lakewr.s3.to_parquet(df=df,path="s3://bucket/dataset/",dataset=True,database="my_db",table="my_table")# Retrieving the data directly from Amazon S3df=wr.s3.read_parquet("s3://bucket/dataset/",dataset=True)# Retrieving the data from Amazon Athenadf=wr.athena.read_sql_query("SELECT * FROM my_table",database="my_db")# Get a Redshift connection from Glue Catalog and retrieving data from Redshift Spectrumcon=wr.redshift.connect("my-glue-connection")df=wr.redshift.read_sql_query("SELECT * FROM external_schema.my_table",con=con)con.close()# Amazon Timestream Writedf=pd.DataFrame({"time":[datetime.now(),datetime.now()],"my_dimension":["foo","boo"],"measure":[1.0,1.1],})rejected_records=wr.timestream.write(df,database="sampleDB",table="sampleTable",time_col="time",measure_col="measure",dimensions_cols=["my_dimension"],)# Amazon Timestream Querywr.timestream.query("""SELECT time, measure_value::double, my_dimensionFROM "sampleDB"."sampleTable" ORDER BY time DESC LIMIT 3""")
Read The Docs¶
- What is AWS SDK for pandas?
- Install
- At scale
- Tutorials
- 1 - Introduction
- 2 - Sessions
- 3 - Amazon S3
- 4 - Parquet Datasets
- 5 - Glue Catalog
- 6 - Amazon Athena
- 7 - Redshift, MySQL, PostgreSQL, SQL Server and Oracle
- 8 - Redshift - COPY & UNLOAD
- 9 - Redshift - Append, Overwrite and Upsert
- 10 - Parquet Crawler
- 11 - CSV Datasets
- 12 - CSV Crawler
- 13 - Merging Datasets on S3
- 14 - Schema Evolution
- 15 - EMR
- 16 - EMR & Docker
- 17 - Partition Projection
- 18 - QuickSight
- 19 - Amazon Athena Cache
- 20 - Spark Table Interoperability
- 21 - Global Configurations
- 22 - Writing Partitions Concurrently
- 23 - Flexible Partitions Filter (PUSH-DOWN)
- 24 - Athena Query Metadata
- 25 - Redshift - Loading Parquet files with Spectrum
- 26 - Amazon Timestream
- 27 - Amazon Timestream - Example 2
- 28 - Amazon DynamoDB
- 29 - S3 Select
- 30 - Data Api
- 31 - OpenSearch
- 33 - Amazon Neptune
- 34 - Distributing Calls Using Ray
- 35 - Distributing Calls on Ray Remote Cluster
- 36 - Distributing Calls on Glue Interactive sessions
- 37 - Glue Data Quality
- 38 - OpenSearch Serverless
- 39 - Athena Iceberg
- 40 - EMR Serverless
- 41 - Apache Spark on Amazon Athena
- Architectural Decision Records
- 1. Record architecture decisions
- 2. Handling unsupported arguments in distributed mode
- 3. Use TypedDict to group similar parameters
- 4. AWS SDK for pandas does not alter IAM permissions
- 5. Move dependencies to optional
- 6. Deprecate wr.s3.merge_upsert_table
- 7. Design of engine and memory format
- 8. Switching between PyArrow and Pandas based datasources for CSV/JSON I/O
- 9. Engine selection and lazy initialization
- API Reference
- Amazon S3
- AWS Glue Catalog
- Amazon Athena
- Amazon Redshift
- PostgreSQL
- MySQL
- Data API Redshift
- Data API RDS
- AWS Glue Data Quality
- OpenSearch
- Amazon Neptune
- DynamoDB
- Amazon Timestream
- AWS Clean Rooms
- Amazon EMR
- Amazon EMR Serverless
- Amazon CloudWatch Logs
- Amazon QuickSight
- AWS STS
- AWS Secrets Manager
- Amazon Chime
- Typing
- Global Configurations
- Engine and Memory Format
- Distributed - Ray
- Community Resources
- Logging
- Who uses AWS SDK for pandas?
- License
- Contributing