Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Ingesting Data to Postgres
totalSophie
totalSophie

Posted on • Edited on

     

Ingesting Data to Postgres

DEZoomcamp study notes

To set up PostgreSQL in Docker, run the pgcli command, and execute SQL statements, you can follow these steps:

Step 1: Install Docker

Make sure you have Docker installed on your machine. You can download Docker from the official website:Docker.

Step 2: Pull PostgreSQL Docker Image

Open a terminal and pull the official PostgreSQL Docker image:

docker pull postgres
Enter fullscreen modeExit fullscreen mode

Step 3: Run PostgreSQL Container

Run a PostgreSQL container with a specified password for the default user 'postgres':

docker run--name mypostgres-ePOSTGRES_USER="root"-ePOSTGRES_PASSWORD="password"-ePOSTGRES_DB="ny_taxi"-v$(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data-p 5432:5432-d postgres
Enter fullscreen modeExit fullscreen mode

This command starts a PostgreSQL container named 'mypostgres' with the password 'password' and exposes port 5432 on the host.
e declares the environment variables.
v declares volume path

Step 4: Install pgcli

Install pgcli, a command-line interface for PostgreSQL, on your local machine:

pipinstallpgcli
Enter fullscreen modeExit fullscreen mode

Step 5: Connect to PostgreSQL using pgcli

Connect to the PostgreSQL database using pgcli:

pgcli-h localhost-p 5432-U root-d ny_taxi-W
Enter fullscreen modeExit fullscreen mode

h declares the host variable which is localhost connection port.
u is the username.
d is the database name
-W prompts the user for the password. After entering the command

Enter the password when prompted (use 'password' if you followed the previous steps).

Step 6: Execute SQL Statements

Once connected, you can execute SQL statements directly in the pgcli interface. For example:

-- Create a new databaseCREATEDATABASEmydatabase;-- Switch to the new database\cmydatabase-- Create a tableCREATETABLEmytable(idserialPRIMARYKEY,nameVARCHAR(100),ageINT);-- Insert some dataINSERTINTOmytable(name,age)VALUES('John',25),('Jane',30);-- Query the dataSELECT*FROMmytable;
Enter fullscreen modeExit fullscreen mode

Feel free to modify the SQL statements according to your requirements.

Step 7: To Exit pgcli and Stop the PostgreSQL Container

To exit pgcli, type\q. After that, stop and remove the PostgreSQL container:

docker stop mypostgresdockerrmmypostgres
Enter fullscreen modeExit fullscreen mode

Data Ingestion from CSV to PostgreSQL using Pandas and SQLAlchemy

Step 1: Setting Up the Environment:

  • Use Pandas to read the CSV file in chunks for more efficient processing.
  • Define a PostgreSQL connection string using SQLAlchemy.

Step 2: Creating the Table Schema:

  • Read the first chunk of data to create the initial table schema in the database.
  • Utilize Pandas'to_sql method to replace or create the table in the PostgreSQL database.

Step 3: Iterative Data Insertion:

  • Iterate through the remaining chunks of the CSV file.
  • Optimize timestamp data types using Pandas'to_datetime.
  • Append each chunk to the existing PostgreSQL table.

Final Code:

fromsqlalchemyimportcreate_enginefromtimeimporttimeimportpandasaspd# specify the database you want to use based on the docker run command we had# postgresql://username:password@localhost:port/dbnamedb_url='postgresql://root:password@localhost:5432/ny_taxi'engine=create_engine(db_url)# Chunksize for reading CSV and inserting into the databasechunk_size=100000# Create an iterator for reading CSV in chunkscsv_iter=pd.read_csv('2021_Yellow_Taxi_Trip_Data.csv',iterator=True,chunksize=chunk_size)# Get the first chunk to create the table schemafirst_chunk=next(csv_iter)first_chunk.to_sql(name='yellow_taxi_data',con=engine,if_exists='replace',index=False)# Loop through the remaining chunks and append to the tableforchunkincsv_iter:t_start=time()# Fix timestamp type issuechunk['tpep_pickup_datetime']=pd.to_datetime(chunk['tpep_pickup_datetime'])chunk['tpep_dropoff_datetime']=pd.to_datetime(chunk['tpep_dropoff_datetime'])# Append data to the existing tablechunk.to_sql(name='yellow_taxi_data',con=engine,if_exists='append',index=False)# Print a message and benchmark the timet_end=time()print(f'Inserted another chunk... took{t_end-t_start:.3f} second(s)')
Enter fullscreen modeExit fullscreen mode

Extra, Extra!!!

Animated GIF

Usingargparse to Parse Command Line Arguments

Utilizing theargparse standard library to efficiently parse command line arguments, this script downloads a CSV file from a specified URL and ingests its data into a PostgreSQL database.

fromtimeimporttimefromsqlalchemyimportcreate_engineimportpandasaspdimportargparseimportosdefmain(params):user=params.userpassword=params.passwordhost=params.hostport=params.portdb=params.dbtable_name=params.table_nameurl=params.urlcsv_name='output.csv'# Download the CSV using the os system function to execute command line arguments from Pythonos.system(f"wget{url} -O{csv_name}")engine=create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')df_iter=pd.read_csv(csv_name,iterator=True,chunksize=100000)df=next(df_iter)df.tpep_pickup_datetime=pd.to_datetime(df.tpep_pickup_datetime)df.tpep_dropoff_datetime=pd.to_datetime(df.tpep_dropoff_datetime)# Adding the column namesdf.head(n=0).to_sql(name=table_name,con=engine,if_exists="replace")# Adding the first batch of rowsdf.to_sql(name=table_name,con=engine,if_exists="append")whileTrue:t_start=time()df=next(df_iter)df.tpep_pickup_datetime=pd.to_datetime(df.tpep_pickup_datetime)df.tpep_dropoff_datetime=pd.to_datetime(df.tpep_dropoff_datetime)df.to_sql(name=table_name,con=engine,if_exists="append")t_end=time()print('Inserted another chunk... took %.3f second(s)'%(t_end-t_start))if__name__=='__main__':parser=argparse.ArgumentParser(description="Ingest CSV data to Postgres")parser.add_argument('--user',help="user name for postgres")parser.add_argument('--password',help="password for postgres")parser.add_argument('--host',help="host for postgres")parser.add_argument('--port',help="port for postgres")parser.add_argument('--db',help="database name for postgres")parser.add_argument('--table_name',help="name of the table where we will write the results to")parser.add_argument('--url',help="url of the CSV")args=parser.parse_args()# Dockerizing Ingestion ScriptIntheprovidedDockerfile:>**Dockerfile**>```{%endraw%}docker>FROMpython:3.9.1>>RUNapt-getinstallwget>RUNpipinstallpandassqlalchemypsycopg2>>WORKDIR/app>COPYingest_data.pyingest_data.py>>ENTRYPOINT["python","ingest_data.py"]>{%raw%}
Enter fullscreen modeExit fullscreen mode

Thepsychopg2 package is included to facilitate access to the PostgreSQL database from Python, serving as a valuable "database wrapper."

To build the Docker image, execute the following command:

bashdocker build -t taxi_ingest:v001 .
Enter fullscreen modeExit fullscreen mode

Now run the image instead of the script with the network argument and changing the database host...

You can serve the local file over HTTP on your machine and access it through your IP address by running this in its location
python3 -m http.server

bash# If your file is localURL="http://192.x.x.x:8000/2021_Yellow_Taxi_Trip_Data.csv"docker run -it \  --network=pg-network \  taxi_ingest:v001 \  --user=root \  --password=password \  --host=pg-database \  --port=5432 \  --db=ny_taxi \  --table_name=yellow_taxi_trips \  --url="${URL}"
Enter fullscreen modeExit fullscreen mode

Not yet...

Connecting pgAdmin and Postgres

pgCLI allows for quickly looking into data. But the more convenient way to work with a postgres database is to use the pgAdmin tool which is a web based GUI tool.

To install pgAdmin in a Docker container, you can follow these steps:

  1. Pull the pgAdmin Docker Image:Use the following command to pull the official pgAdmin Docker image from Docker Hub.
bash   docker pull dpage/pgadmin4
Enter fullscreen modeExit fullscreen mode
  1. Create a Docker Network:It's a good practice to create a Docker network to facilitate communication between the PostgreSQL container and the pgAdmin container.
bash   docker network create pgadmin-network
Enter fullscreen modeExit fullscreen mode
  1. Run the PostgreSQL Container:Now modify the postgres db run command
bashdocker run --name pg-database \--network pgadmin-network \-e POSTGRES_USER="root" \-e POSTGRES_PASSWORD="password" \-e POSTGRES_DB="ny_taxi" \-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \-p 5432:5432 \-d postgres
Enter fullscreen modeExit fullscreen mode

Replacepassword with your desired PostgreSQL password.

  1. Run the pgAdmin Container:Now, you can run the pgAdmin container and link it to the PostgreSQL container.
bash   docker run --name pgadmin-container \              --network pgadmin-network \              -e PGADMIN_DEFAULT_EMAIL=myemail@example.com \              -e PGADMIN_DEFAULT_PASSWORD=mypassword \              -p 5055:80 \              -d dpage/pgadmin4
Enter fullscreen modeExit fullscreen mode

Replacemyemail@example.com andmypassword with your desired pgAdmin login credentials.

  1. Access pgAdmin:
    Open your web browser and navigate tohttp://localhost:5055. Log in with the credentials you provided in the previous step.

  2. Add PostgreSQL Server:
    In pgAdmin, click on "Add New Server" and fill in the necessary details to connect to the PostgreSQL server running in the Docker container.

  • Host name/address:postgres-container (the name of your PostgreSQL container)
  • Port:5432
  • Username:postgres
  • Password: (the password you set in step 3)

Now, you should be able to manage your PostgreSQL server using pgAdmin in a Docker container. Adjust the commands and parameters according to your specific requirements and environment.

Rather, we can also use Docker Compose

Create adocker-compose.yml.. Now, you don't specify the network

services:  pgdatabase:    image: postgres:latest    environment:      - POSTGRES_USER=root      - POSTGRES_PASSWORD=password      - POSTGRES_DB=ny_taxi    volumes:      - "./ny_taxi_postgres_data:/var/lib/postgresql/data:rw"    ports:      - "5432:5432"    container_name: mypostgres  pgadmin:    image: dpage/pgadmin4    environment:      - PGADMIN_DEFAULT_EMAIL=myemail@example.com      - PGADMIN_DEFAULT_PASSWORD=mypassword    ports:      - "5055:80"    container_name: pgadmin
Enter fullscreen modeExit fullscreen mode

To start Docker Composedocker-compose up
To run Docker Compose in the backgrounddocker-compose up -d
To view Docker Compose containersdocker-compose ps
To stop Docker Composedocker-compose down
To stop Docker Compose if you used the -d flagdocker-compose down -v

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

MLOps || Software Engineering
  • Pronouns
    she/her
  • Joined

More fromtotalSophie

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp