- Notifications
You must be signed in to change notification settings - Fork29
Python scripts to import StackExchange data dump into Postgres DB.
License
Networks-Learning/stackexchange-dump-to-postgres
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a quick script to move the Stackoverflow data from theStackExchangedata dump (Sept '14) to a PostgresSQL database.
Schema hints are taken froma post onMeta.StackExchangeand fromStackExchange Data Explorer.
Install requirements, create a new database (e.g.beerSO
below), and useload_into_pg.py
script:
$pip install -r requirements.txt...Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0$createdb beerSO$python load_into_pg.py -s beer -d beerSO
This will download compressed files fromarchive.org and loadall the tables at once.
You can use a custom database name as well. Make sure to explicitly give itwhile executing the script later.
Each table data is archived in an XML file. Available tables varies accrosshistory.load_into_pg.py
knows how to handle the following tables:
Badges
.Posts
.Tags
(not present in earliest dumps).Users
.Votes
.PostLinks
.PostHistory
.Comments
.
You can download manually the files to the folder from where the program isexecuted:Badges.xml
,Votes.xml
,Posts.xml
,Users.xml
,Tags.xml
. Insome old dumps, the cases in the filenames are different.
Then load each file with e.g.python load_into_pg.py -t Badges
.
After all the initial tables have been created:
$psql beerSO< ./sql/final_post.sql
For some additional indexes and tables, you can also execute the the following;
$psql beerSO< ./sql/optional_post.sql
If you give a schema name using the-n
switch, all the tables will be movedto the given schema. This schema will be created in the script.
The paths are not changed in the final scriptssql/final_post.sql
andsql/optional_post.sql
. To run them, first set thesearch_path to yourschema name:SET search_path TO <myschema>;
- It prepares some indexes and views which may not be necessary for your analysis.
- The
Body
field inPosts
table is NOT populated by default. You have to use--with-post-body
argument to include it. - The
EmailHash
field inUsers
table is NOT populated.
- The
tags.xml
is missing from the data dump. Hence, thePostTag
andUserTagQA
tables will be empty afterfinal_post.sql
. - The
ViewCount
inPosts
is sometimes equal to anempty
value. It is replaced byNULL
in those cases.
About
Python scripts to import StackExchange data dump into Postgres DB.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors10
Uh oh!
There was an error while loading.Please reload this page.