Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Python scripts to import StackExchange data dump into Postgres DB.

License

NotificationsYou must be signed in to change notification settings

Networks-Learning/stackexchange-dump-to-postgres

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

This is a quick script to move the Stackoverflow data from theStackExchangedata dump (Sept '14) to a PostgresSQL database.

Schema hints are taken froma post onMeta.StackExchangeand fromStackExchange Data Explorer.

Quickstart

Install requirements, create a new database (e.g.beerSO below), and useload_into_pg.py script:

$pip install -r requirements.txt...Successfully installed argparse-1.2.1 libarchive-c-2.9 lxml-4.5.2 psycopg2-binary-2.8.4 six-1.10.0$createdb beerSO$python load_into_pg.py -s beer -d beerSO

This will download compressed files fromarchive.org and loadall the tables at once.

Advanced Usage

You can use a custom database name as well. Make sure to explicitly give itwhile executing the script later.

Each table data is archived in an XML file. Available tables varies accrosshistory.load_into_pg.py knows how to handle the following tables:

  • Badges.
  • Posts.
  • Tags (not present in earliest dumps).
  • Users.
  • Votes.
  • PostLinks.
  • PostHistory.
  • Comments.

You can download manually the files to the folder from where the program isexecuted:Badges.xml,Votes.xml,Posts.xml,Users.xml,Tags.xml. Insome old dumps, the cases in the filenames are different.

Then load each file with e.g.python load_into_pg.py -t Badges.

After all the initial tables have been created:

$psql beerSO< ./sql/final_post.sql

For some additional indexes and tables, you can also execute the the following;

$psql beerSO< ./sql/optional_post.sql

If you give a schema name using the-n switch, all the tables will be movedto the given schema. This schema will be created in the script.

The paths are not changed in the final scriptssql/final_post.sql andsql/optional_post.sql. To run them, first set thesearch_path to yourschema name:SET search_path TO <myschema>;

Caveats and TODOs

  • It prepares some indexes and views which may not be necessary for your analysis.
  • TheBody field inPosts table is NOT populated by default. You have to use--with-post-body argument to include it.
  • TheEmailHash field inUsers table is NOT populated.

Sept 2011 data dump

  • Thetags.xml is missing from the data dump. Hence, thePostTag andUserTagQA tables will be empty afterfinal_post.sql.
  • TheViewCount inPosts is sometimes equal to anempty value. It is replaced byNULL in those cases.

Acknowledgement

  • @madtibo made significant contributions by addingjsonb and Foreign Key support.
  • @bersace brought the dependencies and theREADME.md instructions into 2020s.
  • @rdrg109 simplified handling of non-public schemas and fixed bugs associated with re-importing tables.

About

Python scripts to import StackExchange data dump into Postgres DB.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors10

Languages


[8]ページ先頭

©2009-2025 Movatter.jp