- Notifications
You must be signed in to change notification settings - Fork24
Stackexchange (e.g., stackoverflow) data dump converter from XML to CSV format.
License
SkobelevIgor/stackexchange-xml-converter
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
CLI tool that allows you to convertStack Exchange data dumps fromXML
toCSV
orJSON
formats, which is more suitable for importing to the different databases.
Here you can find the examples of the schema for the different databases:
Before, ensure that you have:
- WorkingGo environment with go version >= 1.14. Execute in the console
go version
command. It should display the current version of the compiler. - Archiver that can extract
.7z
files. Possible candidate is7z.
Choose and download thedatabase dump that you are going to convert.
Important: Stackoverflow dump stored in 8 separated 7z archives:
- stackoverflow.com-Badges.7z ( ~70M compressed /4G uncompressed /37M rows )
- stackoverflow.com-Comments.7z ( ~4.5G compressed /22G uncompressed /76M rows )
- stackoverflow.com-PostHistory.7z ( ~28.0G compressed /138G uncompressed /133M rows)
- stackoverflow.com-PostLinks.7z ( ~100M compressed /800M uncompressed /7M rows)
- stackoverflow.com-Posts.7z ( ~16G compressed /80G uncompressed /50M rows)
- stackoverflow.com-Tags.7z ( ~900K compressed /5.0M uncompressed /60K rows)
- stackoverflow.com-Users.7z ( ~650M compressed /4.0G uncompressed /13M rows)
- stackoverflow.com-Votes.7z ( ~1.0G compressed /20G uncompressed /200M rows)
Extract archive(s) content file(s) to the directory from where you will convert XML files.
Example withacademia.stackexchange.com.7z dump:
$ mkdir xml csv$ 7z e academia.stackexchange.com.7z -oxml$ ls xml/Badges.xml Comments.xml PostHistory.xml PostLinks.xml Posts.xml Tags.xml Users.xml Votes.xml
Clone & build stackexchange-xml-converter
converter:
$ git clone https://github.com/SkobelevIgor/stackexchange-xml-converter$cd stackexchange-xml-converter/$ go build
Now you have thestackexchange-xml-converter
executable file. Let’s convert XML files to the CSV format:
./stackexchange-xml-converter -result-format=csv -source-path=../xml -store-to-dir=../csv
result-format
(Required) Result format (csv or json)source-path
(Required) Absolute or relative path to the directory with an XML file(s) or to the separate XML file.store-to-dir
(Optional) Absolute or relative path to the directory where to store result CSV files.skip-html-decoding
(Optional) Some of the files (e.g., Posts.xml) contain escaped HTML. By default, the converter will decode them. To disable this behavior, use this flag.
About
Stackexchange (e.g., stackoverflow) data dump converter from XML to CSV format.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Contributors2
Uh oh!
There was an error while loading.Please reload this page.