- Notifications
You must be signed in to change notification settings - Fork17
A python parser for DBLP dataset
License
NotificationsYou must be signed in to change notification settings
26hzhang/DBLPParser
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
It is a python parser forDBLP dataset, the XML format dumped file can be downloadedhere fromDBLP Homepage.
This parser requiresdtd
file, so make sure you have bothdblp-XXX.xml
(dataset) anddblp-XXX.dtd
files. Note that you also should guarantee that bothxml
anddtd
files are in the same directory, and the name ofdtd
file shoud same as the name given in the<!DOCTYPE>
tag of thexml
file. Such information can be easily accessed throughhead dblp-XXX.xml
command. As shown below
<?xml version="1.0" encoding="ISO-8859-1"?><!DOCTYPEdblp SYSTEM "dblp-2017-08-29.dtd"><dblp><phdthesismdate="2016-05-04"key="phd/dk/Heine2010"><author>Carmen Heine</author><title>Modell zur Produktion von Online-Hilfen.</title>...
A sample to use the parser:
defmain():dblp_path='dataset/dblp.xml'save_path='article.json'try:context_iter(dblp_path)log_msg("LOG: Successfully loaded\"{}\".".format(dblp_path))exceptIOError:log_msg("ERROR: Failed to load file\"{}\". Please check your XML and DTD files.".format(dblp_path))exit()parse_article(dblp_path,save_path,save_to_csv=False)# default save as json format
Some extracted results:
Count the number of all different type of publications:
Count the number of all different attributes among all publications: