- Notifications
You must be signed in to change notification settings - Fork0
Text analysis of all 163000+ theoretical high energy physics papers on arXiv.
License
Daniele-Gregori/ArXiv-Hepth-Data-Analysis
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Text analysis of all 163 000+ theoretical high energy physics papers on arXiv (with hep-th as primary or cross-list category), from 1986 to 2023.
Exploration of the following possible tasks: 1) counting; 2) feature extraction; 3) classification; 4) question answering; 5) summarising; 6) recommending papers / research directions. The results are the following:
- interesting temporal trends appear in title words popularity;

2-words combinations of title words turn out to correspond to hep-th concepts and allow effective feature extraction and CONCEPT embedding of abstracts;
classifiers of article categories are built as Neural Networks (NNs) based on either CONCEPT or SciBERT embedding;

- through a more sophisticated NN, the CONCEPT classifier works also for the subcategories within hep-th category;

effective question answering and summarization of article introductions, through high level AI WL functionality;
a first basic recommendation algorithm, according to distance in feature space.
In perspective it looks sensible to relate papers in feature space and thus inspire new discoveries.
All this can be found in the notebook named arXivDataAnalysisV1.3 (to unzip).
Then, as a partial aside, in the notebook Affiliation Countries, we also show the computation of total number of papers over affiliated co-authors, for each country in 2023. This is done using directly inspirehep API. The results are the following: as total

or as shares per (1->10^6) capita

About
Text analysis of all 163000+ theoretical high energy physics papers on arXiv.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.