- Notifications
You must be signed in to change notification settings - Fork7
Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!
License
sidphbot/Auto-Research
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
title | emoji | colorFrom | colorTo | sdk | sdk_version | app_file | pinned |
---|---|---|---|---|---|---|---|
Researcher | 🤓 | gray | pink | streamlit | 1.2.0 | app.py | false |
A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.
Data Provider:arXiv Open Archive Initiative OAI
Requirements:
- python 3.7 or above
- poppler-utils -
sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
- list of requirements in requirements.txt -
cat requirements.txt | xargs pip install
- 8GB disk space
- 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)
Video Demo :https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing
Kaggle Re-usable Demo :https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
([TIP]
click 'edit and run' to run the demo for your custom queries on a free GPU)
sudo apt-get install build-essential poppler-utils libpoppler-cpp-dev pkg-config python-devpip install git+https://github.com/sidphbot/Auto-Research.git
python survey.py [options] <your_research_query>
streamlit run app.py
from survey import Surveyormysurveyor = Surveyor()mysurveyor.survey('quantum entanglement')
These are independent tools for your research or document text handling needs.
*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
abstractive_summary
- takes a long text document (string
) and returns a 1-paragraph abstract or “abstractive” summary (string
)Input:
`longtext` : string
Returns:
`summary` : string
extractive_summary
- takes a long text document (string
) and returns a 1-paragraph of extracted highlights or “extractive” summary (string
)Input:
`longtext` : string
Returns:
`summary` : string
generate_title
- takes a long text document (string
) and returns a generated title (string
)Input:
`longtext` : string
Returns:
`title` : string
extractive_highlights
- takes a long text document (string
) and returns a list of extracted highlights ([string]
), a list of keywords ([string]
) and key phrases ([string]
)Input:
`longtext` : string
Returns:
`highlights` : [string] `keywords` : [string] `keyphrases` : [string]
extract_images_from_file
- takes a pdf file name (string
) and returns a list of image filenames ([string]
).Input:
`pdf_file` : string
Returns:
`images_files` : [string]
extract_tables_from_file
- takes a pdf file name (string
) and returns a list of csv filenames ([string]
).Input:
`pdf_file` : string
Returns:
`images_files` : [string]
cluster_lines
- takes a list of lines (string
) and returns the topic-clustered sections (dict(generated_title: [cluster_abstract])
) and clustered lines (dict(cluster_id: [cluster_lines])
)Input:
`lines` : [string]
Returns:
`sections` : dict(generated_title: [cluster_abstract]) `clusters` : dict(cluster_id: [cluster_lines])
extract_headings
-[for scientific texts - Assumes an ‘abstract’ heading present] takes a text file name (string
) and returns a list of headings ([string]
) and refined lines ([string]
).[Tip 1]
: Useextract_sections
as a wrapper (e.g.extract_sections(extract_headings(“/path/to/textfile”)
) to get heading-wise sectioned text with refined lines instead (dict( heading: text)
)[Tip 2]
: write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!Input:
`text_file` : string
Returns:
`refined` : [string], `headings` : [string] `sectioned_doc` : dict( heading: text) (Optional - Wrapper case)
- inside code
from survey.Surveyor import DEFAULTSfrom pprint import pprintpprint(DEFAULTS)
or,
- Modify static config file -
defaults.py
or,
- At runtime (utility)
python survey.py --help
usage: survey.py [-h] [--max_search max_metadata_papers] [--num_papers max_num_papers] [--pdf_dir pdf_dir] [--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir] [--dump_dir dump_dir] [--models_dir save_models_dir] [--title_model_name title_model_name] [--ex_summ_model_name extractive_summ_model_name] [--ledmodel_name ledmodel_name] [--embedder_name sentence_embedder_name] [--nlp_name spacy_model_name] [--similarity_nlp_name similarity_nlp_name] [--kw_model_name kw_model_name] [--refresh_models refresh_models] [--high_gpu high_gpu] query_stringGenerate a survey just from a query !!positional arguments: query_string your research query/keywordsoptional arguments: -h, --help show this help message and exit --max_search max_metadata_papers maximium number of papers to gaze at - defaults to 100 --num_papers max_num_papers maximium number of papers to download and analyse - defaults to 25 --pdf_dir pdf_dir pdf paper storage directory - defaults to arxiv_data/tarpdfs/ --txt_dir txt_dir text-converted paper storage directory - defaults to arxiv_data/fulltext/ --img_dir img_dir image storage directory - defaults to arxiv_data/images/ --tab_dir tab_dir tables storage directory - defaults to arxiv_data/tables/ --dump_dir dump_dir all_output_dir - defaults to arxiv_dumps/ --models_dir save_models_dir directory to save models (> 5GB) - defaults to saved_models/ --title_model_name title_model_name title model name/tag in hugging-face, defaults to 'Callidior/bert2bert-base-arxiv-titlegen' --ex_summ_model_name extractive_summ_model_name extractive summary model name/tag in hugging-face, defaults to 'allenai/scibert_scivocab_uncased' --ledmodel_name ledmodel_name led model(for abstractive summary) name/tag in hugging-face, defaults to 'allenai/led- large-16384-arxiv' --embedder_name sentence_embedder_name sentence embedder name/tag in hugging-face, defaults to 'paraphrase-MiniLM-L6-v2' --nlp_name spacy_model_name spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to 'en_core_sci_scibert' --similarity_nlp_name similarity_nlp_name spacy downstream model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to 'en_core_sci_lg' --kw_model_name kw_model_name keyword extraction model name/tag in hugging-face, defaults to 'distilbert-base-nli-mean-tokens' --refresh_models refresh_models Refresh model downloads with given names (needs atleast one model name param above), defaults to False --high_gpu high_gpu High GPU usage permitted, defaults to False
At runtime (code)
during surveyor object initialization with
surveyor_obj = Surveyor()
pdf_dir
: String, pdf paper storage directory - defaults toarxiv_data/tarpdfs/
txt_dir
: String, text-converted paper storage directory - defaults toarxiv_data/fulltext/
img_dir
: String, image image storage directory - defaults toarxiv_data/images/
tab_dir
: String, tables storage directory - defaults toarxiv_data/tables/
dump_dir
: String, all_output_dir - defaults toarxiv_dumps/
models_dir
: String, directory to save to huge models, defaults tosaved_models/
title_model_name
: String, title model name/tag in hugging-face, defaults toCallidior/bert2bert-base-arxiv-titlegen
ex_summ_model_name
: String, extractive summary model name/tag in hugging-face, defaults toallenai/scibert_scivocab_uncased
ledmodel_name
: String, led model(for abstractive summary) name/tag in hugging-face, defaults toallenai/led-large-16384-arxiv
embedder_name
: String, sentence embedder name/tag in hugging-face, defaults toparaphrase-MiniLM-L6-v2
nlp_name
: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults toen_core_sci_scibert
similarity_nlp_name
: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults toen_core_sci_lg
kw_model_name
: String, keyword extraction model name/tag in hugging-face, defaults todistilbert-base-nli-mean-tokens
high_gpu
: Bool, High GPU usage permitted, defaults toFalse
refresh_models
: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
during survey generation with
surveyor_obj.survey(query="my_research_query")
max_search
: int maximium number of papers to gaze at - defaults to100
num_papers
: int maximium number of papers to download and analyse - defaults to25
- Detailed survey draft paper as txt file
- A curated list of top 25+ papers as pdfs and txts
- Images extracted from above papers as jpegs, bmps etc
- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
- Tables extracted from papers(optional)
- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump
This work builds upon these fantastic models (for various nlp sub-tasks) out there for researchers and devs like ushttps://huggingface.co/Callidior/bert2bert-base-arxiv-titlegenhttps://huggingface.co/allenai/scibert_scivocab_uncasedhttps://huggingface.co/allenai/led-large-16384-arxivhttps://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2https://huggingface.co/sentence-transformers/distilbert-base-nli-mean-tokenshttps://tabula.technology/https://spacy.io/ andhttps://allenai.github.io/scispacy/
Please cite this repo if it helped you :)
About
Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!