- Notifications
You must be signed in to change notification settings - Fork7
Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!
License
sidphbot/Auto-Research
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
title | emoji | colorFrom | colorTo | sdk | sdk_version | app_file | pinned |
---|---|---|---|---|---|---|---|
Researcher | 🤓 | gray | pink | streamlit | 1.2.0 | app.py | false |
A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.
Data Provider:arXiv Open Archive Initiative OAI
Requirements:
- python 3.7 or above
- poppler-utils -
sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
- list of requirements in requirements.txt -
cat requirements.txt | xargs pip install
- 8GB disk space
- 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)
Video Demo :https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing
Kaggle Re-usable Demo :https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query
([TIP]
click 'edit and run' to run the demo for your custom queries on a free GPU)
sudo apt-get install build-essential poppler-utils libpoppler-cpp-dev pkg-config python-devpip install git+https://github.com/sidphbot/Auto-Research.git
python survey.py [options] <your_research_query>
streamlit run app.py
from survey import Surveyormysurveyor = Surveyor()mysurveyor.survey('quantum entanglement')
These are independent tools for your research or document text handling needs.
*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
abstractive_summary
- takes a long text document (string
) and returns a 1-paragraph abstract or “abstractive” summary (string
)Input:
`longtext` : string
Returns:
`summary` : string
extractive_summary
- takes a long text document (string
) and returns a 1-paragraph of extracted highlights or “extractive” summary (string
)Input:
`longtext` : string
Returns:
`summary` : string
generate_title
- takes a long text document (string
) and returns a generated title (string
)Input:
`longtext` : string
Returns:
`title` : string
extractive_highlights
- takes a long text document (string
) and returns a list of extracted highlights ([string]
), a list of keywords ([string]
) and key phrases ([string]
)Input:
`longtext` : string
Returns:
`highlights` : [string] `keywords` : [string] `keyphrases` : [string]
extract_images_from_file
- takes a pdf file name (string
) and returns a list of image filenames ([string]
).Input:
`pdf_file` : string
Returns:
`images_files` : [string]
extract_tables_from_file
- takes a pdf file name (string
) and returns a list of csv filenames ([string]
).Input:
`pdf_file` : string
Returns:
`images_files` : [string]
cluster_lines
- takes a list of lines (string
) and returns the topic-clustered sections (dict(generated_title: [cluster_abstract])
) and clustered lines (dict(cluster_id: [cluster_lines])
)Input:
`lines` : [string]
Returns:
`sections` : dict(generated_title: [cluster_abstract]) `clusters` : dict(cluster_id: [cluster_lines])
extract_headings
-[for scientific texts - Assumes an ‘abstract’ heading present] takes a text file name (string
) and returns a list of headings ([string]
) and refined lines ([string]
).[Tip 1]
: Useextract_sections
as a wrapper (e.g.extract_sections(extract_headings(“/path/to/textfile”)
) to get heading-wise sectioned text with refined lines instead (dict( heading: text)
)[Tip 2]
: write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!Input:
`text_file` : string
Returns:
`refined` : [string], `headings` : [string] `sectioned_doc` : dict( heading: text) (Optional - Wrapper case)
- inside code
from survey.Surveyor import DEFAULTSfrom pprint import pprintpprint(DEFAULTS)
or,
- Modify static config file -
defaults.py
or,
- At runtime (utility)
python survey.py --help
usage: survey.py [-h] [--max_search max_metadata_papers] [--num_papers max_num_papers] [--pdf_dir pdf_dir] [--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir] [--dump_dir dump_dir] [--models_dir save_models_dir] [--title_model_name title_model_name] [--ex_summ_model_name extractive_summ_model_name] [--ledmodel_name ledmodel_name] [--embedder_name sentence_embedder_name] [--nlp_name spacy_model_name] [--similarity_nlp_name similarity_nlp_name] [--kw_model_name kw_model_name] [--refresh_models refresh_models] [--high_gpu high_gpu] query_stringGenerate a survey just from a query !!positional arguments: query_string your research query/keywordsoptional arguments: -h, --help show this help message and exit --max_search max_metadata_papers maximium number of papers to gaze at - defaults to 100 --num_papers max_num_papers maximium number of papers to download and analyse - defaults to 25 --pdf_dir pdf_dir pdf paper storage directory - defaults to arxiv_data/tarpdfs/ --txt_dir txt_dir text-converted paper storage directory - defaults to arxiv_data/fulltext/ --img_dir img_dir image storage directory - defaults to arxiv_data/images/ --tab_dir tab_dir tables storage directory - defaults to arxiv_data/tables/ --dump_dir dump_dir all_output_dir - defaults to arxiv_dumps/ --models_dir save_models_dir directory to save models (> 5GB) - defaults to saved_models/ --title_model_name title_model_name title model name/tag in hugging-face, defaults to 'Callidior/bert2bert-base-arxiv-titlegen' --ex_summ_model_name extractive_summ_model_name extractive summary model name/tag in hugging-face, defaults to 'allenai/scibert_scivocab_uncased' --ledmodel_name ledmodel_name led model(for abstractive summary) name/tag in hugging-face, defaults to 'allenai/led- large-16384-arxiv' --embedder_name sentence_embedder_name sentence embedder name/tag in hugging-face, defaults to 'paraphrase-MiniLM-L6-v2' --nlp_name spacy_model_name spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to 'en_core_sci_scibert' --similarity_nlp_name similarity_nlp_name spacy downstream model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults to 'en_core_sci_lg' --kw_model_name kw_model_name keyword extraction model name/tag in hugging-face, defaults to 'distilbert-base-nli-mean-tokens' --refresh_models refresh_models Refresh model downloads with given names (needs atleast one model name param above), defaults to False --high_gpu high_gpu High GPU usage permitted, defaults to False
At runtime (code)
during surveyor object initialization with
surveyor_obj = Surveyor()
pdf_dir
: String, pdf paper storage directory - defaults toarxiv_data/tarpdfs/
txt_dir
: String, text-converted paper storage directory - defaults toarxiv_data/fulltext/
img_dir
: String, image image storage directory - defaults toarxiv_data/images/
tab_dir
: String, tables storage directory - defaults toarxiv_data/tables/
dump_dir
: String, all_output_dir - defaults toarxiv_dumps/
models_dir
: String, directory to save to huge models, defaults tosaved_models/
title_model_name
: String, title model name/tag in hugging-face, defaults toCallidior/bert2bert-base-arxiv-titlegen
ex_summ_model_name
: String, extractive summary model name/tag in hugging-face, defaults toallenai/scibert_scivocab_uncased
ledmodel_name
: String, led model(for abstractive summary) name/tag in hugging-face, defaults toallenai/led-large-16384-arxiv
embedder_name
: String, sentence embedder name/tag in hugging-face, defaults toparaphrase-MiniLM-L6-v2
nlp_name
: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults toen_core_sci_scibert
similarity_nlp_name
: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults toen_core_sci_lg
kw_model_name
: String, keyword extraction model name/tag in hugging-face, defaults todistilbert-base-nli-mean-tokens
high_gpu
: Bool, High GPU usage permitted, defaults toFalse
refresh_models
: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False
during survey generation with
surveyor_obj.survey(query="my_research_query")
max_search
: int maximium number of papers to gaze at - defaults to100
num_papers
: int maximium number of papers to download and analyse - defaults to25
- Detailed survey draft paper as txt file
- A curated list of top 25+ papers as pdfs and txts
- Images extracted from above papers as jpegs, bmps etc
- Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
- Tables extracted from papers(optional)
- Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump
This work builds upon these fantastic models (for various nlp sub-tasks) out there for researchers and devs like ushttps://huggingface.co/Callidior/bert2bert-base-arxiv-titlegenhttps://huggingface.co/allenai/scibert_scivocab_uncasedhttps://huggingface.co/allenai/led-large-16384-arxivhttps://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2https://huggingface.co/sentence-transformers/distilbert-base-nli-mean-tokenshttps://tabula.technology/https://spacy.io/ andhttps://allenai.github.io/scispacy/
Please cite this repo if it helped you :)
About
Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.