Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Generate custom detailed survey paper with topic clustered sections and proper citations, from just a single query in just under 30 mins !!

License

NotificationsYou must be signed in to change notification settings

sidphbot/Auto-Research

Repository files navigation

titleemojicolorFromcolorTosdksdk_versionapp_filepinned
Researcher
🤓
gray
pink
streamlit
1.2.0
app.py
false

Auto-Research

A no-code utility to generate a detailed well-cited survey with topic clustered sections (draft paper format) and other interesting artifacts from a single research query.

Data Provider:arXiv Open Archive Initiative OAI

Requirements:

  • python 3.7 or above
  • poppler-utils -sudo apt-get install build-essential libpoppler-cpp-dev pkg-config python-dev
  • list of requirements in requirements.txt -cat requirements.txt | xargs pip install
  • 8GB disk space
  • 13GB CUDA(GPU) memory - for a survey of 100 searched papers(max_search) and 25 selected papers(num_papers)

Demo :

Video Demo :https://drive.google.com/file/d/1-77J2L10lsW-bFDOGdTaPzSr_utY743g/view?usp=sharing

Kaggle Re-usable Demo :https://www.kaggle.com/sidharthpal/auto-research-generate-survey-from-query

([TIP] click 'edit and run' to run the demo for your custom queries on a free GPU)

Installation:

sudo apt-get install build-essential poppler-utils libpoppler-cpp-dev pkg-config python-devpip install git+https://github.com/sidphbot/Auto-Research.git

Run Survey (cli):

python survey.py [options] <your_research_query>

Run Survey (Streamlit web-interface - new):

streamlit run app.py

Run Survey (Python API):

from survey import Surveyormysurveyor = Surveyor()mysurveyor.survey('quantum entanglement')

Research tools:

These are independent tools for your research or document text handling needs.

*[Tip]* :(models can be changed in defaults or passed on during init along with `refresh-models=True`)
  • abstractive_summary - takes a long text document (string) and returns a 1-paragraph abstract or “abstractive” summary (string)

    Input:

      `longtext` : string

    Returns:

      `summary` : string
  • extractive_summary - takes a long text document (string) and returns a 1-paragraph of extracted highlights or “extractive” summary (string)

    Input:

      `longtext` : string

    Returns:

      `summary` : string
  • generate_title - takes a long text document (string) and returns a generated title (string)

    Input:

      `longtext` : string

    Returns:

      `title` : string
  • extractive_highlights - takes a long text document (string) and returns a list of extracted highlights ([string]), a list of keywords ([string]) and key phrases ([string])

    Input:

      `longtext` : string

    Returns:

      `highlights` : [string]  `keywords` : [string]  `keyphrases` : [string]
  • extract_images_from_file - takes a pdf file name (string) and returns a list of image filenames ([string]).

    Input:

      `pdf_file` : string

    Returns:

      `images_files` : [string]
  • extract_tables_from_file - takes a pdf file name (string) and returns a list of csv filenames ([string]).

    Input:

      `pdf_file` : string

    Returns:

      `images_files` : [string]
  • cluster_lines - takes a list of lines (string) and returns the topic-clustered sections (dict(generated_title: [cluster_abstract])) and clustered lines (dict(cluster_id: [cluster_lines]))

    Input:

      `lines` : [string]

    Returns:

      `sections` : dict(generated_title: [cluster_abstract])  `clusters` : dict(cluster_id: [cluster_lines])
  • extract_headings -[for scientific texts - Assumes an ‘abstract’ heading present] takes a text file name (string) and returns a list of headings ([string]) and refined lines ([string]).

    [Tip 1] : Useextract_sections as a wrapper (e.g.extract_sections(extract_headings(“/path/to/textfile”)) to get heading-wise sectioned text with refined lines instead (dict( heading: text))

    [Tip 2] : write the word ‘abstract’ at the start of the file text to get an extraction for non-scientific texts as well !!

    Input:

      `text_file` : string

    Returns:

      `refined` : [string],   `headings` : [string]  `sectioned_doc` : dict( heading: text) (Optional - Wrapper case)

Access/Modify defaults:

  • inside code
from survey.Surveyor import DEFAULTSfrom pprint import pprintpprint(DEFAULTS)

or,

  • Modify static config file -defaults.py

or,

  • At runtime (utility)
python survey.py --help
usage: survey.py [-h] [--max_search max_metadata_papers]                   [--num_papers max_num_papers] [--pdf_dir pdf_dir]                   [--txt_dir txt_dir] [--img_dir img_dir] [--tab_dir tab_dir]                   [--dump_dir dump_dir] [--models_dir save_models_dir]                   [--title_model_name title_model_name]                   [--ex_summ_model_name extractive_summ_model_name]                   [--ledmodel_name ledmodel_name]                   [--embedder_name sentence_embedder_name]                   [--nlp_name spacy_model_name]                   [--similarity_nlp_name similarity_nlp_name]                   [--kw_model_name kw_model_name]                   [--refresh_models refresh_models] [--high_gpu high_gpu]                   query_stringGenerate a survey just from a query !!positional arguments:  query_string          your research query/keywordsoptional arguments:  -h, --help            show this help message and exit  --max_search max_metadata_papers                        maximium number of papers to gaze at - defaults to 100  --num_papers max_num_papers                        maximium number of papers to download and analyse -                        defaults to 25  --pdf_dir pdf_dir     pdf paper storage directory - defaults to                        arxiv_data/tarpdfs/  --txt_dir txt_dir     text-converted paper storage directory - defaults to                        arxiv_data/fulltext/  --img_dir img_dir     image storage directory - defaults to                        arxiv_data/images/  --tab_dir tab_dir     tables storage directory - defaults to                        arxiv_data/tables/  --dump_dir dump_dir   all_output_dir - defaults to arxiv_dumps/  --models_dir save_models_dir                        directory to save models (> 5GB) - defaults to                        saved_models/  --title_model_name title_model_name                        title model name/tag in hugging-face, defaults to                        'Callidior/bert2bert-base-arxiv-titlegen'  --ex_summ_model_name extractive_summ_model_name                        extractive summary model name/tag in hugging-face,                        defaults to 'allenai/scibert_scivocab_uncased'  --ledmodel_name ledmodel_name                        led model(for abstractive summary) name/tag in                        hugging-face, defaults to 'allenai/led-                        large-16384-arxiv'  --embedder_name sentence_embedder_name                        sentence embedder name/tag in hugging-face, defaults                        to 'paraphrase-MiniLM-L6-v2'  --nlp_name spacy_model_name                        spacy model name/tag in hugging-face (if changed -                        needs to be spacy-installed prior), defaults to                        'en_core_sci_scibert'  --similarity_nlp_name similarity_nlp_name                        spacy downstream model(for similarity) name/tag in                        hugging-face (if changed - needs to be spacy-installed                        prior), defaults to 'en_core_sci_lg'  --kw_model_name kw_model_name                        keyword extraction model name/tag in hugging-face,                        defaults to 'distilbert-base-nli-mean-tokens'  --refresh_models refresh_models                        Refresh model downloads with given names (needs                        atleast one model name param above), defaults to False  --high_gpu high_gpu   High GPU usage permitted, defaults to False
  • At runtime (code)

    during surveyor object initialization withsurveyor_obj = Surveyor()

    • pdf_dir: String, pdf paper storage directory - defaults toarxiv_data/tarpdfs/
    • txt_dir: String, text-converted paper storage directory - defaults toarxiv_data/fulltext/
    • img_dir: String, image image storage directory - defaults toarxiv_data/images/
    • tab_dir: String, tables storage directory - defaults toarxiv_data/tables/
    • dump_dir: String, all_output_dir - defaults toarxiv_dumps/
    • models_dir: String, directory to save to huge models, defaults tosaved_models/
    • title_model_name: String, title model name/tag in hugging-face, defaults toCallidior/bert2bert-base-arxiv-titlegen
    • ex_summ_model_name: String, extractive summary model name/tag in hugging-face, defaults toallenai/scibert_scivocab_uncased
    • ledmodel_name: String, led model(for abstractive summary) name/tag in hugging-face, defaults toallenai/led-large-16384-arxiv
    • embedder_name: String, sentence embedder name/tag in hugging-face, defaults toparaphrase-MiniLM-L6-v2
    • nlp_name: String, spacy model name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults toen_core_sci_scibert
    • similarity_nlp_name: String, spacy downstream trained model(for similarity) name/tag in hugging-face (if changed - needs to be spacy-installed prior), defaults toen_core_sci_lg
    • kw_model_name: String, keyword extraction model name/tag in hugging-face, defaults todistilbert-base-nli-mean-tokens
    • high_gpu: Bool, High GPU usage permitted, defaults toFalse
    • refresh_models: Bool, Refresh model downloads with given names (needs atleast one model name param above), defaults to False

    during survey generation withsurveyor_obj.survey(query="my_research_query")

    • max_search: int maximium number of papers to gaze at - defaults to100
    • num_papers: int maximium number of papers to download and analyse - defaults to25

Artifacts generated (zipped):

  • Detailed survey draft paper as txt file
  • A curated list of top 25+ papers as pdfs and txts
  • Images extracted from above papers as jpegs, bmps etc
  • Heading/Section wise highlights extracted from above papers as a re-usable pure python joblib dump
  • Tables extracted from papers(optional)
  • Corpus of metadata highlights/text of top 100 papers as a re-usable pure python joblib dump

This work builds upon these fantastic models (for various nlp sub-tasks) out there for researchers and devs like ushttps://huggingface.co/Callidior/bert2bert-base-arxiv-titlegenhttps://huggingface.co/allenai/scibert_scivocab_uncasedhttps://huggingface.co/allenai/led-large-16384-arxivhttps://huggingface.co/sentence-transformers/paraphrase-MiniLM-L6-v2https://huggingface.co/sentence-transformers/distilbert-base-nli-mean-tokenshttps://tabula.technology/https://spacy.io/ andhttps://allenai.github.io/scispacy/

Please cite this repo if it helped you :)


[8]ページ先頭

©2009-2025 Movatter.jp