- Notifications
You must be signed in to change notification settings - Fork3
Submission for HackDataKIBots 2018 - Web crawler combined with document analysis
License
manuel-lang/Autonomous-Semantic-Search-Engine
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A search engine that autonomously crawls documents from a given domain including their subdomains, analyzes them and renders them into a search frontend. This implementation demonstrates the functionality with theStanford University's website. This project was implemented during theNext Iteration Hackathon 2018
Python implementation with Scrapyhere.
Python implementation usingWatson NLU (for Named Entities, Keywords),gensim (for Summarization and Semantic Representation) and a custom Document Type classifier (Random Forest, withsklearn). Title, a thumbnail and embedded images are also extracted from documents. See notebooks for specific implementations.
A react frontend that displays the information with additional image information usingBing Image Searchhere.
- Swig etc. for Textract:https://textract.readthedocs.io/en/stable/installation.html
- Ghostscript:https://wiki.scribus.net/canvas/Installation_and_Configuration_of_Ghostscript
- ImageMagick 6:ImageMagick/ImageMagick#953
About
Submission for HackDataKIBots 2018 - Web crawler combined with document analysis
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors3
Uh oh!
There was an error while loading.Please reload this page.