- Notifications
You must be signed in to change notification settings - Fork0
Semantic Desktop Search - search for answers not the file names
License
tsureshkumar/semdesk
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
SemDesk is a desktop tool and service to search files semantically. Instead ofmatching files on names or keywords in the file, this tool tries to find theanswers from the contents of the file.
For instance, if you have saved a file deposits.txt with some content like "Ihave $5000.", you can ask "How much do I have in my bank?". This tool willanswer that "$5000", where as non-semantic search will return if there is akeyword match in the file name.
Currently, this tool works with text files and some level of functionality withPDF files. PDF files are hard to parse for text as the flow may not be linear.
This uses facebook's faiss vector index for document retrieval and google's bertmodel for question-answering.
There could be other usecases such as when you through a set of files into theindex from a directory, you can ask the tool questions from those set of files.
Currently, the tool works on MacOS only.
You need to have rust installed and at least 5 GB disk space for downloading themodels, which is larger size.
$ cargo build
The tool has two binaries.semdesk is background daemon that crawls theconfigured directories once a day. It also has the backend for documentretrieval and answering queries.semdesk-cli contacts the daemon and executesthe query.
You also need pdf2ps and ps2ascii to extract text out of pdf files. You caninstall these using$ brew install ghostscript on MacOS.
$ cargo run --bin semdesk# run in a background terminal$ semdesk-cli query"How much do I have in my bank?"File: /Users/$USER/personal_docs/deposit_details.txtMatch Probability: 95.56Rs.5000
The configuration for the daemon lives in~/.config/semdesk/config.toml
# ~/.config/semdesk/config.toml[crawler]files = ["~/Downloads/newsgroups/","~/Downloads/personal_docs/", ]max_depth =3
This writes the status of scanned files to~/.local/share/semdesk*.
This project usesfaiss for storingthe vector index. And it uses rust port of hugging face's transformers libraryand bert model for question answering pipelinerust_bert.
I am able to run the indexer and model on my 4 year old macbook pro (16 GB RAM)comfortably with a scan of about 1000+ small text files. Thanks to rust, ittakes only about 600 MB of RAM. The initial loading and during crawling, theusage goes upto about 1.5 GB of RAM but the inference only takes 500+ MB.
This is still in early stages of development, there could be some unknownissues.
This is free and open source software. You can use, copy, modify,merge, publish, distribute, sublicense, and/or sell copies of it,under the terms of the Apache 2.0 License. SeeLICENSE.md for details.
This software is provided "AS IS", WITHOUT WARRANTY OF ANY KIND,express or implied. SeeLICENSE.md for details.
The author of this project hangs out at the following places online:
- Twitter:@tsureshkumar
- Mastodon:@tsureshkumar@mastodon.social
- GitHub:@tsureshkumar
You are welcome to subscribe to, follow, or join one or more of theabove channels to receive updates from the author or ask questionsabout this project.
About
Semantic Desktop Search - search for answers not the file names
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
