tsureshkumar/semdeskPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star1

Semantic Desktop Search - search for answers not the file names

License

Apache-2.0 license

1 star 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
.cargo		.cargo
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.md		LICENSE.md
README.md		README.md

Repository files navigation

SemDesk

SemDesk is a desktop tool and service to search files semantically. Instead ofmatching files on names or keywords in the file, this tool tries to find theanswers from the contents of the file.

For instance, if you have saved a file deposits.txt with some content like "Ihave $5000.", you can ask "How much do I have in my bank?". This tool willanswer that "$5000", where as non-semantic search will return if there is akeyword match in the file name.

Currently, this tool works with text files and some level of functionality withPDF files. PDF files are hard to parse for text as the flow may not be linear.

This uses facebook's faiss vector index for document retrieval and google's bertmodel for question-answering.

There could be other usecases such as when you through a set of files into theindex from a directory, you can ask the tool questions from those set of files.

Building

Currently, the tool works on MacOS only.

You need to have rust installed and at least 5 GB disk space for downloading themodels, which is larger size.

$ cargo build

Running

The tool has two binaries.semdesk is background daemon that crawls theconfigured directories once a day. It also has the backend for documentretrieval and answering queries.semdesk-cli contacts the daemon and executesthe query.

You also need pdf2ps and ps2ascii to extract text out of pdf files. You caninstall these using$ brew install ghostscript on MacOS.

$ cargo run --bin semdesk# run in a background terminal$ semdesk-cli query"How much do I have in my bank?"File: /Users/$USER/personal_docs/deposit_details.txtMatch Probability: 95.56Rs.5000

Configuration

The configuration for the daemon lives in~/.config/semdesk/config.toml

# ~/.config/semdesk/config.toml[crawler]files = ["~/Downloads/newsgroups/","~/Downloads/personal_docs/",    ]max_depth =3

Other files

This writes the status of scanned files to~/.local/share/semdesk*.

Details

This project usesfaiss for storingthe vector index. And it uses rust port of hugging face's transformers libraryand bert model for question answering pipelinerust_bert.

I am able to run the indexer and model on my 4 year old macbook pro (16 GB RAM)comfortably with a scan of about 1000+ small text files. Thanks to rust, ittakes only about 600 MB of RAM. The initial loading and during crawling, theusage goes upto about 1.5 GB of RAM but the inference only takes 500+ MB.

This is still in early stages of development, there could be some unknownissues.

License

This is free and open source software. You can use, copy, modify,merge, publish, distribute, sublicense, and/or sell copies of it,under the terms of the Apache 2.0 License. SeeLICENSE.md for details.

This software is provided "AS IS", WITHOUT WARRANTY OF ANY KIND,express or implied. SeeLICENSE.md for details.

Channels

The author of this project hangs out at the following places online:

You are welcome to subscribe to, follow, or join one or more of theabove channels to receive updates from the author or ask questionsabout this project.

About

Semantic Desktop Search - search for answers not the file names

Releases

No releases published

Packages

No packages published

Languages

Rust100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

SemDesk

Building

Running

Configuration

Other files

Details

License

Channels

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

tsureshkumar/semdesk

Folders and files

Latest commit

History

Repository files navigation

SemDesk

Building

Running

Configuration

Other files

Details

License

Channels

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages