Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Semantic search and document parsing tools for the command line

License

NotificationsYou must be signed in to change notification settings

run-llama/semtools

Repository files navigation

Semantic search and document parsing tools for the command line

A collection of high-performance CLI tools for document processing and semantic search, built with Rust for speed and reliability.

  • parse - Parse documents (PDF, DOCX, etc.) using, by default, the LlamaParse API into markdown format
  • search - Local semantic keyword search using multilingual embeddings with cosine similarity matching and per-line context matching
  • workspace - Workspace management for accelerating search over large collections

NOTE: By default,parse uses LlamaParse as a backend. Get your API key today for free athttps://cloud.llamaindex.ai.search remains local-only.

Key Features

  • Fast semantic search using model2vec embeddings fromminishlab/potion-multilingual-128M
  • Reliable document parsing with caching and error handling
  • Unix-friendly design with proper stdin/stdout handling
  • Configurable distance thresholds and returned chunk sizes
  • Multi-format support for parsing documents (PDF, DOCX, PPTX, etc.)
  • Concurrent processing for better parsing performance
  • Workspace management for efficient document retrieval over large collections

Installation

Prerequisites:

  • For theparse tool: LlamaIndex Cloud API key

Install:

You can installsemtools via npm:

npm i -g @llamaindex/semtools

Or via cargo:

# install entire cratecargo install semtools# install only parsecargo install semtools --no-default-features --features=parse# install only searchcargo install semtools --no-default-features --features=search

Note: Installing from npm builds the Rust binaries locally during install if a prebuilt binary is not available, which requires Rust and Cargo to be available in your environment. Install fromrustup if needed:https://www.rust-lang.org/tools/install.

Quick Start

Basic Usage:

# Parse some filesparse my_dir/*.pdf# Search some (text-based) filessearch"some keywords"*.txt --max-distance 0.3 --n-lines 5# Combine parsing and searchparse my_docs/*.pdf| xargs search"API endpoints"

Advanced Usage:

# Combine with grep for exact-match pre-filtering and distance thresholdingparse*.pdf| xargs cat| grep -i"error"| search"network error" --max-distance 0.3# Pipeline with content search (note the 'cat')find. -name"*.md"| xargs parse| xargs search"installation"# Combine with grep for filtering (grep could be before or after parse/search!)parse docs/*.pdf| xargs search"API"| grep -A5"authentication"# Save search resultsparse report.pdf| xargs cat| search"summary"> results.txt

Using Workspaces:

# Create or select a workspace# Workspaces are stored in ~/.semtools/workspaces/workspace use my-workspace> Workspace'my-workspace' configured.> To activate it, run:>export SEMTOOLS_WORKSPACE=my-workspace>> Or add this to your shell profile (.bashrc, .zshrc, etc.)# Activate the workspaceexport SEMTOOLS_WORKSPACE=my-workspace# All search commands will now use the workspace for caching embeddings# The initial command is used to initialize the workspacesearch"some keywords" ./some_large_dir/*.txt --n-lines 5 --top-k 10# If documents change, they are automatically re-embedded and cachedecho"some new content"> ./some_large_dir/some_file.txtsearch"some keywords" ./some_large_dir/*.txt --n-lines 5 --top-k 10# If documents are removed, you can run prune to clean up stale filesworkspace prune# You can see the stats of a workspace at any timeworkspace status> Active workspace: arxiv> Root: /Users/loganmarkewich/.semtools/workspaces/arxiv> Documents: 3000> Index: Yes (IVF_PQ)

CLI Help

$ parse --helpA CLI toolfor parsing documents using various backendsUsage: parse [OPTIONS]<FILES>...Arguments:<FILES>...  Files to parseOptions:  -c, --parse-config<PARSE_CONFIG>  Path to the config file. Defaults to~/.parse_config.json  -b, --backend<BACKEND>            The backendtype to usefor parsing. Defaults to`llama-parse` [default: llama-parse]  -v, --verbose                      Verbose outputwhile parsing  -h, --help                         Printhelp  -V, --version                      Print version
$ search --helpA CLI toolfor fast semantic keyword searchUsage: search [OPTIONS]<QUERY> [FILES]...Arguments:<QUERY>     Query to searchfor (positional argument)  [FILES]...  Files or directories to searchOptions:  -n, --n-lines<N_LINES>            How many lines before/after toreturn as context [default: 3]      --top-k<TOP_K>                The top-k files or texts toreturn (ignoredif max_distance is set) [default: 3]  -m, --max-distance<MAX_DISTANCE>  Return all results with distance below this threshold (0.0+)  -i, --ignore-case                  Perform case-insensitive search (default is false)  -h, --help                         Printhelp  -V, --version                      Print version
$ workspace --helpManage semtools workspacesUsage: workspace<COMMAND>Commands:  use     Use or create a workspace (printsexportcommand to run)  status  Show active workspace and basic stats  prune   Remove stale or missing files from storehelp    Print this message or thehelp of the given subcommand(s)Options:  -h, --help     Printhelp  -V, --version  Print version

Configuration

Parse Tool Configuration

By default, theparse tool uses the LlamaParse API to parse documents.

It will look for a~/.parse_config.json file to configure the API key and other parameters.

Otherwise, it will fallback to looking for aLLAMA_CLOUD_API_KEY environment variable and a set of default parameters.

To configure theparse tool, create a~/.parse_config.json file with the following content (defaults are shown below):

{"api_key":"your_llama_cloud_api_key_here","num_ongoing_requests":10,"base_url":"https://api.cloud.llamaindex.ai","check_interval":5,"max_timeout":3600,"max_retries":10,"retry_delay_ms":1000,"backoff_multiplier":2.0,"parse_kwargs": {"parse_mode":"parse_page_with_agent","model":"openai-gpt-4-1-mini","high_res_ocr":"true","adaptive_long_table":"true","outlined_table_extraction":"true","output_tables_as_HTML":"true"  }}

Or just set via environment variable:

export LLAMA_CLOUD_API_KEY="your_api_key_here"

Agent Use Case Examples

Future Work

  • More parsing backends (something local-only would be great!)
  • Improved search algorithms
  • (optional) Persistence for speedups on repeat searches on the same files

Contributing

We welcome contributions! Please seeCONTRIBUTING.md for guidelines.

License

This project is licensed under the MIT License - see theLICENSE file for details.

Acknowledgments

About

Semantic search and document parsing tools for the command line

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors3

  •  
  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp