Document transformers

📄️ AI21SemanticTextSplitter

This example goes over how to use AI21SemanticTextSplitter in LangChain.

📄️ Beautiful Soup

Beautiful Soup is a Python package for parsing

This notebook shows how to implement reranker in a retriever with your own cross encoder from Hugging Face cross encoder models or Hugging Face models that implements cross encoder function (example: BAAI/bge-reranker-base). SagemakerEndpointCrossEncoder enables you to use these HuggingFace models loaded on Sagemaker.

📄️ DashScope Reranker

This notebook shows how to use DashScope Reranker for document compression and retrieval. DashScope is the generative AI service from Alibaba Cloud (Aliyun).

📄️ Doctran: extract properties

We can extract useful features of documents using the Doctran library, which uses OpenAI's function calling feature to extract specific metadata.

📄️ Doctran: interrogate documents

Documents used in a vector store knowledge base are typically stored in a narrative or conversational format. However, most user queries are in question format. If we convert documents into Q&A format before vectorizing them, we can increase the likelihood of retrieving relevant documents, and decrease the likelihood of retrieving irrelevant documents.

📄️ Doctran: language translation

Comparing documents through embeddings has the benefit of working across multiple languages. "Harrison says hello" and "Harrison dice hola" will occupy similar positions in the vector space because they have the same meaning semantically.

📄️ Google Cloud Vertex AI Reranker

The Vertex Search Ranking API is one of the standalone APIs in Vertex AI Agent Builder. It takes a list of documents and reranks those documents based on how relevant the documents are to a query. Compared to embeddings, which look only at the semantic similarity of a document and a query, the ranking API can give you precise scores for how well a document answers a given query. The ranking API can be used to improve the quality of search results after retrieving an initial set of candidate documents.

📄️ Google Cloud Document AI

Document AI is a document understanding platform from Google Cloud to transform unstructured data from documents into structured data, making it easier to understand, analyze, and consume.

📄️ Google Translate

Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another.

📄️ HTML to text

html2text is a Python package that converts a page of HTML into clean, easy-to-read plain ASCII text.

📄️ Infinity Reranker

Infinity is a high-throughput, low-latency REST API for serving text-embeddings, reranking models and clip.

📄️ Jina Reranker

This notebook shows how to use Jina Reranker for document compression and retrieval.

📄️ Markdownify

markdownify is a Python package that converts HTML documents to Markdown format with customizable options for handling tags (links, images, ...), heading styles and other.

📄️ Nuclia

Nuclia automatically indexes your unstructured data from any internal and external source, providing optimized search results and generative answers. It can handle video and audio transcription, image content extraction, and document parsing.

📄️ OpenAI metadata tagger

It can often be useful to tag ingested documents with structured metadata, such as the title, tone, or length of a document, to allow for a more targeted similarity search later. However, for large numbers of documents, performing this labelling process manually can be tedious.

📄️ OpenVINO Reranker

OpenVINO™ is an open-source toolkit for optimizing and deploying AI inference. The OpenVINO™ Runtime supports various hardware devices including x86 and ARM CPUs, and Intel GPUs. It can help to boost deep learning performance in Computer Vision, Automatic Speech Recognition, Natural Language Processing and other common tasks.

📄️ RankLLM Reranker

RankLLM is a flexible reranking framework supporting listwise, pairwise, and pointwise ranking models. It includes RankVicuna, RankZephyr, MonoT5, DuoT5, LiT5, and FirstMistral, with integration for FastChat, vLLM, SGLang, and TensorRT-LLM for efficient inference. RankLLM is optimized for retrieval and ranking tasks, leveraging both open-source LLMs and proprietary rerankers like RankGPT and RankGemini. It supports batched inference, first-token reranking, and retrieval via BM25 and SPLADE.

📄️ Volcengine Reranker

This notebook shows how to use Volcengine Reranker for document compression and retrieval. Volcengine is a cloud service platform developed by ByteDance, the parent company of TikTok.

📄️ VoyageAI Reranker

Voyage AI provides cutting-edge embedding/vectorizations models.

Movatterモバイル変換