Writer PDF Parser
This notebook provides a quick overview for getting started with the WriterPDFParser
document loader.
Writer'sPDF Parser converts PDF documents into other formats like text or Markdown. This is particularly useful when you need to extract and process text content from PDF files for further analysis or integration into your workflow. Inlangchain-writer
, we provide usage of Writer's PDF Parser as a LangChain document parser.
Overview
Integration details
Class | Package | Local | Serializable | JS support | Package downloads | Package latest |
---|---|---|---|---|---|---|
PDFParser | langchain-writer | ❌ | ❌ | ❌ |
Setup
ThePDFParser
is available in thelangchain-writer
package:
%pip install--quiet-U langchain-writer
Credentials
Sign up forWriter AI Studio to generate an API key (you can follow thisQuickstart). Then, set the WRITER_API_KEY environment variable:
import getpass
import os
ifnot os.getenv("WRITER_API_KEY"):
os.environ["WRITER_API_KEY"]= getpass.getpass("Enter your Writer API key: ")
It's also helpful (but not needed) to set upLangSmith for best-in-class observability. If you wish to do so, you can set theLANGSMITH_TRACING
andLANGSMITH_API_KEY
environment variables:
# os.environ["LANGSMITH_TRACING"] = "true"
# os.environ["LANGSMITH_API_KEY"] = getpass.getpass()
Instantiation
Next, instantiate an instance of the Writer PDF Parser with the desired output format:
from langchain_writer.pdf_parserimport PDFParser
parser= PDFParser(format="markdown")
Usage
There are two ways to use the PDF Parser, either synchronously or asynchronously. In either case, the PDF Parser will return a list ofDocument
objects, each containing the parsed content of a page from the PDF file.
Synchronous usage
To invoke the PDF Parser synchronously, pass aBlob
object to theparse
method referencing the PDF file you want to parse:
from langchain_core.documents.baseimport Blob
file= Blob.from_path("../example_data/layout-parser-paper.pdf")
parsed_pages= parser.parse(blob=file)
parsed_pages
Asynchronous usage
To invoke the PDF Parser asynchronously, pass aBlob
object to theaparse
method referencing the PDF file you want to parse:
parsed_pages_async=await parser.aparse(blob=file)
parsed_pages_async
API reference
For detailed documentation of allPDFParser
features and configurations, head to theAPI reference.
Additional resources
You can find information about Writer's models (including costs, context windows, and supported input types) and tools in theWriter docs.