unjs/unpdfPublic

NotificationsYou must be signed in to change notification settings
Fork31
Star938

📄 PDF extraction and rendering across all JavaScript runtimes

License

MIT license

938 stars 31 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 218 Commits
.github/workflows		.github/workflows
.vscode		.vscode
examples/cloudflare		examples/cloudflare
scripts		scripts
src		src
test		test
.editorconfig		.editorconfig
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.config.ts		build.config.ts
eslint.config.mjs		eslint.config.mjs
package.json		package.json
pdfjs.rollup.config.ts		pdfjs.rollup.config.ts
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
tsconfig.json		tsconfig.json
vitest.config.ts		vitest.config.ts

Repository files navigation

unpdf

A collection of utilities for PDF extraction and rendering. Designed specifically for serverless environments, but it also works in Node.js, Deno, Bun and the browser.unpdf is particularly useful for serverless AI applications, especially for summarizing PDF documents in document analysis workflows.

This library ships with a serverless build/redistribution of Mozilla'sPDF.js that is optimized for edge environments. Some string replacements, global mocks and inlining the PDF.js worker allow the browser code to become platform agnostic. Seepdfjs.rollup.config.ts for the details.

This library is also intended as a modern alternative to the unmaintained but still popularpdf-parse.

Features

🏗️ Made for Node.js, browser and serverless environments
🪭 Includes serverless build of PDF.js (unpdf/pdfjs)
💬 Extracttext,links, andimages from PDF files
🧠 Perfect for AI applications and PDF summarization
🧱 Opt-in to legacy PDF.js build
💨 Zero dependencies

PDF.js Compatibility

Tip

The serverless PDF.js bundle provided byunpdf is built from PDF.js v5.4.394.

You can use anofficial PDF.js build by using thedefinePDFJSModule method. This is useful if you want to use a specific version or a custom build of PDF.js.

Installation

Run the following command to addunpdf to your project.

# pnpmpnpm add -D unpdf# npmnpm install -D unpdf# yarnyarn add -D unpdf

Usage

Extract Text From PDF

import{extractText,getDocumentProxy}from'unpdf'// Either fetch a PDF file from the web or load it from the file systemconstbuffer=awaitfetch('https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf').then(res=>res.arrayBuffer())constbuffer=awaitreadFile('./dummy.pdf')// Then, load the PDF file into a PDF.js documentconstpdf=awaitgetDocumentProxy(newUint8Array(buffer))// Finally, extract the text from the PDF fileconst{ totalPages, text}=awaitextractText(pdf,{mergePages:true})console.log(`Total pages:${totalPages}`)console.log(text)

Official or Legacy PDF.js Build

Usually you don't need to worry about the PDF.js build.unpdf ships with a serverless build of the latest PDF.js version. However, if you want to use the official PDF.js version or the legacy build, you can define a custom PDF.js module.

Warning

PDF.js v5.x usesPromise.withResolvers, which may not be supported in all environments, such as Node < 22. Consider to use the bundled serverless build, which includes a polyfill, or use an older version of PDF.js.

For example, if you want to use the official PDF.js build, you can do the following:

import{definePDFJSModule,extractText,getDocumentProxy}from'unpdf'// Define the PDF.js build before using any other unpdf methodawaitdefinePDFJSModule(()=>import('pdfjs-dist'))// Now, you can use all unpdf methods with the official PDF.js buildconstpdf=awaitgetDocumentProxy(/* … */)const{ text}=awaitextractText(pdf)

PDF.js API

unpdf provides helpfulmethods to work with PDF files, such asextractText andextractImages, which should cover most use cases. However, if you need more control over the PDF.js API, you can use thegetResolvedPDFJS method to get the resolved PDF.js module.

Access the PDF.js API directly by callinggetResolvedPDFJS:

import{getResolvedPDFJS}from'unpdf'const{ version}=awaitgetResolvedPDFJS()

Note

If no other PDF.js build was defined, the serverless build will always be used.

For example, you can use thegetDocument method to load a PDF file and then use thegetMetadata method to get the metadata of the PDF file:

import{readFile}from'node:fs/promises'import{getResolvedPDFJS}from'unpdf'const{ getDocument}=awaitgetResolvedPDFJS()constdata=awaitreadFile('./dummy.pdf')constdocument=awaitgetDocument(newUint8Array(data)).promiseconsole.log(awaitdocument.getMetadata())

API

`definePDFJSModule`

Allows to define a custom PDF.js build. This method should be called before using any other method. If no custom build is defined, the serverless build will be used.

Type Declaration

functiondefinePDFJSModule(pdfjs:()=>Promise<PDFJS>):Promise<void>

`getResolvedPDFJS`

Returns the resolved PDF.js module. If no other PDF.js build was defined, the serverless build will be used. This method is useful if you want to use the PDF.js API directly.

Type Declaration

functiongetResolvedPDFJS():Promise<PDFJS>

`getMeta`

Extracts metadata from a PDF. IfparseDates is set totrue, the date properties will be parsed intoDate objects.

Type Declaration

functiongetMeta(data:DocumentInitParameters['data']|PDFDocumentProxy,options?:{parseDates?:boolean},):Promise<{info:Record<string,any>metadata:Record<string,any>}>

`extractText`

Extracts all text from a PDF. IfmergePages is set totrue, the text of all pages will be merged into a single string. Otherwise, an array of strings for each page will be returned.

Type Declaration

functionextractText(data:DocumentInitParameters['data']|PDFDocumentProxy,options?:{mergePages?:false}):Promise<{totalPages:numbertext:string[]}>functionextractText(data:DocumentInitParameters['data']|PDFDocumentProxy,options:{mergePages:true}):Promise<{totalPages:numbertext:string}>

`extractLinks`

Extracts all links from a PDF document, including hyperlinks and external URLs.

Type Declaration

functionextractLinks(data:DocumentInitParameters['data']|PDFDocumentProxy,):Promise<{totalPages:numberlinks:string[]}>

Example

import{readFile}from'node:fs/promises'import{extractLinks,getDocumentProxy}from'unpdf'// Load a PDF fileconstbuffer=awaitreadFile('./document.pdf')constpdf=awaitgetDocumentProxy(newUint8Array(buffer))// Extract all links from the PDFconst{ totalPages, links}=awaitextractLinks(pdf)console.log(`Total pages:${totalPages}`)console.log(`Found${links.length} links:`)for(constlinkoflinks)console.log(link)

`extractImages`

Extracts images from a specific page of a PDF document, including necessary metadata such as width, height, and calculated color channels.

Note

This method will only work in Node.js and browser environments.

In order to use this method, make sure to meet the following requirements:

Use the official PDF.js build (see below for details).
Install the@napi-rs/canvas package if you are using Node.js. This package is required to render the PDF page as an image.

Type Declaration

interfaceExtractedImageObject{data:Uint8ClampedArraywidth:numberheight:numberchannels:1|3|4key:string}functionextractImages(data:DocumentInitParameters['data']|PDFDocumentProxy,pageNumber:number,):Promise<ExtractedImageObject[]>

Example

Note

The following example uses thesharp library to process and save the extracted images. You will need to install it with your preferred package manager.

import{readFile,writeFile}from'node:fs/promises'importsharpfrom'sharp'import{extractImages,getDocumentProxy}from'unpdf'asyncfunctionextractPdfImages(){constbuffer=awaitreadFile('./document.pdf')constpdf=awaitgetDocumentProxy(newUint8Array(buffer))// Extract images from page 1constimagesData=awaitextractImages(pdf,1)console.log(`Found${imagesData.length} images on page 1`)// Process each image with sharp (optional)lettotalImagesProcessed=0for(constimgDataofimagesData){constimageIndex=++totalImagesProcessedawaitsharp(imgData.data,{raw:{width:imgData.width,height:imgData.height,channels:imgData.channels}}).png().toFile(`image-${imageIndex}.png`)console.log(`Saved image${imageIndex} (${imgData.width}x${imgData.height},${imgData.channels} channels)`)}}extractPdfImages().catch(console.error)

`renderPageAsImage`

To render a PDF page as an image, you can use therenderPageAsImage method. This method will return anArrayBuffer of the rendered image. It can also return a data URL (string) iftoDataURL option is set totrue.

Note

This method will only work in Node.js and browser environments.

In order to use this method, make sure to meet the following requirements:

Use the official PDF.js build (see below for details).
Install the@napi-rs/canvas package if you are using Node.js. This package is required to render the PDF page as an image.

Type Declaration

functionrenderPageAsImage(data:DocumentInitParameters['data']|PDFDocumentProxy,pageNumber:number,options?:{canvasImport?:()=>Promise<typeofimport('@napi-rs/canvas')>/**@default 1.0 */scale?:numberwidth?:numberheight?:numbertoDataURL?:false},):Promise<ArrayBuffer>functionrenderPageAsImage(data:DocumentInitParameters['data']|PDFDocumentProxy,pageNumber:number,options:{canvasImport?:()=>Promise<typeofimport('@napi-rs/canvas')>/**@default 1.0 */scale?:numberwidth?:numberheight?:numbertoDataURL:true},):Promise<string>

Examples

import{definePDFJSModule,renderPageAsImage}from'unpdf'// Use the official PDF.js buildawaitdefinePDFJSModule(()=>import('pdfjs-dist'))constpdf=awaitreadFile('./dummy.pdf')constbuffer=newUint8Array(pdf)constpageNumber=1constresult=awaitrenderPageAsImage(buffer,pageNumber,{canvasImport:()=>import('@napi-rs/canvas'),scale:2,})awaitwriteFile('dummy-page-1.png',newUint8Array(result))

import{definePDFJSModule,renderPageAsImage}from'unpdf'awaitdefinePDFJSModule(()=>import('pdfjs-dist'))constpdf=awaitreadFile('./dummy.pdf')constbuffer=newUint8Array(pdf)constpageNumber=1constresult=awaitrenderPageAsImage(buffer,pageNumber,{canvasImport:()=>import('@napi-rs/canvas'),scale:2,toDataURL:true,})consthtml=`<!DOCTYPE html><html lang="en">  <head>    <meta charset="UTF-8">    <meta name="viewport" content="width=device-width, initial-scale=1.0">    <title>Dummy Page</title>  </head>  <body>    <img alt="Example Page" src="${result}">  </body></html>`awaitwriteFile('dummy-page-1.html',html)