- Notifications
You must be signed in to change notification settings - Fork5
🪭 Serverless redistribution of PDF.js for edge environments
License
johannschopplich/pdfjs-serverless
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
A redistribution of Mozilla'sPDF.js for edge environments, like Cloudflare Workers. It is especially useful for serverless AI applications, where you want to parse PDF documents and extract text content.
This package comes with zero dependencies. The whole export is about 1.4 MB (minified).
Note
pdfjs-serverless
is currently built from PDF.js v5.2.133.
Run the following command to addpdfjs-serverless
to your project.
# pnpmpnpm add -D pdfjs-serverless# npmnpm install -D pdfjs-serverless# yarnyarn add -D pdfjs-serverless
Tip
For common operations, such as extracting text content or images from PDF files, you can use theunpdf
package. It is a wrapper aroundpdfjs-serverless
and provides a simple API for common use cases.
pdfjs-serverless
provides the same API as the original PDF.js library. To use any of the PDF.js exports, rename the import topdfjs-serverless
instead ofpdfjs-dist
:
- import { getDocument } from 'pdfjs-dist'+ import { getDocument } from 'pdfjs-serverless'
exportdefault{asyncfetch(request){if(request.method!=='POST')returnnewResponse('Method Not Allowed',{status:405})const{ getDocument}=awaitimport('pdfjs-serverless')// Get the PDF file from the POST request body as a bufferconstdata=awaitrequest.arrayBuffer()constdocument=awaitgetDocument({data:newUint8Array(data),useSystemFonts:true,}).promise// Get metadata and initialize output objectconstmetadata=awaitdocument.getMetadata()constoutput={ metadata,pages:[]}// Iterate through each page and fetch the text contentfor(leti=1;i<=document.numPages;i++){constpage=awaitdocument.getPage(i)consttextContent=awaitpage.getTextContent()constcontents=textContent.items.map(item=>item.str).join(' ')// Add page content to outputoutput.pages.push({pageNumber:i,content:contents})}// Return the results as JSONreturnnewResponse(JSON.stringify(output),{headers:{'Content-Type':'application/json'}})}}
import{getDocument}from'https://esm.sh/pdfjs-serverless'constdata=Deno.readFileSync('sample.pdf')constdocument=awaitgetDocument({ data,useSystemFonts:true,}).promiseconsole.log(awaitdocument.getMetadata())// Iterate through each page and fetch the text contentfor(leti=1;i<=document.numPages;i++){constpage=awaitdocument.getPage(i)consttextContent=awaitpage.getTextContent()constcontents=textContent.items.map(item=>item.str).join(' ')console.log(contents)}
Heart and soul of this package is therollup.config.ts
file. It usesrollup
to bundle the PDF.js library into a single file that can be used in serverless environments.
The heavy lifting comes from string replacements of thePDF.js
library, i.e. removing browser context references and checks such astypeof window
. Additionally, we enforce Node.js compatibility (may sound paradoxical at first, bear with me), i.e. we mock the@napi-rs/canvas
module and set theisNodeJS
flag totrue
.
PDF.js uses a worker to parse PDF documents. This worker is a separate file that is loaded by the main library. For the serverless build, we need to inline the worker code into the main library.
Finally, some mocks are added to the global scope that are not available in serverless environments, such asFinalizationRegistry
which is not available in Cloudflare Workers.
pdf.mjs
, a nodeless build of PDF.js v2.
MIT License © 2023-PRESENTJohann Schopplich
About
🪭 Serverless redistribution of PDF.js for edge environments
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.