Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Derek
Derek

Posted on

Extract Words from PDF using PHP

Source:Extract Text from PDF

Step1: Get and Access the License of PHP PDF API

 

ComPDFKit API provide users1000 free PDF API requests. Follow the steps below to access the license and start your API requests.

  1. Register ComPDFKit API to go to the dashboard. You will see the API Keys, the progress of your API plan, and the status of API requests on your dashboard.

Image description
 

  1. Create a project and get the Public Key and Secret Key.

After your account is created, a default project will be created. You can create more projects to call ComPDFKit API. All supported PDF APIs could be checked on the documentation pages.

There are unique Public Key and Secret Key for each project. Remember to apply the right key for the corresponding project.

Image description

Step2: Authentication PDF API for PDF Text Extraction

You need to replace the real publicKey and secretKey to get the accessToken. Then, use the accessToken to create a task, upload files, extract PDF words, and get the extracted PDF Text JSON file.

PHP code example to authenticate ComPDFKit PDF text Extracting API:

$params = [    'publicKey' => $publicKey,    'secretKey' => $secretKey];$headers = ['Content-Type: application/json'];$curl = curl_init();curl_setopt_array($curl, array(    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/oauth/token',    CURLOPT_RETURNTRANSFER => true,    CURLOPT_ENCODING => '',    CURLOPT_MAXREDIRS => 10,    CURLOPT_TIMEOUT => 0,    CURLOPT_FOLLOWLOCATION => true,    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,    CURLOPT_CUSTOMREQUEST => 'POST',    CURLOPT_HTTPHEADER => $headers,    CURLOPT_POSTFIELDS => json_encode($params)));$response = curl_exec($curl);curl_close($curl);$result = json_decode($response, true);$accessToken = $result['data']['accessToken'];$bearerToken = "Bearer $accessToken";
Enter fullscreen modeExit fullscreen mode

Step3: Create Task - Extract PDF Text

You need to replace the accessToken which was obtained from the previous step. Set the language type you want to display the error information (1, English, 2, Chinese). ComPDFKit PDF API parameters can be found on the Quick Start --> Request Description page.

After replacing them, you will get the taskId in the response data. PHP code example to create PDF text extracting task:

$headers = [    'Content-Type: application/json',    'Authorization: ' . $bearerToken];$curl = curl_init();curl_setopt_array($curl, array(    CURLOPT_URL => 'https://api-server.compdf.com/server/v1/task/pdf/json?language=' . $language,    CURLOPT_RETURNTRANSFER => true,    CURLOPT_ENCODING => '',    CURLOPT_MAXREDIRS => 10,    CURLOPT_TIMEOUT => 0,    CURLOPT_FOLLOWLOCATION => true,    CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,    CURLOPT_CUSTOMREQUEST => 'GET',    CURLOPT_HTTPHEADER => $headers,));$response = curl_exec($curl);curl_close($curl);$result = json_decode($response, true);$taskId = $result['data']['taskId'];
Enter fullscreen modeExit fullscreen mode

 

Step4: Upload Files for PDF Parser

Replace the information in the PHP code:

  • PDF Files: The PDF you want to extract Text from.
  • taskId: Obtained in the tast creating step.
  • Language: The language you want to display the error information.
  • accessToken: Obtained in the Authentication step.

ComPDFKit API provide AI, OCR, etc. You can also input the parameters in this step:

  • type:Options to extract contents (0: text, 1: table) Default 0.
  • isAllowOcr: Whether to allow to open OCR (1: yes, 0: no), Default 0.
  • isOnlyAiTable: Whether to enable AI to recognize table (1: yes, 0: no) Default 0.

PHP code example to upload PDFs to parsing:
···
$params = [
'taskId' => $taskId, // ID of your task
'file' => new CURLFile($pdfPath), // Files you need to process
'language' => $language,
'password' => '',
'parameter' => json_encode(['type' => 1, 'isAllowOcr' => 1, 'isContainOcrBg' => 0])
];
$headers = [
'Authorization: ' . $bearerToken
];
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => 'https://api-server.compdf.com/server/v1/file/upload',
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => '',
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 0,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => 'POST',
CURLOPT_HTTPHEADER => $headers,
CURLOPT_POSTFIELDS => $params
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
$fileKey = $result['data']['fileKey'];
···

Step5: Process and Extract Text From Uploaded PDF Files

···
Execute the tast to extract Words from PDF you uploaded. Here is the PHP code example:

$headers = [
'Content-Type: application/json',
'Authorization: ' . $bearerToken
];
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => 'https://api-server.compdf.com/server/v1/execute/start?language=' . $language . '&taskId=' . $taskId,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => '',
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 0,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => 'GET',
CURLOPT_HTTPHEADER => $headers,
));
$response = curl_exec($curl);
curl_close($curl);
···

Step6: Get Task Information of PDF Text Extraction

Follow the PHP code example below to obtain the task information. Replace the needed information like taskId and access_token. The PDF PDF parser and extracted result file is presented in a JSON file, which is a structured data format beneficial for the reuse of PDF text extraction.

···
$headers = [
'Content-Type: application/json',
'Authorization: ' . $bearerToken
];

$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => 'https://api-server.compdf.com/server/v1/task/taskInfo' . '?taskId=' . $taskId,
CURLOPT_RETURNTRANSFER => true,
CURLOPT_ENCODING => '',
CURLOPT_MAXREDIRS => 10,
CURLOPT_TIMEOUT => 0,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_HTTP_VERSION => CURL_HTTP_VERSION_1_1,
CURLOPT_CUSTOMREQUEST => 'GET',
CURLOPT_HTTPHEADER => $headers,
));
$response = curl_exec($curl);
curl_close($curl);
$result = json_decode($response, true);
···

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

PDF software engineer. .NET, Java, C#, Python.
  • Joined

More fromDerek

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp