PDF Data Extraction API for Developers

Authentication and getting started

Every API request requires an API key passed in the Authorization header as a Bearer token. After signing up, you receive a primary and secondary key from your dashboard. The secondary key exists for zero-downtime rotation: activate the new key, update your application, then revoke the old one. Keys can be scoped to specific operations and restricted to a list of IP addresses for defense-in-depth security. All requests must use HTTPS, and the API rejects plain HTTP connections at the network layer.

The base URL for all endpoints is https://api.extractdatafrompdf.com/v1. Versioning is path-based, so breaking changes will ship under /v2 while /v1 remains stable. Non-breaking additions like new response fields are added to the current version without a version bump. The API returns standard HTTP status codes: 200 for success, 400 for malformed requests, 401 for authentication failures, 429 for rate limit violations, and 500 for server errors. Every error response includes a machine-readable error code and a human-readable message.

Quick start with cURL

The fastest way to test the API is to upload a single PDF with cURL. Send a POST request to the /extract endpoint with the PDF as a multipart form upload and your API key in the header. The response arrives synchronously for documents under 10 pages, typically in 2 to 5 seconds. For larger documents or when you need to process multiple files, the batch endpoint is more efficient. The extract endpoint accepts an optional document_type parameter to skip auto-classification if you already know the document type, which reduces processing time by roughly 20 percent.

Python SDK example

The Python SDK wraps the REST API with convenience methods for common workflows. Install it with pip install extractdatafrompdf and initialize the client with your API key. The client.extract() method accepts a file path or bytes object and returns a structured result with fields, tables, and metadata. For batch processing, client.extract_batch() accepts a list of file paths and returns a batch ID that you can poll with client.get_batch_result() or handle via webhook. The SDK handles retry logic, rate limiting, and multipart upload chunking automatically.

Core endpoints and response format

The API exposes three primary endpoints. POST /v1/extract processes a single PDF synchronously and returns the extracted data as JSON. POST /v1/batch accepts up to 50 PDFs in a single request and processes them asynchronously, returning a batch ID. GET /v1/batch/{id} retrieves the results of a batch job once processing completes. Each endpoint returns a consistent JSON schema with a top-level status field, a document object containing the classification result, and an array of extracted fields with names, values, confidence scores, and bounding box coordinates.

Table extraction deserves special attention because PDFs encode tables in ways that are notoriously difficult to parse. The API detects table boundaries using layout analysis, then extracts cell values into a structured array with row and column indices. Each cell includes its raw text value, detected data type (string, number, date, currency), and a normalized value. Currency values are returned as numeric amounts with a separate currency code field. Dates are normalized to ISO 8601 format regardless of the input format. This normalization layer eliminates the parsing work that would otherwise fall on your application code.

Webhook callbacks

For asynchronous processing, you can register a webhook URL that receives a POST request when extraction completes. The webhook payload contains the full extraction result, identical to what you would receive from polling the batch endpoint. Each webhook request includes an X-Signature header containing an HMAC-SHA256 signature computed with your webhook secret, so your server can verify the payload was not tampered with. If your server returns a non-2xx status code, the API retries delivery with exponential backoff: 1 minute, 5 minutes, 30 minutes, 2 hours, and 12 hours. After 5 failed attempts, the delivery is marked as failed and visible in your dashboard.

Rate limits, batch processing, and performance

The standard plan allows 60 API requests per minute with a burst allowance of 10 additional requests. Rate limits are applied per API key, not per IP address, so multiple servers sharing a key share the same limit. When you exceed the rate limit, the API returns a 429 status code with a Retry-After header indicating how many seconds to wait. The Python and Node.js SDKs handle rate limiting automatically with built-in retry logic and exponential backoff, so your application code does not need to implement retry logic manually.

Batch processing is the most efficient way to handle high-volume extraction. Instead of making 50 individual API calls, each with its own overhead, a single batch request uploads all 50 PDFs and processes them in parallel on the server side. The system optimizes batch execution by grouping documents of the same type together, which improves classification accuracy and reduces total processing time by 30 to 40 percent compared to sequential single-document calls. Batch results are available for 7 days after completion and can be retrieved in full or filtered by document index.

Error handling and idempotency

Every API request accepts an optional idempotency_key header. If you send two requests with the same idempotency key within 24 hours, the API returns the result of the first request without reprocessing the document. This is critical for reliability in distributed systems where network failures can cause duplicate requests. The API also returns a request_id in every response header, which you should log for debugging and support purposes. When contacting support, providing the request ID allows the team to trace the exact processing path your document followed.

Node.js integration and advanced workflows

The Node.js SDK is available through npm with npm install extractdatafrompdf. It provides both callback and promise-based interfaces, with full TypeScript type definitions for all request and response objects. The SDK handles authentication, multipart uploads, rate limiting, and retry logic, so you can focus on your application logic rather than HTTP plumbing. For serverless environments like AWS Lambda or Vercel Functions, the SDK supports a lightweight mode that reduces cold start times by lazy-loading optional dependencies.

Advanced workflows often combine extraction with post-processing logic. For example, an accounts payable automation might extract invoice fields, validate the vendor against an internal database, check for duplicate invoice numbers, and then create a record in the ERP system. The API supports custom extraction schemas that let you define the exact fields you expect for each document type, along with validation rules like required fields, value ranges, and regex patterns. Documents that fail validation are flagged in the response so your application can route them to a human reviewer instead of processing them automatically.

API frequently asked questions

What format does the PDF extraction API return?

The API returns structured JSON with extracted fields organized by document type. Each response includes a confidence score per field, the raw extracted value, and a normalized value where applicable (for example, dates are normalized to ISO 8601 format). For table data, the response includes row and column arrays that can be directly mapped to spreadsheet cells or database rows.

How does API authentication work?

Authentication uses API keys passed in the Authorization header as a Bearer token. Each account receives a primary and secondary key for rotation purposes. Keys can be scoped to specific operations (read-only, extract, batch) and IP-restricted for additional security. All API traffic is encrypted over TLS 1.2 or higher, and keys can be revoked instantly from the dashboard.

What are the API rate limits and how does batch processing work?

The standard plan allows 60 requests per minute and 1,000 pages per day. Batch endpoints accept up to 50 PDFs in a single request and process them asynchronously, returning a batch ID that you poll for results or receive via webhook callback. Batch processing is more efficient than individual requests because the system optimizes document classification across the batch, reducing total processing time by 30 to 40 percent compared to sequential single-document calls.

Does the API support webhook callbacks for async processing?

Yes. You can register webhook URLs in your account settings or pass a callback URL with each request. When extraction completes, the API sends a POST request to your webhook with the full extraction result as the JSON body. Webhooks include an HMAC signature header so your server can verify the payload originated from ExtractDataFromPDF. Failed webhook deliveries are retried with exponential backoff up to 5 times over 24 hours.

PDF Data Extraction API