Getting started
Documentation
Everything you need to parse PDFs into structured markdown and JSON.
Docule turns PDFs into clean markdown and richly-structured JSON. The API is asynchronous: you submit a file, poll a status endpoint, then download the result when it's ready.
What you'll find here
Base URL
All API requests use the following base URL:
https://docule.dev/api/v1
How it works
- Submit a PDF to
POST /parse. You get a job_id back immediately.
- Poll
GET /status/{job_id} until status becomes "completed".
- Download the parsed output from
GET /result/{job_id} in JSON, markdown, or plain text.
Getting started
Quick start
Parse your first PDF in three steps.
1
Get an API key
Sign up and grab a key from the API keys dashboard. Free accounts get 100 pages per month.
2
Submit a PDF
POST your file to /parse. The API returns a job_id immediately — parsing happens in the background.
curl -X POST https://docule.dev/api/v1/parse \
-H "X-API-Key: YOUR_KEY" \
-F "file=@report.pdf"
{
"job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
"status": "queued",
"created_at": "2026-04-07T12:05:30.123Z",
"message": "Job submitted. Poll GET /api/v1/status/{job_id}"
}
3
Poll status, then download the result
Poll the status endpoint until status is "completed", then fetch the parsed output.
curl https://docule.dev/api/v1/status/JOB_ID \
-H "X-API-Key: YOUR_KEY"
curl https://docule.dev/api/v1/result/JOB_ID?format=json \
-H "X-API-Key: YOUR_KEY"
That's it. Continue to the
Python or
cURL examples for a full end-to-end script.
Getting started
Authentication
Authenticate every request with your API key.
All API requests require an API key passed via the X-API-Key HTTP header.
curl https://docule.dev/api/v1/health \
-H "X-API-Key: pp_live_your_key_here"
Getting your key
After signing up, generate a key from the API keys dashboard. You can create multiple keys and revoke them at any time.
Keep your key secret
Your API key grants full access to your account's parsing quota. Treat it like a password:
- Never commit keys to git or paste them into client-side code.
- Use environment variables in production.
- Rotate keys if you suspect compromise.
Errors
| Code | Meaning |
401 | Missing or invalid X-API-Key header |
403 | No active subscription |
API Reference
Parse a document
POST
/api/v1/parse
Upload a PDF for parsing. Returns a job ID for status polling.
Processing is asynchronous — you'll get a job_id back immediately, then poll /status until the job completes.
Request
Send the PDF as multipart/form-data. All other parameters are optional query strings:
| Parameter | Type | Default | Description |
file | file | — | Required. PDF file to parse. Max 50 MB. |
formats | query string | json,md | Comma-separated output formats: json, md, txt |
no_vision | query bool | false | Disable vision API calls (faster, lower quality for complex pages) |
force_vision | query bool | false | Force vision processing for every page (higher quality, higher cost) |
document_type | query string | auto | Document type: auto, report. Legacy aliases invoice/contract are accepted. |
include_page_markers | query bool | false | Add explicit page markers to markdown/text outputs (e.g. START OF PAGE / END OF PAGE). |
include_source_metadata | query bool | false | Include source_path in JSON output. |
include_document_metadata | query bool | false | Include extracted document_metadata in JSON output. |
include_page_metadata | query bool | false | Include page-level metadata in JSON (page number/type, links, dimensions, header/footer). |
include_item_metadata | query bool | false | Include item-level metadata in JSON (bbox, section fields, table quality fields). |
include_processing_metadata | query bool | false | Include processing metadata in JSON (total_processing_time_ms, api_calls_made, estimated_cost_usd). |
Response (HTTP 202)
{
"job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
"status": "queued",
"created_at": "2026-04-07T12:05:30.123Z",
"message": "Job submitted. Poll GET /api/v1/status/{job_id}"
}
Limits
- Maximum file size: 50 MB
- Maximum concurrent jobs: 1 per user (returns HTTP 429 if exceeded)
- Monthly page quota varies by plan — see rate limits
API Reference
Check job status
GET
/api/v1/status/{job_id}
Poll for job progress and quality metrics.
Returns the current job status, progress, and (when complete) quality and cost metrics.
Response
{
"job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
"status": "completed",
"created_at": "2026-04-07T12:05:30.123Z",
"started_at": "2026-04-07T12:05:31.456Z",
"completed_at": "2026-04-07T12:05:45.789Z",
"progress": 1.0,
"total_pages": 42,
"pages_processed": 42,
"quality_score": 97.5,
"quality_warning": false,
"cost_usd": 0.0234,
"api_calls": 8,
"document_type": "report",
"error": null
}
Fields
| Field | Type | Description |
status | string | queued, processing, completed, or failed |
progress | float | 0.0 to 1.0 — fraction of pages processed |
total_pages | int | Total page count of the PDF |
pages_processed | int | Pages completed so far |
quality_score | float | 0–100 quality rating (available when completed) |
quality_warning | bool | true if quality score is below 85 |
cost_usd | float | API cost for this job in USD |
api_calls | int | Number of vision API calls made |
document_type | string | Detected document type (e.g. report) |
error | string | Error message if status is failed |
Tip: Polling once every 2 seconds is a sensible default. Most documents complete in under 30 seconds.
API Reference
Get results
GET
/api/v1/result/{job_id}
Retrieve the parsed output of a completed job.
Returns the structured result wrapped in a response envelope with metadata.
Parameters
| Parameter | Type | Default | Description |
format | query string | json | Output format: json, md, or txt |
include_page_markers | query bool | false | For format=md/txt: add explicit page boundary markers to the response. |
include_source_path | query bool | false | For format=json: include top-level source_path. |
include_document_metadata | query bool | false | For format=json: include top-level document_metadata. |
include_page_metadata | query bool | false | For format=json: include page-level metadata fields (e.g. page_number, page type, dimensions, links). |
include_item_metadata | query bool | false | For format=json: include item-level metadata fields (e.g. bbox, section metadata, table quality fields). |
include_processing_metadata | query bool | false | For format=json: include top-level processing metadata. |
Response (format=json)
{
"job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
"status": "completed",
"result": { },
"markdown": null,
"text": null,
"quality_score": 97.5,
"metadata": {
"filename": "report.pdf",
"pages": 42,
"cost_usd": 0.0234,
"api_calls": 8
},
"document_type": "report"
}
When requesting format=md, the markdown field contains the full document text and result is null. Same for format=txt with the text field.
Returns HTTP 409 if the job is not yet completed, or if it has failed. Always check the status endpoint first.
API Reference
Download raw file
GET
/api/v1/result/{job_id}/raw
Download the raw output file directly, without the JSON envelope.
Returns the file content with the appropriate Content-Type — useful for piping straight to disk.
Parameters
| Parameter | Type | Default | Description |
format | query string | md | json, md, or txt |
include_page_markers | query bool | false | For format=md/txt: add explicit page boundary markers to raw output. |
include_source_path | query bool | false | For format=json: include top-level source_path. |
include_document_metadata | query bool | false | For format=json: include top-level document_metadata. |
include_page_metadata | query bool | false | For format=json: include page-level metadata fields. |
include_item_metadata | query bool | false | For format=json: include item-level metadata fields. |
include_processing_metadata | query bool | false | For format=json: include top-level processing metadata. |
Example
curl "https://docule.dev/api/v1/result/JOB_ID/raw?format=md&include_page_markers=true" \
-H "X-API-Key: YOUR_KEY" \
-o report.md
Content types
format=json → application/json
format=md → text/plain
format=txt → text/plain
API Reference
List jobs
GET
/api/v1/jobs
List your recent parsing jobs, newest first.
Parameters
| Parameter | Type | Default | Description |
status | query string | — | Filter by status: queued, processing, completed, failed |
limit | query int | 50 | Max results to return (max 200) |
Response
[
{
"job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
"status": "completed",
"filename": "report.pdf",
"created_at": "2026-04-07T12:05:30.123Z",
"completed_at": "2026-04-07T12:05:45.789Z",
"pages": 42,
"quality_score": 97.5,
"cost_usd": 0.0234
}
]
API Reference
Health check
GET
/api/v1/health
Check API status and queue depth. No authentication required.
Response
{
"status": "ok",
"version": "0.1.0",
"active_jobs": 3,
"queued_jobs": 0
}
Fields
| Field | Type | Description |
status | string | Always "ok" if the service is running |
version | string | Current API version |
active_jobs | int | Number of jobs currently being processed |
queued_jobs | int | Number of jobs waiting in queue |
Output
Output formats
Three formats for three different use cases.
Every parsed document can be retrieved in three formats. Specify which to generate using the formats parameter on /parse:
POST /api/v1/parse?formats=json,md,txt
JSON (json)
Structured data that defaults to strict content export (pages, text, markdown, items) with optional metadata toggles when you need extra context.
Best for: programmatic access, RAG pipelines, search indexing with rich metadata.
See the full schema at JSON structure.
Markdown (md)
Clean, readable text with tables in pipe format. Page markers are optional via query parameter.
Best for: human review, LLM context windows, documentation pipelines.
Plain text (txt)
Raw text without formatting or markers.
Best for: full-text search indexing, simple analytics.
Output
JSON structure
The full schema returned by the parser.
The JSON output (the result field of the /result response) defaults to strict content mode:
{
"pages": [
{
"text": "Income Statement\nItem 2025 2024...",
"markdown": "### Income Statement\n| Item | 2025 | 2024 |\n...",
"items": [
{
"type": "heading",
"markdown": "### Income Statement",
"text": "Income Statement",
"level": 3
},
{
"type": "table",
"markdown": "| Item | 2025 | 2024 |\n|---|---|---|\n...",
"text": "Item 2025 2024...",
"table": {
"rows": [["Revenue", "1,200", "1,050"]]
}
}
]
}
]
}
Optional metadata can be added with query params on /result or /result/{job_id}/raw:
GET /api/v1/result/{job_id}?format=json&include_page_metadata=true&include_item_metadata=true
GET /api/v1/result/{job_id}?format=json&include_source_path=true&include_document_metadata=true
GET /api/v1/result/{job_id}?format=json&include_processing_metadata=true
To include page boundary markers in text outputs:
GET /api/v1/result/{job_id}?format=md&include_page_markers=true
GET /api/v1/result/{job_id}/raw?format=txt&include_page_markers=true
Page metadata fields (optional)
Each page is classified into a semantic type for smarter processing:
| Field | Description |
page_number | Original 1-based page index in the PDF |
page_type | Detected semantic class (for example narrative, financial_table, glossary) |
width, height | Page dimensions |
links, images, charts | Detected page assets/references |
page_header_markdown, page_footer_markdown | Detected header/footer content when available |
normalized_text | Normalized text representation for matching/search |
definitions | Structured glossary definitions when page type is glossary |
| Field | Description |
source_path | Original file path in parser environment (include_source_path=true) |
document_metadata | Extracted document-level metadata (include_document_metadata=true) |
processing | Processing stats (total_processing_time_ms, api_calls_made, estimated_cost_usd) when include_processing_metadata=true |
Item types
The items array contains structured content blocks:
| Type | Fields | Description |
heading | markdown, text, level | Section heading with level (1–6) |
text | markdown, text | Paragraph or text block |
table | markdown, text, table.rows | Table with structured row data. Optional table quality fields are included with include_item_metadata=true. |
image | markdown, text, image.url, image.alt | Embedded image reference |
Guides
Python example
Upload a PDF, wait for parsing, and save the result.
This complete example uses only the requests library — no SDK required.
Pythonimport requests
import time
API_KEY = "pp_live_your_key_here"
BASE = "https://docule.dev/api/v1"
HEADERS = {"X-API-Key": API_KEY}
with open("report.pdf", "rb") as f:
resp = requests.post(
f"{BASE}/parse",
headers=HEADERS,
files={"file": f},
params={"formats": "json,md"},
)
resp.raise_for_status()
job_id = resp.json()["job_id"]
print(f"Job submitted: {job_id}")
while True:
status = requests.get(
f"{BASE}/status/{job_id}", headers=HEADERS
).json()
print(f" {status['status']} — {status['progress']:.0%}")
if status["status"] == "completed":
break
if status["status"] == "failed":
raise RuntimeError(f"Job failed: {status['error']}")
time.sleep(2)
result = requests.get(
f"{BASE}/result/{job_id}",
headers=HEADERS,
params={"format": "json"},
).json()
print(f"Pages: {result['metadata']['pages']}")
print(f"Quality: {result['quality_score']}")
for page in result["result"]["pages"]:
print(page["markdown"][:200])
md = requests.get(
f"{BASE}/result/{job_id}/raw?format=md&include_page_markers=true",
headers=HEADERS,
).text
with open("report.md", "w") as f:
f.write(md)
print("Saved report.md")
Guides
cURL example
End-to-end shell script using cURL and jq.
Bash
API_KEY="YOUR_KEY"
BASE="https://docule.dev/api/v1"
JOB=$(curl -s -X POST $BASE/parse \
-H "X-API-Key: $API_KEY" \
-F "file=@report.pdf" | jq -r '.job_id')
echo "Job: $JOB"
while true; do
STATUS=$(curl -s $BASE/status/$JOB \
-H "X-API-Key: $API_KEY" | jq -r '.status')
echo "Status: $STATUS"
[ "$STATUS" = "completed" ] && break
[ "$STATUS" = "failed" ] && echo "Failed!" && exit 1
sleep 2
done
curl -s "$BASE/result/$JOB/raw?format=md&include_page_markers=true" \
-H "X-API-Key: $API_KEY" -o report.md
echo "Saved report.md"
Requires: jq for JSON parsing. Install via apt install jq, brew install jq, or choco install jq.
Reference
Rate limits
Plan tiers, quotas, and overage pricing.
| Plan | Pages / month | Requests / sec | Overage |
| Free | 100 | 1 | — |
| Starter (€29/mo) | 1,500 | 3 | €0.018 / page |
| Pro (€49/mo) | 4,000 | 10 | €0.014 / page |
| Business (€199/mo) | 20,000 | 20 | €0.009 / page |
How counting works
- One page equals one credit. A 42-page PDF consumes 42 credits.
- Quotas reset on the first day of each calendar month.
- Free accounts have no overage — requests are rejected once the quota is hit.
-
Paid plans can opt in to on-demand usage from
Settings → Billing. When enabled, any pages beyond the plan's monthly
allowance are added to your next Stripe invoice automatically at the
overage rate above. We email you once per month the first time a job
crosses into on-demand usage.
-
Opt-in is off by default, so you can never be billed beyond your plan
without explicitly turning it on.
Concurrency
Maximum 1 concurrent job per user across all plans. Submitting a second while one is in progress returns HTTP 429.
What triggers HTTP 429
- Exceeding your requests-per-second limit
- Exceeding your monthly page quota (free plan only)
- Submitting more than 1 job concurrently
Reference
Error codes
HTTP status codes and how to handle them.
| Code | Meaning |
400 | Invalid request — not a PDF, file too large (max 50 MB), or malformed request |
401 | Missing or invalid API key |
403 | No active subscription, or not authorized to access this job |
404 | Job not found, or requested format not available |
409 | Job not yet completed or has failed — check status first |
429 | Rate limit exceeded, monthly quota exhausted, or too many concurrent jobs |
503 | Service unavailable — the parsing engine is starting up |
Error response format
All error responses share this shape:
{
"error": "Rate limit exceeded",
"detail": "Rate limit exceeded (3 req/sec). Upgrade your plan for higher limits."
}
Recommended handling
- 429: back off and retry. Use exponential backoff starting at 1 second.
- 409 on /result: the job isn't ready — keep polling /status.
- 5xx: retry with backoff. Check /health if errors persist.
- 4xx (other): fix your request — these will not succeed on retry.