Documentation

Everything you need to parse PDFs into structured markdown and JSON.

Docule turns PDFs into clean markdown and richly-structured JSON. The API is asynchronous: you submit a file, poll a status endpoint, then download the result when it's ready.

What you'll find here

Quick start →

Parse your first PDF in three steps

API reference →

Every endpoint, parameter, and response field

JSON structure →

Page types, items, tables, and metadata

Code examples →

Python and cURL end-to-end walkthroughs

Base URL

All API requests use the following base URL:

https://docule.dev/api/v1

How it works

Submit a PDF to POST /parse. You get a job_id back immediately.
Poll GET /status/{job_id} until status becomes "completed".
Download the parsed output from GET /result/{job_id} in JSON, markdown, or plain text.

Next Quick start →

Quick start

Parse your first PDF in three steps.

1

Get an API key

Sign up and grab a key from the API keys dashboard. Free accounts get 100 pages per month.

2

Submit a PDF

POST your file to /parse. The API returns a job_id immediately — parsing happens in the background.

# Upload your PDF
curl -X POST https://docule.dev/api/v1/parse \
  -H "X-API-Key: YOUR_KEY" \
  -F "file=@report.pdf"

# Response (HTTP 202)
{
  "job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
  "status": "queued",
  "created_at": "2026-04-07T12:05:30.123Z",
  "message": "Job submitted. Poll GET /api/v1/status/{job_id}"
}

3

Poll status, then download the result

Poll the status endpoint until status is "completed", then fetch the parsed output.

# Poll until completed
curl https://docule.dev/api/v1/status/JOB_ID \
  -H "X-API-Key: YOUR_KEY"

# Download the parsed result
curl https://docule.dev/api/v1/result/JOB_ID?format=json \
  -H "X-API-Key: YOUR_KEY"

That's it. Continue to the Python or cURL examples for a full end-to-end script.

Previous ← Introduction Next Authentication →

Authentication

Authenticate every request with your API key.

All API requests require an API key passed via the X-API-Key HTTP header.

curl https://docule.dev/api/v1/health \
  -H "X-API-Key: pp_live_your_key_here"

Getting your key

After signing up, generate a key from the API keys dashboard. You can create multiple keys and revoke them at any time.

Keep your key secret

Your API key grants full access to your account's parsing quota. Treat it like a password:

Never commit keys to git or paste them into client-side code.
Use environment variables in production.
Rotate keys if you suspect compromise.

Errors

Code	Meaning
`401`	Missing or invalid `X-API-Key` header
`403`	No active subscription

Previous ← Quick start Next Parse a document →

Parse a document

POST /api/v1/parse

Upload a PDF for parsing. Returns a job ID for status polling.

Processing is asynchronous — you'll get a job_id back immediately, then poll /status until the job completes.

Request

Send the PDF as multipart/form-data. All other parameters are optional query strings:

Parameter	Type	Default	Description
`file`	file	—	Required. PDF file to parse. Max 50 MB.
`formats`	query string	`json,md`	Comma-separated output formats: `json`, `md`, `txt`
`no_vision`	query bool	`false`	Disable vision API calls (faster, lower quality for complex pages)
`force_vision`	query bool	`false`	Force vision processing for every page (higher quality, higher cost)
`document_type`	query string	`auto`	Document type: `auto`, `report`. Legacy aliases `invoice`/`contract` are accepted.
`include_page_markers`	query bool	`false`	Add explicit page markers to markdown/text outputs (e.g. `START OF PAGE` / `END OF PAGE`).
`include_source_metadata`	query bool	`false`	Include `source_path` in JSON output.
`include_document_metadata`	query bool	`false`	Include extracted `document_metadata` in JSON output.
`include_page_metadata`	query bool	`false`	Include page-level metadata in JSON (page number/type, links, dimensions, header/footer).
`include_item_metadata`	query bool	`false`	Include item-level metadata in JSON (bbox, section fields, table quality fields).
`include_processing_metadata`	query bool	`false`	Include processing metadata in JSON (`total_processing_time_ms`, `api_calls_made`, `estimated_cost_usd`).

Response (HTTP 202)

{
  "job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
  "status": "queued",
  "created_at": "2026-04-07T12:05:30.123Z",
  "message": "Job submitted. Poll GET /api/v1/status/{job_id}"
}

Limits

Maximum file size: 50 MB
Maximum concurrent jobs: 1 per user (returns HTTP 429 if exceeded)
Monthly page quota varies by plan — see rate limits

Previous ← Authentication Next Check job status →

Check job status

GET /api/v1/status/{job_id}

Poll for job progress and quality metrics.

Returns the current job status, progress, and (when complete) quality and cost metrics.

Response

{
  "job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
  "status": "completed",
  "created_at": "2026-04-07T12:05:30.123Z",
  "started_at": "2026-04-07T12:05:31.456Z",
  "completed_at": "2026-04-07T12:05:45.789Z",
  "progress": 1.0,
  "total_pages": 42,
  "pages_processed": 42,
  "quality_score": 97.5,
  "quality_warning": false,
  "cost_usd": 0.0234,
  "api_calls": 8,
  "document_type": "report",
  "error": null
}

Fields

Field	Type	Description
`status`	string	`queued`, `processing`, `completed`, or `failed`
`progress`	float	0.0 to 1.0 — fraction of pages processed
`total_pages`	int	Total page count of the PDF
`pages_processed`	int	Pages completed so far
`quality_score`	float	0–100 quality rating (available when completed)
`quality_warning`	bool	`true` if quality score is below 85
`cost_usd`	float	API cost for this job in USD
`api_calls`	int	Number of vision API calls made
`document_type`	string	Detected document type (e.g. `report`)
`error`	string	Error message if status is `failed`

Tip: Polling once every 2 seconds is a sensible default. Most documents complete in under 30 seconds.

Previous ← Parse a document Next Get results →

Get results

GET /api/v1/result/{job_id}

Retrieve the parsed output of a completed job.

Returns the structured result wrapped in a response envelope with metadata.

Parameters

Parameter	Type	Default	Description
`format`	query string	`json`	Output format: `json`, `md`, or `txt`
`include_page_markers`	query bool	`false`	For `format=md`/`txt`: add explicit page boundary markers to the response.
`include_source_path`	query bool	`false`	For `format=json`: include top-level `source_path`.
`include_document_metadata`	query bool	`false`	For `format=json`: include top-level `document_metadata`.
`include_page_metadata`	query bool	`false`	For `format=json`: include page-level metadata fields (e.g. `page_number`, page type, dimensions, links).
`include_item_metadata`	query bool	`false`	For `format=json`: include item-level metadata fields (e.g. `bbox`, section metadata, table quality fields).
`include_processing_metadata`	query bool	`false`	For `format=json`: include top-level `processing` metadata.

Response (format=json)

{
  "job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
  "status": "completed",
  "result": { /* full parsed JSON — see JSON structure */ },
  "markdown": null,
  "text": null,
  "quality_score": 97.5,
  "metadata": {
    "filename": "report.pdf",
    "pages": 42,
    "cost_usd": 0.0234,
    "api_calls": 8
  },
  "document_type": "report"
}

When requesting format=md, the markdown field contains the full document text and result is null. Same for format=txt with the text field.

Returns HTTP 409 if the job is not yet completed, or if it has failed. Always check the status endpoint first.

Previous ← Check job status Next Download raw file →

Download raw file

GET /api/v1/result/{job_id}/raw

Download the raw output file directly, without the JSON envelope.

Returns the file content with the appropriate Content-Type — useful for piping straight to disk.

Parameters

Parameter	Type	Default	Description
`format`	query string	`md`	`json`, `md`, or `txt`
`include_page_markers`	query bool	`false`	For `format=md`/`txt`: add explicit page boundary markers to raw output.
`include_source_path`	query bool	`false`	For `format=json`: include top-level `source_path`.
`include_document_metadata`	query bool	`false`	For `format=json`: include top-level `document_metadata`.
`include_page_metadata`	query bool	`false`	For `format=json`: include page-level metadata fields.
`include_item_metadata`	query bool	`false`	For `format=json`: include item-level metadata fields.
`include_processing_metadata`	query bool	`false`	For `format=json`: include top-level processing metadata.

Example

# Download markdown directly to a file with page markers
curl "https://docule.dev/api/v1/result/JOB_ID/raw?format=md&include_page_markers=true" \
  -H "X-API-Key: YOUR_KEY" \
  -o report.md

Content types

format=json → application/json
format=md → text/plain
format=txt → text/plain

Previous ← Get results Next List jobs →

List jobs

GET /api/v1/jobs

List your recent parsing jobs, newest first.

Parameters

Parameter	Type	Default	Description
`status`	query string	—	Filter by status: `queued`, `processing`, `completed`, `failed`
`limit`	query int	`50`	Max results to return (max 200)

Response

[
  {
    "job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
    "status": "completed",
    "filename": "report.pdf",
    "created_at": "2026-04-07T12:05:30.123Z",
    "completed_at": "2026-04-07T12:05:45.789Z",
    "pages": 42,
    "quality_score": 97.5,
    "cost_usd": 0.0234
  }
]

Previous ← Download raw file Next Health check →

Health check

GET /api/v1/health

Check API status and queue depth. No authentication required.

Response

{
  "status": "ok",
  "version": "0.1.0",
  "active_jobs": 3,
  "queued_jobs": 0
}

Fields

Field	Type	Description
`status`	string	Always `"ok"` if the service is running
`version`	string	Current API version
`active_jobs`	int	Number of jobs currently being processed
`queued_jobs`	int	Number of jobs waiting in queue

Previous ← List jobs Next Output formats →

Output formats

Three formats for three different use cases.

Every parsed document can be retrieved in three formats. Specify which to generate using the formats parameter on /parse:

POST /api/v1/parse?formats=json,md,txt

JSON `(json)`

Structured data that defaults to strict content export (pages, text, markdown, items) with optional metadata toggles when you need extra context.

Best for: programmatic access, RAG pipelines, search indexing with rich metadata.

See the full schema at JSON structure.

Markdown `(md)`

Clean, readable text with tables in pipe format. Page markers are optional via query parameter.

Best for: human review, LLM context windows, documentation pipelines.

Plain text `(txt)`

Raw text without formatting or markers.

Best for: full-text search indexing, simple analytics.

Previous ← Health check Next JSON structure →

JSON structure

The full schema returned by the parser.

The JSON output (the result field of the /result response) defaults to strict content mode:

{
  "pages": [
    {
      "text": "Income Statement\nItem 2025 2024...",
      "markdown": "### Income Statement\n| Item | 2025 | 2024 |\n...",
      "items": [
        {
          "type": "heading",
          "markdown": "### Income Statement",
          "text": "Income Statement",
          "level": 3
        },
        {
          "type": "table",
          "markdown": "| Item | 2025 | 2024 |\n|---|---|---|\n...",
          "text": "Item 2025 2024...",
          "table": {
            "rows": [["Revenue", "1,200", "1,050"]]
          }
        }
      ]
    }
  ]
}

Optional metadata can be added with query params on /result or /result/{job_id}/raw:

GET /api/v1/result/{job_id}?format=json&include_page_metadata=true&include_item_metadata=true
GET /api/v1/result/{job_id}?format=json&include_source_path=true&include_document_metadata=true
GET /api/v1/result/{job_id}?format=json&include_processing_metadata=true

To include page boundary markers in text outputs:

GET /api/v1/result/{job_id}?format=md&include_page_markers=true
GET /api/v1/result/{job_id}/raw?format=txt&include_page_markers=true

Page metadata fields (optional)

Each page is classified into a semantic type for smarter processing:

Field	Description
`page_number`	Original 1-based page index in the PDF
`page_type`	Detected semantic class (for example `narrative`, `financial_table`, `glossary`)
`width`, `height`	Page dimensions
`links`, `images`, `charts`	Detected page assets/references
`page_header_markdown`, `page_footer_markdown`	Detected header/footer content when available
`normalized_text`	Normalized text representation for matching/search
`definitions`	Structured glossary definitions when page type is glossary

Top-level metadata fields (optional)

Field	Description
`source_path`	Original file path in parser environment (`include_source_path=true`)
`document_metadata`	Extracted document-level metadata (`include_document_metadata=true`)
`processing`	Processing stats (`total_processing_time_ms`, `api_calls_made`, `estimated_cost_usd`) when `include_processing_metadata=true`

Item types

The items array contains structured content blocks:

Type	Fields	Description
`heading`	`markdown`, `text`, `level`	Section heading with level (1–6)
`text`	`markdown`, `text`	Paragraph or text block
`table`	`markdown`, `text`, `table.rows`	Table with structured row data. Optional table quality fields are included with `include_item_metadata=true`.
`image`	`markdown`, `text`, `image.url`, `image.alt`	Embedded image reference

Previous ← Output formats Next Python example →

Python example

Upload a PDF, wait for parsing, and save the result.

This complete example uses only the requests library — no SDK required.

Pythonimport requests
import time

API_KEY = "pp_live_your_key_here"
BASE = "https://docule.dev/api/v1"
HEADERS = {"X-API-Key": API_KEY}

# 1. Upload PDF
with open("report.pdf", "rb") as f:
    resp = requests.post(
        f"{BASE}/parse",
        headers=HEADERS,
        files={"file": f},
        params={"formats": "json,md"},
    )
    resp.raise_for_status()

job_id = resp.json()["job_id"]
print(f"Job submitted: {job_id}")

# 2. Poll until complete
while True:
    status = requests.get(
        f"{BASE}/status/{job_id}", headers=HEADERS
    ).json()

    print(f"  {status['status']} — {status['progress']:.0%}")

    if status["status"] == "completed":
        break
    if status["status"] == "failed":
        raise RuntimeError(f"Job failed: {status['error']}")

    time.sleep(2)

# 3. Get JSON result
result = requests.get(
    f"{BASE}/result/{job_id}",
    headers=HEADERS,
    params={"format": "json"},
).json()

print(f"Pages: {result['metadata']['pages']}")
print(f"Quality: {result['quality_score']}")

for page in result["result"]["pages"]:
    print(page["markdown"][:200])

# 4. Download markdown to file (with page markers)
md = requests.get(
    f"{BASE}/result/{job_id}/raw?format=md&include_page_markers=true",
    headers=HEADERS,
).text

with open("report.md", "w") as f:
    f.write(md)
print("Saved report.md")

Previous ← JSON structure Next cURL example →

cURL example

End-to-end shell script using cURL and jq.

Bash#!/bin/bash
# Submit a PDF and download the parsed markdown

API_KEY="YOUR_KEY"
BASE="https://docule.dev/api/v1"

# Submit the PDF
JOB=$(curl -s -X POST $BASE/parse \
  -H "X-API-Key: $API_KEY" \
  -F "file=@report.pdf" | jq -r '.job_id')

echo "Job: $JOB"

# Poll until done
while true; do
  STATUS=$(curl -s $BASE/status/$JOB \
    -H "X-API-Key: $API_KEY" | jq -r '.status')
  echo "Status: $STATUS"
  [ "$STATUS" = "completed" ] && break
  [ "$STATUS" = "failed" ] && echo "Failed!" && exit 1
  sleep 2
done

# Download the markdown with optional page markers
curl -s "$BASE/result/$JOB/raw?format=md&include_page_markers=true" \
  -H "X-API-Key: $API_KEY" -o report.md

echo "Saved report.md"

Requires: jq for JSON parsing. Install via apt install jq, brew install jq, or choco install jq.

Previous ← Python example Next Rate limits →

Rate limits

Plan tiers, quotas, and overage pricing.

Plan	Pages / month	Requests / sec	Overage
Free	100	1	—
Starter (€29/mo)	1,500	3	€0.018 / page
Pro (€49/mo)	4,000	10	€0.014 / page
Business (€199/mo)	20,000	20	€0.009 / page

How counting works

One page equals one credit. A 42-page PDF consumes 42 credits.
Quotas reset on the first day of each calendar month.
Free accounts have no overage — requests are rejected once the quota is hit.
Paid plans can opt in to on-demand usage from Settings → Billing. When enabled, any pages beyond the plan's monthly allowance are added to your next Stripe invoice automatically at the overage rate above. We email you once per month the first time a job crosses into on-demand usage.
Opt-in is off by default, so you can never be billed beyond your plan without explicitly turning it on.

Concurrency

Maximum 1 concurrent job per user across all plans. Submitting a second while one is in progress returns HTTP 429.

What triggers HTTP 429

Exceeding your requests-per-second limit
Exceeding your monthly page quota (free plan only)
Submitting more than 1 job concurrently

Previous ← cURL example Next Error codes →

Error codes

HTTP status codes and how to handle them.

Code	Meaning
`400`	Invalid request — not a PDF, file too large (max 50 MB), or malformed request
`401`	Missing or invalid API key
`403`	No active subscription, or not authorized to access this job
`404`	Job not found, or requested format not available
`409`	Job not yet completed or has failed — check status first
`429`	Rate limit exceeded, monthly quota exhausted, or too many concurrent jobs
`503`	Service unavailable — the parsing engine is starting up

Error response format

All error responses share this shape:

{
  "error": "Rate limit exceeded",
  "detail": "Rate limit exceeded (3 req/sec). Upgrade your plan for higher limits."
}

Recommended handling

429: back off and retry. Use exponential backoff starting at 1 second.
409 on /result: the job isn't ready — keep polling /status.
5xx: retry with backoff. Check /health if errors persist.
4xx (other): fix your request — these will not succeed on retry.

Need help? If an error persists, check /api/v1/health for service status or contact support@docule.dev.

Previous ← Rate limits

Documentation

What you'll find here

Quick start →

API reference →

JSON structure →

Code examples →

Base URL

How it works

Quick start

Get an API key

Submit a PDF

Poll status, then download the result

Authentication

Getting your key

Keep your key secret

Errors

Parse a document

Request

Response (HTTP 202)

Limits

Check job status

Response

Fields

Get results

Parameters

Response (format=json)

Download raw file

Parameters

Example

Content types

List jobs

Parameters

Response

Health check

Response

Fields

Output formats

JSON (json)

Markdown (md)

Plain text (txt)

JSON structure

Page metadata fields (optional)

Top-level metadata fields (optional)

Item types

Python example

cURL example

Rate limits

How counting works

Concurrency

What triggers HTTP 429

Error codes

Error response format

Recommended handling

JSON `(json)`

Markdown `(md)`

Plain text `(txt)`