Documentation

Everything you need to parse PDFs into structured markdown and JSON.

Docule turns PDFs into clean markdown and richly-structured JSON. The API is asynchronous: you submit a file, poll a status endpoint, then download the result when it's ready.

What you'll find here

Base URL

All API requests use the following base URL:

https://docule.dev/api/v1

How it works

  1. Submit a PDF to POST /parse. You get a job_id back immediately.
  2. Poll GET /status/{job_id} until status becomes "completed".
  3. Download the parsed output from GET /result/{job_id} in JSON, markdown, or plain text.

Quick start

Parse your first PDF in three steps.

1

Get an API key

Sign up and grab a key from the API keys dashboard. Free accounts get 100 pages per month.

2

Submit a PDF

POST your file to /parse. The API returns a job_id immediately — parsing happens in the background.

# Upload your PDF
curl -X POST https://docule.dev/api/v1/parse \
  -H "X-API-Key: YOUR_KEY" \
  -F "file=@report.pdf"
# Response (HTTP 202)
{
  "job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
  "status": "queued",
  "created_at": "2026-04-07T12:05:30.123Z",
  "message": "Job submitted. Poll GET /api/v1/status/{job_id}"
}
3

Poll status, then download the result

Poll the status endpoint until status is "completed", then fetch the parsed output.

# Poll until completed
curl https://docule.dev/api/v1/status/JOB_ID \
  -H "X-API-Key: YOUR_KEY"

# Download the parsed result
curl https://docule.dev/api/v1/result/JOB_ID?format=json \
  -H "X-API-Key: YOUR_KEY"
That's it. Continue to the Python or cURL examples for a full end-to-end script.

Authentication

Authenticate every request with your API key.

All API requests require an API key passed via the X-API-Key HTTP header.

curl https://docule.dev/api/v1/health \
  -H "X-API-Key: pp_live_your_key_here"

Getting your key

After signing up, generate a key from the API keys dashboard. You can create multiple keys and revoke them at any time.

Keep your key secret

Your API key grants full access to your account's parsing quota. Treat it like a password:

  • Never commit keys to git or paste them into client-side code.
  • Use environment variables in production.
  • Rotate keys if you suspect compromise.

Errors

CodeMeaning
401Missing or invalid X-API-Key header
403No active subscription

Parse a document

POST /api/v1/parse

Upload a PDF for parsing. Returns a job ID for status polling.

Processing is asynchronous — you'll get a job_id back immediately, then poll /status until the job completes.

Request

Send the PDF as multipart/form-data. All other parameters are optional query strings:

ParameterTypeDefaultDescription
filefileRequired. PDF file to parse. Max 50 MB.
formatsquery stringjson,mdComma-separated output formats: json, md, txt
no_visionquery boolfalseDisable vision API calls (faster, lower quality for complex pages)
force_visionquery boolfalseForce vision processing for every page (higher quality, higher cost)
document_typequery stringautoDocument type: auto, report. Legacy aliases invoice/contract are accepted.
include_page_markersquery boolfalseAdd explicit page markers to markdown/text outputs (e.g. START OF PAGE / END OF PAGE).
include_source_metadataquery boolfalseInclude source_path in JSON output.
include_document_metadataquery boolfalseInclude extracted document_metadata in JSON output.
include_page_metadataquery boolfalseInclude page-level metadata in JSON (page number/type, links, dimensions, header/footer).
include_item_metadataquery boolfalseInclude item-level metadata in JSON (bbox, section fields, table quality fields).
include_processing_metadataquery boolfalseInclude processing metadata in JSON (total_processing_time_ms, api_calls_made, estimated_cost_usd).

Response (HTTP 202)

{
  "job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
  "status": "queued",
  "created_at": "2026-04-07T12:05:30.123Z",
  "message": "Job submitted. Poll GET /api/v1/status/{job_id}"
}

Limits

  • Maximum file size: 50 MB
  • Maximum concurrent jobs: 1 per user (returns HTTP 429 if exceeded)
  • Monthly page quota varies by plan — see rate limits

Check job status

GET /api/v1/status/{job_id}

Poll for job progress and quality metrics.

Returns the current job status, progress, and (when complete) quality and cost metrics.

Response

{
  "job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
  "status": "completed",
  "created_at": "2026-04-07T12:05:30.123Z",
  "started_at": "2026-04-07T12:05:31.456Z",
  "completed_at": "2026-04-07T12:05:45.789Z",
  "progress": 1.0,
  "total_pages": 42,
  "pages_processed": 42,
  "quality_score": 97.5,
  "quality_warning": false,
  "cost_usd": 0.0234,
  "api_calls": 8,
  "document_type": "report",
  "error": null
}

Fields

FieldTypeDescription
statusstringqueued, processing, completed, or failed
progressfloat0.0 to 1.0 — fraction of pages processed
total_pagesintTotal page count of the PDF
pages_processedintPages completed so far
quality_scorefloat0–100 quality rating (available when completed)
quality_warningbooltrue if quality score is below 85
cost_usdfloatAPI cost for this job in USD
api_callsintNumber of vision API calls made
document_typestringDetected document type (e.g. report)
errorstringError message if status is failed
Tip: Polling once every 2 seconds is a sensible default. Most documents complete in under 30 seconds.

Get results

GET /api/v1/result/{job_id}

Retrieve the parsed output of a completed job.

Returns the structured result wrapped in a response envelope with metadata.

Parameters

ParameterTypeDefaultDescription
formatquery stringjsonOutput format: json, md, or txt
include_page_markersquery boolfalseFor format=md/txt: add explicit page boundary markers to the response.
include_source_pathquery boolfalseFor format=json: include top-level source_path.
include_document_metadataquery boolfalseFor format=json: include top-level document_metadata.
include_page_metadataquery boolfalseFor format=json: include page-level metadata fields (e.g. page_number, page type, dimensions, links).
include_item_metadataquery boolfalseFor format=json: include item-level metadata fields (e.g. bbox, section metadata, table quality fields).
include_processing_metadataquery boolfalseFor format=json: include top-level processing metadata.

Response (format=json)

{
  "job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
  "status": "completed",
  "result": { /* full parsed JSON — see JSON structure */ },
  "markdown": null,
  "text": null,
  "quality_score": 97.5,
  "metadata": {
    "filename": "report.pdf",
    "pages": 42,
    "cost_usd": 0.0234,
    "api_calls": 8
  },
  "document_type": "report"
}

When requesting format=md, the markdown field contains the full document text and result is null. Same for format=txt with the text field.

Returns HTTP 409 if the job is not yet completed, or if it has failed. Always check the status endpoint first.

Download raw file

GET /api/v1/result/{job_id}/raw

Download the raw output file directly, without the JSON envelope.

Returns the file content with the appropriate Content-Type — useful for piping straight to disk.

Parameters

ParameterTypeDefaultDescription
formatquery stringmdjson, md, or txt
include_page_markersquery boolfalseFor format=md/txt: add explicit page boundary markers to raw output.
include_source_pathquery boolfalseFor format=json: include top-level source_path.
include_document_metadataquery boolfalseFor format=json: include top-level document_metadata.
include_page_metadataquery boolfalseFor format=json: include page-level metadata fields.
include_item_metadataquery boolfalseFor format=json: include item-level metadata fields.
include_processing_metadataquery boolfalseFor format=json: include top-level processing metadata.

Example

# Download markdown directly to a file with page markers
curl "https://docule.dev/api/v1/result/JOB_ID/raw?format=md&include_page_markers=true" \
  -H "X-API-Key: YOUR_KEY" \
  -o report.md

Content types

  • format=jsonapplication/json
  • format=mdtext/plain
  • format=txttext/plain

List jobs

GET /api/v1/jobs

List your recent parsing jobs, newest first.

Parameters

ParameterTypeDefaultDescription
statusquery stringFilter by status: queued, processing, completed, failed
limitquery int50Max results to return (max 200)

Response

[
  {
    "job_id": "20260407_120530_a1b2c3d4e5f6g7h8",
    "status": "completed",
    "filename": "report.pdf",
    "created_at": "2026-04-07T12:05:30.123Z",
    "completed_at": "2026-04-07T12:05:45.789Z",
    "pages": 42,
    "quality_score": 97.5,
    "cost_usd": 0.0234
  }
]

Health check

GET /api/v1/health

Check API status and queue depth. No authentication required.

Response

{
  "status": "ok",
  "version": "0.1.0",
  "active_jobs": 3,
  "queued_jobs": 0
}

Fields

FieldTypeDescription
statusstringAlways "ok" if the service is running
versionstringCurrent API version
active_jobsintNumber of jobs currently being processed
queued_jobsintNumber of jobs waiting in queue

Output formats

Three formats for three different use cases.

Every parsed document can be retrieved in three formats. Specify which to generate using the formats parameter on /parse:

POST /api/v1/parse?formats=json,md,txt

JSON (json)

Structured data that defaults to strict content export (pages, text, markdown, items) with optional metadata toggles when you need extra context.

Best for: programmatic access, RAG pipelines, search indexing with rich metadata.

See the full schema at JSON structure.

Markdown (md)

Clean, readable text with tables in pipe format. Page markers are optional via query parameter.

Best for: human review, LLM context windows, documentation pipelines.

Plain text (txt)

Raw text without formatting or markers.

Best for: full-text search indexing, simple analytics.

JSON structure

The full schema returned by the parser.

The JSON output (the result field of the /result response) defaults to strict content mode:

{
  "pages": [
    {
      "text": "Income Statement\nItem 2025 2024...",
      "markdown": "### Income Statement\n| Item | 2025 | 2024 |\n...",
      "items": [
        {
          "type": "heading",
          "markdown": "### Income Statement",
          "text": "Income Statement",
          "level": 3
        },
        {
          "type": "table",
          "markdown": "| Item | 2025 | 2024 |\n|---|---|---|\n...",
          "text": "Item 2025 2024...",
          "table": {
            "rows": [["Revenue", "1,200", "1,050"]]
          }
        }
      ]
    }
  ]
}

Optional metadata can be added with query params on /result or /result/{job_id}/raw:

GET /api/v1/result/{job_id}?format=json&include_page_metadata=true&include_item_metadata=true
GET /api/v1/result/{job_id}?format=json&include_source_path=true&include_document_metadata=true
GET /api/v1/result/{job_id}?format=json&include_processing_metadata=true

To include page boundary markers in text outputs:

GET /api/v1/result/{job_id}?format=md&include_page_markers=true
GET /api/v1/result/{job_id}/raw?format=txt&include_page_markers=true

Page metadata fields (optional)

Each page is classified into a semantic type for smarter processing:

FieldDescription
page_numberOriginal 1-based page index in the PDF
page_typeDetected semantic class (for example narrative, financial_table, glossary)
width, heightPage dimensions
links, images, chartsDetected page assets/references
page_header_markdown, page_footer_markdownDetected header/footer content when available
normalized_textNormalized text representation for matching/search
definitionsStructured glossary definitions when page type is glossary

Top-level metadata fields (optional)

FieldDescription
source_pathOriginal file path in parser environment (include_source_path=true)
document_metadataExtracted document-level metadata (include_document_metadata=true)
processingProcessing stats (total_processing_time_ms, api_calls_made, estimated_cost_usd) when include_processing_metadata=true

Item types

The items array contains structured content blocks:

TypeFieldsDescription
headingmarkdown, text, levelSection heading with level (1–6)
textmarkdown, textParagraph or text block
tablemarkdown, text, table.rowsTable with structured row data. Optional table quality fields are included with include_item_metadata=true.
imagemarkdown, text, image.url, image.altEmbedded image reference

Python example

Upload a PDF, wait for parsing, and save the result.

This complete example uses only the requests library — no SDK required.

Pythonimport requests
import time

API_KEY = "pp_live_your_key_here"
BASE = "https://docule.dev/api/v1"
HEADERS = {"X-API-Key": API_KEY}

# 1. Upload PDF
with open("report.pdf", "rb") as f:
    resp = requests.post(
        f"{BASE}/parse",
        headers=HEADERS,
        files={"file": f},
        params={"formats": "json,md"},
    )
    resp.raise_for_status()

job_id = resp.json()["job_id"]
print(f"Job submitted: {job_id}")

# 2. Poll until complete
while True:
    status = requests.get(
        f"{BASE}/status/{job_id}", headers=HEADERS
    ).json()

    print(f"  {status['status']} — {status['progress']:.0%}")

    if status["status"] == "completed":
        break
    if status["status"] == "failed":
        raise RuntimeError(f"Job failed: {status['error']}")

    time.sleep(2)

# 3. Get JSON result
result = requests.get(
    f"{BASE}/result/{job_id}",
    headers=HEADERS,
    params={"format": "json"},
).json()

print(f"Pages: {result['metadata']['pages']}")
print(f"Quality: {result['quality_score']}")

for page in result["result"]["pages"]:
    print(page["markdown"][:200])

# 4. Download markdown to file (with page markers)
md = requests.get(
    f"{BASE}/result/{job_id}/raw?format=md&include_page_markers=true",
    headers=HEADERS,
).text

with open("report.md", "w") as f:
    f.write(md)
print("Saved report.md")

cURL example

End-to-end shell script using cURL and jq.

Bash#!/bin/bash
# Submit a PDF and download the parsed markdown

API_KEY="YOUR_KEY"
BASE="https://docule.dev/api/v1"

# Submit the PDF
JOB=$(curl -s -X POST $BASE/parse \
  -H "X-API-Key: $API_KEY" \
  -F "file=@report.pdf" | jq -r '.job_id')

echo "Job: $JOB"

# Poll until done
while true; do
  STATUS=$(curl -s $BASE/status/$JOB \
    -H "X-API-Key: $API_KEY" | jq -r '.status')
  echo "Status: $STATUS"
  [ "$STATUS" = "completed" ] && break
  [ "$STATUS" = "failed" ] && echo "Failed!" && exit 1
  sleep 2
done

# Download the markdown with optional page markers
curl -s "$BASE/result/$JOB/raw?format=md&include_page_markers=true" \
  -H "X-API-Key: $API_KEY" -o report.md

echo "Saved report.md"
Requires: jq for JSON parsing. Install via apt install jq, brew install jq, or choco install jq.

Rate limits

Plan tiers, quotas, and overage pricing.

PlanPages / monthRequests / secOverage
Free1001
Starter (€29/mo)1,5003€0.018 / page
Pro (€49/mo)4,00010€0.014 / page
Business (€199/mo)20,00020€0.009 / page

How counting works

  • One page equals one credit. A 42-page PDF consumes 42 credits.
  • Quotas reset on the first day of each calendar month.
  • Free accounts have no overage — requests are rejected once the quota is hit.
  • Paid plans can opt in to on-demand usage from Settings → Billing. When enabled, any pages beyond the plan's monthly allowance are added to your next Stripe invoice automatically at the overage rate above. We email you once per month the first time a job crosses into on-demand usage.
  • Opt-in is off by default, so you can never be billed beyond your plan without explicitly turning it on.

Concurrency

Maximum 1 concurrent job per user across all plans. Submitting a second while one is in progress returns HTTP 429.

What triggers HTTP 429

  • Exceeding your requests-per-second limit
  • Exceeding your monthly page quota (free plan only)
  • Submitting more than 1 job concurrently

Error codes

HTTP status codes and how to handle them.

CodeMeaning
400Invalid request — not a PDF, file too large (max 50 MB), or malformed request
401Missing or invalid API key
403No active subscription, or not authorized to access this job
404Job not found, or requested format not available
409Job not yet completed or has failed — check status first
429Rate limit exceeded, monthly quota exhausted, or too many concurrent jobs
503Service unavailable — the parsing engine is starting up

Error response format

All error responses share this shape:

{
  "error": "Rate limit exceeded",
  "detail": "Rate limit exceeded (3 req/sec). Upgrade your plan for higher limits."
}

Recommended handling

  • 429: back off and retry. Use exponential backoff starting at 1 second.
  • 409 on /result: the job isn't ready — keep polling /status.
  • 5xx: retry with backoff. Check /health if errors persist.
  • 4xx (other): fix your request — these will not succeed on retry.
Need help? If an error persists, check /api/v1/health for service status or contact support@docule.dev.