# Docule — Complete Technical Reference

> PDF parsing API that converts complex documents into structured Markdown, JSON, and plain text. Specializes in financial tables, annual reports, invoices, and multi-column layouts.

Docule is a document intelligence API at [docule.dev](https://docule.dev). It provides production-quality PDF to structured data conversion via a REST API.

## Overview

Docule solves the problem of extracting structured data from PDF documents — particularly complex financial documents where traditional OCR and text extraction tools fail. The multi-stage pipeline combines layout detection, table extraction, and quality validation to produce accurate, structured output.

### Why Docule?

1. **Accuracy on complex tables**: Financial documents contain side-by-side layouts, merged cells, nested hierarchies, and multi-column reports. Docule handles these correctly where other parsers produce garbled output.
2. **Adaptive cost**: Simple text pages use fast extraction. Only complex pages escalate to higher-fidelity processing, so you pay for what each page needs.
3. **RAG-optimized output**: Each page includes keyword metadata and context headers so downstream applications always know what they're looking at.
4. **Built-in quality assurance**: Column consistency checks, header detection, sum verification. Failed pages are automatically re-processed at higher fidelity.

## API Reference

Base URL: `https://docule.dev/api/v1`

Authentication: API key via `X-API-Key: YOUR_API_KEY` header. Generate keys from the [API keys dashboard](https://docule.dev/dashboard/api-keys) after signing up.

### Endpoints

#### Submit a PDF for parsing

```
POST /api/v1/parse
Content-Type: multipart/form-data
X-API-Key: YOUR_API_KEY

Parameters:
- file (required, form field): PDF file to parse (max 50 MB)
- formats (query, optional): Comma-separated output formats: json, md, txt (default: "json,md")
- no_vision (query, optional): Disable vision API calls (default: false)
- force_vision (query, optional): Force vision for all pages (default: false)
- agentic (query, optional): Enable agentic quality scan (default: false)
- document_type (query, optional): auto or report (default: "auto")
```

Response: `202 Accepted` with `{"job_id": "...", "status": "queued", "created_at": "..."}`.

#### Check job status

```
GET /api/v1/status/{job_id}
X-API-Key: YOUR_API_KEY

Response: Job state, progress, page counts, cost, quality score.
```

#### Get the parsed result

```
GET /api/v1/result/{job_id}?format=json
X-API-Key: YOUR_API_KEY

format query parameter: json (default), md, or txt
Response: Structured result with pages, tables, metadata.
```

#### Download raw output

```
GET /api/v1/result/{job_id}/raw?format=md
X-API-Key: YOUR_API_KEY

format query parameter: md (default), json, or txt
Response: The file contents as text/plain or application/json.
```

#### List recent jobs

```
GET /api/v1/jobs?limit=50&status=completed
X-API-Key: YOUR_API_KEY

limit: 1-200 (default 50)
status: optional filter by job status
Response: Array of recent jobs scoped to the authenticated account.
```

#### Health check

```
GET /api/v1/health

(No auth required.)
Response: {"active_jobs": N, "queued_jobs": N}
```

### Example Usage (Python)

```python
import requests, time

API = "https://docule.dev/api/v1"
headers = {"X-API-Key": "YOUR_API_KEY"}

# 1. Submit a PDF for parsing
job = requests.post(
    f"{API}/parse",
    headers=headers,
    files={"file": open("report.pdf", "rb")},
    params={"formats": "json,md"},
).json()
job_id = job["job_id"]

# 2. Poll until ready
while True:
    status = requests.get(f"{API}/status/{job_id}", headers=headers).json()
    if status["status"] in ("completed", "failed"):
        break
    time.sleep(2)

# 3. Fetch the structured result
result = requests.get(
    f"{API}/result/{job_id}",
    headers=headers,
    params={"format": "json"},
).json()
for page in result["result"]["pages"]:
    print(page["markdown"])
```

### Example Usage (cURL)

```bash
# Submit
curl -X POST https://docule.dev/api/v1/parse \
  -H "X-API-Key: YOUR_API_KEY" \
  -F "file=@report.pdf"

# Check status
curl https://docule.dev/api/v1/status/JOB_ID \
  -H "X-API-Key: YOUR_API_KEY"

# Fetch result
curl https://docule.dev/api/v1/result/JOB_ID \
  -H "X-API-Key: YOUR_API_KEY"

# Fetch markdown with page markers
curl "https://docule.dev/api/v1/result/JOB_ID/raw?format=md&include_page_markers=true" \
  -H "X-API-Key: YOUR_API_KEY"
```

### Error codes

| Code | Meaning |
|---|---|
| 400 | Bad request (e.g. non-PDF file, file too large) |
| 401 | Missing or invalid `X-API-Key` header |
| 403 | No active subscription, or job does not belong to your account |
| 404 | Job not found |
| 409 | Job not yet completed (or failed) |
| 429 | Rate limit exceeded, concurrent job limit, or monthly quota exceeded |
| 503 | Service not ready |

### Rate limits

Per-plan request rates (applied to `POST /parse` and read endpoints):

| Plan | Requests/second | Monthly pages |
|---|---|---|
| Free | 1 | 100 |
| Starter | 3 | 2,000 |
| Pro | 10 | 6,000 |
| Business | 20 | 15,000 |

Max 1 concurrent parsing job per user. Max file size: 50 MB per PDF. Max files per job submission: 5,000.

## Document Types

Docule handles all PDF document types, with particular strength in:

| Document Type | Key Capabilities |
|---|---|
| Annual reports | Financial tables, segment reports, notes with multi-column layouts |
| Invoices | Line items, totals, VAT calculation, vendor/buyer metadata |
| Bank statements | Transaction tables, running balances, multi-period statements |
| Contracts | Clause extraction, structured sections, appendix handling |
| Sustainability reports | ESRS/CSRD tables, KPI extraction, compliance data |
| SEC filings | 10-K/10-Q tables, XBRL-style structured data |
| Research papers | Academic tables, figure captions, citation sections |
| Tax filings | Form field extraction, schedule tables |

## Output Format

### JSON Response Structure (strict default)

```json
{
  "job_id": "abc123",
  "status": "completed",
  "result": {
    "pages": [
      {
        "text": "Income Statement\nItem 2025 2024...",
        "markdown": "### Income Statement\n| Item | 2025 | 2024 |\n...",
        "items": [
          {
            "type": "heading",
            "markdown": "### Income Statement",
            "text": "Income Statement",
            "level": 3
          },
          {
            "type": "table",
            "markdown": "| Item | 2025 | 2024 |\n|---|---|---|\n...",
            "text": "Item 2025 2024...",
            "table": {
              "rows": [["Revenue", "1,200", "1,050"]]
            }
          }
        ]
      }
    ]
  },
  "metadata": {
    "filename": "report.pdf",
    "pages": 12,
    "cost_usd": 0.08,
    "api_calls": 3
  }
}
```

Optional metadata can be enabled per request:

- `include_source_path=true`
- `include_document_metadata=true`
- `include_page_metadata=true`
- `include_item_metadata=true`
- `include_processing_metadata=true`

### Markdown Output

Clean GFM (GitHub Flavored Markdown) with proper table formatting, headers preserved from the source document, and hierarchical structure maintained.

## Pricing

Credits-based pricing with a free tier and paid plans for higher volume. One page equals one credit. See [docule.dev](https://docule.dev) for current plans and pricing.

## Links

- Website: https://docule.dev
- Documentation: https://docule.dev/docs
- API Base: https://docule.dev/api/v1
- Privacy Policy: https://docule.dev/privacy
- Terms of Service: https://docule.dev/terms
- Contact: support@docule.dev