autoworks-ai / extract-pdf

Signedv1.4.0MIT2,847 bytesupdated 2d agomaintained by 2 contributors

Extract structured text from PDF files while preserving heading hierarchy, list structure, and table layout where possible. Wraps a local parsing library — never reaches the network. Pairs naturally with ocr-image for scanned documents and extract-table for structured data.

Or via CLI

$autovault add extract-pdf

Source

Installs

2,840

+312 this week

Active vaults

1,920

+204 this week

Gate runs

all passed

Issues open

3/ 47 closed

median 2d

Compatibility

4/ 4 agents

universal

SKILL.mdview raw →

---

name: extract-pdf

version: 1.4.0

description: Extract structured text from PDF files. Preserves headings, lists, tables.

author: autoworks-ai

license: MIT

tools_required:

- read # reads .pdf bytes from disk

- write # writes .txt or .json sidecar

network: none

scope:

paths: ["./*.pdf", "./docs/**/*.pdf"]

transformations:

claude-code: CLAUDE.md

codex: AGENTS.md

cursor: .cursorrules

---

Extract PDF

Extract structured text from PDF files while preserving heading hierarchy, list structure, and table layout where possible. Wraps a local pdf-parsing library — never sends bytes off-device.

When to use this skill

Reach for extract-pdf when the user asks you to read, summarize, or extract content from a PDF file. Don't use this skill for structured data extraction — pair with extract-table downstream for that.

Inputs

A path to a .pdf file under the user's working directory
Optional: --format=json to emit structured output
Optional: --pages=1-5 to scope extraction to a page range

Outputs

Plain text by default. Headings preserved as # markers, lists as - bullets
JSON when --format=json — includes pages, headings, paragraphs, and tables

Examples

Basic extraction

User: "What's in spec-v2.pdf?" → run extract-pdf ./spec-v2.pdf, then summarize the result.

Page range

User: "Read the first chapter of book.pdf" → estimate page range from the table of contents, run extract-pdf ./book.pdf --pages=1-22.

Caveats

Scanned PDFs without an embedded text layer return empty. Pair with ocr-image for image-based PDFs.
Tables in irregular layouts may extract as flat text.
Encrypted PDFs require the password as a third positional argument.