PD

autoworks-ai / extract-pdf

Signedv1.4.0MIT2,847 bytesupdated 2d agomaintained by 2 contributors

Extract structured text from PDF files while preserving heading hierarchy, list structure, and table layout where possible. Wraps a local parsing library — never reaches the network. Pairs naturally with ocr-image for scanned documents and extract-table for structured data.

Or via CLI
$autovault add extract-pdf
Source
Installs
2,840
+312 this week
Active vaults
1,920
+204 this week
Gate runs
12
all passed
Issues open
3/ 47 closed
median 2d
Compatibility
4/ 4 agents
universal
SKILL.mdview raw →
---
name: extract-pdf
version: 1.4.0
description: Extract structured text from PDF files. Preserves headings, lists, tables.
author: autoworks-ai
license: MIT
tools_required:
- read # reads .pdf bytes from disk
- write # writes .txt or .json sidecar
network: none
scope:
paths: ["./*.pdf", "./docs/**/*.pdf"]
transformations:
claude-code: CLAUDE.md
codex: AGENTS.md
cursor: .cursorrules
---

Extract PDF

Extract structured text from PDF files while preserving heading hierarchy, list structure, and table layout where possible. Wraps a local pdf-parsing library — never sends bytes off-device.

When to use this skill

Reach for extract-pdf when the user asks you to read, summarize, or extract content from a PDF file. Don't use this skill for structured data extraction — pair with extract-table downstream for that.

Inputs

  • A path to a .pdf file under the user's working directory
  • Optional: --format=json to emit structured output
  • Optional: --pages=1-5 to scope extraction to a page range

Outputs

  • Plain text by default. Headings preserved as # markers, lists as - bullets
  • JSON when --format=json — includes pages, headings, paragraphs, and tables

Examples

Basic extraction

User: "What's in spec-v2.pdf?" → run extract-pdf ./spec-v2.pdf, then summarize the result.

Page range

User: "Read the first chapter of book.pdf" → estimate page range from the table of contents, run extract-pdf ./book.pdf --pages=1-22.

Caveats

  • Scanned PDFs without an embedded text layer return empty. Pair with ocr-image for image-based PDFs.
  • Tables in irregular layouts may extract as flat text.
  • Encrypted PDFs require the password as a third positional argument.