autoworks-ai / extract-pdf
Extract structured text from PDF files while preserving heading hierarchy, list structure, and table layout where possible. Wraps a local parsing library — never reaches the network. Pairs naturally with ocr-image for scanned documents and extract-table for structured data.
Extract PDF
Extract structured text from PDF files while preserving heading hierarchy, list structure, and table layout where possible. Wraps a local pdf-parsing library — never sends bytes off-device.
When to use this skill
Reach for extract-pdf when the user asks you to read, summarize, or extract content from a PDF file. Don't use this skill for structured data extraction — pair with extract-table downstream for that.
Inputs
- A path to a .pdf file under the user's working directory
- Optional:
--format=jsonto emit structured output - Optional:
--pages=1-5to scope extraction to a page range
Outputs
- Plain text by default. Headings preserved as
#markers, lists as-bullets - JSON when
--format=json— includes pages, headings, paragraphs, and tables
Examples
Basic extraction
User: "What's in spec-v2.pdf?" → run extract-pdf ./spec-v2.pdf, then summarize the result.
Page range
User: "Read the first chapter of book.pdf" → estimate page range from the table of contents, run extract-pdf ./book.pdf --pages=1-22.
Caveats
- Scanned PDFs without an embedded text layer return empty. Pair with
ocr-imagefor image-based PDFs. - Tables in irregular layouts may extract as flat text.
- Encrypted PDFs require the password as a third positional argument.