Skip to main content

PDF

Reading and writing PDF files in Savant

PDFs are used for document-based workflows – reports, statements, and customer-facing outputs. Unlike structured data formats, PDFs are treated as complete documents rather than row-based datasets. Savant supports both passing PDFs through workflows as document artifacts and extracting their text content for use as data.


Supported extension

Extension

Read limit

Write limit

.pdf

1 GB

1 GB


Read features

Reading a single file

When you configure a dataset with a File URL, Savant reads exactly that one file every time the dataset runs. Savant does not scan the surrounding folder or consider other files at that location. Each run fetches the latest version of the file, so the dataset always reflects the most current content available. If the file is moved, deleted, or access is revoked, the run fails with an explicit error.

File replacement and stale reads: Savant tracks source files by file ID, not filename. If you update a file by uploading a new version rather than editing the original in place, the original file moves to the trash with its ID intact – and Savant keeps reading it silently until it is permanently deleted.

As a best practice, always edit source files in place. If you do replace a file, update the dataset with the new file URL before the next run.

Native PDF

Native PDF treats the file as a document artifact. Savant passes the PDF through the workflow without attempting to parse its contents. Use this when the goal is to store, route, archive, or forward the file to a downstream system.

Text extract

Text Extract reads the embedded text content of a PDF and makes it available as data within Savant. This works reliably for PDFs that contain real text (for example, exported reports or digital statements). Scanned PDFs – where each page is an image rather than selectable text – will not produce meaningful output with this mode unless the file has been processed with OCR separately.

Include metadata

When enabled, Savant appends file-level context columns to the extracted content, such as file name, full path, and last-modified timestamp. This is particularly useful when reading from a folder with multiple PDFs, as it makes each extracted record traceable back to its source file.


Reading from a folder

When you configure a dataset with a Folder URL, Savant treats the folder as the input source and applies the selection rules below at run time.

File pattern

File Pattern restricts ingestion to files whose names match a specified pattern. For example, a pattern like statements_*.pdf ingests only matching files and ignores everything else in the folder.

Scan subfolders

When enabled, Scan Subfolders extends the read to include files in nested subdirectories. This is useful when PDFs are organized by date, region, or business unit across multiple folders. Savant supports reading from up to 10 subfolders within any given folder.

Use only most recent file

When enabled, Savant ingests a single file: the newest matching file based on last-modified time. Use this for recurring workflows where a fresh PDF is delivered to the folder on a schedule and you always want the latest version without updating the dataset configuration.


Write features

Folder path

Folder Path defines the destination where Savant writes the output file. For recurring workflows, keeping this path consistent ensures downstream systems and users always know where to find the latest output.

Subfolder name

Subfolder Naming controls how output files are organized within the destination path. Subfolders can be static (a fixed path) or dynamic, where Savant generates a separate subfolder for each unique value of one or more dataset fields – for example, one folder per region or one folder per day.

File name

File Naming determines how each output file is named within its folder. Names can be static or dynamic. Dynamic naming generates one file per unique field value – for example, one file per customer – within a single run.

Sorting

Sorting controls the order in which output files are written when dynamic naming produces multiple files in a single run. Specifying a sort order ensures consistent, predictable output across runs.


Next steps

Review the file type guides for detailed configuration options, parsing behavior, and known limitations:

Did this answer your question?