Use an LLM to Convert Table Text to a Dataframe
convert_table_text_to_dataframe.RdIt is common to encounter valuable tabular data that is stored in a file type that does not codify tabular data as such, e.g., a table in a PDF or .docx file. This function uses a user-specified LLM (from OpenAI or Anthropic) to convert the text of a table into a dataframe. Note that users must have an API key with credits for the specified LLM. For a typical full-page PDF table, the LLM costs are roughly $.02-.05 USD per page.
Usage
convert_table_text_to_dataframe(
text,
column_types,
llm_company_name = "openai",
preprocess = TRUE,
read_warning = FALSE,
short_document = FALSE,
required = FALSE
)Arguments
- text
The table-as-text input. Each page of input should be its own vector or list item, as is the default when using
pdftools::pdf_text(). Non-table text in the inputtextshould be minimized.- column_types
The column types of the output dataframe. This is a
type_objectobject from theellmerpackage. Including descriptions of what each column represents improves accuracy.- llm_company_name
One of c("openai", "anthropic"). Default is "openai".
- preprocess
Should the text be preprocessed before it is passed to the LLM? Default is
TRUE. This removes unneeded spaces, line breaks, and page numbers.- read_warning
Did the user read the function documentation?
- short_document
Boolean; default is FALSE. If TRUE, it is assumed that the document is short enough that it can be processed in a single API call. If FALSE and the inputted
textis a single item, the function throws an error. Note that multi-page documents should be broken into multi-item vectors/lists before being passed totext.- required
Boolean; default is FALSE. If TRUE, the LLM will be instructed to return values for all columns. If FALSE,
NULLvalues are allowed. Generally, NULL values should be allowed unless you are certain that every value in the inputted text-table has a non-NULL value.
Value
A list of tibbles, where each list element corresponds to one item (typically one page) in the input text vector/list. Each tibble contains:
- Structure
Columns match the names and types defined in
column_types. Each row represents one record extracted from the table text by the LLM.- NULL values
When
required = FALSE(default), columns may contain NULL/NA values if the LLM could not extract a value for that cell.- Empty dataframes
If the LLM encounters an error processing a page, that list element will be an empty
data.frame().
Use purrr::list_rbind() or dplyr::bind_rows() to consolidate results into a single dataframe. A warning is issued reminding users to review AI-generated results for accuracy.
Examples
if (FALSE) { # \dontrun{
column_types = type_object(
col1 = type_string("Zip code"),
col2 = type_integer("Number of buildings"))
convert_table_text_to_dataframe(
text = example_text,
column_types = column_types,
preprocess = TRUE,
read_warning = TRUE,
llm_company_name = "openai",
short_document = TRUE,
required = FALSE)
} # }