Search

The volumes of historical data locked behind unstructured formats have long been a challenge for researchers in the computational humanities. While optical character recognition (OCR) and natural language processing have enabled large-scale text mining projects, the irregular formatting, inconsistent terminology and evolving printing practices complicate automated parsing and information extraction efforts for historical documents. This study explores the potential of large language models (LLMs) in processing and structuring irregular and non-standardized historical materials, using the U.S. Department of Agriculture’s Plant Inventory books (1898–2008) as a test case. Given the frequent evolution of these historical records, we implemented a pipeline combining OCR, custom segmentation rules and LLMs to extract structured data from the scanned texts. It provides an example of how incorporating LLMs into data-processing pipelines can enhance the accessibility and usability of historical and archival materials for scholars.

Search Results

Refine search

Refine search

Actions for selected content:

1 results

Retrieving information from unstructured historical sources using large language models

Search Results

Refine search

Refine search

Actions for selected content:

Save Search

1 results

Retrieving information from unstructured historical sources using large language models