Name: Exploring the AI kaleidoscope with VoucherVision: a spectrum of techniques for specimen label transcription
Start: 2024-05-30T16:00:00-0500
End: 2024-05-30T16:15:00-0500

Please join us for the 8th Annual Digital Data in Biodiversity Research Conference: Synthesizing & Harmonizing Data for Integrated Biodiversity Research.

Exploring the AI kaleidoscope with VoucherVision: a spectrum of techniques for specimen label transcription

Facing considerable digitization backlogs and limited resources, natural history collections stand to benefit from recent advancements in computer vision and natural language processing. Our project investigates the application of Large Language Models (LLMs) to facilitate the transcription of specimen labels into digitally searchable formats, utilizing the VoucherVision platform. First, we evaluate four distinct optical character recognition (OCR) techniques for extracting text from specimen images. Then, we leverage the adaptability and capabilities of LLMs to transform the extracted unstructured text into structured JSON dictionaries, which can then be ingested into existing collections software like Symbiota, Specify, and BRAHMS. To do this, we test the capabilities of three separate AI-powered workflows to determine the most effective transcription methodologies: (1) the utilization of prompt engineering coupled with API calls to more than 15 different well-known foundational LLMs (such as ChatGPT, Mistral, Gemini, etc.), (2) an agent-based strategy employing multiple LLM bots for the transcription and subsequent recursive validation of each LLM-generated output, (3) the application of LLMs that are fine-tuned with natural history transcription datasets. We demonstrate that each of the three AI methods can produce a properly formatted JSON dictionary, but that content accuracy varies with the LLM version, hyperparameter settings, and prompting style. To complement VoucherVision, we also worked with 15 partner institutions to develop an easy-to-use editing tool for correcting transcription errors. This collaborative effort highlights the potential of integrating AI to streamline the digitization of natural history collections but also underscores the importance of iterative testing and customization in achieving high transcription accuracy and efficiency.

Speakers

William Weaver

Thursday May 30, 2024 4:00pm - 4:15pm CDT
Burge Union

Concurrent 2: Facilitating ecological discovery and understanding

Feedback form isn't open yet.

Digital Data in Biodiversity Research Conference

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

William Weaver

Attendees (15)