Structured Entity Extraction From Product Images Using Fine-Tuned Vision-Language Models For Digital Marketplaces

Dr. T. Jalaja; Dr. T. Adilakshmi; Vamsi Krishna  Desineedi; Spoorthi Vadlakonda

doi:10.64252/sc0jsw64

Authors

Dr. T. Jalaja Author
Dr. T. Adilakshmi Author
Vamsi Krishna Desineedi Author
Spoorthi Vadlakonda Author

DOI:

https://doi.org/10.64252/sc0jsw64

Keywords:

Vision-Language Models, Entity Extraction, PaLI-Gemma, LoRA, E-commerce Automation

Abstract

In the evolving landscape of e-commerce, product listings often lack consistent and structured metadata such as weight, dimensions, or voltage — all critical for accurate cataloging and comparison. This project addresses the challenge by developing an AI-based system that can extract such entity values directly from product images. Leveraging the PaLIGemma vision-language model, fine tuned using the Low-Rank Adaptation (LoRA) method, the system is trained on 5,000 annotated product images made publicly available by Amazon. The model receives both the image and a prompt specifying the desired attribute (e.g., “What is the weight?”), and returns a structured output. With entity-specific prompts and efficient fine-tuning, the system demonstrates a significant performance improvement over the base model, achieving a 0.70 F1 score on a held-out test set. This solution automates the metadata extraction process, offering a scalable and precise alternative to manual annotation in digital marketplaces.

Downloads

Download data is not yet available.

Structured Entity Extraction From Product Images Using Fine-Tuned Vision-Language Models For Digital Marketplaces

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

Issue

Section

License

How to Cite

Make a Submission

Indexing

Language