Structured Entity Extraction From Product Images Using Fine-Tuned Vision-Language Models For Digital Marketplaces

Authors

  • Dr. T. Jalaja Author
  • Dr. T. Adilakshmi Author
  • Vamsi Krishna Desineedi Author
  • Spoorthi Vadlakonda Author

DOI:

https://doi.org/10.64252/sc0jsw64

Keywords:

Vision-Language Models, Entity Extraction, PaLI-Gemma, LoRA, E-commerce Automation

Abstract

In the evolving landscape of e-commerce, product listings often lack consistent and structured metadata such as weight, dimensions, or voltage — all critical for accurate cataloging and comparison. This project addresses the challenge by developing an AI-based system that can extract such entity values directly from product images. Leveraging the PaLIGemma vision-language model, fine tuned using the Low-Rank Adaptation (LoRA) method, the system is trained on 5,000 annotated product images made publicly available by Amazon. The model receives both the image and a prompt specifying the desired attribute (e.g., “What is the weight?”), and returns a structured output. With entity-specific prompts and efficient fine-tuning, the system demonstrates a significant performance improvement over the base model, achieving a 0.70 F1 score on a held-out test set. This solution automates the metadata extraction process, offering a scalable and precise alternative to manual annotation in digital marketplaces.

Downloads

Download data is not yet available.

Downloads

Published

2025-06-02

Issue

Section

Articles

How to Cite

Structured Entity Extraction From Product Images Using Fine-Tuned Vision-Language Models For Digital Marketplaces. (2025). International Journal of Environmental Sciences, 1456-1462. https://doi.org/10.64252/sc0jsw64