Structured Entity Extraction From Product Images Using Fine-Tuned Vision-Language Models For Digital Marketplaces
DOI:
https://doi.org/10.64252/sc0jsw64Keywords:
Vision-Language Models, Entity Extraction, PaLI-Gemma, LoRA, E-commerce AutomationAbstract
In the evolving landscape of e-commerce, product listings often lack consistent and structured metadata such as weight, dimensions, or voltage — all critical for accurate cataloging and comparison. This project addresses the challenge by developing an AI-based system that can extract such entity values directly from product images. Leveraging the PaLIGemma vision-language model, fine tuned using the Low-Rank Adaptation (LoRA) method, the system is trained on 5,000 annotated product images made publicly available by Amazon. The model receives both the image and a prompt specifying the desired attribute (e.g., “What is the weight?”), and returns a structured output. With entity-specific prompts and efficient fine-tuning, the system demonstrates a significant performance improvement over the base model, achieving a 0.70 F1 score on a held-out test set. This solution automates the metadata extraction process, offering a scalable and precise alternative to manual annotation in digital marketplaces.