Product Matching using Sentence-BERT: A Deep Learning Approach to E-Commerce Product Deduplication
Downloads
Product matching in e-commerce platforms presents a significant challenge due to variations in product titles, descriptions, and categorizations across different vendors. This paper presents a lightweight yet effective approach to product matching using Sentence-BERT (SBERT), specifically the all-MiniLM-L6-v2 variant. Our method combines efficient text preprocessing, strategic training pair generation, and threshold-based similarity matching to achieve high-accuracy product matching while maintaining computational efficiency. The system was evaluated on the Pricerunner dataset, achieving exceptional results with 98.10% accuracy, 100% precision, and 91.84% recall. The implementation includes a modular architecture that facilitates maintenance and updates, while the threshold-based matching strategy allows fine-tuned control over precision-recall trade-offs. Our results suggest that carefully designed preprocessing and training strategies, combined with lightweight transformer models, can achieve state-of-the-art performance in product matching without requiring complex model architectures or extensive computational resources.
Nasir, M., Ezeife, C. I., & Gidado, A. (2021). Improving e-commerce product recommendation using semantic context and sequential historical purchases. Social Network Analysis and Mining, 11(1), 82.
Reimers, N. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. arXiv preprint arXiv:1908.10084.
Mistiawan, A., & Suhartono, D. (2024). Product Matching with Two-Branch Neural Network Embedding. Journal Européen des Systèmes Automatisés, 57(4).
Wen, M., Vasthimal, D. K., Lu, A., Wang, T., & Guo, A. (2019, December). Building large-scale deep learning system for entity recognition in e-commerce search. In Proceedings of the 6th IEEE/ACM International Conference on Big Data Computing, Applications and Technologies (pp. 149-154).
Köpcke, H., Thor, A., & Rahm, E. (2010). Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3(1-2), 484-493.
Cohen, W. W., Ravikumar, P., & Fienberg, S. E. (2003, August). A Comparison of String Distance Metrics for Name-Matching Tasks. In IIWeb (Vol. 3, pp. 73-78).
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26.
Pennington, J., Socher, R., & Manning, C. D. (2014, October). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532-1543).
Yuan, C., Pang, M., Fang, Z., Jiang, X., Peng, C., & Lin, Z. (2024, May). A Semi-supervised Multi-channel Graph Convolutional Network for Query Classification in E-commerce. In Companion Proceedings of the ACM on Web Conference 2024 (pp. 56-64).
Tracz, J., Wójcik, P. I., Jasinska-Kobus, K., Belluzzo, R., Mroczkowski, R., & Gawlik, I. (2020). BERT-based similarity learning for product matching. In Proceedings of Workshop on Natural Language Processing in E-Commerce (pp. 66-75).
Abolghasemi, A., Verberne, S., & Azzopardi, L. (2022, April). Improving BERT-based query-by-document retrieval with multi-task optimization. In European Conference on Information Retrieval (pp. 3-12). Cham: Springer International Publishing.
Chiu, J. (2023, December). Retrieval-Enhanced Dual Encoder Training for Product Matching. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track (pp. 216-222).
Ahuja, A., Rao, N., Katariya, S., Subbian, K., & Reddy, C. K. (2020, January). Language-agnostic representation learning for product search on e-commerce platforms. In Proceedings of the 13th International Conference on Web Search and Data Mining (pp. 7-15).
Gupte, K., Pang, L., Vuyyuri, H., & Pasumarty, S. (2021, December). Multimodal product matching and category mapping: Text+ image based deep neural network. In 2021 IEEE International Conference on Big Data (Big Data) (pp. 4500-4505). IEEE.