- Alexandros Zeakis ORCID:orcid.org/0000-0001-5336-81571,2,
- George Papadakis1,
- Dimitrios Skoutas2 &
- …
- Manolis Koubarakis1
265Accesses
Abstract
Recent works on entity resolution (ER) leverage deep learning techniques that rely on language models to improve effectiveness. These techniques are used both for blocking and matching, the two main steps of ER. Several language models have been tested in the literature, with fastText and BERT variants being most popular. However, there is no detailed analysis of their strengths and weaknesses. We cover this gap through a thorough experimental analysis of 12 popular pre-trained language models over 17 established benchmark datasets. First, we examine their relative effectiveness in blocking, unsupervised matching and supervised matching. We enhance our analysis by also investigating the complementarity and transferability of the language models and we further justify their relative performance by looking into the similarity scores and ranking positions each model yields. In each task, we compare them with several state-of-the-art techniques in the literature. Then, we investigate their relative time efficiency with respect to vectorization overhead, blocking scalability and matching run-time. The experiments are carried out both in schema-agnostic and schema-aware settings. In the former, all attribute values per entity are concatenated into a representative sentence, whereas in the latter the values of individual attributes are considered. Our results provide novel insights into the pros and cons of the main language models, facilitating their use in ER applications.
This is a preview of subscription content,log in via an institution to check access.
Access this article
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
Price includes VAT (Japan)
Instant access to the full article PDF.



















Similar content being viewed by others
Notes
Note that this case does not apply to Clean-Clean ER, where the query entities are disjoint from the indexed and retrieved ones. Thus, all candidate pairs are unique by default.
We used the implementation provided by pyJedAI athttps://github.com/AI-team-UoA/pyJedAI.
References
Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM CSUR53(6), 1–42 (2021)
Dong, X.L., Srivastava, D.: Big data integration. PVLDB6(11), 1188–1189 (2013)
Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in the Web of Data. Morgan & Claypool (2015)
Christen, P.: Data Matching. Springer, Berlin (2012)
Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. PVLDB5(12), 2018–2019 (2012)
Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: Blocking and filtering techniques for entity resolution: a survey. ACM CSUR53(2), 1–42 (2021)
Papadakis, G., Ioannou, E., Thanos, E., Palpanas, T.: The Four Generations of Entity Resolution. Morgan & Claypool (2021)
Pilehvar, M.T., Camacho-Collados, J.: Embeddings in Natural Language Processing. Morgan & Claypool (2020)
Thirumuruganathan, S., Li, H., Tang, N., Ouzzani, M., Govind, Y., Paulsen, D., Fung, G., Doan, A.: Deep learning for blocking in entity matching: a design space exploration. PVLDB14(11), 2459–2472 (2021)
Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: a design space exploration. In: SIGMOD, pp. 19–34 (2018)
Brunner, U., Stockinger, K.: Entity matching with transformer architectures—a step forward in data integration. In: EDBT, pp. 463–473 (2020)
Ebraheem, M., Thirumuruganathan, S., Joty, S.R., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. PVLDB11(11), 1454–1467 (2018)
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans. Big Data7(3), 535–547 (2021)
Tu, J., Fan, J., Tang, N., Wang, P., Li, G., Du, X., Jia, X., Gao, S.: Unicorn: a unified multi-tasking model for supporting matching tasks in data integration. SIGMOD1(1), 1–26 (2023)
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. Proc. VLDB Endow.14(1), 50–60 (2020)
Papadakis, G., Efthymiou, V., Thanos, E., Hassanzadeh, O., Christen, P.: An analysis of one-to-one matching algorithms for entity resolution. VLDB J.32(6), 1369–1400 (2023)
Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: Autoblock: a hands-off blocking framework for entity matching. In: WSDM, pp. 744–752 (2020)
Nie, H., Han, X., He, B., Sun, L., Chen, B., Zhang, W., Wu, S., Kong, H.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: CIKM, pp. 629–638 (2019)
Li, B., Wang, W., Sun, Y., Zhang, L., Ali, M.A., Wang, Y.: Grapher: token-centric entity resolution with graph convolutional neural networks. In: IAAI, pp. 8172–8179 (2020)
Wang, Z., Sisman, B., Wei, H., Dong, X.L., Ji, S.: Cordel: a contrastive deep learning approach for entity linkage. In: ICDM, pp. 1322–1327 (2020)
Zhang, D., Nie, Y., Wu, S., Shen, Y., Tan, K.: Multi-context attention for entity matching. In: WWW, pp. 2634–2640 (2020)
Fu, C., Han, X., He, J., Sun, L.: Hierarchical matching network for heterogeneous entity resolution. In: IJCAI, pp. 3665–3671 (2020)
Yao, Z., Li, C., Dong, T., Lv, X., Yu, J., Hou, L., Li, J., Zhang, Y., Dai, Z.: Interpretable and low-resource entity matching via decoupling feature learning from decision making. In: ACL/IJCNLP, pp. 2770–2781 (2021)
Peeters, R., Bizer, C.: Dual-objective fine-tuning of BERT for entity matching. PVLDB14, 1913–1921 (2021)
Paganelli, M., Del Buono, F., Marco, P., Guerra, F., Vincini, M.: Automated machine learning for entity matching tasks. In: EDBT, pp. 325–330 (2021)
Chen, R., Shen, Y., Zhang, Y.: GNEM: a generic one-to-set neural entity matching framework. In: WWW, pp. 1686–1694 (2020)
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. PVLDB14(1), 50–60 (2020)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (Workshop Poster) (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. NeurIPS, vol. 26 (2013)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. TACL5, 135–146 (2017)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1). Association for Computational Linguistics, pp. 4171–4186 (2019)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: ICLR. OpenReview.net (2020)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized bert pretraining approach. arXiv preprintarXiv:1907.11692 (2019)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprintarXiv:1910.01108 (2019)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. NeurIPS, vol. 32 (2019)
Song, K., Tan, X., Qin, T., Lu, J., Liu, T.-Y.: Mpnet: masked and permuted pre-training for language understanding. NeurIPS33, 16857–16867 (2020)
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.21(140), 1–67 (2020)
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. NeurIPS33, 5776–5788 (2020)
Peeters, R., Bizer, C.: Entity matching using large language models. arXiv preprintarXiv:2310.11244 (2023)
Zeakis, A., Papadakis, G., Skoutas, D., Koubarakis, M.: Pre-trained embeddings for entity resolution: an experimental analysis. Proc. VLDB Endow.16(9), 2225–2238 (2023)
Mugeni, J.B., Amagasa, T.: A graph-based blocking approach for entity matching using contrastively learned embeddings. ACM SIGAPP Appl. Comput. Rev.22(4), 37–46 (2023)
Paulsen, D., Govind, Y., Doan, A.: Sparkly: a simple yet surprisingly strong tf/idf blocker for entity matching. PVLDB16(6), 1507–1519 (2023)
Papadakis, G., Fisichella, M., Schoger, F., Mandilaras, G., Augsten, N., Nejdl, W.: Benchmarking filtering techniques for entity resolution. In: ICDE, pp. 653–666 (2023)
Brinkmann, A., Shraga, R., Bizer, C.: Sc-block: supervised contrastive blocking within entity resolution pipelines. In: ESWC, pp. 121–142 (2024)
Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: Zeroer: entity resolution using zero labeled examples. In: SIGMOD, pp. 1149–1164 (2020)
Ge, C., Wang, P., Chen, L., Liu, X., Zheng, B., Gao, Y.: Collaborem: a self-supervised entity matching framework using multi-features collaboration. TKDE35(12), 12139–12152 (2021)
Peeters, R., Bizer, C.: Using chatgpt for entity matching. In: European Conference on Advances in Databases and Information Systems, pp. 221–230 (2023)
Narayan, A., Chami, I., Orr, L.J., Ré, C.: Can foundation models wrangle your data? Proc. VLDB Endow.16(4), 738–746 (2022)
Zhang, H., Dong, Y., Xiao, C., Oyamada, M.: Jellyfish: a large language model for data preprocessing. arXiv preprintarXiv:2312.01678 (2023)
Peeters, R., Bizer, C.: Supervised contrastive learning for product matching. Companion Proc. Web Conf.2022, 248–251 (2022)
Wang, R., Li, Y., Wang, J.: Sudowoodo: Contrastive self-supervised learning for multi-purpose data integration and preparation. In ICDE , pp. 1502–1515 (2023)
Yao, D., Gu, Y., Cong, G., Jin, H., Lv, X.: Entity resolution with hierarchical graph attention networks. In: SIGMOD, pp. 429–442 (2022)
Ni, J., Qu, C., Lu, J., Dai, Z., Ábrego, G.H., Ma, J., Zhao, V.Y., Luan, Y., Hall, K.B., Chang, M., Yang, Y.: Large dual encoders are generalizable retrievers. In: EMNLP. Association for Computational Linguistics, pp. 9844–9855 (2022)
Paganelli, M., Buono, F.D., Baraldi, A., Guerra, F.: Analyzing how BERT performs entity matching. PVLDB15(8), 1726–1738 (2022)
Liu, Q., Kusner, M.J., Blunsom, P.: A survey on contextual embeddings. CoRR, vol. abs/2003.07278 (2020)
Trummer, I.: From BERT to GPT-3 codex: harnessing the potential of very large language models for data management. PVLDB15(12), 3770–3773 (2022)
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprintarXiv:1708.00055 (2017)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: BlackboxNLP@EMNLP. Association for Computational Linguistics, pp. 353–355 (2018)
Akbarian Rastaghi, M., Kamalloo, E., Rafiei, D.: Probing the robustness of pre-trained language models for entity matching. In: CIKM, pp. 3786–3790 (2022)
Peeters, R., Der, R.C., Bizer, C.: WDC products: a multi-dimensional entity matching benchmark. In: EDBT. OpenProceedings.org, pp. 22–33 (2024)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. NeurIPS, vol. 30 (2017)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. NeurIPS, vol. 27 (2014)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network, vol. 2, no. 7, arXiv preprintarXiv:1503.02531 (2015)
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: ICLR (Poster) (2015)
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: EMNLP/IJCNLP (1). Association for Computational Linguistics, pp. 3980–3990 (2019)
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: Tinybert: distilling BERT for natural language understanding. In: EMNLP (Findings), ser. Findings of ACL, vol. EMNLP 2020. Association for Computational Linguistics, pp. 4163–4174 (2020)
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: Mobilebert: a compact task-agnostic BERT for resource-limited devices, pp. 2158–2170 (2020)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB3(1), 484–493 (2010)
Obraczka, D., Schuchart, J., Rahm, E.: EAGER: embedding-assisted entity resolution for knowledge graphs. CoRR, vol. abs/2101.06126 (2021)
Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. In: WSDM, pp. 535–544 (2011)
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB9(9), 684–695 (2016)
Kenig, B., Gal, A.: Mfiblocks: an effective blocking algorithm for entity resolution. Inf. Syst.38(6), 908–926 (2013)
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. TKDE24(9), 1537–1555 (2012)
Christen, P.: “Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface,” in SIGKDD, pp. 1065–1068
Papadakis, G., Kirielle, N., Christen, P., Palpanas, T.: A critical re-evaluation of benchmark datasets for (deep) learning-based matching algorithms. In: ICDE, pp. 3435–3448 (2024)
Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. PVLDB9(4), 312–323 (2015)
Li, W., Zhang, Y., Sun, Y., Wang, W., Li, M., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement. TKDE32(8), 1475–1488 (2019)
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell.42(4), 824–836 (2020)
Lacoste-Julien, S., Palla, K., Davies, A., Kasneci, G., Graepel, T., Ghahramani, Z.: Sigma: simple greedy matching for aligning large knowledge bases. In: KDD, pp. 572–580 (2013)
Konda, P., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endow.9(12), 1197–1208 (2016)
Zeakis, A., Skoutas, D., Sacharidis, D., Papapetrou, O., Koubarakis, M.: TokenJoin: efficient filtering for set similarity join with maximumweighted bipartite matching. PVLDB16(4), 790–802 (2022)
Acknowledgements
This work was partially funded by the EU project STELAR (Horizon Europe - Grant No. 101070122).
Author information
Authors and Affiliations
National and Kapodistrian University of Athens, Athens, Greece
Alexandros Zeakis, George Papadakis & Manolis Koubarakis
Athena Research Center, Athens, Greece
Alexandros Zeakis & Dimitrios Skoutas
- Alexandros Zeakis
You can also search for this author inPubMed Google Scholar
- George Papadakis
You can also search for this author inPubMed Google Scholar
- Dimitrios Skoutas
You can also search for this author inPubMed Google Scholar
- Manolis Koubarakis
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toAlexandros Zeakis.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zeakis, A., Papadakis, G., Skoutas, D.et al. An in-depth analysis of pre-trained embeddings for entity resolution.The VLDB Journal34, 5 (2025). https://doi.org/10.1007/s00778-024-00879-4
Received:
Revised:
Accepted:
Published:
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative