Movatterモバイル変換

265Accesses
Explore all metrics

Abstract

Recent works on entity resolution (ER) leverage deep learning techniques that rely on language models to improve effectiveness. These techniques are used both for blocking and matching, the two main steps of ER. Several language models have been tested in the literature, with fastText and BERT variants being most popular. However, there is no detailed analysis of their strengths and weaknesses. We cover this gap through a thorough experimental analysis of 12 popular pre-trained language models over 17 established benchmark datasets. First, we examine their relative effectiveness in blocking, unsupervised matching and supervised matching. We enhance our analysis by also investigating the complementarity and transferability of the language models and we further justify their relative performance by looking into the similarity scores and ranking positions each model yields. In each task, we compare them with several state-of-the-art techniques in the literature. Then, we investigate their relative time efficiency with respect to vectorization overhead, blocking scalability and matching run-time. The experiments are carried out both in schema-agnostic and schema-aware settings. In the former, all attribute values per entity are concatenated into a representative sentence, whereas in the latter the values of individual attributes are considered. Our results provide novel insights into the pros and cons of the main language models, facilitating their use in ER applications.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Exploring the Design Space of Unsupervised Blocking with Pre-trained Language Models in Entity Resolution

SDLER: stacked dedupe learning for entity resolution in big data era

Article15 March 2021

A benchmark and comprehensive survey on knowledge graph entity alignment via representation learning

Article24 May 2022

Notes

https://www.sbert.net/docs/pretrained_models.html.
http://oaei.ontologymatching.org/2010/im.
https://zenodo.org/record/6950980.
https://github.com/alexZeakis/Embeddings4ER.
https://radimrehurek.com/gensim.
https://huggingface.co.
https://faiss.ai/cpp_api/struct/structfaiss_1_1IndexHNSW.html.
Note that this case does not apply to Clean-Clean ER, where the query entities are disjoint from the indexed and retrieved ones. Thus, all candidate pairs are unique by default.
https://github.com/qcri/DeepBlocker.
https://github.com/anhaidgroup/sparkly.
https://spark.apache.org.
https://lucene.apache.org/pylucene.
https://github.com/alexZeakis/TokenJoin.
We used the implementation provided by pyJedAI athttps://github.com/AI-team-UoA/pyJedAI.

References

Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM CSUR53(6), 1–42 (2021)
Article MATH Google Scholar
Dong, X.L., Srivastava, D.: Big data integration. PVLDB6(11), 1188–1189 (2013)
MATH Google Scholar
Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in the Web of Data. Morgan & Claypool (2015)
Christen, P.: Data Matching. Springer, Berlin (2012)
Book MATH Google Scholar
Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. PVLDB5(12), 2018–2019 (2012)
MATH Google Scholar
Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: Blocking and filtering techniques for entity resolution: a survey. ACM CSUR53(2), 1–42 (2021)
Article Google Scholar
Papadakis, G., Ioannou, E., Thanos, E., Palpanas, T.: The Four Generations of Entity Resolution. Morgan & Claypool (2021)
Pilehvar, M.T., Camacho-Collados, J.: Embeddings in Natural Language Processing. Morgan & Claypool (2020)
Thirumuruganathan, S., Li, H., Tang, N., Ouzzani, M., Govind, Y., Paulsen, D., Fung, G., Doan, A.: Deep learning for blocking in entity matching: a design space exploration. PVLDB14(11), 2459–2472 (2021)
Google Scholar
Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: a design space exploration. In: SIGMOD, pp. 19–34 (2018)
Brunner, U., Stockinger, K.: Entity matching with transformer architectures—a step forward in data integration. In: EDBT, pp. 463–473 (2020)
Ebraheem, M., Thirumuruganathan, S., Joty, S.R., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. PVLDB11(11), 1454–1467 (2018)
Google Scholar
Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans. Big Data7(3), 535–547 (2021)
Article Google Scholar
Tu, J., Fan, J., Tang, N., Wang, P., Li, G., Du, X., Jia, X., Gao, S.: Unicorn: a unified multi-tasking model for supporting matching tasks in data integration. SIGMOD1(1), 1–26 (2023)
MATH Google Scholar
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. Proc. VLDB Endow.14(1), 50–60 (2020)
Article Google Scholar
Papadakis, G., Efthymiou, V., Thanos, E., Hassanzadeh, O., Christen, P.: An analysis of one-to-one matching algorithms for entity resolution. VLDB J.32(6), 1369–1400 (2023)
Article Google Scholar
Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: Autoblock: a hands-off blocking framework for entity matching. In: WSDM, pp. 744–752 (2020)
Nie, H., Han, X., He, B., Sun, L., Chen, B., Zhang, W., Wu, S., Kong, H.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: CIKM, pp. 629–638 (2019)
Li, B., Wang, W., Sun, Y., Zhang, L., Ali, M.A., Wang, Y.: Grapher: token-centric entity resolution with graph convolutional neural networks. In: IAAI, pp. 8172–8179 (2020)
Wang, Z., Sisman, B., Wei, H., Dong, X.L., Ji, S.: Cordel: a contrastive deep learning approach for entity linkage. In: ICDM, pp. 1322–1327 (2020)
Zhang, D., Nie, Y., Wu, S., Shen, Y., Tan, K.: Multi-context attention for entity matching. In: WWW, pp. 2634–2640 (2020)
Fu, C., Han, X., He, J., Sun, L.: Hierarchical matching network for heterogeneous entity resolution. In: IJCAI, pp. 3665–3671 (2020)
Yao, Z., Li, C., Dong, T., Lv, X., Yu, J., Hou, L., Li, J., Zhang, Y., Dai, Z.: Interpretable and low-resource entity matching via decoupling feature learning from decision making. In: ACL/IJCNLP, pp. 2770–2781 (2021)
Peeters, R., Bizer, C.: Dual-objective fine-tuning of BERT for entity matching. PVLDB14, 1913–1921 (2021)
MATH Google Scholar
Paganelli, M., Del Buono, F., Marco, P., Guerra, F., Vincini, M.: Automated machine learning for entity matching tasks. In: EDBT, pp. 325–330 (2021)
Chen, R., Shen, Y., Zhang, Y.: GNEM: a generic one-to-set neural entity matching framework. In: WWW, pp. 1686–1694 (2020)
Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. PVLDB14(1), 50–60 (2020)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (Workshop Poster) (2013)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. NeurIPS, vol. 26 (2013)
Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. TACL5, 135–146 (2017)
Article Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1). Association for Computational Linguistics, pp. 4171–4186 (2019)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: ICLR. OpenReview.net (2020)
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized bert pretraining approach. arXiv preprintarXiv:1907.11692 (2019)
Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprintarXiv:1910.01108 (2019)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. NeurIPS, vol. 32 (2019)
Song, K., Tan, X., Qin, T., Lu, J., Liu, T.-Y.: Mpnet: masked and permuted pre-training for language understanding. NeurIPS33, 16857–16867 (2020)
MATH Google Scholar
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.21(140), 1–67 (2020)
MathSciNet Google Scholar
Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. NeurIPS33, 5776–5788 (2020)
MATH Google Scholar
Peeters, R., Bizer, C.: Entity matching using large language models. arXiv preprintarXiv:2310.11244 (2023)
Zeakis, A., Papadakis, G., Skoutas, D., Koubarakis, M.: Pre-trained embeddings for entity resolution: an experimental analysis. Proc. VLDB Endow.16(9), 2225–2238 (2023)
Article Google Scholar
Mugeni, J.B., Amagasa, T.: A graph-based blocking approach for entity matching using contrastively learned embeddings. ACM SIGAPP Appl. Comput. Rev.22(4), 37–46 (2023)
Article MATH Google Scholar
Paulsen, D., Govind, Y., Doan, A.: Sparkly: a simple yet surprisingly strong tf/idf blocker for entity matching. PVLDB16(6), 1507–1519 (2023)
MATH Google Scholar
Papadakis, G., Fisichella, M., Schoger, F., Mandilaras, G., Augsten, N., Nejdl, W.: Benchmarking filtering techniques for entity resolution. In: ICDE, pp. 653–666 (2023)
Brinkmann, A., Shraga, R., Bizer, C.: Sc-block: supervised contrastive blocking within entity resolution pipelines. In: ESWC, pp. 121–142 (2024)
Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: Zeroer: entity resolution using zero labeled examples. In: SIGMOD, pp. 1149–1164 (2020)
Ge, C., Wang, P., Chen, L., Liu, X., Zheng, B., Gao, Y.: Collaborem: a self-supervised entity matching framework using multi-features collaboration. TKDE35(12), 12139–12152 (2021)
MATH Google Scholar
Peeters, R., Bizer, C.: Using chatgpt for entity matching. In: European Conference on Advances in Databases and Information Systems, pp. 221–230 (2023)
Narayan, A., Chami, I., Orr, L.J., Ré, C.: Can foundation models wrangle your data? Proc. VLDB Endow.16(4), 738–746 (2022)
Article Google Scholar
Zhang, H., Dong, Y., Xiao, C., Oyamada, M.: Jellyfish: a large language model for data preprocessing. arXiv preprintarXiv:2312.01678 (2023)
Peeters, R., Bizer, C.: Supervised contrastive learning for product matching. Companion Proc. Web Conf.2022, 248–251 (2022)
MATH Google Scholar
Wang, R., Li, Y., Wang, J.: Sudowoodo: Contrastive self-supervised learning for multi-purpose data integration and preparation. In ICDE , pp. 1502–1515 (2023)
Yao, D., Gu, Y., Cong, G., Jin, H., Lv, X.: Entity resolution with hierarchical graph attention networks. In: SIGMOD, pp. 429–442 (2022)
Ni, J., Qu, C., Lu, J., Dai, Z., Ábrego, G.H., Ma, J., Zhao, V.Y., Luan, Y., Hall, K.B., Chang, M., Yang, Y.: Large dual encoders are generalizable retrievers. In: EMNLP. Association for Computational Linguistics, pp. 9844–9855 (2022)
Paganelli, M., Buono, F.D., Baraldi, A., Guerra, F.: Analyzing how BERT performs entity matching. PVLDB15(8), 1726–1738 (2022)
Google Scholar
Liu, Q., Kusner, M.J., Blunsom, P.: A survey on contextual embeddings. CoRR, vol. abs/2003.07278 (2020)
Trummer, I.: From BERT to GPT-3 codex: harnessing the potential of very large language models for data management. PVLDB15(12), 3770–3773 (2022)
MATH Google Scholar
Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprintarXiv:1708.00055 (2017)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: BlackboxNLP@EMNLP. Association for Computational Linguistics, pp. 353–355 (2018)
Akbarian Rastaghi, M., Kamalloo, E., Rafiei, D.: Probing the robustness of pre-trained language models for entity matching. In: CIKM, pp. 3786–3790 (2022)
Peeters, R., Der, R.C., Bizer, C.: WDC products: a multi-dimensional entity matching benchmark. In: EDBT. OpenProceedings.org, pp. 22–33 (2024)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. NeurIPS, vol. 30 (2017)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. NeurIPS, vol. 27 (2014)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)
Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network, vol. 2, no. 7, arXiv preprintarXiv:1503.02531 (2015)
Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: ICLR (Poster) (2015)
Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: EMNLP/IJCNLP (1). Association for Computational Linguistics, pp. 3980–3990 (2019)
Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: Tinybert: distilling BERT for natural language understanding. In: EMNLP (Findings), ser. Findings of ACL, vol. EMNLP 2020. Association for Computational Linguistics, pp. 4163–4174 (2020)
Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: Mobilebert: a compact task-agnostic BERT for resource-limited devices, pp. 2158–2170 (2020)
Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB3(1), 484–493 (2010)
MATH Google Scholar
Obraczka, D., Schuchart, J., Rahm, E.: EAGER: embedding-assisted entity resolution for knowledge graphs. CoRR, vol. abs/2101.06126 (2021)
Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. In: WSDM, pp. 535–544 (2011)
Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB9(9), 684–695 (2016)
MATH Google Scholar
Kenig, B., Gal, A.: Mfiblocks: an effective blocking algorithm for entity resolution. Inf. Syst.38(6), 908–926 (2013)
Article MATH Google Scholar
Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. TKDE24(9), 1537–1555 (2012)
MATH Google Scholar
Christen, P.: “Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface,” in SIGKDD, pp. 1065–1068
Papadakis, G., Kirielle, N., Christen, P., Palpanas, T.: A critical re-evaluation of benchmark datasets for (deep) learning-based matching algorithms. In: ICDE, pp. 3435–3448 (2024)
Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. PVLDB9(4), 312–323 (2015)
Google Scholar
Li, W., Zhang, Y., Sun, Y., Wang, W., Li, M., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement. TKDE32(8), 1475–1488 (2019)
MATH Google Scholar
Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell.42(4), 824–836 (2020)
Article MATH Google Scholar
Lacoste-Julien, S., Palla, K., Davies, A., Kasneci, G., Graepel, T., Ghahramani, Z.: Sigma: simple greedy matching for aligning large knowledge bases. In: KDD, pp. 572–580 (2013)
Konda, P., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endow.9(12), 1197–1208 (2016)
Article MATH Google Scholar
Zeakis, A., Skoutas, D., Sacharidis, D., Papapetrou, O., Koubarakis, M.: TokenJoin: efficient filtering for set similarity join with maximumweighted bipartite matching. PVLDB16(4), 790–802 (2022)
Google Scholar

Download references

Acknowledgements

This work was partially funded by the EU project STELAR (Horizon Europe - Grant No. 101070122).

Author information

Authors and Affiliations

National and Kapodistrian University of Athens, Athens, Greece
Alexandros Zeakis, George Papadakis & Manolis Koubarakis
Athena Research Center, Athens, Greece
Alexandros Zeakis & Dimitrios Skoutas

Authors

Alexandros Zeakis
View author publications
You can also search for this author inPubMed Google Scholar
George Papadakis
View author publications
You can also search for this author inPubMed Google Scholar
Dimitrios Skoutas
View author publications
You can also search for this author inPubMed Google Scholar
Manolis Koubarakis
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toAlexandros Zeakis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (pdf 796 KB)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zeakis, A., Papadakis, G., Skoutas, D.et al. An in-depth analysis of pre-trained embeddings for entity resolution.The VLDB Journal34, 5 (2025). https://doi.org/10.1007/s00778-024-00879-4

Download citation

Received:10 January 2024
Revised:26 September 2024
Accepted:07 October 2024
Published:04 December 2024
DOI:https://doi.org/10.1007/s00778-024-00879-4

Movatterモバイル変換

An in-depth analysis of pre-trained embeddings for entity resolution

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Exploring the Design Space of Unsupervised Blocking with Pre-trained Language Models in Entity Resolution

SDLER: stacked dedupe learning for entity resolution in big data era

A benchmark and comprehensive survey on knowledge graph entity alignment via representation learning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Supplementary Information

Supplementary file 1 (pdf 796 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Access this article

Subscribe and save

Buy Now