Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Springer Nature Link
Log in

An in-depth analysis of pre-trained embeddings for entity resolution

  • Regular Paper
  • Published:
The VLDB Journal Aims and scope Submit manuscript

Abstract

Recent works on entity resolution (ER) leverage deep learning techniques that rely on language models to improve effectiveness. These techniques are used both for blocking and matching, the two main steps of ER. Several language models have been tested in the literature, with fastText and BERT variants being most popular. However, there is no detailed analysis of their strengths and weaknesses. We cover this gap through a thorough experimental analysis of 12 popular pre-trained language models over 17 established benchmark datasets. First, we examine their relative effectiveness in blocking, unsupervised matching and supervised matching. We enhance our analysis by also investigating the complementarity and transferability of the language models and we further justify their relative performance by looking into the similarity scores and ranking positions each model yields. In each task, we compare them with several state-of-the-art techniques in the literature. Then, we investigate their relative time efficiency with respect to vectorization overhead, blocking scalability and matching run-time. The experiments are carried out both in schema-agnostic and schema-aware settings. In the former, all attribute values per entity are concatenated into a representative sentence, whereas in the latter the values of individual attributes are considered. Our results provide novel insights into the pros and cons of the main language models, facilitating their use in ER applications.

This is a preview of subscription content,log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

References

  1. Christophides, V., Efthymiou, V., Palpanas, T., Papadakis, G., Stefanidis, K.: An overview of end-to-end entity resolution for big data. ACM CSUR53(6), 1–42 (2021)

    Article MATH  Google Scholar 

  2. Dong, X.L., Srivastava, D.: Big data integration. PVLDB6(11), 1188–1189 (2013)

    MATH  Google Scholar 

  3. Christophides, V., Efthymiou, V., Stefanidis, K.: Entity Resolution in the Web of Data. Morgan & Claypool (2015)

  4. Christen, P.: Data Matching. Springer, Berlin (2012)

    Book MATH  Google Scholar 

  5. Getoor, L., Machanavajjhala, A.: Entity resolution: theory, practice & open challenges. PVLDB5(12), 2018–2019 (2012)

    MATH  Google Scholar 

  6. Papadakis, G., Skoutas, D., Thanos, E., Palpanas, T.: Blocking and filtering techniques for entity resolution: a survey. ACM CSUR53(2), 1–42 (2021)

    Article  Google Scholar 

  7. Papadakis, G., Ioannou, E., Thanos, E., Palpanas, T.: The Four Generations of Entity Resolution. Morgan & Claypool (2021)

  8. Pilehvar, M.T., Camacho-Collados, J.: Embeddings in Natural Language Processing. Morgan & Claypool (2020)

  9. Thirumuruganathan, S., Li, H., Tang, N., Ouzzani, M., Govind, Y., Paulsen, D., Fung, G., Doan, A.: Deep learning for blocking in entity matching: a design space exploration. PVLDB14(11), 2459–2472 (2021)

    Google Scholar 

  10. Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., Raghavendra, V.: Deep learning for entity matching: a design space exploration. In: SIGMOD, pp. 19–34 (2018)

  11. Brunner, U., Stockinger, K.: Entity matching with transformer architectures—a step forward in data integration. In: EDBT, pp. 463–473 (2020)

  12. Ebraheem, M., Thirumuruganathan, S., Joty, S.R., Ouzzani, M., Tang, N.: Distributed representations of tuples for entity resolution. PVLDB11(11), 1454–1467 (2018)

    Google Scholar 

  13. Johnson, J., Douze, M., Jégou, H.: Billion-scale similarity search with gpus. IEEE Trans. Big Data7(3), 535–547 (2021)

    Article  Google Scholar 

  14. Tu, J., Fan, J., Tang, N., Wang, P., Li, G., Du, X., Jia, X., Gao, S.: Unicorn: a unified multi-tasking model for supporting matching tasks in data integration. SIGMOD1(1), 1–26 (2023)

    MATH  Google Scholar 

  15. Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. Proc. VLDB Endow.14(1), 50–60 (2020)

    Article  Google Scholar 

  16. Papadakis, G., Efthymiou, V., Thanos, E., Hassanzadeh, O., Christen, P.: An analysis of one-to-one matching algorithms for entity resolution. VLDB J.32(6), 1369–1400 (2023)

    Article  Google Scholar 

  17. Zhang, W., Wei, H., Sisman, B., Dong, X.L., Faloutsos, C., Page, D.: Autoblock: a hands-off blocking framework for entity matching. In: WSDM, pp. 744–752 (2020)

  18. Nie, H., Han, X., He, B., Sun, L., Chen, B., Zhang, W., Wu, S., Kong, H.: Deep sequence-to-sequence entity matching for heterogeneous entity resolution. In: CIKM, pp. 629–638 (2019)

  19. Li, B., Wang, W., Sun, Y., Zhang, L., Ali, M.A., Wang, Y.: Grapher: token-centric entity resolution with graph convolutional neural networks. In: IAAI, pp. 8172–8179 (2020)

  20. Wang, Z., Sisman, B., Wei, H., Dong, X.L., Ji, S.: Cordel: a contrastive deep learning approach for entity linkage. In: ICDM, pp. 1322–1327 (2020)

  21. Zhang, D., Nie, Y., Wu, S., Shen, Y., Tan, K.: Multi-context attention for entity matching. In: WWW, pp. 2634–2640 (2020)

  22. Fu, C., Han, X., He, J., Sun, L.: Hierarchical matching network for heterogeneous entity resolution. In: IJCAI, pp. 3665–3671 (2020)

  23. Yao, Z., Li, C., Dong, T., Lv, X., Yu, J., Hou, L., Li, J., Zhang, Y., Dai, Z.: Interpretable and low-resource entity matching via decoupling feature learning from decision making. In: ACL/IJCNLP, pp. 2770–2781 (2021)

  24. Peeters, R., Bizer, C.: Dual-objective fine-tuning of BERT for entity matching. PVLDB14, 1913–1921 (2021)

    MATH  Google Scholar 

  25. Paganelli, M., Del Buono, F., Marco, P., Guerra, F., Vincini, M.: Automated machine learning for entity matching tasks. In: EDBT, pp. 325–330 (2021)

  26. Chen, R., Shen, Y., Zhang, Y.: GNEM: a generic one-to-set neural entity matching framework. In: WWW, pp. 1686–1694 (2020)

  27. Li, Y., Li, J., Suhara, Y., Doan, A., Tan, W.: Deep entity matching with pre-trained language models. PVLDB14(1), 50–60 (2020)

    Google Scholar 

  28. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)

  29. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR (Workshop Poster) (2013)

  30. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. NeurIPS, vol. 26 (2013)

  31. Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. TACL5, 135–146 (2017)

    Article  Google Scholar 

  32. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (1). Association for Computational Linguistics, pp. 4171–4186 (2019)

  33. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. In: ICLR. OpenReview.net (2020)

  34. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: a robustly optimized bert pretraining approach. arXiv preprintarXiv:1907.11692 (2019)

  35. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprintarXiv:1910.01108 (2019)

  36. Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: Xlnet: generalized autoregressive pretraining for language understanding. NeurIPS, vol. 32 (2019)

  37. Song, K., Tan, X., Qin, T., Lu, J., Liu, T.-Y.: Mpnet: masked and permuted pre-training for language understanding. NeurIPS33, 16857–16867 (2020)

    MATH  Google Scholar 

  38. Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., Liu, P.J.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res.21(140), 1–67 (2020)

    MathSciNet  Google Scholar 

  39. Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., Zhou, M.: Minilm: deep self-attention distillation for task-agnostic compression of pre-trained transformers. NeurIPS33, 5776–5788 (2020)

    MATH  Google Scholar 

  40. Peeters, R., Bizer, C.: Entity matching using large language models. arXiv preprintarXiv:2310.11244 (2023)

  41. Zeakis, A., Papadakis, G., Skoutas, D., Koubarakis, M.: Pre-trained embeddings for entity resolution: an experimental analysis. Proc. VLDB Endow.16(9), 2225–2238 (2023)

    Article  Google Scholar 

  42. Mugeni, J.B., Amagasa, T.: A graph-based blocking approach for entity matching using contrastively learned embeddings. ACM SIGAPP Appl. Comput. Rev.22(4), 37–46 (2023)

    Article MATH  Google Scholar 

  43. Paulsen, D., Govind, Y., Doan, A.: Sparkly: a simple yet surprisingly strong tf/idf blocker for entity matching. PVLDB16(6), 1507–1519 (2023)

    MATH  Google Scholar 

  44. Papadakis, G., Fisichella, M., Schoger, F., Mandilaras, G., Augsten, N., Nejdl, W.: Benchmarking filtering techniques for entity resolution. In: ICDE, pp. 653–666 (2023)

  45. Brinkmann, A., Shraga, R., Bizer, C.: Sc-block: supervised contrastive blocking within entity resolution pipelines. In: ESWC, pp. 121–142 (2024)

  46. Wu, R., Chaba, S., Sawlani, S., Chu, X., Thirumuruganathan, S.: Zeroer: entity resolution using zero labeled examples. In: SIGMOD, pp. 1149–1164 (2020)

  47. Ge, C., Wang, P., Chen, L., Liu, X., Zheng, B., Gao, Y.: Collaborem: a self-supervised entity matching framework using multi-features collaboration. TKDE35(12), 12139–12152 (2021)

    MATH  Google Scholar 

  48. Peeters, R., Bizer, C.: Using chatgpt for entity matching. In: European Conference on Advances in Databases and Information Systems, pp. 221–230 (2023)

  49. Narayan, A., Chami, I., Orr, L.J., Ré, C.: Can foundation models wrangle your data? Proc. VLDB Endow.16(4), 738–746 (2022)

    Article  Google Scholar 

  50. Zhang, H., Dong, Y., Xiao, C., Oyamada, M.: Jellyfish: a large language model for data preprocessing. arXiv preprintarXiv:2312.01678 (2023)

  51. Peeters, R., Bizer, C.: Supervised contrastive learning for product matching. Companion Proc. Web Conf.2022, 248–251 (2022)

    MATH  Google Scholar 

  52. Wang, R., Li, Y., Wang, J.: Sudowoodo: Contrastive self-supervised learning for multi-purpose data integration and preparation. In ICDE , pp. 1502–1515 (2023)

  53. Yao, D., Gu, Y., Cong, G., Jin, H., Lv, X.: Entity resolution with hierarchical graph attention networks. In: SIGMOD, pp. 429–442 (2022)

  54. Ni, J., Qu, C., Lu, J., Dai, Z., Ábrego, G.H., Ma, J., Zhao, V.Y., Luan, Y., Hall, K.B., Chang, M., Yang, Y.: Large dual encoders are generalizable retrievers. In: EMNLP. Association for Computational Linguistics, pp. 9844–9855 (2022)

  55. Paganelli, M., Buono, F.D., Baraldi, A., Guerra, F.: Analyzing how BERT performs entity matching. PVLDB15(8), 1726–1738 (2022)

    Google Scholar 

  56. Liu, Q., Kusner, M.J., Blunsom, P.: A survey on contextual embeddings. CoRR, vol. abs/2003.07278 (2020)

  57. Trummer, I.: From BERT to GPT-3 codex: harnessing the potential of very large language models for data management. PVLDB15(12), 3770–3773 (2022)

    MATH  Google Scholar 

  58. Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., Specia, L.: Semeval-2017 task 1: semantic textual similarity-multilingual and cross-lingual focused evaluation. arXiv preprintarXiv:1708.00055 (2017)

  59. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: BlackboxNLP@EMNLP. Association for Computational Linguistics, pp. 353–355 (2018)

  60. Akbarian Rastaghi, M., Kamalloo, E., Rafiei, D.: Probing the robustness of pre-trained language models for entity matching. In: CIKM, pp. 3786–3790 (2022)

  61. Peeters, R., Der, R.C., Bizer, C.: WDC products: a multi-dimensional entity matching benchmark. In: EDBT. OpenProceedings.org, pp. 22–33 (2024)

  62. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. NeurIPS, vol. 30 (2017)

  63. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. NeurIPS, vol. 27 (2014)

  64. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: ICLR (2015)

  65. Hinton, G., Vinyals, O., Dean, J., et al.: Distilling the knowledge in a neural network, vol. 2, no. 7, arXiv preprintarXiv:1503.02531 (2015)

  66. Romero, A., Ballas, N., Kahou, S.E., Chassang, A., Gatta, C., Bengio, Y.: Fitnets: Hints for thin deep nets. In: ICLR (Poster) (2015)

  67. Reimers, N., Gurevych, I.: Sentence-bert: sentence embeddings using siamese bert-networks. In: EMNLP/IJCNLP (1). Association for Computational Linguistics, pp. 3980–3990 (2019)

  68. Jiao, X., Yin, Y., Shang, L., Jiang, X., Chen, X., Li, L., Wang, F., Liu, Q.: Tinybert: distilling BERT for natural language understanding. In: EMNLP (Findings), ser. Findings of ACL, vol. EMNLP 2020. Association for Computational Linguistics, pp. 4163–4174 (2020)

  69. Sun, Z., Yu, H., Song, X., Liu, R., Yang, Y., Zhou, D.: Mobilebert: a compact task-agnostic BERT for resource-limited devices, pp. 2158–2170 (2020)

  70. Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB3(1), 484–493 (2010)

    MATH  Google Scholar 

  71. Obraczka, D., Schuchart, J., Rahm, E.: EAGER: embedding-assisted entity resolution for knowledge graphs. CoRR, vol. abs/2101.06126 (2021)

  72. Papadakis, G., Ioannou, E., Niederée, C., Fankhauser, P.: Efficient entity resolution for large heterogeneous information spaces. In: WSDM, pp. 535–544 (2011)

  73. Papadakis, G., Svirsky, J., Gal, A., Palpanas, T.: Comparative analysis of approximate blocking techniques for entity resolution. PVLDB9(9), 684–695 (2016)

    MATH  Google Scholar 

  74. Kenig, B., Gal, A.: Mfiblocks: an effective blocking algorithm for entity resolution. Inf. Syst.38(6), 908–926 (2013)

    Article MATH  Google Scholar 

  75. Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. TKDE24(9), 1537–1555 (2012)

    MATH  Google Scholar 

  76. Christen, P.: “Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface,” in SIGKDD, pp. 1065–1068

  77. Papadakis, G., Kirielle, N., Christen, P., Palpanas, T.: A critical re-evaluation of benchmark datasets for (deep) learning-based matching algorithms. In: ICDE, pp. 3435–3448 (2024)

  78. Papadakis, G., Alexiou, G., Papastefanatos, G., Koutrika, G.: Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data. PVLDB9(4), 312–323 (2015)

    Google Scholar 

  79. Li, W., Zhang, Y., Sun, Y., Wang, W., Li, M., Zhang, W., Lin, X.: Approximate nearest neighbor search on high dimensional data-experiments, analyses, and improvement. TKDE32(8), 1475–1488 (2019)

    MATH  Google Scholar 

  80. Malkov, Y.A., Yashunin, D.A.: Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans. Pattern Anal. Mach. Intell.42(4), 824–836 (2020)

    Article MATH  Google Scholar 

  81. Lacoste-Julien, S., Palla, K., Davies, A., Kasneci, G., Graepel, T., Ghahramani, Z.: Sigma: simple greedy matching for aligning large knowledge bases. In: KDD, pp. 572–580 (2013)

  82. Konda, P., et al.: Magellan: toward building entity matching management systems. Proc. VLDB Endow.9(12), 1197–1208 (2016)

    Article MATH  Google Scholar 

  83. Zeakis, A., Skoutas, D., Sacharidis, D., Papapetrou, O., Koubarakis, M.: TokenJoin: efficient filtering for set similarity join with maximumweighted bipartite matching. PVLDB16(4), 790–802 (2022)

    Google Scholar 

Download references

Acknowledgements

This work was partially funded by the EU project STELAR (Horizon Europe - Grant No. 101070122).

Author information

Authors and Affiliations

  1. National and Kapodistrian University of Athens, Athens, Greece

    Alexandros Zeakis, George Papadakis & Manolis Koubarakis

  2. Athena Research Center, Athens, Greece

    Alexandros Zeakis & Dimitrios Skoutas

Authors
  1. Alexandros Zeakis

    You can also search for this author inPubMed Google Scholar

  2. George Papadakis

    You can also search for this author inPubMed Google Scholar

  3. Dimitrios Skoutas

    You can also search for this author inPubMed Google Scholar

  4. Manolis Koubarakis

    You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toAlexandros Zeakis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zeakis, A., Papadakis, G., Skoutas, D.et al. An in-depth analysis of pre-trained embeddings for entity resolution.The VLDB Journal34, 5 (2025). https://doi.org/10.1007/s00778-024-00879-4

Download citation

Keywords

Access this article

Subscribe and save

Springer+ Basic
¥17,985 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Advertisement


[8]ページ先頭

©2009-2025 Movatter.jp