Part of the book series:Lecture Notes in Computer Science ((LNISA,volume 9672))
Included in the following conference series:
Abstract
In this paper, we revisit our method for reconstructing the primary sources of documents, which make up an important part of their provenance. Our method is based on the assumption that if two documents are semantically similar, there is a high chance that they also share a common source. We previously evaluated this assumption on an excerpt from a news archive, achieving 68.2 % precision and 73 % recall when reconstructing the primary sources of all articles. However, since we could not release this dataset to the public, it made our results hard to compare to others. In this work, we extend the flexibility of our method by adding a new parameter, and re-evaluate it on the human-generated dataset created for the 2014 Provenance Reconstruction Challenge. The extended method achieves up to 86 % precision and 59 % recall, and is now directly comparable to any approach that uses the same dataset.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 5719
- Price includes VAT (Japan)
- Softcover Book
- JPY 7149
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aierken, A., Davis, D.B., Zhang, Q., Gupta, K., Wong, A., Asuncion, H.U.: A multi-level funneling approach to data provenance reconstruction. In: IEEE 10th International Conference on e-Science, vol. 2, pp. 71–74. IEEE (2014)
De Nies, T., Coppens, S., Van Deursen, D., Mannens, E., Van de Walle, R.: Automatic discovery of high-level provenance using semantic similarity. In: Groth, P., Frew, J. (eds.) IPAW 2012. LNCS, vol. 7525, pp. 97–110. Springer, Heidelberg (2012)
De Nies, T., Magliacane, S., Verborgh, R., Coppens, S., Groth, P., Mannens, E., Van de Walle, R.: Git2PROV: exposing version control system content as W3C PROV. In: ISWC Posters & Demos, pp. 125–128 (2013)
Leskovec, J., Backstrom, L., Kleinberg, J.: Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 497–506. ACM (2009)
Simmons, M.P., Adamic, L.A., Adar, E.: Memes online: extracted, subtracted, injected, and recollected. In: ICWSM 2011, pp. 17–21 (2011)
Zhang, J., Jagadish, H.V.: Lost source provenance. In: 13th International Conference on Extending Database Technology, pp. 311–322. ACM (2010)
Zhao, J., Gomadam, K., Prasanna, V.: Predicting missing provenance using semantic associations in reservoir engineering. In: Fifth IEEE International Conference on Semantic Computing (ICSC), pp. 141–148. IEEE (2011)
Author information
Authors and Affiliations
Ghent University – iMinds – Data Science Lab, Ghent, Belgium
Tom De Nies, Erik Mannens & Rik Van de Walle
- Tom De Nies
You can also search for this author inPubMed Google Scholar
- Erik Mannens
You can also search for this author inPubMed Google Scholar
- Rik Van de Walle
You can also search for this author inPubMed Google Scholar
Corresponding author
Correspondence toTom De Nies.
Editor information
Editors and Affiliations
COPPE/UFRJ, Rio de Janeiro, Brazil
Marta Mattoso
Illinois Institute of Technology, Chicago, Illinois, USA
Boris Glavic
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
De Nies, T., Mannens, E., Van de Walle, R. (2016). Reconstructing Human-Generated Provenance Through Similarity-Based Clustering. In: Mattoso, M., Glavic, B. (eds) Provenance and Annotation of Data and Processes. IPAW 2016. Lecture Notes in Computer Science(), vol 9672. Springer, Cham. https://doi.org/10.1007/978-3-319-40593-3_19
Download citation
Published:
Publisher Name:Springer, Cham
Print ISBN:978-3-319-40592-6
Online ISBN:978-3-319-40593-3
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative