Part of the book series:Lecture Notes in Computer Science ((LNISA,volume 8485))
Included in the following conference series:
6043Accesses
Abstract
Authorship attribution refers to the task of identifying the authors of a set of documents. Early studies in this area either used book length texts or assumed that there were a large number of training documents. The focus of modern authorship attribution has been shifted to the analysis on small online texts. This is realistic since in the real life it is hard to collect the training texts. However, the small size of training data makes the authorship attribution much more difficult. In this paper, we present a novel co-training method to iteratively recognize a few unlabeled data to augment the training set. Specifically, each document is first partitioned into two distinct views, i.e., lexical and syntactic view. And then, a two view semi-supervised method, co-training, is adopted to exploit the large amount of unlabeled documents. Our experiment results based on real data show that the proposed method can effectively exploit unlabeled data to improve the classification performance.
This is a preview of subscription content,log in via an institution to check access.
Access this chapter
Subscribe and save
- Get 10 units per month
- Download Article/Chapter or eBook
- 1 Unit = 1 Article or 1 Chapter
- Cancel anytime
Buy Now
- Chapter
- JPY 3498
- Price includes VAT (Japan)
- eBook
- JPY 5719
- Price includes VAT (Japan)
- Softcover Book
- JPY 7149
- Price includes VAT (Japan)
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Literary and Linguistic Computing pp. 1–3 (2004)
Argamon, S., Whitelaw, C., Chase, P., Hota, S.R., Garg, N., Levitan, S.: Stylistic text classification using functional lexical features: Research articles. J. Am. Soc. Inf. Sci. Technol. 58, 802–822 (2007)
Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: Proceedings of the 11th Annual Conference on Computational Learning Theory, pp. 92–100 (1998)
Burrows, J.: All the way through: Testing for authorship in different frequency data. Literary and Linguistic Computing 22, 27–47 (2007)
Diederich, J., Kindermann, J., Leopold, E., Paass, G., Informationstechnik, G.F., Augustin, D.S.: Authorship attribution with support vector machines. Applied Intelligence 19, 109–123 (2000)
Escalante, H.J., Solorio, T., Montes-y Gómez, M.: Local histograms of character n-grams for authorship attribution. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 288–298 (2011)
Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: Proceedings of the 20th International Conference on Computational Linguistics (2004)
Graham, N., Hirst, G., Marthi, B.: Segmenting documents by stylistic character. Natural Language Engineering 11, 397–415 (2005)
Grieve, J.: Quantitative authorship attribution: An evaluation of techniques. Literary and Linguistic Computing 22, 251–270 (2007)
van Halteren, H.: Author verification by linguistic profiling: An exploration of the parameter space. ACM Transactions on Speech and Language Processing 4, 1–17 (2007)
van Halteren, H., Tweedie, F., Baayen, H.: Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing 11, 121–132 (1996)
Hedegaard, S., Simonsen, J.G.: Lost in translation: authorship attribution using frame semantics. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, vol. 2, pp. 65–70. Human Language Technologies (2011)
Hirst, G., Feiguina, O.: Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing 22, 405–417 (2007)
Hoover, D.L.: Statistical stylistics and authorship attribution: an empirical investigation. Literary and Linguistic Computing 16, 421–424 (2001)
Joachims, T.: (2007),http://www.cs.cornell.edu/people/tj/svm_light/old/svmmulticlass_v2.12.html
Kaster, A., Siersdorfer, S., Weikum, G.: Combining text and linguistic document representations for authorship attribution. In: SIGIR Workshop: Stylistic Analysis of Text for Information Access (STYLE), pp. 27–35 (2005)
Kim, S., Kim, H., Weninger, T., Han, J., Kim, H.D.: Authorship classification: a discriminative syntactic tree mining approach. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 455–464 (2011)
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)
Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-First International Conference on Machine Learning (2004)
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)
Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. Lang. Resources & Evaluation 45, 83–94 (2011)
Kourtis, I., Stamatatos, E.: Author identification using semi-supervised learning. In: Notebook for PAN at CLEF 2011 (2011)
Li, J., Zheng, R., Chen, H.: From fingerprint to writeprint. Communications of the ACM 49, 76–82 (2006)
Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the 22nd International Conference on Computational Linguistics, pp. 513–520 (2008)
Maria-Florina, B., Avrim Blum, K.Y.: Co-training and expansion: Towards bridging theory and practice. In: Advances in Neural Information Processing Systems (2004)
Mosteller, F.W.: Inference and disputed authorship: The Federalist. Addison-Wesley (1964)
Nigam, K., Analyzing, G.R.: Analyzing the effectiveness and applicability of co-training. In: Proceedings of the 9th International Conference on Information and Knowledge Management, pp. 86–93 (2000)
Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking: an investigation. In: Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp. 482–491 (2006)
Seroussi, Y., Bohnert, F., Zukerman, I.: Authorship attribution with author-aware topic models. In: Proc. of The 50th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 264–269 (2012)
Seroussi, Y., Zukerman, I., Bohnert, F.: Collaborative inference of sentiments from texts. In: De Bra, P., Kobsa, A., Chin, D. (eds.) UMAP 2010. LNCS, vol. 6075, pp. 195–206. Springer, Heidelberg (2010)
Solorio, T., Pillay, S., Raghavan, S., Montes Y Gómez, M.: Modality specific meta features for authorship attribution in web forum posts. In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp. 156–164 (2011)
Stamatatos, E.: Ensemble-based author identification using character n-grams. In: Proc. of the 3rd Int. Workshop on Textbased Information Retrieval, pp. 41–46 (2003)
Stamatatos, E.: Author identification using imbalanced and limited training texts. In: Proc. of the 4th International Workshop on Text-based Information Retrieval, pp. 237–241 (2007)
Stamatatos, E.: A survey of modern authorship attribution methods. Journal of The American Society for Information Science and Technology 60, 538–556 (2009)
Stamatatos, E., Kokkinakis, G., Fakotakis, N.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26, 471–495 (2000)
Uzuner, Ö., Katz, B.: A comparative study of language models for book and author recognition. In: Proceedings of the 2nd International Joint Conference on Natural Language Processing, pp. 969–980 (2005)
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining email content for author identification forensics. Sigmod Record 30, 55–64 (2001)
Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society of Information Science and Technology 57, 378–393 (2006)
Author information
Authors and Affiliations
State Key Laboratory of Software Engineering, Wuhan University, Wuhan, China
Mengdi Fan, Tieyun Qian, Li Chen, Bin Liu, Ming Zhong & Guoliang He
- Mengdi Fan
You can also search for this author inPubMed Google Scholar
- Tieyun Qian
You can also search for this author inPubMed Google Scholar
- Li Chen
You can also search for this author inPubMed Google Scholar
- Bin Liu
You can also search for this author inPubMed Google Scholar
- Ming Zhong
You can also search for this author inPubMed Google Scholar
- Guoliang He
You can also search for this author inPubMed Google Scholar
Editor information
Editors and Affiliations
School of Computing, University of Utah, 50 S. Central Campus Drive, 84112, Salt Lake City,, UT, USA
Feifei Li
Department of Computer Science, Tsinghua University, 100084, Beijing, China
Guoliang Li
POSTECH, Republic of Korea
Seung-won Hwang
Shanghai Key Laboratory of Scalable Computing and Systems, Department of Computer Science and Engineering,, Shanghai Jiao Tong University, China
Bin Yao
Advanced Digital Sciences Center (ADSC), 138632, Singapore, Singapore
Zhenjie Zhang
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Fan, M., Qian, T., Chen, L., Liu, B., Zhong, M., He, G. (2014). Authorship Attribution with Very Few Labeled Data: A Co-training Approach. In: Li, F., Li, G., Hwang, Sw., Yao, B., Zhang, Z. (eds) Web-Age Information Management. WAIM 2014. Lecture Notes in Computer Science, vol 8485. Springer, Cham. https://doi.org/10.1007/978-3-319-08010-9_70
Download citation
Publisher Name:Springer, Cham
Print ISBN:978-3-319-08009-3
Online ISBN:978-3-319-08010-9
eBook Packages:Computer ScienceComputer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative