Citation analysis is the examination of the frequency, patterns, and graphs ofcitations in documents. It uses thedirected graph of citations – links from one document to another document – to reveal properties of the documents. A typical aim would be to identify the most important documents in a collection. A classic example is that of the citations between academicarticles and books.[1][2] For another example, judges of law support theirjudgements by referring back to judgements made in earlier cases (seecitation analysis in a legal context). An additional example is provided by patents which containprior art, citation of earlier patents relevant to the current claim. The digitization of patent data and increasing computing power have led to a community of practice that uses these citation data to measure innovation attributes, trace knowledge flows, and map innovation networks.[3]
Documents can be associated with many other features in addition to citations, such as authors, publishers, journals as well as their actual texts. The general analysis of collections of documents is known asbibliometrics and citation analysis is a key part of that field. For example,bibliographic coupling and co-citation are association measures based on citation analysis (shared citations or shared references). The citations in a collection of documents can also be represented in forms such as acitation graph, as pointed out byDerek J. de Solla Price in his 1965 article "Networks of Scientific Papers".[4] This means that citation analysis draws on aspects ofsocial network analysis andnetwork science.
An early example of automated citation indexing wasCiteSeer, which was used for citations between academic papers, whileWeb of Science is an example of a modern system which includes more than just academic books and articles reflecting a wider range of information sources. Today, automatedcitation indexing[5] has changed the nature of citation analysis research, allowing millions of citations to be analyzed forlarge-scale patterns andknowledge discovery. Citation analysis tools can be used to compute various impact measures for scholars based on data fromcitation indices.[6][7][note 1] These have various applications, from the identification of expert referees to review papers and grant proposals, to providing transparent data in support of academic merit review,tenure, and promotion decisions. This competition for limited resources may lead to ethically questionable behavior to increase citations.[8][9]
A great deal of criticism has been made of the practice of naively using citation analyses to compare the impact of different scholarly articles without taking into account other factors which may affect citation patterns.[10] Among these criticisms, a recurrent one focuses on "field-dependent factors", which refers to the fact that citation practices vary from one area of science to another, and even between fields of research within a discipline.[11]
While citation indexes were originally designed forinformation retrieval, they are increasingly used forbibliometrics and other studies involving research evaluation. Citation data is also the basis of the popularjournal impact factor.
There is a large body of literature on citation analysis, sometimes calledscientometrics, a term invented byVasily Nalimov, or more specificallybibliometrics. The field blossomed with the advent of theScience Citation Index, which now covers source literature from 1900 on. The leading journals of the field areScientometrics,Informetrics, and theJournal of the Association for Information Science and Technology.ASIST also hosts anelectronic mailing list called SIGMETRICS at ASIST.[12] This method is undergoing a resurgence based on the wide dissemination of the Web of Science and Scopus subscription databases in many universities, and the universally available free citation tools such asCiteBase,CiteSeerX,Google Scholar, and the formerWindows Live Academic (now available with extra features asMicrosoft Academic). Methods of citation analysis research include qualitative, quantitative and computational approaches. The main foci of such scientometric studies have included productivity comparisons, institutional research rankings, journal rankings[13] establishing faculty productivity and tenure standards,[14] assessing the influence of top scholarly articles,[15] tracing the development trajectory of a science or technology field,[16] and developing profiles of top authors and institutions in terms of research performance.[17]
Legal citation analysis is a citation analysis technique for analyzinglegal documents to facilitate the understanding of the inter-related regulatory compliance documents by the exploration the citations that connect provisions to other provisions within the same document or between different documents. Legal citation analysis uses acitation graph extracted from a regulatory document, which could supplementE-discovery - a process that leverages on technological innovations inbig data analytics.[18][19][20][21]
In a 1965 paper,Derek J. de Solla Price described the inherent linking characteristic of the SCI as "Networks of Scientific Papers".[4] The links between citing and cited papers became dynamic when the SCI began to be published online. TheSocial Sciences Citation Index became one of the first databases to be mounted on theDialog system[22] in 1972. With the advent of theCD-ROM edition, linking became even easier and enabled the use ofbibliographic coupling for finding related records. In 1973,Henry Small published his classic work onCo-Citation analysis which became aself-organizing classification system that led todocument clustering experiments and eventually an "Atlas of Science" later called "Research Reviews".
The use of citation counts to rank journals was a technique used in the early part of the nineteenth century but the systematic ongoing measurement of these counts for scientific journals was initiated by Eugene Garfield at the Institute for Scientific Information who also pioneered the use of these counts to rank authors andpapers. In a landmark paper of 1965 he andIrving Sher showed the correlation between citation frequency and eminence in demonstrating thatNobel Prize winners published five times the average number of papers while their work was cited 30 to 50 times the average. In a long series of essays on the Nobel and other prizes Garfield reported this phenomenon. The usual summary measure is known asimpact factor, the number of citations to a journal for the previous two years, divided by the number of articles published in those years. It is widely used, both for appropriate and inappropriate purposes – in particular, the use of this measure alone for ranking authors and papers is thereforequite controversial.
Automatic citation indexing was introduced in 1998 byLee Giles,Steve Lawrence andKurt Bollacker[25] and enabled automatic algorithmic extraction and grouping of citations for any digital academic and scientific document. Where previous citation extraction was a manual process, citation measures could now scale up and be computed for any scholarly and scientific field and document venue, not just those selected by organizations such as ISI. This led to the creation of new systems for public and automated citation indexing, the first beingCiteSeer (nowCiteSeerX, soon followed by Cora, which focused primarily on the field ofcomputer science andinformation science. These were later followed by large scale academic domain citation systems such as the Google Scholar and Microsoft Academic. Such autonomous citation indexing is not yet perfect in citation extraction or citation clustering with an error rate estimated by some at 10% though a careful statistical sampling has yet to be done. This has resulted in such authors asAnn Arbor,Milton Keynes, andWalton Hall being credited with extensive academic output.[26] SCI claims to create automatic citation indexing through purely programmatic methods. Even the older records have a similar magnitude of error.
Citation impact or citation metric is a measure of how often anacademic article,jornal, book, author or institution iscited.[27][28][29][30][31][32][33]Citation count is araw score equal to the number of citations received (considered in a givencitation index) while citation frequency or citation rate is anormalized value given by the ratio of citation counts to number articles published by the journal or author group during a given time period; for example, 5 citations received by 10 articles would result in a citation frequency of 0.5=5/10.
Citation analysis for legal documents is an approach to facilitate the understanding and analysis of inter-relatedregulatory compliance documents by exploration of the citations that connectprovisions to other provisions within the same document or between different documents. Citation analysis uses acitation graph extracted from a regulatory document, which could supplementE-discovery - a process that leverages on technological innovations inbig data analytics.[20][21][18]
Citation-based plagiarism detection (CbPD)[37] relies on citation analysis, and is the only approach to plagiarism detection that does not rely on the textual similarity.[38] CbPD examines the citation and reference information in texts to identify similarpatterns in the citation sequences. As such, this approach is suitable for scientific texts, or other academic documents that contain citations. Citation analysis to detect plagiarism is a relatively young concept. It has not been adopted bycommercial software, but a first prototype of a citation-based plagiarism detection system exists.[39] Similar order and proximity of citations in the examined documents are the main criteria used to compute citation pattern similarities. Citation patterns represent subsequences non-exclusively containing citations shared by the documents compared.[38][40] Factors, including the absolute number or relative fraction of shared citations in the pattern, as well as the probability that citations co-occur in a document are also considered to quantify the patterns' degree of similarity.[38][40][41][42]
Natural language processing (NLP), a field at the intersection ofartificial intelligence andlinguistics is poised to substantially impact society through various innovations such aslarge language models. The impact on and of NLP has been extensively studied through citations. Researchers have analyzed various factors such as the cross-field influence between different fields,[43][44] industry impact,[45] temporal citation patterns,[46] plagiarism,[47] geographic location,[48] and gender.[49] Many studies show the field is becoming more insular, with a narrowing focus, reduced interdisciplinarity, and concentration of funding across few industry actors.
E-publishing: due to the unprecedented growth ofelectronic resource (e-resource) availability, one of the questions currently being explored is, "how often are e-resources being cited in my field?"[50] For instance, there are claims that On-Line access tocomputer scienceliterature leads to higher citation rates,[51] however,humanities articles may suffer if not in print.
Self-citations: it has been criticized that authors game the system by accumulating citations by citing themselves excessively.[52] For instance, it has been found that men tend to cite themselves more often than women.[53]
Citation pollution: the infiltration ofretracted research, or fake research, being cited in legitimate research, but negatively impacting on the validity of the research.[54] It is due to various factors, including the publication race and the concerning rise in unscrupulous business practices related to so-calledpredatory or deceptive publishers, research quality, in general, is facing different types of threats.
Citation justice andcitation bias: Because having others cite a publication helps the original author's career prospects, and because the key works in some fields were published by men, by older scholars, and by white people, there have been calls to promote social justice by deliberately citing publications by people from marginalized backgrounds, or by checking citations for bias before publication.[55]
^Jaffe, Adam;de Rassenfosse, Gaétan (2017). "Patent citation data in social science research: Overview and best practices".Journal of the Association for Information Science and Technology.68 (6):1360–1374.doi:10.1002/asi.23731.
^Giles, C. Lee; Bollacker, Kurt D.; Lawrence, Steve (1998), "CiteSeer",Proceedings of the third ACM conference on Digital libraries - DL '98, New York: Association for Computing Machinery, pp. 89–98,doi:10.1145/276675.276685,ISBN978-0-89791-965-4,S2CID514080
^Hoang, D.; Kaur, J.; Menczer, F. (2010),"Crowdsourcing Scholarly Data",Proceedings of the WebSci10: Extending the Frontiers of Society On-Line, April 26-27th, 2010, Raleigh, NC: US, archived fromthe original on 2015-04-17, retrieved2015-08-09
^Anderson, M.S. van; Ronning, E.A. van; de Vries, R.; Martison, B.C. (2007). "The perverse effects of competition on scientists' work and relationship".Science and Engineering Ethics.4 (13):437–461.doi:10.1007/s11948-007-9042-5.PMID18030595.S2CID2994701.
^Anauati, Maria Victoria and Galiani, Sebastian and Gálvez, Ramiro H., Quantifying the Life Cycle of Scholarly Articles Across Fields of Economic Research (November 11, 2014). Available at SSRN:https://ssrn.com/abstract=2523078
^Liu, John S.; Lu, Louis Y.Y. (2012-03-01). "An integrated approach for main path analysis: Development of the Hirsch index as an example".Journal of the American Society for Information Science and Technology.63 (3):528–542.doi:10.1002/asi.21692.ISSN1532-2890.
^abHamou-Lhadj, Abdelwahab; Hamdaqa, Mohammad (2009).Citation Analysis: An Approach for Facilitating the Understanding and the Analysis of Regulatory Compliance Documents. 2009 Sixth International Conference on Information Technology: New Generations. Las Vegas, NV: IEEE. pp. 278–283.doi:10.1109/ITNG.2009.161.ISBN978-1-4244-3770-2.S2CID10083351.
^Mohammad Hamdaqa and A. Hamou-Lhadj, "Citation Analysis: An Approach for Facilitating the Understanding and the Analysis of Regulatory Compliance Documents", In Proc. of the 6th International Conference on Information Technology, Las Vegas, US
^C.L. Giles, K. Bollacker, S. Lawrence, "CiteSeer: An Automatic Citation Indexing System", DL'98 Digital Libraries, 3rd ACM Conference on Digital Libraries, pp. 89-98, 1998.
^Rungta, Mukund; Singh, Janvijay; Mohammad, Saif M.; Yang, Diyi (December 2022)."Geographic Citation Gaps in NLP Research". In Goldberg, Yoav; Kozareva, Zornitsa; Zhang, Yue (eds.).Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics. pp. 1371–1383.doi:10.18653/v1/2022.emnlp-main.89.
^Lawrence, Steve.Free online availability substantially increases a paper's impact. Nature volume 411 (number 6837) (2001): 521. Also online athttp://citeseer.ist.psu.edu/online-nature01/
^Gálvez RH (March 2017). "Assessing author self-citation as a mechanism of relevant knowledge diffusion".Scientometrics.111 (3):1801–1812.doi:10.1007/s11192-017-2330-1.S2CID6863843.