Big Data promises to revolutionise the production of knowledge withinand beyond science, by enabling novel, highly efficient ways to plan,conduct, disseminate and assess research. The last few decades havewitnessed the creation of novel ways to produce, store, and analysedata, culminating in the emergence of the field ofdatascience, which brings together computational, algorithmic,statistical and mathematical techniques towards extrapolatingknowledge from big data. At the same time, theOpen Datamovement—emerging from policy trends such as the push for OpenGovernment and Open Science—has encouraged the sharing andinterlinking of heterogeneous research data via large digitalinfrastructures. The availability of vast amounts of data inmachine-readable formats provides an incentive to create efficientprocedures to collect, organise, visualise and model these data. Theseinfrastructures, in turn, serve as platforms for the development ofartificial intelligence, with an eye to increasing the reliability,speed and transparency of processes of knowledge creation. Researchersacross all disciplines see the newfound ability to link andcross-reference data from diverse sources as improving the accuracyand predictive power of scientific findings and helping to identifyfuture directions of inquiry, thus ultimately providing a novelstarting point for empirical investigation. As exemplified by the riseof dedicated funding, training programmes and publication venues, bigdata are widely viewed as ushering in a new way of performing researchand challenging existing understandings of what counts as scientificknowledge.
This entry explores these claims in relation to the use of big datawithin scientific research, and with an emphasis on the philosophicalissues emerging from such use. To this aim, the entry discusses howthe emergence of big data—and related technologies, institutionsand norms—informs the analysis of the following themes:
These are areas where attention to research practices revolving aroundbig data can benefit philosophy, and particularly work in theepistemology and methodology of science. This entry doesn’t cover thevast scholarship in the history and social studies of science that hasemerged in recent years on this topic, though references to some ofthat literature can be found when conceptually relevant. Complementinghistorical and social scientific work in data studies, thephilosophical analysis of data practices can also elicit significantchallenges to the hype surrounding data science and foster a criticalunderstanding of the role of data-fuelled artificial intelligence inresearch.
We are witnessing a progressive “datafication” of sociallife. Human activities and interactions with the environment are beingmonitored and recorded with increasing effectiveness, generating anenormous digital footprint. The resulting “big data” are atreasure trove for research, with ever more sophisticatedcomputational tools being developed to extract knowledge from suchdata. One example is the use of various different types of dataacquired from cancer patients, including genomic sequences,physiological measurements and individual responses to treatment, toimprove diagnosis and treatment. Another example is the integration ofdata on traffic flow, environmental and geographical conditions, andhuman behaviour to produce safety measures for driverless vehicles, sothat when confronted with unforeseen events (such as a child suddenlydarting into the street on a very cold day), the data can be promptlyanalysed to identify and generate an appropriate response (the carswerving enough to avoid the child while also minimising the risk ofskidding on ice and damaging to other vehicles). Yet another instanceis the understanding of the nutritional status and needs of aparticular population that can be extracted from combining data onfood consumption generated by commercial services (e.g., supermarkets,social media and restaurants) with data coming from public health andsocial services, such as blood test results and hospital intakeslinked to malnutrition. In each of these cases, the availability ofdata and related analytic tools is creating novel opportunities forresearch and for the development of new forms of inquiry, which arewidely perceived as having a transformative effect on science as awhole.
A useful starting point in reflecting on the significance of suchcases for a philosophical understanding of research is to considerwhat the term “big data” actually refers to withincontemporary scientific discourse. There are multiple ways to definebig data (Kitchin 2014, Kitchin & McArdle 2016). Perhaps the moststraightforward characterisation is aslarge datasets thatare produced in adigital form and can be analysed throughcomputational tools. Hence the two features most commonlyassociated with Big Data are volume and velocity.Volumerefers to the size of the files used to archive and spread data.Velocity refers to the pressing speed with which data isgenerated and processed. The body of digital data created by researchis growing at breakneck pace and in ways that are arguably impossiblefor the human cognitive system to grasp and thus require some form ofautomated analysis.
Volume and velocity are also, however, the most disputed features ofbig data. What may be perceived as “large volume” or“high velocity” depends on rapidly evolving technologiesto generate, store, disseminate and visualise the data. This isexemplified by the high-throughput production, storage anddissemination of genomic sequencing and gene expression data, whereboth data volume and velocity have dramatically increased within thelast two decades. Similarly, current understandings of big data as“anything that cannot be easily captured in an Excelspreadsheet” are bound to shift rapidly as new analytic softwarebecomes established, and the very idea of using spreadsheets tocapture data becomes a thing of the past. Moreover, data size andspeed do not take account of the diversity of data types used byresearchers, which may include data that are not generated in digitalformats or whose format is not computationally tractable, and whichunderscores the importance of data provenance (that is, the conditionsunder which data were generated and disseminated) to processes ofinference and interpretation. And as discussed below, the emphasis onphysical features of data obscures the continuing dependence of datainterpretation on circumstances of data use, including specificqueries, values, skills and research situations.
An alternative is to define big data not by reference to theirphysical attributes, but rather by virtue of what can and cannot bedone with them. In this view, big data is a heterogeneous ensemble ofdata collected from a variety of different sources, typically (but notalways) in digital formats suitable for algorithmic processing, inorder to generate new knowledge. For example boyd and Crawford (2012:663) identify big data with “the capacity to search, aggregateand cross-reference large datasets”, while O’Malley andSoyer (2012) focus on the ability to interrogate and interrelatediverse types of data, with the aim to be able to consult them as asingle body of evidence. The examples of transformative “bigdata research” given above are all easily fitted into this view:it is not the mere fact that lots of data are available that makes adifferent in those cases, but rather the fact that lots of data can bemobilised from a wide variety of sources (medical records,environmental surveys, weather measurements, consumer behaviour). Thisaccount makes sense of other characteristic “v-words” thathave been associated with big data, including:
This list of features, though not exhaustive, highlights how big datais not simply “a lot of data”. The epistemic power of bigdata lies in their capacity to bridge between different researchcommunities, methodological approaches and theoretical frameworks thatare difficult to link due to conceptual fragmentation, social barriersand technical difficulties (Leonelli 2019a). And indeed, appeals tobig data often emerge from situations of inquiry that are at oncetechnically, conceptually and socially challenging, and where existingmethods and resources have proved insufficient or inadequate (Sterner& Franz 2017; Sterner, Franz, & Witteveen 2020).
This understanding of big data is rooted in a long history ofresearchers grappling with large and complex datasets, as exemplifiedby fields like astronomy, meteorology, taxonomy and demography (seethe collections assembled by Daston 2017; Anorova et al. 2017; Porter& Chaderavian 2018; as well as Anorova et al. 2010, Sepkoski 2013,Stevens 2016, Strasser 2019 among others). Similarly, biomedicalresearch—and particularly subfields such as epidemiology,pharmacology and public health—has an extensive tradition oftackling data of high volume, velocity, variety and volatility, andwhose validity, veracity and value are regularly negotiated andcontested by patients, governments, funders, pharmaceutical companies,insurances and public institutions (Bauer 2008). Throughout thetwentieth century, these efforts spurred the development oftechniques, institutions and instruments to collect, order, visualiseand analyse data, such as: standard classification systems andformats; guidelines, tools and legislation for the management andsecurity of sensitive data; and infrastructures to integrate andsustain data collections over long periods of time (Daston 2017).
This work culminated in the application of computational technologies,modelling tools and statistical methods to big data (Porter 1995;Humphreys 2004; Edwards 2010), increasingly pushing the boundaries ofdata analytics thanks to supervised learning, model fitting, deepneural networks, search and optimisation methods, complex datavisualisations and various other tools now associated with artificialintelligence. Many of these tools are based on algorithms whosefunctioning and results are tested against specific data samples (aprocess called “training”). These algorithms areprogrammed to “learn” from each interaction with noveldata: in other words, they have the capacity to change themselves inresponse to new information being inputted into the system, thusbecoming more attuned to the phenomena they are analysing andimproving their ability to predict future behaviour. The scope andextent of such changes is shaped by the assumptions used to build thealgorithms and the capability of related software and hardware toidentify, access and process information of relevance to the learningin question. There is however a degree of unpredictability and opacityto these systems, which can evolve to the point of defying humanunderstanding (more on this below).
New institutions, communication platforms and regulatory frameworksalso emerged to assemble, prepare and maintain data for such uses(Kitchin 2014), such as various forms of digital data infrastructures,organisations aiming to coordinate and improve the global datalandscape (e.g., the Research Data Alliance), and novel measures fordata protection, like the General Data Protection Regulation launchedin 2017 by the European Union. Together, these techniques andinstitutions afford the opportunity to assemble and interpret data ata much broader scale, while also promising to deliver finer levels ofgranularity in data analysis.[1] They increase the scope of any investigation by making it possiblefor researchers to link their own findings to those of countlessothers across the world, both within and beyond the academic sphere.By enhancing the mobility of data, they facilitate their repurposingfor a variety of goals that may have been unforeseeable when the datawere originally generated. And by transforming the role of data withinresearch, they heighten their status as valuable research outputs inand of themselves. These technological and methodological developmentshave significant implications for philosophical conceptualisations ofdata, inferential processes and scientific knowledge, as well as forhow research is conducted, organised, governed and assessed. It is tothese philosophical concerns that I now turn.
Big data are often associated to the idea ofdata-drivenresearch, where learning happens through the accumulation of data andthe application of methods to extract meaningful patterns from thosedata. Within data-driven inquiry, researchers are expected to use dataas their starting point for inductive inference, without relying ontheoretical preconceptions—a situation described by advocates as“the end of theory”, in contrast to theory-drivenapproaches where research consists of testing a hypothesis (Anderson2008, Hey et al. 2009). In principle at least, big data constitute thelargest pool of data ever assembled and thus a strong starting pointto search for correlations (Mayer-Schönberger & Cukier 2013).Crucial to the credibility of the data-driven approach is the efficacyof the methods used to extrapolate patterns from data and evaluatewhether or not such patterns are meaningful, and what“meaning” may involve in the first place. Hence, somephilosophers and data scholars have argued that
the most important and distinctive characteristic of Big Data [is] itsuse of statistical methods and computational means of analysis,(Symons & Alvarado 2016: 4)
such as for instance machine learning tools, deep neural networks andother “intelligent” practices of data handling.
The emphasis on statistics as key adjudicator of validity andreliability of patterns extracted from data is not novel. Exponents oflogical empiricism looked for logically watertight methods to secureand justify inference from data, and their efforts to develop a theoryof probability proceeded in parallel with the entrenchment ofstatistical reasoning in the sciences in the first half of thetwentieth century (Romeijn 2017). In the early 1960s, Patrick Suppesoffered a seminal link between statistical methods and the philosophyof science through his work on the production and interpretation ofdata models. As a philosopher deeply embedded in experimentalpractice, Suppes was interested in the means and motivations of keystatistical procedures for data analysis such as data reduction andcurve fitting. He argued that once data are adequatelyprepared for statistical modelling, all the concerns andchoices that motivated data processing become irrelevant to theiranalysis and interpretation. This inspired him to differentiatebetween models of theory, models of experiment and models of data,noting that such different components of inquiry are governed bydifferent logics and cannot be compared in a straightforward way. Forinstance,
the precise definition of models of the data for any given experimentrequires that there be a theory of the data in the sense of theexperimental procedure, as well as in the ordinary sense of theempirical theory of the phenomena being studied. (Suppes 1962: 253)
Suppes viewed data models as necessarily statistical: that is, asobjects
designed to incorporate all the information about the experiment whichcan be used in statistical tests of the adequacy of the theory.(Suppes 1962: 258)
His formal definition of data models reflects this decision, withstatistical requirements such as homogeneity, stationarity and orderidentified as the ultimate criteria to identify a data model Z andevaluate its adequacy:
Z is an N-fold model of the data for experimentY if andonly if there is a setY and a probability measureP onsubsets ofY such that \(Y = \langle Y, P\rangle\) is a modelof the theory of the experiment,Z is an N-tuple of elements ofY, andZ satisfies the statistical tests of homogeneity,stationarity and order. (1962: 259)
This analysis of data models portrayed statistical methods as keyconduits between data and theory, and hence as crucial components ofinferential reasoning.
The focus on statistics as entry point to discussions of inferencefrom data was widely promoted in subsequent philosophical work.Prominent examples include Deborah Mayo, who in her bookError andthe Growth of Experimental Knowledge asked:
What should be included in data models? The overriding constraint isthe need for data models that permit the statistical assessment of fit(between prediction and actual data); (Mayo 1996: 136)
and Bas van Fraassen, who also embraced the idea of data models as“summarizing relative frequencies found in data” (VanFraassen 2008: 167). Closely related is the emphasis on statistics asmeans to detect error within datasets in relation to specifichypotheses, most prominently endorsed by the error-statisticalapproach to inference championed by Mayo and Aris Spanos (Mayo &Spanos 2009a). Thisapproach aligns with the emphasis on computational methods for dataanalysis within big data research, and supports the idea that thebetter the inferential tools and methods, the better the chance toextract reliable knowledge from data.
When it comes to addressing methodological challenges arising from thecomputational analysis of big data, however, statistical expertiseneeds to be complemented by computational savvy in the training andapplication of algorithms associated to artificial intelligence,including machine learning but also other mathematical procedures foroperating upon data (Bringsjord & Govindarajulu 2018). Considerfor instance the problem of overfitting, i.e., the mistakenidentification of patterns in a dataset, which can be greatlyamplified by the training techniques employed by machine learningalgorithms. There is no guarantee that an algorithm trained tosuccessfully extrapolate patterns from a given dataset will be assuccessful when applied to other data. Common approaches to thisproblem involve the re-ordering and partitioning of both data andtraining methods, so that it is possible to compare the application ofthe same algorithms to different subsets of the data(“cross-validation”), combine predictions arising fromdifferently trained algorithms (“ensembling”) or usehyperparameters (parameters whose value is set prior to data training)to prepare the data for analysis.
Handling these issues, in turn, requires
familiarity with the mathematical operations in question, theirimplementations in code, and the hardware architectures underlyingsuch implementations. (Lowrie 2017: 3)
For instance, machine learning
aims to build programs that develop their own analytic or descriptiveapproaches to a body of data, rather than employing ready-madesolutions such as rule-based deduction or the regressions of moretraditional statistics. (Lowrie 2017: 4)
In other words, statistics and mathematics need to be complemented byexpertise in programming and computer engineering. The ensemble ofskills thus construed results in a specific epistemological approachto research, which is broadly characterised by an emphasis on themeans of inquiry as the most significant driver of research goals andoutputs. This approach, which Sabina Leonelli characterised asdata-centric, involves “focusing more on the processesthrough which research is carried out than on its ultimateoutcomes” (Leonelli 2016: 170). In this view, procedures,techniques, methods, software and hardware are the prime motors ofinquiry and the chief influence on its outcomes. Focusing morespecifically on computational systems, John Symons and Jack Hornerargued that much of big data research consists ofsoftware-intensive science rather than data-driven research:that is, science that depends on software for its design, development,deployment and use, and thus encompasses procedures, types ofreasoning and errors that are unique to software, such as for examplethe problems generated by attempts to map real-world quantities todiscrete-state machines, or approximating numerical operations (Symons& Horner 2014: 473). Software-intensive science is arguablysupported by analgorithmic rationality focused on thefeasibility, practicality and efficiency of algorithms, which istypically assessed by reference to concrete situations of inquiry(Lowrie 2017).
Algorithms are enormously varied in their mathematical structures andunderpinning conceptual commitments, and more philosophical work needsto be carried out on the specifics of computational tools and softwareused in data science and related applications—with emerging workin philosophy of computer science providing an excellent way forward(Turner & Angius 2019). Nevertheless, it is clear that whether ornot a given algorithm successfully applies to the data at hand dependson factors that cannot be controlled through statistical or evencomputational methods: for instance, the size, structure and format ofthe data, the nature of the classifiers used to partition the data,the complexity of decision boundaries and the very goals of theinvestigation.
In a forceful critique informed by the philosophy of mathematics,Christian Calude and Giuseppe Longo argued that there is a fundamentalproblem with the assumption that more data will necessarily yield moreinformation:
very large databases have to contain arbitrary correlations. Thesecorrelations appear only due to the size, not the nature, of data.(Calude & Longo 2017: 595)
They conclude that big data analysis is by definition unable todistinguish spurious from meaningful correlations and is therefore athreat to scientific research. A related worry, sometimes dubbed“the curse of dimensionality” by data scientists, concernsthe extent to which the analysis of a given dataset can be scaled upin complexity and in the number of variables being considered. It iswell known that the more dimensions one considers in classifyingsamples, for example, the larger the dataset on which such dimensionscan be accurately generalised. This demonstrates the continuing, tightdependence between the volume and quality of data on the one hand, andthe type and breadth of research questions for which data need toserve as evidence on the other hand.
Determining the fit between inferential methods and data requires highlevels of expertise and contextual judgement (a situation known withinmachine learning as the “no free lunch theorem”). Indeed,overreliance on software for inference and data modelling can yieldhighly problematic results. Symons and Horner note that the use ofcomplex software in big data analysis makes margins of errorunknowable, because there is no clear way to test them statistically(Symons & Horner 2014: 473). The path complexity of programs withhigh conditionality imposes limits on standard error correctiontechniques. As a consequence, there is no effective method forcharacterising the error distribution in the software except bytesting all paths in the code, which is unrealistic and intractable inthe vast majority of cases due to the complexity of the code.
Rather than acting as a substitute, the effective and responsible useof artificial intelligence tools in big data analysis requires thestrategic exercise of human intelligence—but for this to happen,AI systems applied to big data need to be accessible to scrutiny andmodification. Whether or not this is the case, and who is bestqualified to exercise such scrutiny, is under dispute. Thomas Nicklesargued that the increasingly complex and distributed algorithms usedfor data analysis follow in the footsteps of long-standing scientificattempts to transcend the limits of human cognition. The resultingepistemic systems may no longer be intelligible to humans: an“alien intelligence” within which “human abilitiesare no longer the ultimate criteria of epistemic success”(Nickles forthcoming). Such unbound cognition holds the promise ofenabling powerful inferential reasoning from previously unimaginablevolumes of data. The difficulties in contextualising and scrutinisingsuch reasoning, however, sheds doubt on the reliability of theresults. It is not only machine learning algorithms that are becomingincreasingly inaccessible to evaluation: beyond the complexities ofprogramming code, computational data analysis requires a wholeecosystem of classifications, models, networks and inference toolswhich typically have different histories and purposes, and whoserelation to each other—and effects when they are usedtogether—are far from understood and may well beuntraceable.
This raises the question of whether the knowledge produced by suchdata analytic systems is at all intelligible to humans, and if so,what forms of intelligibility it yields. It is certainly the case thatderiving knowledge from big data may not involve an increase in humanunderstanding, especially if understanding is understood as anepistemic skill (de Regt 2017). This may not be a problem to those whoawait the rise of a new species of intelligent machines, who maymaster new cognitive tools in a way that humans cannot. But asNickles, Nicholas Rescher (1984), Werner Callebaut (2012) and otherspointed out, even in that case “we would not have arrived atperspective-free science” (Nickles forthcoming). While the humanhistories and assumptions interwoven into these systems may be hard todisentangle, they still affect their outcomes; and whether or notthese processes of inquiry are open to critical scrutiny, their telos,implications and significance for life on the planet arguably shouldbe. As argued by Dan McQuillan (2018), the increasing automation ofbig data analytics may foster acceptance of a Neoplatonistmachinic metaphysics, within which mathematical structures“uncovered” by AI would trump any appeal to humanexperience. Luciano Floridi echoes this intuition in his analysis ofwhat he calls theinfosphere:
The great opportunities offered by Information and CommunicationTechnologies come with a huge intellectual responsibility tounderstand them and take advantage of them in the right way.(2014: vii)
These considerations parallel Paul Humphreys’s long-standingcritique of computer simulations asepistemically opaque(Humphreys 2004, 2009)—and particularly his definition of whathe callsessential epistemic opacity:
A process is essentially epistemically opaque toX if and onlyif it isimpossible, given the nature ofX, forX to know all of the epistemically relevant elements of theprocess. (Humphreys 2009: 618)
Different facets of the general problem of epistemic opacity arestressed within the vast philosophical scholarship on the role ofmodelling, computing and simulations in the sciences: the implicationsof lacking experimental access to the concrete parts of the worldbeing modelled, for instance (Morgan 2005; Parker 2009; Radder 2009);the difficulties in testing the reliability of computational methodsused within simulations (Winsberg 2010; Morrison 2015); the relationbetween opacity and justification (Durán & Formanek 2018);the forms of black-boxing associated to mechanistic reasoningimplemented in computational analysis (Craver and Darden 2013; Bechtel2016); and the debate over the intrinsic limits of computationalapproaches and related expertise (Collins 1990; Dreyfus 1992). RomanFrigg and Julian Reiss argued that such issues do not constitutefundamental challenges to the nature of inquiry and modelling, and infact exist in a continuum with traditional methodological issueswell-known within the sciences (Frigg & Reiss 2009). Whether ornot one agrees with this position (Humphreys 2009; Beisbart 2012), bigdata analysis is clearly pushing computational and statistical methodsto their limit, thus highlighting the boundaries to what eventechnologically augmented human beings are capable of knowing andunderstanding.
Research on big data analysis thus sheds light on elements of theresearch process that cannot be fully controlled, rationalised or evenconsidered through recourse to formal tools.
One such element is the work required to present empirical data ina machine-readable format that is compatible with the software andanalytic tools at hand. Data need to be selected, cleaned and preparedto be subjected to statistical and computational analysis. Theprocesses involved in separating data from noise, clustering data sothat it is tractable, and integrating data of different formats turnout to be highly sophisticated and theoretically structured, asdemonstrated for instance by James McAllister’s (1997, 2007,2011) and Uljana Feest’s (2011) work on data patterns, MarcelBoumans’s and Leonelli’s comparison of clusteringprinciples across fields (forthcoming), and James Griesemer’s(forthcoming) and Mary Morgan’s (forthcoming) analyses of thepeculiarities of datasets. Suppes was so concerned by what he calledthe “bewildering complexity” of data production andprocessing activities, that he worried that philosophers would notappreciate the ways in which statistics can and does help scientiststo abstract data away from such complexity. He described the largegroup of research components and activities used to prepare data formodelling as “pragmatic aspects” encompassing “everyintuitive consideration of experimental design that involved no formalstatistics” (Suppes 1962: 258), and positioned them as thelowest step of his hierarchy of models—at the opposite end ofits pinnacle, which are models of theory. Despite recent efforts torehabilitate the methodology of inductive-statistical modelling andinference (Mayo & Spanos 2009b), this approach has been shared bymany philosophers who regard processes of data production andprocessing as so chaotic as to defy systematic analysis. This explainswhy data have received so little consideration in philosophy ofscience when compared to models and theory.
The question of how data are defined and identified, however, iscrucial for understanding the role of big data in scientific research.Let us now consider two philosophical views—therepresentational view and therelationalview—that are both compatible with the emergence of bigdata, and yet place emphasis on different aspects of that phenomenon,with significant implications for understanding the role of datawithin inferential reasoning and, as we shall see in the next section,as evidence. Therepresentational view construes data asreliable representations of reality which are produced via theinteraction between humans and the world. The interactions thatgenerate data can take place in any social setting regardless ofresearch purposes. Examples range from a biologist measuring thecircumference of a cell in the lab and noting the result in an Excelfile, to a teacher counting the number of students in her class andtranscribing it in the class register. What counts as data in theseinteractions are the objects created in the process of descriptionand/or measurement of the world. These objects can be digital (theExcel file) or physical (the class register) and form a footprint of aspecific interaction with the natural world. Thisfootprint—“trace” or “mark”, in thewords of Ian Hacking (1992) and Hans-Jörg Rheinberger (2011),respectively—constitutes a crucial reference point for analyticstudy and for the extraction of new insights. This is the reason whydata forms a legitimate foundation to empirical knowledge: theproduction of data is equivalent to “capturing” featuresof the world that can be used for systematic study. According to therepresentative approach, data are objects with fixed and unchangeablecontent, whose meaning, in virtue of being representations of reality,needs to be investigated and revealed step-by-step through adequateinferential methods. The data documenting cell shape can be modelledto test the relevance of shape to the elasticity, permeability andresilience of cells, producing an evidence base to understandcell-to-cell signalling and development. The data produced countingstudents in class can be aggregated with similar data collected inother schools, producing an evidence base to evaluate the density ofstudents in the area and their school attendance frequency.
This reflects the intuition that data, especially when they come inthe form of numerical measurements or images such as photographs,somehow mirror the phenomena that they are created to document, thusproviding a snapshot of those phenomena that is amenable to studyunder the controlled conditions of research. It also reflects the ideaof data as “raw” products of research, which are as closeas it gets to unmediated knowledge of reality. This makes sense of thetruth-value sometimes assigned to data as irrefutable sources ofevidence—the Popperian idea that if data are found to support agiven claim, then that claim is corroborated as true at least as longas no other data are found to disprove it. Data in this view representan objective foundation for the acquisition of knowledge and this veryobjectivity—the ability to derive knowledge from humanexperience while transcending it—is what makes knowledgeempirical. This position is well-aligned with the idea that big datais valuable to science because it facilitates the (broadly understood)inductive accumulation of knowledge: gathering data collected viareliable methods produces a mountain of facts ready to be analysedand, the more facts are produced and connected with each other, themore knowledge can be extracted.
Philosophers have long acknowledged that data do not speak forthemselves and different types of data require different tools foranalysis and preparation to be interpreted (Bogen 2009 [2013]).According to the representative view, there are correct and incorrectways of interpreting data, which those responsible for data analysisneed to uncover. But what is a “correct” interpretation inthe realm of big data, where data are consistently treated as mobileentities that can, at least in principle, be reused in countlessdifferent ways and towards different objectives? Perhaps more than atany other time in the history of science, the current mobilisation andre-use of big data highlights the degree to which datainterpretation—and with it, whatever data is taken torepresent—may differ depending on the conceptual, material andsocial conditions of inquiry. The analysis of how big data travelsacross contexts shows that the expectations and abilities of thoseinvolved determine not only the way data are interpreted, but alsowhat is regarded as “data” in the first place (Leonelli& Tempini forthcoming). The representative view of data as objectswith fixed and contextually independent meaning is at odds with theseobservations.
An alternative approach is to embrace these findings and abandon theidea of data as fixed representations of reality altogether. Withintherelational view, data are objects that are treated aspotential or actual evidence for scientific claims in ways that can,at least in principle, be scrutinised and accounted for (Leonelli2016). The meaningassigned to data depends on their provenance, their physical featuresand what these features are taken to represent, and the motivationsand instruments used to visualise them and to defend specificinterpretations. The reliability of data thus depends on thecredibility and strictness of the processes used to produce andanalyse them. The presentation of data; the way they are identified,selected, and included (or excluded) in databases; and the informationprovided to users to re-contextualise them are fundamental toproducing knowledge and significantly influence its content. Forinstance, changes in data format—as most obviously involved indigitisation, data compression or archival procedures— can havea significant impact on where, when, and who uses the data as sourceof knowledge.
This framework acknowledges that any object can be used as a datum, orstop being used as such, depending on the circumstances—aconsideration familiar to big data analysts used to pick and mix datacoming from a vast variety of sources. The relational view alsoexplains how, depending on the research perspective interpreting it,the same dataset may be used to represent different aspects of theworld (“phenomena” as famously characterised by JamesBogen and James Woodward, 1988). When considering the full cycle ofscientific inquiry from the viewpoint of data production and analysis,it is at the stage of datamodelling that a specificrepresentational value is attributed to data (Leonelli 2019b).
The relational view of data encourages attention to the history ofdata, highlighting their continual evolution and sometimes radicalalteration, and the impact of this feature on the power of data toconfirm or refute hypotheses. It explains the critical importance ofdocumenting data management and transformation processes, especiallywith big data that transit far and wide over digital channels and aregrouped and interpreted in different ways and formats. It alsoexplains the increasing recognition of the expertise of those whoproduce, curate, and analyse data as indispensable to the effectiveinterpretation of big data within and beyond the sciences; and theinextricable link between social and ethical concerns around thepotential impact of data sharing and scientific concerns around thequality, validity, and security of data (boyd & Crawford 2012;Tempini & Leonelli, 2018).
Depending on which view on data one takes, expectations around whatbig data can do for science will vary dramatically. Therepresentational view accommodates the idea of big data as providingthe most comprehensive, reliable and generative knowledge base everwitnessed in the history of science, by virtue of its sheer size andheterogeneity. The relational view makes no such commitment, focusinginstead on what inferences are being drawn from such data at any givenpoint, how and why.
One thing that the representational and relational views agree on isthe key epistemic role of data as empirical evidence for knowledgeclaims or interventions. While there is a large philosophicalliterature on the nature of evidence (e.g., Achinstein 2001; Reiss2015; Kelly 2016), however, the relation between data and evidence hasreceived less attention. This is arguably due to an implicitacceptance, by many philosophers, of the representational view ofdata. Within the representational view, the identification of whatcounts as data is prior to the study of what those data can beevidence for: in other words, data are “givens”, as theetymology of the word indicates, and inferential methods areresponsible for determining whether and how the data available toinvestigators can be used as evidence, and for what. The focus ofphilosophical attention is thus on formal methods to single out errorsand misleading interpretations, and the probabilistic and/orexplanatory relation between what is unproblematically taken to be abody of evidence and a given hypothesis. Hence much of the expansivephilosophical work on evidence avoids the term “data”altogether. Peter Achinstein’s seminal work is a case in point:it discusses observed facts and experimental results, and whether andunder which conditions scientists would have reasons to believe suchfacts, but it makes no mention of data and related processingpractices (Achinstein 2001).
By contrast, within the relational view an object can only beidentified as datum when it is viewed as having value as evidence.Evidence becomes a category of data identification, rather than acategory of data use as in the representational view (Canali 2019). Evidence is thus constitutive of the very notion of dataand cannot be disentangled from it. This involves accepting that theconditions under which a given object can serve as evidence—andthus be viewed as datum - may change; and that should this evidentialrole stop altogether, the object would revert back into an ordinary,non-datum item. For example, the photograph of a plant taken by atourist in a remote region may become relevant as evidence for aninquiry into the morphology of plants from that particular locality;yet most photographs of plants are never considered as evidence for aninquiry into the features and functioning of the world, and of thosewho are, many may subsequently be discarded as uninteresting or nolonger pertinent to the questions being asked.
This view accounts for the mobility and repurposing that characterisesbig data use, and for the possibility that objects that were notoriginally generated in order to serve as evidence may be subsequentlyadopted as such. Consider Mayo and Spanos’s “minimalscientific principle for evidence”, which they define asfollows:
Datax0 provide poor evidence forH if theyresult from a method or procedure that has little or no ability offinding flaws inH, even ifH is false. (Mayo & Spanos 2009b)
This principle is compatible with the relational view of data since itincorporates cases where the methods used to generate and process datamay not have been geared towards the testing of a hypothesis H: all itasks is that such methods can be made relevant to the testing of H, atthe point in which data are used as evidence for H (I shall come backto the role of hypotheses in the handling of evidence in the nextsection).
The relational view also highlights the relevance of practices of dataformatting and manipulation to the treatment of data as evidence, thustaking attention away from the characteristics of the data objectsalone and focusing instead on the agency attached to and enabled bythose characteristics. Nora Boyd has provided a way to conceptualisedata processing as an integral part of inferential processes, and thusof how we should understand evidence. To this aim she introduced thenotion of “line of evidence”, which she defines as:
a sequence of empirical results including the records of datacollection and all subsequent products of data processing generated onthe way to some final empirical constraint. (Boyd 2018:406)
She thus proposes a conception of evidence that embraces both data andthe way in which data are handled, and indeed emphasises theimportance of auxiliary information used when assessing data forinterpretation, which includes
the metadata regarding the provenance of the data records and theprocessing workflow that transforms them. (2018: 407)
As she concludes,
together, a line of evidence and its associated metadata compose whatI am calling an “enriched line of evidence”. Theevidential corpus is then to be made up of many such enriched lines ofevidence. (2018: 407)
The relational view thus fosters a functional and contextualistapproach to evidence as the manner through which one or more objectsare used as warrant for particular knowledge items (which can bepropositional claims, but also actions such as specific decisions ormodes of conduct/ways of operating). This chimes with the contextualview of evidence defended by Reiss (2015), John Norton’s work onthe multiple, tangled lines of inferential reasoning underpinningappeals to induction (2003), and Hasok Chang’s emphasis on theepistemic activities required to ground evidential claims (2012).Building on these ideas and on Stephen Toulmin’s seminal work onresearch schemas (1958),Alison Wylie has gone one step further in evaluating the inferentialscaffolding that researchers (and particularly archaeologists, who sooften are called to re-evaluate the same data as evidence for newclaims; Wylie 2017) need to make sense of their data, interpret themin ways that are robust to potential challenges, and modifyinterpretations in the face of new findings. This analysis enabledWylie to formulate a set of conditions for robust evidentialreasoning, which include epistemic security in the chain of evidence,causal anchoring and causal independence of the data used as evidence,as well as the explicit articulation of the grounds for calibration ofthe instruments and methods involved (Chapman & Wylie 2016; Wylieforthcoming). A similar conclusion is reached by Jessey Wright’sevaluation of the diverse data analysis techniques thatneuroscientists use to make sense of functional magnetic resonanceimaging of the brain (fMRI scans):
different data analysis techniques reveal different patterns in thedata. Through the use of multiple data analysis techniques,researchers can produce results that are locally robust. (Wright 2017:1179)
Wylie’s and Wright’s analyses exemplify how a relationalapproach to data fosters a normative understanding of “goodevidence” which is anchored in situated judgement—thearguably human prerogative to contextualise and assess thesignificance of evidential claims. The advantages of this view ofevidence are eloquently expressed by Nancy Cartwright’s critiqueof both philosophical theories and policy approaches that do notrecognise the local and contextual nature of evidential reasoning. Asshe notes,
we need a concept that can give guidance about what is relevant toconsider in deciding on the probability of the hypothesis, not onethat requires that we already know significant facts about theprobability of the hypothesis on various pieces of evidence.(Cartwright 2013: 6)
Thus she argues for a notion of evidence that is not too restrictive,takes account of the difficulties in combining and selecting evidence,and allows for contextual judgement on what types of evidence are bestsuited to the inquiry at hand (Cartwright 2013, 2019). Reiss’sproposal of a pragmatic theory of evidence similarly aims to
takes scientific practice [..] seriously, both in terms of its greateruse of knowledge about the conditions under which science is practisedand in terms of its goal to develop insights that are relevant topractising scientists. (Reiss 2015: 361)
A better characterisation of the relation between data and evidence,predicated on the study of how data are processed and aggregated, maygo a long way towards addressing these demands. As aptly argued byJames Woodward, the evidential relationship between data and claims isnot a “a purely formal, logical, or a priori matter”(Woodward 2000: S172–173). This again sits uneasily with the expectation thatbig data analysis may automate scientific discovery and make humanjudgement redundant.
Let us now return to the idea of data-driven inquiry, often suggestedas a counterpoint to hypothesis-driven science (e.g., Hey et al.2009). Kevin Elliot and colleagues have offered a brief history ofhypothesis-driven inquiry (Elliott et al. 2016), emphasising howscientific institutions (including funding programmes and publicationvenues) have pushed researchers towards a Popperian conceptualisationof inquiry as the formulation and testing of a strong hypothesis. Bigdata analysis clearly points to a different and arguably Baconianunderstanding of the role of hypothesis in science. Theoreticalexpectations are no longer seen as driving the process of inquiry andempirical input is recognised as primary in determining the directionof research and the phenomena—and relatedhypotheses—considered by researchers.
The emphasis on data as a central component of research poses asignificant challenge to one of the best-established philosophicalviews on scientific knowledge. According to this view, which I shalllabel thetheory-centric view of science, scientificknowledge consists of justified true beliefs about the world. Thesebeliefs are obtained through empirical methods aiming to test thevalidity and reliability of statements that describe or explainaspects of reality. Hence scientific knowledge is conceptualised asinherently propositional: what counts as an output are claimspublished in books and journals, which are also typically presented assolutions to hypothesis-driven inquiry. This view acknowledges thesignificance of methods, data, models, instruments and materialswithin scientific investigations, but ultimately regards them as meanstowards one end: the achievement of true claims about the world.Reichenbach’s seminal distinction between contexts of discoveryand justification exemplifies this position (Reichenbach 1938).Theory-centrism recognises research components such as data andrelated practical skills as essential to discovery, and morespecifically to the messy, irrational part of scientific work thatinvolves value judgements, trial-and-error, intuition and explorationand within which the very phenomena to be investigated may not havebeen stabilised. The justification of claims, by contrast, involvesthe rational reconstruction of the research that has been performed,so that it conforms to established norms of inferential reasoning.Importantly, within the context of justification, only data thatsupport the claims of interest are explicitly reported and discussed:everything else—including the vast majority of data produced inthe course of inquiry—is lost to the chaotic context of discovery.[2]
Much recent philosophy of science, and particularly modelling andexperimentation, has challenged theory-centrism by highlighting therole of models, methods and modes of intervention as research outputsrather than simple tools, and stressing the importance of expandingphilosophical understandings of scientific knowledge to include theseelements alongside propositional claims. The rise of big data offersanother opportunity to reframe understandings of scientific knowledgeas not necessarily centred on theories and to includenon-propositional components—thus, in Cartwright’sparaphrase of Gilbert Ryle’s famous distinction, refocusing onknowing-how over knowing-that (Cartwright 2019). One way to construedata-centric methods is indeed to embrace a conception of knowledge asability, such as promoted by early pragmatists like John Dewey andmore recently reprised by Chang, who specifically highlighted it asthe broader category within which the understanding ofknowledge-as-information needs to be placed (Chang 2017).
Another way to interpret the rise of big data is as a vindication ofinductivism in the face of the barrage of philosophical criticismlevelled against theory-free reasoning over the centuries. Forinstance, Jon Williamson (2004: 88) has argued that advances inautomation, combined with the emergence of big data, lend plausibilityto inductivist philosophy of science. Wolfgang Pietsch agrees withthis view and provided a sophisticated framework to understand justwhat kind of inductive reasoning is instigated by big data and relatedmachine learning methods such as decision trees (Pietsch2015). Following John Stuart Mill, he calls this approachvariational induction and presents it as common to both bigdata approaches and exploratory experimentation, though the former canhandle a much larger number of variables (Pietsch 2015: 913). Pietschconcludes that the problem of theory-ladenness in machine learning canbe addressed by determining under which theoretical assumptionsvariational induction works (2015: 910ff).
Others are less inclined to see theory-ladenness as a problem that canbe mitigated by data-intensive methods, and rather see it as aconstitutive part of the process of empirical inquiry. Arching back tothe extensive literature on perspectivism and experimentation (Gooding1990; Giere 2006; Radder 2006; Massimi 2012), Werner Callebaut hasforcefully argued that the most sophisticated and standardisedmeasurements embody a specific theoretical perspective, and this is noless true of big data (Callebaut 2012). Elliott and colleaguesemphasise that conceptualising big data analysis as atheoretical risksencouraging unsophisticated attitudes to empirical investigation as a
“fishing expedition”, having a high probability of leadingto nonsense results or spurious correlations, being reliant onscientists who do not have adequate expertise in data analysis, andyielding data biased by the mode of collection. (Elliott et al. 2016:880)
To address related worries in genetic analysis, Ken Waters hasprovided the useful characterisation of “theory-informed”inquiry (Waters 2007),which can be invoked to stress how theory informs the methods used toextract meaningful patterns from big data, and yet does notnecessarily determine either the starting point or the outcomes ofdata-intensive science. This does not resolve the question of whatrole theory actually plays. Rob Kitchin (2014) has proposed to see big dataas linked to a new mode of hypothesis generation within ahypothetical-deductive framework. Leonelli is more sceptical ofattempts to match big data approaches, which are many and diverse,with a specific type of inferential logic. She rather focused on theextent to which the theoretical apparatus at work within big dataanalysis rests on conceptual decisions about how to order and classifydata—and proposed that such decisions can give rise to aparticular form of theorization, which she calls classificatory theory(Leonelli 2016).
These disagreements point to big data as eliciting diverseunderstandings of the nature of knowledge and inquiry, and the complexiterations through which different inferential methods build on eachother. Again, in the words of Elliot and colleagues,
attempting to draw a sharp distinction between hypothesis-driven anddata-intensive science is misleading; these modes of research are notin fact orthogonal and often intertwine in actual scientific practice.(Elliott et al. 2016: 881, see also O’Malley et al. 2009,Elliott 2012)
Another epistemological debate strongly linked to reflection on bigdata concerns the specific kinds of knowledge emerging fromdata-centric forms of inquiry, and particularly the relation betweenpredictive and causal knowledge.
Big data science is widely seen as revolutionary in the scale andpower of predictions that it can support. Unsurprisingly perhaps, aphilosophically sophisticated defence of this position comes from thephilosophy of mathematics, where Marco Panza, Domenico Napoletani andDaniele Struppa argued for big data science as occasioning a momentousshift in the predictive knowledge that mathematical analysis canyield, and thus its role within broader processes of knowledgeproduction. The whole point of big data analysis, they posit, is itsdisregard for causal knowledge:
answers are found through a process of automatic fitting of the datato models that do not carry any structural understanding beyond theactual solution of the problem itself. (Napoletani, Panza, &Struppa 2014: 486)
This view differs from simplistic popular discourse on “thedeath of theory” (Anderson 2008) and the “power ofcorrelations” (Mayer-Schoenberg and Cukier 2013) insofar as itdoes not side-step the constraints associated with knowledge andgeneralisations that can be extracted from big data analysis.Napoletani, Panza and Struppa recognise that there are inescapabletensions around the ability of mathematical reasoning to overdetermineempirical input, to the point of providing a justification for any andevery possible interpretation of the data. In their words,
the problem arises of how we can gain meaningful understanding ofhistorical phenomena, given the tremendous potential variability oftheir developmental processes. (Napoletani et al. 2014: 487)
Their solution is to clarify that understanding phenomena is not thegoal of predictive reasoning, which is rather a form ofagnosticscience: “the possibility of forecasting and analysingwithout a structured and general understanding” (Napoletani et al.2011: 12). Theopacity of algorithmic rationality thus becomes its key virtue and thereason for the extraordinary epistemic success of forecasting groundedon big data. While “the phenomenon may forever re-main hidden toour understanding”(ibid.: 5), the application of mathematical models and algorithmsto big data can still provide meaningful and reliable answers towell-specified problems—similarly to what has been argued in thecase of false models (Wimsatt 2007). Examples include the use of“forcing” methods such as regularisation or diffusiongeometry to facilitate the extraction of useful insights from messydatasets.
This view is at odds with accounts that posit scientific understandingas a key aim of science (de Regt 2017), and the intuition that whatresearchers are ultimately interested in is
whether the opaque data-model generated by machine-learningtechnologies count as explanations for the relationships found betweeninput and output. (Boon 2020: 44)
Within the philosophy of biology, for example, it is well recognisedthat big data facilitates effective extraction of patterns and trends,and that being able to model and predict how an organism or ecosystemmay behave in the future is of great importance, particularly withinmore applied fields such as biomedicine or conservation science. Atthe same time, researchers are interested in understanding the reasonsfor observed correlations, and typically use predictive patterns asheuristics to explore, develop and verify causal claims about thestructure and functioning of entities and processes. Emanuele Ratti(2015) has argued that big data mining within genome-wide associationstudies often used in cancer genomics can actually underpinmechanistic reasoning, for instance by supporting eliminativeinference to develop mechanistic hypotheses and by helping to exploreand evaluate generalisations used to analyse the data. In a similarvein, Pietsch (2016)proposed to use variational induction as a method to establish whatcounts as causal relationships among big data patterns, by focusing onwhich analytic strategies allow for reliable prediction and effectivemanipulation of a phenomenon.
Through the study of data sourcing and processing in epidemiology,Stefano Canali has instead highlighted the difficulties of derivingmechanistic claims from big data analysis, particularly where data arevaried and embodying incompatible perspectives and methodologicalapproaches (Canali 2016, 2019). Relatedly, the semantic and logisticalchallenges of organising big data give reason to doubt the reliabilityof causal claims extracted from such data. In terms of logistics,having a lot of data is not the same as having all of them, andcultivating illusions of comprehensiveness is a risky and potentiallymisleading strategy, particularly given the challenges encountered indeveloping and applying curatorial standards for data other than thehigh-throughput results of “omics” approaches (see alsothe next section). The constant worry about the partiality andreliability of data is reflected in the care put by database curatorsin enabling database users to assess such properties; and in theimportance given by researchers themselves, particularly in thebiological and environmental sciences, to evaluating the quality ofdata found on the internet (Leonelli 2014, Fleming et al. 2017). Interms of semantics, we are back to the role of data classifications astheoretical scaffolding for big data analysis that we discussed in theprevious section. Taxonomic efforts to order and visualise data informcausal reasoning extracted from such data (Sterner& Franz 2017), and can themselves constitute a bottom-upmethod—grounded in comparative reasoning—for assigningmeaning to data models, particularly in situation where a full-blowntheory or explanation for the phenomenon under investigation is notavailable (Sterner 2014).
It is no coincidence that much philosophical work on the relationbetween causal and predictive knowledge extracted from big data comesfrom the philosophy of the life sciences, where the absence ofaxiomatized theories has elicited sophisticated views on the diversityof forms and functions of theory within inferential reasoning.Moreover, biological data are heterogeneous both in their content andin their format; are curated and re-purposed to address the needs ofhighly disparate and fragmented epistemic communities; and presentcurators with specific challenges to do with tracking complex, diverseand evolving organismal structures and behaviours, whose relation toan ever-changing environment is hard to pinpoint with any stability(e.g., Shavit & Griesemer 2009). Hence in this domain, some of thecore methods and epistemic concerns of experimentalresearch—including exploratory experimentation, sampling and thesearch for causal mechanisms—remain crucial parts ofdata-centric inquiry.
At the start of this entry I listed “value” as a majorcharacteristic of big data and pointed to the crucial role of valuingprocedures in identifying, processing, modelling and interpreting dataas evidence. Identifying and negotiating different forms of data valueis an unavoidable part of big data analysis, since these valuationpractices determine which data is made available to whom, under whichconditions and for which purposes. What researchers choose to consideras reliable data (and data sources) is closely intertwined not onlywith their research goals and interpretive methods, but also withtheir approach to data production, packaging, storage and sharing.Thus, researchers need to consider what value their data may have forfuture research by themselves and others, and how to enhance thatvalue—such as through decisions around which data to makepublic, how, when and in which format; or, whenever dealing with dataalready in the public domain (such as personal data on social media),decisions around whether the data should be shared and used at all,and how.
No matter how one conceptualises value practices, it is clear thattheir key role in data management and analysis prevents faciledistinctions between values and “facts” (understood aspropositional claims for which data provide evidential warrant). Forexample, consider a researcher who values bothopenness—and related practices of widespread datasharing—and scientificrigour—which requires astrict monitoring of the credibility and validity of conditions underwhich data are interpreted. The scale and manner of big datamobilisation and analysis create tensions between these two values.While the commitment to openness may prompt interest in data sharing,the commitment to rigour may hamper it, since once data are freelycirculated online it becomes very difficult to retain control over howthey are interpreted, by whom and with which knowledge, skills andtools. How a researcher responds to this conflict affects which dataare made available for big data analysis, and under which conditions.Similarly, the extent to which diverse datasets may be triangulatedand compared depends on the intellectual property regimes under whichthe data—and related analytic tools—have been produced.Privately owned data are often unavailable to publicly fundedresearchers; and many algorithms, cloud systems and computingfacilities used in big data analytics are only accessible to thosewith enough resources to buy relevant access and training. Whateverclaims result from big data analysis are, therefore, stronglydependent on social, financial and cultural constraints that conditionthe data pool and its analysis.
This prominent role of values in shaping data-related epistemicpractices is not surprising given existing philosophical critiques ofthe fact/value distinction (e.g., Douglas 2009), and the existingliterature on values in science—such as Helen Longino’sseminal distinction between constitutive and contextual values, aspresented in her 1990 bookScience as SocialKnowledge—may well apply in this case too. Similarly, it iswell-established that the technological and social conditions ofresearch strongly condition its design and outcomes. What isparticularly worrying in the case of big data is the temptation,prompted by hyped expectations around the power of data analytics, tohide or side-line the valuing choices that underpin the methods,infrastructures and algorithms used for big data extraction.
Consider the use of high-throughput data production tools, whichenable researchers to easily generate a large volume of data informats already geared to computational analysis. Just as in the caseof other technologies, researchers have a strong incentive to adoptsuch tools for data generation; and may do so even in cases where suchtools are not good or even appropriate means to pursue theinvestigation. Ulrich Krohs uses the termconvenienceexperimentation to refer to experimental designs that are adoptednot because they are the most appropriate ways of pursuing a giveninvestigation, but because they are easily and widely available andusable, and thus “convenient” means for researchers topursue their goals (Krohs 2012).
Appeals to convenience can extend to other aspects of data-intensiveanalysis. Not all data are equally easy to digitally collect,disseminate and link through existing algorithms, which makes somedata types and formats more convenient than others for computationalanalysis. For example, research databases often display the outputs ofwell-resourced labs within research traditions which deal with“tractable” data formats (such as “omics”).And indeed, the existing distribution of resources, infrastructure andskills determineshigh levels of inequality in theproduction, dissemination and use of big data for research. Bigplayers with large financial and technical resources are leading thedevelopment and uptake of data analytics tools, leaving much publiclyfunded research around the world at the receiving end of innovation inthis area. Contrary to popular depictions of the data revolution asharbinger of transparency, democracy and social equality, thedigital divide between those who can access and use datatechnologies, and those who cannot, continues to widen. A result ofsuch divides is the scarcity of data relating to certain subgroups andgeographical locations, which again limits the comprehensiveness ofavailable data resources.
In the vast ecosystem of big data infrastructures, it is difficult tokeep track of such distortions and assess their significance for datainterpretation, especially in situations where heterogeneous datasources structured through appeal to different values are mashedtogether. Thus, the systematic aggregation of convenient datasets andanalytic tools over others often results in a big data pool where therelevant sources and forms of bias are impossible to locate andaccount for (Pasquale 2015; O’Neill 2016; Zuboff 2017; Leonelli2019a). In such a landscape, arguments for a separation between factand value—and even a clear distinction between the role ofepistemic and non-epistemic values in knowledgeproduction—become very difficult to maintain withoutdiscrediting the whole edifice of big data science. Given the extentto which this approach has penetrated research in all domains, it isarguably impossible, however, to critique the value-laden structure ofbig data science without calling into question the legitimacy ofscience itself. A more constructive approach is to embrace the extentto which big data science is anchored in human choices, interests andvalues, and ascertain how this affects philosophical views onknowledge, truth and method.
In closing, it is important to consider at least some of the risks andrelated ethical questions raised by research with big data. As alreadymentioned in the previous section, reliance on big data collected bypowerful institutions or corporations risks raises significant socialconcerns. Contrary to the view that sees big and open data asharbingers of democratic social participation in research, the waythat scientific research is governed and financed is not challenged bybig data. Rather, the increasing commodification and large valueattributed to certain kinds of data (e.g., personal data) isassociated to an increase in inequality of power and visibilitybetween different nations, segments of the population and scientificcommunities (O’Neill 2016; Zuboff 2017; D’Ignazio andKlein 2020). The digital gap between those who not only can accessdata, but can also use it, is widening, leading from a state ofdigital divide to a condition of “data divide”(Bezuidenout et al. 2017).
Moreover, the privatisation of data has serious implications for theworld of research and the knowledge it produces. Firstly, it affectswhich data are disseminated, and with which expectations. Corporationsusually only release data that they regard as having lesser commercialvalue and that they need public sector assistance to interpret. Thisintroduces another distortion on the sources and types of data thatare accessible online while more expensive and complex data are keptsecret. Even many of the ways in which citizens -researchers included- are encouraged to interact with databases and data interpretationsites tend to encourage participation that generates furthercommercial value. Sociologists have recently described this type ofsocial participation as a form of exploitation (Prainsack & Buyx2017; Srnicek 2017). In turn, these ways ofexploiting data strengthen their economic value over their scientificvalue. When it comes to the commerce of personal data betweencompanies working in analysis, the value of the data as commercialproducts -which includes the evaluation of the speed and efficiencywith which access to certain data can help develop new products -often has priority over scientific issues such as for example,representativity and reliability of the data and the ways they wereanalysed. This can result in decisions that pose a problemscientifically or that simply are not interested in investigating theconsequences of the assumptions made and the processes used. This lackof interest easily translates into ignorance of discrimination,inequality and potential errors in the data considered. This type ofignorance is highly strategic and economically productive since itenables the use of data without concerns over social and scientificimplications. In this scenario the evaluation on the quality of datashrinks to an evaluation of their usefulness towards short-termanalyses or forecasting required by the client. There are noincentives in this system to encourage evaluation of the long-termimplications of data analysis. The risk here is that the commerce ofdata is accompanied by an increasing divergence between data and theircontext. The interest in the history of the transit of data, theplurality of their emotional or scientific value and the re-evaluationof their origins tend to disappear over time, to be substituted by theincreasing hold of the financial value of data.
The multiplicity of data sources and tools for aggregation alsocreates risks. The complexity of the data landscape is making itharder to identify which parts of the infrastructure require updatingor have been put in doubt by new scientific developments. Thesituation worsens when considering the number of databases thatpopulate every area of scientific research, each containingassumptions that influence the circulation and interoperability ofdata and that often are not updated in a reliable and regular way.Just to provide an idea of the numbers involved, the prestigiousscientific publicationNucleic Acids Research publishes aspecial issue on new databases that are relevant to molecular biologyevery year and included: 56 new infrastructures in 2015, 62 in 2016,54 in 2017 and 82 in 2018. These are just a small proportion of thehundreds of databases that are developed each year in the lifesciences sector alone. The fact that these databases rely on shortterm funding means that a growing percentage of resources remainavailable to consult online although they are long dead. This is acondition that is not always visible to users of the database whotrust them without checking whether they are actively maintained ornot. At what point do these infrastructures become obsolete? What arethe risks involved in weaving an ever more extensive tapestry ofinfrastructures that depend on each other, given the disparity in theways they are managed and the challenges in identifying and comparingtheir prerequisite conditions, the theories and scaffolding used tobuild them? One of these risks is rampant conservativism: theinsistence on recycling old data whose features and managementelements become increasingly murky as time goes by, instead ofencouraging the production of new data with features that specificallyrespond to the requirements and the circumstances of their users. Indisciplines such as biology and medicine that study living beings andtherefore are by definition continually evolving and developing, suchtrust in old data is particularly alarming. It is not the case, forexample, that data collected on fungi ten, twenty or even a hundredyears ago is reliable to explain the behaviour of the same species offungi now or in the future (Leonelli 2018).
Researchers of what Luciano Floridi calls theinfosphere—the way in which the introduction of digitaltechnologies is changing the world - are becoming aware of thedestructive potential of big data and the urgent need to focus effortsfor management and use of data in active and thoughtful ways towardsthe improvement of the human condition. In Floridi’s own words:
ICT yields great opportunity which, however, entails the enormousintellectual responsibility of understanding this technology to use itin the most appropriate way. (Floridi 2014: vii; see alsoBritish Academy & Royal Society 2017)
In light of these findings, it is essential that ethical and socialissues are seen as a core part of the technical and scientificrequirements associated with data management and analysis. The ethicalmanagement of data is not obtained exclusively by regulating thecommerce of research and management of personal data nor with theintroduction of monitoring of research financing, even though theseare important strategies. To guarantee that big data are used in themost scientifically and socially forward-thinking way it is necessaryto transcend the concept of ethics as something external and alien toresearch. An analysis of the ethical implications of data scienceshould become a basic component of the background and activity ofthose who take care of data and the methods used to view and analyseit. Ethical evaluations and choices are hiddenin every aspect of data management, including those choices that mayseem purely technical.
This entry stressed how the emerging emphasis on big data signals therise of a data-centric approach to research, in which efforts tomobilise, integrate, disseminate and visualise data are viewed ascentral contributions to discovery. The emergence of data-centrismhighlights the challenges involved in gathering, classifying andinterpreting data, and the concepts, technologies and institutionsthat surround these processes. Tools such as high-throughputmeasurement instruments and apps for smartphones are fast generatinglarge volumes of data in digital formats. In principle, these data areimmediately available for dissemination through internet platforms,which can make them accessible to anybody with a broadband connectionin a matter of seconds. In practice, however, access to data isfraught with conceptual, technical, legal and ethical implications;and even when access can be granted, it does not guarantee that thedata can be fruitfully used to spur further research. Furthermore, themathematical and computational tools developed to analyse big data areoften opaque in their functioning and assumptions, leading to resultswhose scientific meaning and credibility may be difficult to assess.This increases the worry that big data science may be grounded upon,and ultimately supporting, the process of making human ingenuityhostage to an alien, artificial and ultimately unintelligibleintelligence.
Perhaps the most confronting aspect of big data science as discussedin this entry is the extent to which it deviates from understandingsof rationality grounded on individual agency and cognitive abilities(on which much of contemporary philosophy of science is predicated).The power of any one dataset to yield knowledge lies in the extent towhich it can be linked with others: this is what lends high epistemicvalue to digital objects such as GPS locations or sequencing data, andwhat makes extensive data aggregation from a variety of sources into ahighly effective surveillance tool. Data production and disseminationchannels such as social media, governmental databases and researchrepositories operate in a globalised, interlinked and distributednetwork, whose functioning requires a wide variety of skills andexpertise. The distributed nature of decision-making involved indeveloping big data infrastructures and analytics makes it impossiblefor any one individual to retain oversight over the quality,scientific significance and potential social impact of the knowledgebeing produced.
Big data analysis may therefore constitute the ultimate instance of adistributed cognitive system. Where does this leave accountabilityquestions? Many individuals, groups and institutions end up sharingresponsibility for the conceptual interpretation and social outcomesof specific data uses. A key challenge for big data governance is tofind mechanisms for allocating responsibilities across this complexnetwork, so that erroneous and unwarranted decisions—as well asoutright fraudulent, unethical, abusive, discriminatory or misguidedactions—can be singled out, corrected and appropriatelysanctioned. Thinking about the complex history, processing and use ofdata can encourage philosophers to avoid ahistorical, uncontextualizedapproaches to questions of evidence, and instead consider the methods,skills, technologies and practices involved in handling data—andparticularly big data—as crucial to understanding empiricalknowledge-making.
How to cite this entry. Preview the PDF version of this entry at theFriends of the SEP Society. Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry atPhilPapers, with links to its database.
[Please contact the author with suggestions.]
artificial intelligence |Bacon, Francis |biology: experiment in |computer science, philosophy of |empiricism: logical |evidence |human genome project |models in science |Popper, Karl |science: theory and observation in |scientific explanation |scientific method |scientific theories: structure of |statistics, philosophy of
The research underpinning this entry was funded by the EuropeanResearch Council (grant award 335925) and the Alan Turing Institute(EPSRC Grant EP/N510129/1).
View this site from another server:
The Stanford Encyclopedia of Philosophy iscopyright © 2023 byThe Metaphysics Research Lab, Department of Philosophy, Stanford University
Library of Congress Catalog Data: ISSN 1095-5054