Movatterモバイル変換


[0]ホーム

URL:


Next Article in Journal
HOSPI Application to Portuguese Hospitals’ Websites
Previous Article in Journal
SAPEx-D: A Comprehensive Dataset for Predictive Analytics in Personalized Education Using Machine Learning
 
 
Search for Articles:
Title / Keyword
Author / Affiliation / Email
Journal
Article Type
 
 
Section
Special Issue
Volume
Issue
Number
Page
 
Logical OperatorOperator
Search Text
Search Type
 
add_circle_outline
remove_circle_outline
 
 
Journals
Data
Volume 10
Issue 3
10.3390/data10030028
Font Type:
ArialGeorgiaVerdana
Font Size:
AaAaAa
Line Spacing:
Column Width:
Background:
Article

A Directory of Datasets for Mining Software Repositories

Electrical and Computer Engineering Department, Aristotle University of Thessaloniki, 541 24 Thessaloniki, Greece
*
Author to whom correspondence should be addressed.
Submission received: 29 November 2024 /Revised: 27 January 2025 /Accepted: 18 February 2025 /Published: 20 February 2025
(This article belongs to the SectionInformation Systems and Data Management)

Abstract

:
The amount of software engineering data is constantly growing, as more and more developers employ online services to store their code, keep track of bugs, or even discuss issues. The data residing in these services can be mined to address different research challenges; therefore, certain initiatives have been established to encourage sharing research datasets collecting them. In this work, we investigate the effect of such an initiative; we create a directory that includes the papers and the corresponding datasets of the data track of the Mining Software Engineering (MSR) conference. Specifically, our directory includes metadata and citation information for the papers of all data tracks, throughout the last twelve years. We also annotate the datasets according to the data source and further assess their compliance to the FAIR principles. Using our directory, researchers can find useful datasets for their research, or even design methodologies for assessing their quality, especially in the software engineering domain. Moreover, the directory can be used for analyzing the citations of data papers, especially with regard to different data categories, as well as for examining their FAIRness score throughout the years, along with its effect on the usage/citation of the datasets.

    1. Introduction

    Lately, the introduction of the open source initiative and the evolution of online services has led to a collaborative way of developing software. As a result, today there is an abundance of software engineering data, spanning different sources and formats, including, e.g., source code stored in online hosting facilities like GitHub1 or GitLab2, task descriptions/bug reports stored in issue tracking systems like Jira3 or Bugzilla4, mailing list discussions, stack traces, Stack Overflow posts, etc. Apart from the well-known benefits of this collaborative paradigm in the various software teams, this deluge of data further provides opportunities for researchers in the software engineering domain.
    Indeed, the importance of available data for producing high-quality research has been underlined by several researchers [1,2]. For this reason, several initiatives have been developed in the last two decades to encourage the public sharing of research datasets, either accompanying scientific publications (usually also including the relevant source code to allow reproducibility) or even as standalone artifacts. These include conference data tracks, such as the data showcase track of the International Conference on Mining Software Repositories (MSR)5, data-oriented journals, such as ScienceDirect Data in Brief6 or MDPI Data7, or even dataset repositories, such as PROMISE [3] and Zenodo [4]. Regardless of the domain of these initiatives (e.g., the MSR data track or the PROMISE repository are specific to software engineering, while the others are general-purpose), their goal is the same: to collect datasets (including data descriptors) that can beuseful for other researchers.
    To assess the effect of these data sharing/descriptor initiatives, it would be interesting to investigate whether the collected datasets are available today and whether they are indeed used by researchers. Current efforts in this direction have mainly focused on bibliometric analysis of publications and datasets. Several of these approaches analyze publications and measure their reputation using the frequency of their citations [5,6,7], while others focus more specifically on research datasets, analyzing their usage by the community [2,8] or even their quality [9,10]. These approaches aspire to practically assess the overall potential for reuse for the publications and/or the datasets. Thereuse potential of a dataset, however, is actually an attribute that can be measured not only based on its citations, but also based on certain properties.
    For a dataset to be considered (re)usable, researchers must be able to find it easily, to understand its format and variables, to be able to integrate and use it in their methodologies. All of these actions, however, may be hindered by significant challenges, e.g., a research dataset may have no metadata, like readme files or variable descriptions, making it very hard to reuse, or it may be stored in a personal/institutional website, and thus, it may easily disappear after a website update. Lately, the research community has been driven towards organizing all these requirements as a set of FAIR principles [11]. Ever since the introduction of the FAIR principles in 2016, there has been effort to make research dataFindable,Accessible,Interoperable, andReusable. And certain tools have also been developed to assess the FAIRness of research datasets [12].
    Against this background, in this paper, we craft a directory of datasets in the domain of mining software repositories, extracted from the data track of the MSR conference. Our goal is twofold: (a) to create an index of datasets that can be useful for researchers in the software engineering and, generally, data engineering/data quality domain, and (b) to assess the datasets of the MSR data track in terms of their use by other researchers (citations) and their FAIRness. To achieve these objectives, our directory involves all data papers from the years of the data track, including their relevant metadata and citation information, thus allowing researchers to easily find popular datasets. Moreover, all datasets are accompanied with annotations relevant to their specific research data sources as well as with a FAIRness assessment report, to illustrate their compliance with the FAIR principles.
    The rest of this paper is organized as follows.Section 2 describes our methodology for creating the directory of datasets.Section 3 provides a comprehensive view of the directory, including the datasets and the relevant assessments. InSection 4, we discuss how the directory can be useful for answering interesting research questions, and, finally,Section 5 concludes this work, summarizing the key takeaways.

    2. Materials and Methods

    Our methodology for building the directory of MSR datasets is shown inFigure 1. First of all, we extracted the digital object identifiers (DOIs) of all papers. Using these identifiers, we were able to find information about each paper (paper metadata, including author names, publishers, citation information, etc.). After that, we also retrieved the paper documents (in pdf format). Using all of this information, we annotated each paper as to the relevant research area, and further extracted the URL of the corresponding dataset. To enrich our annotations with semantics, we also performed topic modeling on the abstracts, while, finally, we also ran a FAIR assessment.
    The github repository athttps://github.com/AuthEceSoftEng/directory-msr-datasets (accessed on 19 February 2025) includes the results, stored in JSON format, as well as all scripts and instructions to reproduce them, according to the steps ofFigure 1. In specific, the repository includes scripts for metadata downloading, paper document downloading, topic modeling and FAIR assessment. All other steps (DOI extraction, paper annotation, dataset URL extraction) were performed manually. The steps ofFigure 1 are analyzed in the following paragraphs.

    2.1. Paper Identification and Metadata Retrieval

    The first step of our methodology is to retrieve all MSR data papers and the information of their respective datasets. Each dataset in our directory corresponds to a paper that describes it. We first located the proceedings of the MSR conference in dblp8. After that, we examined the proceedings along with the relevant websites to identify specifically the papers of the data track. The data track of MSR started in 2013 as Data Showcase Track and retained the same format for nine consecutive conferences, until 2021. From 2022 onwards, the track was renamed to Data and Tool Showcase Track, thus also accepting tool demonstration papers. As our analysis focuses purely on data papers, we examined all papers from 2022, 2023, and 2024 in order to keep only the ones referring to datasets. Any papers that included both datasets and tools as contributions were also included in our analysis.
    Thus, for each paper, we first manually extracted its DOI. DOIs are global, unique and persistent identifiers [13], so they correctly point to the resources behind them (in this case, the html pages of the papers) even after several years have passed. Given the DOI of each paper, we downloaded metadata and citations using the Semantic Scholar API9; to do so, we wrote a script that employed the Python requests library10. The retrieved information for each paper includes title, author(s), publication year and other fields, while all references and citations were also included.

    2.2. Annotation and Dataset URL Extraction

    Upon retrieving the main information about each paper, including its metadata and citations, the next steps were to annotate each paper according to its topic and to extract the dataset URL from each paper. Both steps, however, also required the full text of the papers. Using the DOI of each paper, we were able to find the corresponding online link, which was typically found either in IEEEXplore11 or in the ACM Digital Library12. This way, we were able to automatically download all papers using the Beautiful Soup library13. Having downloaded all papers, we proceeded to annotating them on two different axes. The annotation was performed manually. First, we labeled each paper according to its software mining area. We considered 6 different areas, roughly corresponding to the clustering of software mining areas performed in [2] (practically, our five categories are analogous to corresponding clusters proposed in [2], while the rest of the clusters fall under our other/miscellaneous category). Specifically, the areas are the following:
    • Version Control: Data extracted from version control systems/online hosting facilities like GitHub, including raw code as well as derived data (e.g., abstract syntax trees, dependency graphs). Examples include the Public Git Archive [14], which includes the source code of top-bookmarked GitHub repositories, or DocMine [15], a dataset of documentation strings extracted from GitHub repositories.
    • Software Issues: Data extracted from issue tracking systems, such as Jira or Bugzilla, or generally bug data, typically including bug reports or even annotated bugs in source code, etc. Examples include the ManySStuBs4J dataset [16], which includes single-statement bug fixes in Java source code, or even the Jira issue tracking dataset [17], which includes the issues of the Jira installation of the Apache Software Foundation.
    • Developer Metrics: Data relevant to software developers as the relations among them, typically including interactions in mailing lists, chats, question-answering services, etc. Examples include the OCEAN mailing list dataset [18], which extracts the mail interactions for different open-source communities, or the Apache people dataset [19], which includes the roles for all contributors of the Apache Software Foundation.
    • Software Evolution: Data relevant to the evolution of source code, which may include commits, analyses of code revisions in open-source software, or even edits in question-answering posts. Examples include the repository of 44 years of Unix evolution [20] or the documented Unix facilities over 48 years [21], both focusing on the evolution of software artifacts.
    • Semantic Metrics: Semantically enriched data, usually by extracting semantic features from the data of the other categories. Semantic features typically include topics extracted using a topic modeling approach like LDA [22] or even vectors extracted from a word embedding approach like word2vec [23] or fastText [24] or even transformer models (e.g., BERT [25]). Examples include the dataset of semantically similar Java methods [26] or the word embeddings extracted from Stack Overflow [27].
    • Other Data: All datasets that do not fall into one of the previous categories. Several different types of data may be included, such as dockerfiles [28], Jupyter notebooks [29], even license texts [30].
    Finally, for each paper, we extracted the URL of its dataset from the text. We further tested whether the URL is still functional and even searched online for the dataset, noting down the current links to the datasets in cases where the datasets’ online addresses have changed.

    2.3. Topic Modeling from Abstracts

    As already mentioned, to supplement our analysis, we further apply topic modeling on the abstracts of the MSR data papers. To do so, we initially tokenize the abstracts and remove any punctuation, numbers and special characters. After that, we employ LDA [22] in order to extract a set of topics that the abstracts are categorized into. LDA generates two probability distributions, one modeling the occurrence of terms in topics and one modeling the involvement of topics to abstracts. LDA, and specifically the Gensim [31] implementation that we used, estimates both probabilities using a sparse Dirichlet prior distribution, learning all their parameters, except for the number of topics.
    Concerning the number of topics, we performed a coherence analysis to determine the optimal number. In specific, we iterate the number of topics from 5 to 20, and for each configuration, we calculate the coherence metricCV, which is a metric that is usually highly correlated to human judgment [32]. Our analysis (shown inFigure 2) indicates that the maximum coherence is achieved when the number of topics is 14, meaning that the topics in this case are well defined. As a result, LDA produces the 14 topics shown inTable 1. For each topic, we also highlight using bold typeface certain key terms that better describe it (which are also usually underrepresented in other topics).
    From a practical point of view, these topics are similar, yet complementary to the categories extracted inSection 2.2. The topics, and especially the topic terms, can be used for even more detailed paper (and dataset) filtering, since the extracted terms are standardized in the MSR domain. For instance, instead of the genericSoftware Issues category, one can use the technical topic termsbug,defect, andvulnerability, which mean different things14, and each of them is represented more emphatically by different topics inTable 1. Moreover, in contrast to the categories ofSection 2.2, the topics are extracted automatically (and their number is optimized based on coherence), therefore ensuring that the final terms effectively describe certain aspects of each dataset without depending on manual annotations.
    Indeed, the extracted topics seem to effectively cover multiple categories of datasets. Of course, several categories may be similar, which is more or less expected given that there are multiple adjacent research areas in the software mining domain. On a similar note, there are also several terms that are present in multiple topics, which indicates that a lot of the different datasets may have similar features or the same data source. For instance, almost all topics are relevant to source code (the terms ‘source’ and ‘code’ appear together in 11 out of 14 top terms of topics, while ‘source’ even appears in all topics on its own). Nevertheless, there also key terms that clearly differentiate topics from one another; these terms often refer to specific areas and/or technologies (e.g., the term ‘security’ or the term ‘android’) and usually appear in the top terms of few topics (e.g., the term ‘evolution’, which appears in only two topics).
    The key terms that are highlighted for each topic inTable 1 (using bold typeface) indeed help us make this differentiation. For instance, topics 2, 4, 7, and 8 all seem relevant to issues and bugs (i.e. to the Software Issues category defined inSection 2.2). However, topic 2 further involves testing, as ‘test’ is the second term from the top ones for this topic. Topic 8, on the other hand, is more focused on models (term ‘model’), while topics 4 and 7 involve activities and reports (term ‘activity’ appears in both topics, while term ‘report’ appears in topic 7), therefore possibly focusing on bug report/issue tracking data. In the same context, topics 5 and 11 are both relevant to security issues. Topic 5 is mainly focused on security vulnerabilities, while topic 11 is focused on defects that may also lead to security issues. Several topics are relevant to the source code itself, either by focusing on specific frameworks or by targeting specific challenges. For instance, based on the key terms highlighted inTable 1, topic 0 focuses on libraries and tools for android. Topic 12 involves the challenge of code clones, with the term ‘clone’ appearing only in the top terms of this topic. Similarly, topic 1 focuses on API mining (as it is the only one with ‘api’ as a top term), while topics 10 and 13 focus on the evolution of projects. Finally, topic 3 is relevant to code reviews, while topics 6 and 9 seem to be relevant to developer metrics (including the ‘developer’ key term, along with terms like ‘analysis’ or ‘information’).

    2.4. Dataset FAIRness Evaluation

    The final step of our methodology is the FAIR assessment of the datasets accompanying the MSR data track papers. As already mentioned, the FAIR principles are actually guidelines for making research datasets easier to utilize, or, more specifically,Findable,Accessible,Interoperable, andReusable [11].Table 2 depicts the FAIR principles.
    These principles effectively cover the relevant attributes from a conceptual point of view. For instance, given the attribute ofreusability, the four principles focus on ensuring that users (either machines or humans) will be able to easily decide whether the data are useful for them, and of course, they will be able to easily use them. R1 signifies that the data (and metadata) must be described as richly as possible (i.e., with multiple labels, with info as to the data creation sources, etc.). R1.1 states that the rights of the data (license) must be clearly described to ensure that data (re)use is not limited by ambiguous licensing restrictions. R1.2 focuses on the origin of the data; data reuse is facilitated when it is clear where the data originally came from and which parties need to be acknowledged for each transformation they may have undergone. Finally, R1.3 focuses on using standardized (or well-established) formats for storing the data (e.g., use a specific format for spatial data archiving) and employs best practices when these are established in a domain (e.g., use a common template or vocabulary)15.
    As one may observe, the principles ofTable 2 themselves are abstract, meaning that they do not explicitly define the metrics in order to achieve FAIRness. However, there are certain efforts towards defining metrics and developing tools to assess data compliance to FAIRness principles [12,33,34,35]. For our analysis, we chose F-UJI [35], a tool that evaluates FAIRness metrics defined by the FAIRsFAIR project [36]16. In specific, F-UJI receives as input a link that points to a dataset descriptor and provides as output a FAIR assessment of the dataset. The assessment is performed in an hierarchical manner; eachprinciple defined inTable 2 is evaluated using FAIRnessmetrics, while eachmetric is specifically computed usingpractical tests. We used version 0.5 of the metrics [36], which are shown inTable 3, along with the corresponding FAIR principles.
    An example depicting part of the score hierarchy for the F1 principle is shown inFigure 3. Practically, the F1 principle is evaluated using the FAIRness metrics FsF-F1-01D and FsF-F1-02D. The values for each of these metrics are computed using practical tests. As shown inFigure 3, metric FsF-F1-01D (requirement to have a globally unique identifier) is assessed using two practical tests: one to ensure that the identifier is unique and one to ensure that the identifier points to an accessible web address. All tests have three states (pass, fail and partial, which receives half a point). For instance, if the identifier is a Uniform Resource Locator (URL), then it is unique, and if the URL points to an existing web page, then it is also reachable. If both tests pass, then the score of the metric is 2/2; otherwise, its value may be lower (e.g., it could be 1/2 if the URL exists but points to a non-existing page). Note, however, that these tests do not check if the identifier is persistent (e.g., like a DOI, which is checked for metric FsF-F1-02D) or whether it points to a location that has useful information/metadata. Similar practical tests are defined for all metrics ofTable 3 and are then aggregated at the principle and attribute levels; for instance, the Findable attribute includes 5 metrics and 7 practical tests in total, so its value is measured as the total number of tests passed out of 7 tests in total17.
    Using the interface of the F-UJI tool18, which we accessed using Selenium19, we ran the FAIRness analysis for all the datasets in our directory automatically. The result for each dataset was extracted in JSON format, which includes a hierarchical score about the FAIRness of the dataset, e.g., the scores of the practical tests from FsF-F1-01D, FsF-F1-02D are aggregated to produce a score for F1, the scores of F2, F3, and F4 are produced similarly, etc. After that, all practical tests of findability metrics are combined to produce a score for the Findable attribute and so forth, until all scores are finally combined to produce a final score for dataset FAIRness.

    3. Results

    This section presents the results of our methodology, including the directory of datasets, along with an analysis for the variables of paper usage and FAIRness.

    3.1. Directory Schema and Data

    Our directory is provided in JSON format, and specifically comprises three JSON objects (files) per paper. These objects are shown inFigure 4, along with their connections.
    The paper metadata are shown at the left ofFigure 4, including fields for thetitle, theabstract, theyear, etc. Moreover, there are fields for thecitations and thereferences of the papers. Note that the authors in the papers themselves as well as in the citations are all identified using unique identifiers (URLs); therefore, it is possible to keep track of self-citations. Our annotations for identifying the dataset category and extracting the dataset URL for each paper are shown at the top ofFigure 4, along with the most relevant topic of the paper as well as its top terms. Finally, the FAIR assessment of each dataset is shown at the bottom ofFigure 4. The FAIR assessment includes all the fields for the hierarchical computation of the FAIRness score. In specific, each element of the list ofresults includes a metric (like those described inTable 3), represented by the fieldmetric_name. On par with the hierarchy of the F-UJI tool (for which an example is depicted inFigure 3), each metric involves practical tests, represented by the fieldmetric_tests. As a result, the final score of each metric (fieldscore) can be computed as the number ofearned test points divided by the number oftotal points, both values kept in the corresponding fields of the FAIR assessment JSON object.
    A visualization of a FAIR assessment for a dataset, as performed by the F-UJI tool, is shown at the right ofFigure 4. This specific screenshot is relevant to the dataset of [38]. The maturity level for each of the four attributes (Findable, Accessible, Interoperable, Reusable) depends on their scores (it receives values incomplete, initial, moderate, and advanced), which are computed based on the metrics ofTable 3. Upon having performed all practical tests for the metrics ofTable 3 (via the hierarchy illustrated inFigure 3), the F-UJI tool provides the scores for all four FAIR attributes. As we can see, in this case, the dataset is rather easily findable and accessible, as the maturity of these attributes is advanced. The imperfect score in findability is because of the metric FsF-F2-01M (seeTable 3), where there are some metadata accompanying the dataset (i.e. authors, summary); however, other metadata are missing (i.e., no keywords). Thus, out of the seven practical tests for metrics FsF-F1-01D (two tests in total), FsF-F1-02D (two tests in total), FsF-F2-01M (one test in total), FsF-F3-01M (one test in total), and FsF-F4-01M (one test in total), six are passed for this dataset; therefore, the final score for the Findable attribute is 6/7. Concerning interoperability, the score is moderate, since the dataset is in structured linked format; however, the semantic resources used in the metadata do not correspond to a known ontology (metric FsF-I2-01M). Finally, concerning reusability, the dataset again receives a moderate score, as it has some metadata specifying data uses; however, it does not include details on data variables (FsF-R1-01MD) and does not include adequate information on data creation/generation (FsF-R1.2-01M).
    As already mentioned, our directory is provided as a set of JSON files available in the GitHub repositoryhttps://github.com/AuthEceSoftEng/directory-msr-datasets (accessed on 19 February 2025). Specifically, there are three folders with JSON files, one including the paper metadata, one including the paper annotations, and one including the FAIR assessments ofFigure 4. To further facilitate researchers that prefer to view the datasets of the directory without downloading all of it, we have also created a simple web interface using Jekyll20, which is available athttps://authecesofteng.github.io/directory-msr-datasets/ (accessed on 19 February 2025) and allows dataset filtering based on their metadata. An example screenshot of the web interface is shown inFigure 5.
    The text field on top allows filtering according to the titles, the categories, and the rest of the metadata of the publications (datasets). In this example, the term ‘jira’ is used to return datasets relevant to the popular issue tracking system of Jira (in descending order of years). Upon having retrieved results based on the filters, the user can view certain information about each dataset, including the publication title, authors, abstract and DOI (linking to the paper in the publisher’s site). Moreover, for each dataset, we provide the categories and topic terms, the link to the dataset URL, and, finally, its FAIR score.

    3.2. Data Papers Usage and FAIRness

    In this subsection, we provide an analysis of the statistics of our directory, focusing both on the usage of the datasets (citations) and on their FAIRness. Our analysis is performed using pandas21 and Matplotlib22 (the script producing all tables and figures of this subsection is also available at repohttps://github.com/AuthEceSoftEng/directory-msr-datasets (accessed on 19 February 2025)).
    As already mentioned, we extracted 216 papers that were sent to the MSR Data (and Tool) Showcase tracks over the years. Out of these, 20 were tool papers; therefore, they were excluded from our analysis. For the remaining 196 papers, we applied our methodology to find which of the papers had datasets available. As already mentioned, a dataset was deemed available if it was possible for us to find it and download it.Figure 6 depicts the number of data papers per year as well as the number of corresponding datasets that are still available as of the time of writing. In total, 161 datasets were available, while 35 were not found. As expected, datasets were harder to find during the early years of the data track. Another interesting note is that the data track overall seems like a successful conference track; the data papers have an increasing trend during the latest years.
    Concerning citations, as expected, they follow an exponential decay pattern, shown inFigure 7. More than one-fourth of the publications have less than five citations, possibly indicating how certain datasets are still not widely known or even very specific (so they do not appeal to a broad research audience). Of course, there are also papers that involve extremely useful data, which have been used again and again by multiple researchers, such as AndroZoo (a collection of millions of android apps) [39] or GHTorrent (a collection of GitHub’s events and data) [40], which have been cited more than 550 times.
    Our analysis proceeds with the categories of the datasets. Given the annotations for the category of each dataset (described inSection 2.2), we were able to compute certain interesting statistics. Thus, first of all, we computed the total number of datasets falling into each category and further computed the corresponding percentages. Moreover, we calculated the average number of citations for each category by dividing the total number of citations by the number of papers in the category. The results of our analysis are shown inTable 4. There are several datasets for all five categories, while there are also datasets not falling into one of our categories. The category of software issues is the most popular category, with 58 papers. Note, however, that datasets from version control and datasets including developer metrics are cited more often.
    Figure 8 further depicts the distributions of citations for each category using boxplots. For each boxplot, we observe the median value as a light blue line inside the corresponding box (the box spans from the first to the third quartile). Any outliers are also depicted as values outside the maximum/fourth quartile (clipped to max 100 citations for visualization purposes). The conclusions from this graph are on par with those discussed when analyzingTable 4. Outliers aside (like the papers for AndroZoo [39] and GHTorrent [40] that were mentioned above), it seems that version control data and data related to developer metrics are used more than the other types of datasets.
    The statistics relevant to topics are shown inTable 5. As one may see, all topics have at least five datasets associated with them. Certain topics, such as topics 0 or 6, have many datasets, which possibly indicates that they comprise frequent data types (indeed, topic 0 may correspond to source code datasets and topic 6 to datasets from version control systems, like GitHub). Other topics have fewer datasets, like topic 2, which may involve datasets relevant to developer communities as well as bugs. Concerning citations, the distribution of citations among most topics seems rather balanced. We may note two interesting outliers: topic 2, which has only two citations, possibly due to its small number of datasets (and their specific scope), and topic 5, which has 155 citations on average, which is expected, as it includes GHTorrent [40], which is an outlier having more than 550 citations on its own.
    Finally, concerning FAIRness,Figure 9 depicts the average FAIR score for the datasets of each year of the MSR data track. An interesting conclusion is that dataset FAIRness seems to increase over the years. This is expected, given that the introduction of FAIR principles was in 2016 [11], and large hosting options like Zenodo and Figshare, which make efforts to comply with these principles, have lately become popular for sharing datasets.
    Figure 10 further depicts the number of citations versus the FAIRness score for all papers in our directory, split per year. Each bubble in this bubble chart represents the metrics of a specific dataset paper. The bubble’s horizontal position denotes the year that the paper was published, while its vertical position denotes the FAIRness score of the dataset. The size (diameter) of the bubble indicates the number of citations of the paper, with larger bubbles corresponding to more citations (see legend on top right ofFigure 10). Finally, the bubbles have transparency, so that it is possible for different bubbles to be visible even if their values overlap, i.e., concentric bubbles with darker color shading indicate that two (or more) dataset papers have the same FAIRness score in the corresponding year. As before, one may see that the FAIRness scores are indeed getting higher during the latest years. Interestingly, it seems that newer papers, i.e., from 2021 onwards, are cited more when the corresponding datasets have high FAIR compliance. This is, however, not the case for older papers. This possibly indicates how compliance to the FAIR principles is important before a dataset (or paper) becomes popular. When a dataset is yet relatively unknown, it is very important to make it easy for researchers to find and use it. By contrast, if a dataset is already used by multiple papers, other researchers may use it for comparison/completeness purposes, even if it is not very easy for them. Indeed, as one may notice inFigure 10, almost all citations from the last two years refer to datasets with FAIR scores above 50.

    4. Discussion

    Our directory can be used to confront several challenges in current research. First of all, it can aid researchers in the software engineering domain to find useful datasets for their research. By filtering over software mining areas, one can easily retrieve the relevant papers, while it is also possible to check the abstract or the keywords of each paper, as all metadata are included in our directory. Further using the citation count, one can check the popularity of the datasets before deciding which dataset to use.
    Apart from its obvious use of finding a dataset to perform research on, a directory of datasets is also very useful for designing and assessing data engineering methodologies. For instance, one can use our directory to evaluate data quality assessment systems [10,41] or even to develop tools for automated data curation [42]. When designing these types of methodologies, it is often important to have large collections of datasets from the same domain in order to craft data quality rules for the different data variables [43,44] (e.g., one can create an assessment tool specifically for software effort datasets, as described in [45]).
    The directory of datasets is also quite interesting from a bibliometric point of view. The initial analysis that has been provided in the previous section can be further extended in multiple directions. For instance, it would be interesting to perform a more comprehensive citation analysis, by investigating when citations most commonly appear for each paper (e.g., first year after its publication, second year or even later on, an idea explored also in [2]). Or, it would be possible to extract more metadata variables, either by extracting the semantics of the papers (e.g., to extract topics or brief descriptions [46,47]) or by integrating other data sources, such as the number of downloads for datasets residing in Zenodo. This way, we could differentiate between the number of citations that actually use the data and the ones only citing them.
    And all of these analyses could also be performed with respect to dataset compliance to the FAIR principles. In this aspect, the FAIR analysis itself could be extended to answer various research questions. For instance, one could identify whether certain dataset hosting services are (on average) more FAIR than others, by exploring the FAIRness score for the relevant datasets (an idea explored also in [48], focusing, however, on datasets from the agriculture domain). Finally, it would also be interesting to further analyze the FAIR attributes, and identify correlations between individual attributes and datasets (e.g., the influence that interoperability has on the number of citations).

    5. Conclusions

    In this work, we have crafted a directory of datasets extracted from the data track of the leading conference in the domain of mining software repositories. Our directory includes useful metadata and annotations, allowing researchers to easily find and reuse datasets in this domain. By further incorporating citation information, we were able to perform a bibliometric analysis of these papers, indicating the most popular categories of datasets. Finally, the datasets are accompanied with FAIRness scores, indicating the extent to which they can be (re)used by the research community.
    Future work lies in several directions. Concerning the metadata and the metrics of the directory, it would be useful to add more variables, either by analyzing the papers themselves (i.e., process their text) or even by extracting information for the datasets, including their hosting service (e.g., Zenodo, Figshare, etc.), their size in MBs, or even their format (e.g., using a database or not). And it would be interesting to determine whether these variables influence the citations of the papers. More annotations could also be added, including, e.g., the data source(s) for each dataset (such as GitHub, Jira, Stack Overflow, etc.) or even its license, to further enhance paper (and dataset) filtering. Finally, concerning compliance to the FAIR principles, it would be interesting to further investigate the effect of specific attributes with respect to dataset variables (e.g., determine whether Findability is correlated with citations or whether datasets with clear licensing information are preferred by researchers, on par with principle R1.1).

    Author Contributions

    Conceptualization, T.D. and A.L.S.; methodology, T.D. and A.L.S.; software, T.D. and A.L.S.; validation, T.D. and A.L.S.; formal analysis, T.D. and A.L.S.; investigation, T.D. and A.L.S.; resources, T.D. and A.L.S.; data curation, T.D. and A.L.S.; writing—original draft preparation, T.D. and A.L.S.; writing—review and editing, T.D. and A.L.S.; visualization, T.D. and A.L.S.; supervision, T.D. and A.L.S.; project administration, T.D. and A.L.S.; funding acquisition, T.D. and A.L.S. All authors have read and agreed to the published version of the manuscript.

    Funding

    This research received no external funding.

    Data Availability Statement

    The data presented in this study are openly available on GitHub athttps://github.com/AuthEceSoftEng/directory-msr-datasets (accessed on 19 February 2025).

    Acknowledgments

    Parts of this work have been supported by the Horizon Europe project ECO-READY (Grant Agreement No 101084201), funded by the European Union.

    Conflicts of Interest

    The authors declare no conflicts of interest.

    Notes

    1
    https://github.com (accessed on 19 February 2025)
    2
    https://gitlab.com (accessed on 19 February 2025)
    3
    https://www.atlassian.com/software/jira (accessed on 19 February 2025)
    4
    https://www.bugzilla.org (accessed on 19 February 2025)
    5
    https://www.msrconf.org (accessed on 19 February 2025)
    6
    7
    https://www.mdpi.com/journal/data (accessed on 19 February 2025)
    8
    https://dblp.org/db/conf/msr/index.html (accessed on 19 February 2025)
    9
    https://api.semanticscholar.org (accessed on 19 February 2025)
    10
    https://requests.readthedocs.io/en/latest/ (accessed on 19 February 2025)
    11
    https://ieeexplore.ieee.org (accessed on 19 February 2025)
    12
    https://dl.acm.org (accessed on 19 February 2025)
    13
    14
    Bugs are errors that result in erroneous behavior, defects involve issues with the functionality or design, and vulnerabilities are weaknesses in the security of the software.
    15
    See alsohttps://www.go-fair.org/fair-principles/ (accessed on 19 February 2025) for a more detailed description of the FAIR principles.
    16
    https://www.fairsfair.eu/ (accessed on 19 February 2025)
    17
    Seehttps://www.f-uji.net/index.php?action=methods (accessed on 19 February 2025) for more details about the implementation/rationale of the practical tests of all metrics and principles.
    18
    https://www.f-uji.net/ (accessed on 19 February 2025)
    19
    https://selenium-python.readthedocs.io/ (accessed on 19 February 2025)
    20
    https://jekyllrb.com/ (accessed on 19 February 2025)
    21
    https://pandas.pydata.org/ (accessed on 19 February 2025)
    22
    https://matplotlib.org/ (accessed on 19 February 2025)

    References

    1. Cukic, B. Guest Editor’s Introduction: The Promise of Public Software Engineering Data Repositories.IEEE Softw.2005,22, 20–22. [Google Scholar] [CrossRef]
    2. Kotti, Z.; Kravvaritis, K.; Dritsa, K.; Spinellis, D. Standing on shoulders or feet? An extended study on the usage of the MSR data papers.Empir. Softw. Eng.2020,25, 3288–3322. [Google Scholar] [CrossRef]
    3. Sayyad Shirabad, J.; Menzies, T.The PROMISE Repository of Software Engineering Databases; School of Information Technology and Engineering, University of Ottawa: Ottawa, ON, Canada, 2005. [Google Scholar]
    4. European Organization For Nuclear Research and OpenAIRE. Zenodo. 2013. Available online:https://doi.org/10.25495/7GXK-RD71 (accessed on 19 February 2025).
    5. Gu, Y. Global knowledge management research: A bibliometric analysis.Scientometrics2004,61, 171–190. [Google Scholar] [CrossRef]
    6. Robles, G. Replicating MSR: A study of the potential replicability of papers published in the Mining Software Repositories proceedings. In Proceedings of the 2010 7th IEEE Working Conference on Mining Software Repositories (MSR 2010), Cape Town, South Africa, 2–3 May 2010; pp. 171–180. [Google Scholar] [CrossRef]
    7. de Freitas, F.G.; de Souza, J.T. Ten years of search based software engineering: A bibliometric analysis. In Proceedings of the Third International Conference on Search Based Software Engineering, Szeged, Hungary, 10–12 September 2011; pp. 18–32. [Google Scholar] [CrossRef]
    8. Kotti, Z.; Spinellis, D. Standing on shoulders or feet? The usage of the MSR data papers. In Proceedings of the 16th International Conference on Mining Software Repositories, Montreal, ON, Canada, 26–27 May 2019; pp. 565–576. [Google Scholar] [CrossRef]
    9. Zogaan, W.; Sharma, P.; Mirahkorli, M.; Arnaoudova, V. Datasets from Fifteen Years of Automated Requirements Traceability Research: Current State, Characteristics, and Quality. In Proceedings of the 2017 IEEE 25th International Requirements Engineering Conference (RE), Lisbon, Portugal, 4–8 September 2017; pp. 110–121. [Google Scholar] [CrossRef]
    10. Liebchen, G.A.; Shepperd, M. Data sets and data quality in software engineering. In Proceedings of the 4th International Workshop on Predictor Models in Software Engineering, New York, NY, USA, 12–13 May 2008; pp. 39–44. [Google Scholar] [CrossRef]
    11. Wilkinson, M.D.; Dumontier, M.; Aalbersberg, I.J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.W.; da Silva Santos, L.B.; Bourne, P.E.; et al. The FAIR Guiding Principles for scientific data management and stewardship.Sci. Data2016,3, 160018. [Google Scholar] [CrossRef]
    12. Sun, C.; Emonet, V.; Dumontier, M. A comprehensive comparison of automated FAIRness Evaluation Tools. In Proceedings of the Semantic Web Applications and Tools for Health Care and Life Sciences, Rheinisch-Westfaelische Technische Hochschule Aachen * Lehrstuhl Informatik V, Basel, Switzerland, 13–16 February 2022; Volume 3127, pp. 44–53. [Google Scholar]
    13. International DOI Foundation.The DOI® Handbook; International DOI Foundation: Oxford, UK, 2023. [Google Scholar] [CrossRef]
    14. Markovtsev, V.; Long, W. Public git archive: A big code dataset for all. In Proceedings of the 15th International Conference on Mining Software Repositories, Gothenburg, Sweden, 28–29 May 2018; pp. 34–37. [Google Scholar] [CrossRef]
    15. Manasa Venigalla, A.S.; Chimalakonda, S. DocMine: A Software Documentation-Related Dataset of 950 GitHub Repositories. In Proceedings of the 2023 IEEE/ACM 20th International Conference on Mining Software Repositories, Los Alamitos, CA, USA, 15–16 May 2023; pp. 407–411. [Google Scholar] [CrossRef]
    16. Karampatsis, R.M.; Sutton, C. How Often Do Single-Statement Bugs Occur? The ManySStuBs4J Dataset. In Proceedings of the 17th International Conference on Mining Software Repositories, Seoul, Republic of Korea, 25–26 May 2020; pp. 573–577. [Google Scholar] [CrossRef]
    17. Diamantopoulos, T.; Nastos, D.N.; Symeonidis, A. Semantically-enriched Jira Issue Tracking Data. In Proceedings of the 2023 IEEE/ACM 20th International Conference on Mining Software Repositories (MSR), Melbourne, Australia, 15–16 May 2023; pp. 218–222. [Google Scholar] [CrossRef]
    18. Warrick, M.; Rosenblatt, S.F.; Young, J.G.; Casari, A.; Hébert-Dufresne, L.; Bagrow, J. The OCEAN mailing list data set: Network analysis spanning mailing lists and code repositories. In Proceedings of the 19th International Conference on Mining Software Repositories, Pittsburgh, PA, USA, 23–24 May 2022; pp. 338–342. [Google Scholar] [CrossRef]
    19. Squire, M. Project roles in the apache software foundation: A dataset. In Proceedings of the 10th Working Conference on Mining Software Repositories, San Francisco, CA, USA, 18–19 May 2013; pp. 301–304. [Google Scholar] [CrossRef]
    20. Spinellis, D. A repository with 44 years of Unix evolution. In Proceedings of the 12th Working Conference on Mining Software Repositories, Florence, Italy, 16–17 May 2015; pp. 462–465. [Google Scholar] [CrossRef]
    21. Spinellis, D. Documented unix facilities over 48 years. In Proceedings of the 15th International Conference on Mining Software Repositories, Gothenburg, Sweden, 28–29 May 2018; pp. 58–61. [Google Scholar] [CrossRef]
    22. Blei, D.M.; Ng, A.Y.; Jordan, M.I. Latent Dirichlet Allocation.J. Mach. Learn. Res.2003,3, 993–1022. [Google Scholar]
    23. Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the 1st International Conference on Learning Representations, Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
    24. Joulin, A.; Grave, E.; Bojanowski, P.; Mikolov, T. Bag of Tricks for Efficient Text Classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Valencia, Spain, 3–7 April 2017; pp. 427–431. [Google Scholar]
    25. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.arXiv2018, arXiv:1810.04805. [Google Scholar]
    26. Kamp, M.; Kreutzer, P.; Philippsen, M. SeSaMe: A data set of semantically similar Java methods. In Proceedings of the 16th International Conference on Mining Software Repositories, Montreal, ON, Canada, 26–27 May 2019; pp. 529–533. [Google Scholar] [CrossRef]
    27. Efstathiou, V.; Chatzilenas, C.; Spinellis, D. Word embeddings for the software engineering domain. In Proceedings of the 15th International Conference on Mining Software Repositories, New York, NY, USA, 28–29 May 2018; pp. 38–41. [Google Scholar] [CrossRef]
    28. Henkel, J.; Bird, C.; Lahiri, S.K.; Reps, T. A Dataset of Dockerfiles. In Proceedings of the 17th International Conference on Mining Software Repositories, Seoul, Republic of Korea, 25–26 May 2020; pp. 528–532. [Google Scholar] [CrossRef]
    29. Quaranta, L.; Calefato, F.; Lanubile, F. KGTorrent: A Dataset of Python Jupyter Notebooks from Kaggle. In Proceedings of the 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), Madrid, Spain, 22–30 May 2021; pp. 550–554. [Google Scholar] [CrossRef]
    30. Zacchiroli, S. A large-scale dataset of (open source) license text variants. In Proceedings of the 19th International Conference on Mining Software Repositories, Pittsburgh, PA, USA, 23–24 May 2022; pp. 757–761. [Google Scholar] [CrossRef]
    31. Rehurek, R.; Sojka, P.Gensim–Python Framework for Vector Space Modelling; NLP Centre, Faculty of Informatics, Masaryk University: Brno, Czech Republic, 2011; Volume 3. [Google Scholar]
    32. Röder, M.; Both, A.; Hinneburg, A. Exploring the Space of Topic Coherence Measures. In Proceedings of the Eighth ACM International Conference on Web Search and Data Mining, Shanghai, China, 31 January–6 February 2015; pp. 399–408. [Google Scholar] [CrossRef]
    33. Wilkinson, M.D.; Dumontier, M.; Sansone, S.A.; Bonino da Silva Santos, L.O.; Prieto, M.; Batista, D.; McQuilton, P.; Kuhn, T.; Rocca-Serra, P.; Crosas, M.; et al. Evaluating FAIR maturity through a scalable, automated, community-governed framework.Sci. Data2019,6, 174. [Google Scholar] [CrossRef]
    34. Gaignard, A.; Rosnet, T.; De Lamotte, F.; Lefort, V.; Devignes, M.D. FAIR-Checker: Supporting digital resource findability and reuse with Knowledge Graphs and Semantic Web standards.J. Biomed. Semant.2023,14, 7. [Google Scholar] [CrossRef] [PubMed]
    35. Devaraju, A.; Mokrane, M.; Cepinskas, L.; Huber, R.; Herterich, P.; de Vries, J.; Akerman, V.; L’Hours, H.; Davidson, J.; Diepenbroek, M. From Conceptualization to Implementation: FAIR Assessment of Research Data Objects.Data Sci. J.2021,4, 20. [Google Scholar] [CrossRef]
    36. Devaraju, A.; Huber, R.; Mokrane, M.; Herterich, P.; Cepinskas, L.; de Vries, J.; L’Hours, H.; Davidson, J.; White, A.FAIRsFAIR Data Object Assessment Metrics; FAIRsFAIR: Den Haag, The Netherlands, 2022. [Google Scholar] [CrossRef]
    37. Devaraju, A.; Huber, R. An automated solution for measuring the progress toward FAIR research data.Patterns2021,2, 100370. [Google Scholar] [CrossRef] [PubMed]
    38. Diamantopoulos, T.; Papamichail, M.D.; Karanikiotis, T.; Chatzidimitriou, K.C.; Symeonidis, A.L. Employing Contribution and Quality Metrics for Quantifying the Software Development Process. In Proceedings of the 17th International Conference on Mining Software Repositories, Seoul, Republic of Korea, 25–26 May 2020; pp. 558–562. [Google Scholar] [CrossRef]
    39. Allix, K.; Bissyandé, T.F.; Klein, J.; Le Traon, Y. AndroZoo: Collecting millions of Android apps for the research community. In Proceedings of the 13th International Conference on Mining Software Repositories, New York, NY, USA, 14–15 May 2016; pp. 468–471. [Google Scholar] [CrossRef]
    40. Gousios, G. The GHTorent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories, San Francisco, CA, USA, 18–19 May 2013; pp. 233–236. [Google Scholar] [CrossRef]
    41. Ehrlinger, L.; Wöß, W. A Survey of Data Quality Measurement and Monitoring Tools.Front. Big Data2022,5, 850611. [Google Scholar] [CrossRef]
    42. Freitas, A.; Curry, E. Big Data Curation. InNew Horizons for a Data-Driven Economy: A Roadmap for Usage and Exploitation of Big Data in Europe; Cavanillas, J.M., Curry, E., Wahlster, W., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 87–118. [Google Scholar] [CrossRef]
    43. Batini, C.; Cappiello, C.; Francalanci, C.; Maurino, A. Methodologies for data quality assessment and improvement.ACM Comput. Surv.2009,41, 16. [Google Scholar] [CrossRef]
    44. de Haro-Olmo, F.J.; Valencia-Parra, Á.; Varela-Vaca, Á.J.; Álvarez-Bermejo, J.A. Data curation in the Internet of Things: A decision model approach.Comput. Math. Methods2021,3, e1191. [Google Scholar] [CrossRef]
    45. Bosu, M.F.; Macdonell, S.G. Experience: Quality Benchmarking of Datasets Used in Software Effort Estimation.J. Data Inf. Qual.2019,11, 19. [Google Scholar] [CrossRef]
    46. Onan, A. Two-Stage Topic Extraction Model for Bibliometric Data Analysis Based on Word Embeddings and Clustering.IEEE Access2019,7, 145614–145633. [Google Scholar] [CrossRef]
    47. Cachola, I.; Lo, K.; Cohan, A.; Weld, D. TLDR: Extreme Summarization of Scientific Documents. In Proceedings of the Findings of the Association for Computational Linguistics: EMNLP 2020, Online, 16–20 November 2020; pp. 4766–4777. [Google Scholar] [CrossRef]
    48. Petrosyan, L.; Aleixandre-Benavent, R.; Peset, F.; Valderrama-Zurián, J.C.; Ferrer-Sapena, A.; Sixto-Costoya, A. FAIR degree assessment in agriculture datasets using the F-UJI tool.Ecol. Inform.2023,76, 102126. [Google Scholar] [CrossRef]
    Data 10 00028 g001
    Figure 1. Methodology steps for creating the directory of datasets.
    Figure 1. Methodology steps for creating the directory of datasets.
    Data 10 00028 g001
    Data 10 00028 g002
    Figure 2. Coherence values for different number of topics.
    Figure 2. Coherence values for different number of topics.
    Data 10 00028 g002
    Data 10 00028 g003
    Figure 3. An example of the F1 principle, its metric, and its two practical tests, adapted from [37].
    Figure 3. An example of the F1 principle, its metric, and its two practical tests, adapted from [37].
    Data 10 00028 g003
    Data 10 00028 g004
    Figure 4. The schema of our directory and an example FAIR assessment.
    Figure 4. The schema of our directory and an example FAIR assessment.
    Data 10 00028 g004
    Data 10 00028 g005
    Figure 5. Example screenshot of the web interface of our directory.
    Figure 5. Example screenshot of the web interface of our directory.
    Data 10 00028 g005
    Data 10 00028 g006
    Figure 6. Number of datasets that are still findable per year.
    Figure 6. Number of datasets that are still findable per year.
    Data 10 00028 g006
    Data 10 00028 g007
    Figure 7. Number of citations per paper.
    Figure 7. Number of citations per paper.
    Data 10 00028 g007
    Data 10 00028 g008
    Figure 8. Distribution of citations per category.
    Figure 8. Distribution of citations per category.
    Data 10 00028 g008
    Data 10 00028 g009
    Figure 9. Average FAIR score per year.
    Figure 9. Average FAIR score per year.
    Data 10 00028 g009
    Data 10 00028 g010
    Figure 10. FAIR score versus number of citations for every year.
    Figure 10. FAIR score versus number of citations for every year.
    Data 10 00028 g010
    Table 1. Top terms of topics extracted from abstracts.
    Table 1. Top terms of topics extracted from abstracts.
    IDTop Terms
    0code, source, project,library,tool,android, metric, bug, test, analysis
    1project, code, source, developer, study,api,model,system, technique, type
    2source,test,bug, provide, code,analysis, community, method, study, package
    3code,review, project, source, present,change, set,github, repository, quality
    4code,bug, repository,github, tool, study,activity, source, technique, apps
    5github, project, source,vulnerability, apps, development,tool, repository, present, info
    6code, repository, source, project, datasets, file,analysis,developer,language, github
    7bug, project, repository,report, developer, code, study, single, source,activity
    8study, project,model, development, based, repository, source,issue, code,tool
    9repository, bug, source, project, code,information, method,developer,system, present
    10repository,issue, information,system, source, project, github, file,evolution, developer
    11source,defect, present, developer, android,test,security, development, project, datasets
    12project, source, code, development,clone, set,bug, github,os, datasets
    13source, system,evolution, code, build, set, project,developer,comment, information
    Table 2. FAIR principles.
    Table 2. FAIR principles.
    AttributePrinciples
    FindableF1. (Meta)data are assigned a globally unique and persistent identifier
    F2. Data are described with rich metadata (defined by R1 below)
    F3. Metadata clearly and explicitly include the identifier of the data they describe
    F4. (Meta)data are registered or indexed in a searchable resource
    AccessibleA1. (Meta)data are retrievable by identifier using a standard comms protocol
    A1.1 The protocol is open, free, and universally implementable
    A1.2 The protocol allows for authentication and authorization, where necessary
    A2. Metadata are accessible, even when the data are no longer available
    InteroperableI1. (Meta)data use a formal, accessible, shared, and broadly applicable language
    I2. (Meta)data use vocabularies that follow FAIR principles
    I3. (Meta)data include qualified references to other (meta)data
    ReusableR1. (Meta)data are richly described with plurality of accurate and relevant attributes
    R1.1. (Meta)data are released with a clear and accessible data usage license
    R1.2. (Meta)data are associated with detailed provenance
    R1.3. (Meta)data meet domain-relevant community standards
    Table 3. FAIRness evaluation metrics of F-UJI tool.
    Table 3. FAIRness evaluation metrics of F-UJI tool.
    Princ.Metrics
    F1FsF-F1-01D Data are assigned a globally unique identifier.
    F1FsF-F1-02D Data are assigned a persistent identifier.
    F2FsF-F2-01M Metadata include descriptive core elements to support data findability.
    F3FsF-F3-01M Metadata include the identifier of the data they describe.
    F4FsF-F4-01M Metadata are offered in such a way that they can be retrieved by machines.
    A1FsF-A1-01M Metadata contain access level and access conditions of the data.
    A1FsF-A1-02M Metadata are accessible through a standardized communication protocol.
    A1FsF-A1-03D Data are accessible through a standardized communication protocol.
    A2FsF-A2-01M Metadata remain available, even if the data are no longer available.
    I1FsF-I1-01M Metadata are represented using a formal knowledge representation language.
    I2FsF-I2-01M Metadata use semantic resources.
    I2FsF-I3-01M Metadata include links between the data and their related entities.
    R1FsF-R1-01MD Metadata specify the content of the data.
    R1.1FsF-R1.1-01M Metadata include license information under which data can be reused.
    R1.2FsF-R1.2-01M Metadata include provenance information for data creation/generation.
    R1.3FsF-R1.3-01M Metadata follow standard recommended by target research community.
    R1.3FsF-R1.3-02D Data are available in file format recommended by target research community.
    Table 4. Number of papers/datasets and average number of citations per category.
    Table 4. Number of papers/datasets and average number of citations per category.
    Dataset
    Category
    Total Number
    of Datasets
    Percentage of
    Datasets
    Average Number
    of Citations
    Version Control3819.39%37
    Software Issues5829.59%26
    Developer Metrics2412.25%27
    Software Evolution126.12%13
    Semantic Metrics2010.20%20
    Other Data4422.45%31
    Table 5. Number of papers/datasets and average number of citations per topic.
    Table 5. Number of papers/datasets and average number of citations per topic.
    Dataset
    Topic
    Total Number
    of Datasets
    Percentage
    of Datasets
    Average Number
    of Citations
    Topic 02814.29%23
    Topic 1199.70%13
    Topic 252.55%2
    Topic 3115.61%34
    Topic 463.06%12
    Topic 5126.12%155
    Topic 62814.29%22
    Topic 7126.12%34
    Topic 884.08%24
    Topic 9189.19%13
    Topic 10147.14%11
    Topic 1194.59%18
    Topic 12178.67%18
    Topic 1394.59%25
    Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

    © 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

    Share and Cite

    MDPI and ACS Style

    Diamantopoulos, T.; Symeonidis, A.L. A Directory of Datasets for Mining Software Repositories.Data2025,10, 28. https://doi.org/10.3390/data10030028

    AMA Style

    Diamantopoulos T, Symeonidis AL. A Directory of Datasets for Mining Software Repositories.Data. 2025; 10(3):28. https://doi.org/10.3390/data10030028

    Chicago/Turabian Style

    Diamantopoulos, Themistoklis, and Andreas L. Symeonidis. 2025. "A Directory of Datasets for Mining Software Repositories"Data 10, no. 3: 28. https://doi.org/10.3390/data10030028

    APA Style

    Diamantopoulos, T., & Symeonidis, A. L. (2025). A Directory of Datasets for Mining Software Repositories.Data,10(3), 28. https://doi.org/10.3390/data10030028

    Article Metrics

    No
    No

    Article Access Statistics

    For more information on the journal statistics, clickhere.
    Multiple requests from the same IP address are counted as one view.
    Data, EISSN 2306-5729, Published by MDPI
    RSSContent Alert

    Further Information

    Article Processing Charges Pay an Invoice Open Access Policy Contact MDPI Jobs at MDPI

    Guidelines

    For Authors For Reviewers For Editors For Librarians For Publishers For Societies For Conference Organizers

    MDPI Initiatives

    Sciforum MDPI Books Preprints.org Scilit SciProfiles Encyclopedia JAMS Proceedings Series

    Follow MDPI

    LinkedIn Facebook Twitter
    MDPI

    Subscribe to receive issue release notifications and newsletters from MDPI journals

    © 1996-2025 MDPI (Basel, Switzerland) unless otherwise stated
    Terms and Conditions Privacy Policy
    We use cookies on our website to ensure you get the best experience.
    Read more about our cookieshere.
    Accept
    Back to TopTop
    [8]ページ先頭

    ©2009-2025 Movatter.jp