Search for Articles:

Title / Keyword

Author / Affiliation / Email

Journal

Article Type

Advanced Search

Section

Special Issue

Volume

Issue

Number

Page

Observational Monitoring Records Downstream Impacts of Beaver Dams on Water Quality and Quantity in Temperate Mixed-Land-Use Watersheds
A Comprehensive Data Maturity Model for Data Pre-Analysis
Introducing UWF-ZeekData24: An Enterprise MITRE ATT&CK Labeled Network Attack Traffic Dataset for Machine Learning/AI
Historical Bolide Infrasound Dataset (1960–1972)

Journal Description

Data

Data is apeer-reviewed, open access journal on data in science, with the aim of enhancing data transparency and reusability. The journal publishes in two sections: a section on the collection, treatment and analysis methods of data in science; a section publishing descriptions of scientific and scholarly datasets (one dataset per paper). The journal is published monthly online by MDPI.

Open Access— free for readers, witharticle processing charges (APC) paid by authors or their institutions.
High Visibility: indexed withinScopus,ESCI (Web of Science),Ei Compendex, dblp, Inspec,RePEc, and other databases.
Journal Rank: JCR - Q2 (Multidisciplinary Sciences) / CiteScore - Q2 (Information Systems and Management)
Rapid Publication: manuscripts are peer-reviewed and a first decision is provided to authors approximately 25.2 days after submission; acceptance to publication is undertaken in 2.9 days (median values for papers published in this journal in the first half of 2025).
Recognition of Reviewers: reviewers who provide timely, thorough peer-review reports receive vouchers entitling them to a discount on the APC of their next publication in any MDPI journal, in appreciation of the work done.

Impact Factor: 2.0 (2024); 5-Year Impact Factor: 2.1 (2024)

subject Imprint Information get_app Journal Flyer Open Access ISSN: 2306-5729

Latest Articles

attachment

6 pages, 190 KiB

Open AccessData Descriptor

A Combined HF Radar and Drifter Dataset for Analysis of Highly Variable Surface Currents

byBartolomeo Doronzo,Michele Bendoni,Stefano Taddei,Angelo Boccacci andCarlo Brandini

Data2025,10(7), 115; https://doi.org/10.3390/data10070115 (registering DOI) - 12 Jul 2025

Abstract

This data descriptor presents the HF radar and drifter datasets, along with the methods used to process and apply them in a previously published study on the validation of surface current measurements in a region characterized by highly variable coastal dynamics. The data were collected in the framework of a large-scale Lagrangian experiment, which included extensive drifter deployment and the generation of virtual trajectories based on HF radar-derived flow fields. Both Eulerian and Lagrangian approaches were used to assess radar performance through correlation and RMSE metrics, with additional refinement achieved via Kriging interpolation. The validation results, published inRemote Sensing,demonstrated good agreement between HF radar and drifter observations, particularly when quality control parameters were optimized. The datasets and associated methodologies described here support ongoing efforts to enhance HF radar tuning strategies and improve surface current monitoring in complex marine environments.Full article

(This article belongs to the Special IssueData in Astrophysics and Geophysics: Research and Applications, 3rd Edition)

15 pages, 2054 KiB

Open AccessData Descriptor

Data on Brazilian Powdered Milk Formulations for Infants of Various Age Groups: 0–6 Months, 6–12 Months, and 12–36 Months

byFrancisco José Mendes dos Reis,Antonio Marcos Jacques Barbosa,Elaine Silva de Pádua Melo,Marta Aratuza Pereira Ancel,Rita de Cássia Avellaneda Guimarães,Priscila Aiko Hiane,Flavio Santana Michels,Daniele Bogo,Karine de Cássia Freitas Gielow,Diego Azevedo Zoccal Garcia,Geovanna Vilalva Freire,João Batista Gomes de Souza andValter Aragão do Nascimento

Data2025,10(7), 114;https://doi.org/10.3390/data10070114 - 9 Jul 2025

Abstract

Milk powder is a key nutritional alternative to breastfeeding, but its thermal properties, which vary with temperature, can affect its quality and shelf life. However, there is little information about the physical and chemical properties of powdered milk in several countries. This dataset contains the result of an analysis of the aflatoxins, macroelement and microelement concentrations, oxidative stability, and fatty acid profile of infant formula milk powder. The concentrations of Al, As, Ba, Cd, Co, Cr, Cu, Fe, Mg, Mn, Mo, Ni, Pb, Se, V, and Zn in digested powdered milk samples were quantified through inductively coupled plasma optical emission spectrometry (ICP OES). Thermogravimetry (TG) and differential scanning calorimetry (DSC) were used to estimate the oxidative stability of infant formula milk powder, while the methyl esters of the fatty acids were analyzed by gas chromatography. Most milk samples showed significant concentrations of As (0.5583–1.3101 mg/kg) and Pb (0.2588–0.0847 mg/kg). The concentrations of aflatoxins G2 and B2 are below the limits established by Brazilian regulatory agencies. The thermal degradation behavior of the samples is not the same due to their fatty acid compositions. The data presented may be useful in identifying compounds present in infant milk powder used as a substitute for breast milk and understanding the mechanism of thermal stability and degradation, ensuring food safety for those who consume them.Full article

►▼ Show Figures

Figure 1

10 pages, 4821 KiB

Open AccessData Descriptor

Multi-Resolution Remote Sensing Dataset for the Detection of Anthropogenic Litter: A Multi-Platform and Multi-Sensor Approach

byRobert Rettig,Felix Becker,Alexander Berghoff,Tobias Binkele,Wolfram Michael Butter,Tilman Floehr,Martin Kumm,Carolin Leluschko,Florian Littau,Elmar Reinders,Eike Rodenbäck,Tobias Schmid,Sabine Schründer,Sören Schweigert,Michael Sinhuber,Jens Wellhausen,Frederic Stahl andChristoph Tholen

Data2025,10(7), 113;https://doi.org/10.3390/data10070113 - 9 Jul 2025

Abstract

The dataset developed within the PlasticObs+ project aims to facilitate a multi-resolution approach for detecting and quantifying anthropogenic litter through areal images. Traditional detection methods often suffer from narrow, use-case-specific limitations, reducing their transferability. To address this, an image dataset was created featuring various spatial and spectral resolutions. The highest spatial resolution images (ground sampling distance = 0.2 cm) were used to generate a labeled dataset, which was georeferenced for mapping onto coarser-resolution images.Full article

(This article belongs to the SectionSpatial Data Science and Digital Earth)

►▼ Show Figures

Figure 1

9 pages, 16281 KiB

Open AccessData Descriptor

Advancements in Regional Weather Modeling for South Asia Through the High Impact Weather Assessment Toolkit (HIWAT) Archive

byTimothy Mayer,Jonathan L. Case,Jayanthi Srikishen,Kiran Shakya,Deepak Kumar Shah,Francisco Delgado Olivares,Lance Gilliland,Patrick Gatlin,Birendra Bajracharya andRajesh Bahadur Thapa

Data2025,10(7), 112;https://doi.org/10.3390/data10070112 - 9 Jul 2025

Abstract

Some of the most intense thunderstorms and extreme weather events on Earth occur in the Hindu Kush Himalaya (HKH) region of Southern Asia. The need to provide end users, stakeholders, and decision makers with accurate forecasts and alerts of extreme weather is critical. To that end, a cutting edge weather modeling framework coined the High Impact Weather Assessment Toolkit (HIWAT) was created through the National Aeronautics and Space Administration (NASA) SERVIR Applied Sciences Team (AST) effort, which consists of a suite of varied numerical weather prediction (NWP) model runs to provide probabilities of straight-line damaging winds, hail, frequent lightning, and intense rainfall as part of a daily 54 h forecast tool. The HIWAT system was first deployed in 2018, and the recently released model archive hosted by the Global Hydrometeorology Resource Center (GHRC) Distributed Active Archive Center (DAAC) provides daily model outputs for the years of 2018–2022. With a nested modeling domain covering Nepal, Bangladesh, Bhutan, and Northeast India, the HIWAT archive spans the critical pre-monsoon and monsoon months of March–October when severe weather and flooding are most frequent. As part of NASA’s Transformation To Open Science (TOPS), this data archive is freely available to practitioners and researchers.Full article

(This article belongs to the SectionSpatial Data Science and Digital Earth)

►▼ Show Figures

Figure 1

attachment

12 pages, 659 KiB

Open AccessArticle

PlantDRs: A Database of Dispersed Repeats in Plant Genomes Identified by the Iterative Procedure Method

byValentina Rudenko,Eugene Korotkov andDmitrii Kostenko

Data2025,10(7), 111;https://doi.org/10.3390/data10070111 - 9 Jul 2025

Abstract

In this work, we searched for and analyzed highly divergent dispersed repeats (DRs) in the genomes of four plants:Arabidopsis thaliana,Capsicum annuum,Daucus carota, andZea mays. DRs were detected using the iterative procedure method which has shown efficacy in searches for highly divergent repeats in bacteria and algae. The results indicated that the number of DRs in the plant genomes depended on the genome size, whereas the number of repeat families did not. The DRs covered from 36 to 50% of the studied genomes. The shortest repeats were observed in theD. carota genome, but their consensus lengths were similar to those in the other species. Analysis of periodicity in various DR families showed that most periods were 3 bp long. We created a database of the detected DRs, which contains 5,392,216 DRs grouped in 150 families and which can be accessed on the Research Center of Biotechnology RAS server. The server makes it possible to search for repeats based on various criteria and to download the obtained data.Full article

(This article belongs to the SectionComputational Biology, Bioinformatics, and Biomedical Data Science)

►▼ Show Figures

Figure 1

18 pages, 1810 KiB

Open AccessArticle

Analysis of Student Dropout Risk in Higher Education Using Proportional Hazards Model and Based on Entry Characteristics

byLiga Paura,Irina Arhipova,Gatis Vitols andSandra Sproge

Data2025,10(7), 110;https://doi.org/10.3390/data10070110 - 8 Jul 2025

Abstract

The aim of this study is to identify the key factors contributing to student dropout and to develop a predictive model that estimates the dropout risk of students based on their entry characteristics and enrolment registration data. Our analysis is based on the registration and academic data of 971 full-time and part-time bachelor’s students in five faculties, who were enrolled in the academic year 2021–2022 at the Latvia University of Life Sciences and Technologies (LBTU). The dropout analysis was done during the 3.5 years of study, when the students started their last semester in engineering and information technology, agriculture and food technology, economics and social sciences, and forest and environmental studies and when veterinary medicine students had completed more than half of their program of study. Survival analysis methods were used during the study. Students’ dropout risk in relation to gender, faculty, priority to study in the program, and secondary school performance (SM) was estimated using the Proportional hazard model (Cox model). The highest student dropout was observed during the first year of study. Secondary school performance was a significant predictor of students’ dropout risk; students with higher SM had a lower dropout risk (HR = 0.66,p < 0.05). As well, student dropout can be explained by faculty or study programme. Students in economics and social sciences were at lower dropout risk than the students from the other faculties. Results show the model’s concordance index was 0.59, and this indicates that additional or stronger predictors may be needed to improve model performance.Full article

(This article belongs to the Special IssueData Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

►▼ Show Figures

Figure 1

16 pages, 3375 KiB

Open AccessData Descriptor

ICA-Based Resting-State Networks Obtained on Large Autism fMRI Dataset ABIDE

bySjir J. C. Schielen,Jesper Pilmeyer,Albert P. Aldenkamp,Danny Ruijters andSvitlana Zinger

Data2025,10(7), 109;https://doi.org/10.3390/data10070109 - 3 Jul 2025

Abstract

Functional magnetic resonance imaging (fMRI) has become instrumental in researching the functioning of the brain. One application of fMRI is investigating the brains of people with autism spectrum disorder (ASD). The Autism Brain Imaging Data Exchange (ABIDE) facilitates this research through its extensive data-sharing initiative. While ABIDE offers raw data and data preprocessed with various atlases, independent component analysis (ICA) for dimensionality reduction remains underutilized. ICA is a data-driven way to reduce dimensionality without prior assumptions on delineations. Additionally, ICA separates the noise from the signal, and the signal components correspond well to functional brain networks called resting-state networks (RSNs). Currently, no large, readily available dataset preprocessed with ICA exists. Here, we address this gap by presenting ABIDE’s data preprocessed to extract ICA-based resting-state networks, which are publicly available. These RSNs unveil neural activation clusters without atlas constraints, offering a perspective on ASD analyses that complements the predominantly atlas-based literature. This contribution provides a resource for further research into ASD, benchmarking between methodologies, and the development of new analytical approaches.Full article

(This article belongs to the Special IssueBenchmarking Datasets in Bioinformatics, 2nd Edition)

►▼ Show Figures

Graphical abstract

attachment

10 pages, 687 KiB

Open AccessData Descriptor

A DNA Barcode Dataset for the Aquatic Fauna of the Panama Canal: Novel Resources for Detecting Faunal Change in the Neotropics

byKristin Saltonstall,Rachel Collin,Celestino Aguilar,Fernando Alda,Laura M. Baldrich-Mora,Victor Bravo,María Fernanda Castillo,Sheril Castro,Luis F. De León,Edgardo Díaz-Ferguson,Humberto A. Garcés,Eyda Gómez,Rigoberto G. González,Maribel A. González-Torres,Hector M. Guzman,Alexandra Hiller,Roberto Ibáñez,César Jaramillo,Klara L. Kaiser,Yulang Kam,Mayra Lemus Peralta,Oscar G. Lopez,Maycol E. Madrid C.,Matthew J. Miller,Natalia Ossa-Hernandez,Ruth G. Reina,D. Ross Robertson,Tania E. Romero-Gonzalez,Milton Sandoval,Oris Sanjur,Carmen Schlöder,Ashley E. Sharpe,Diana Sharpe,Jakob Siepmann,David Strasiewsky,Mark E. Torchin,Melany Tumbaco,Marta Vargas,Miryam Venegas-Anaya,Benjamin C. Victor andGustavo Castellanos-Galindoadd Show full author list remove Hide full author list

Data2025,10(7), 108;https://doi.org/10.3390/data10070108 - 2 Jul 2025

Abstract

DNA metabarcoding is a powerful biodiversity monitoring tool, enabling simultaneous assessments of diverse biological communities. However, its accuracy depends on the reliability of reference databases that assign taxonomic identities to obtained sequences. Here we provide a DNA barcode dataset for aquatic fauna of the Panama Canal, a region that connects the Western Atlantic and Eastern Pacific oceans. This unique setting creates opportunities for trans-oceanic dispersal while acting as a modern physical dispersal barrier for some terrestrial organisms. We sequenced 852 specimens from a diverse array of taxa (e.g., fishes, zooplankton, mollusks, arthropods, reptiles, birds, and mammals) using COI, and in some cases, 12S and 16S barcodes. These data were collected for a variety of studies, many of which have sought to understand recent changes in aquatic communities in the Panama Canal. The DNA barcodes presented here are all from captured specimens, which confirms their presence in Panama and, in many cases, inside the Panama Canal. Both native and introduced taxa are included. This dataset represents a valuable resource for environmental DNA (eDNA) work in the Panama Canal region and across the Neotropics aimed at monitoring ecosystem health, tracking non-native and potentially invasive species, and understanding the ecology and distribution of these freshwater and euryhaline taxa.Full article

(This article belongs to the Special IssueBenchmarking Datasets in Bioinformatics, 2nd Edition)

►▼ Show Figures

Figure 1

7 pages, 1300 KiB

Open AccessData Descriptor

Global Database for Naturally Occurring Radionuclides Associated with Offshore Oil and Gas Production

byZiran Wei,Songjie He,Stephanie Sharuga andKanchan Maiti

Data2025,10(7), 107;https://doi.org/10.3390/data10070107 - 1 Jul 2025

Abstract

This study compiles a comprehensive dataset on the occurrence, distribution, and potential impacts of Naturally Occurring Radionuclides (NORMs) near offshore oil and gas platforms. It encompasses data, including activities (Bq/l) and exposure levels (Msv), derived from various environmental matrices. A particular emphasis is placed on petroleum products and waste, such as produced water, scales, and sludges. The dataset contributes to a better understanding of the distribution of NORM wastes in marine environments, informs future radiological safety standards, contributes to the formulation of regulatory policies, and facilitates the design of mitigation strategies. The information—literature and data from five continents over the past 70 years—has been carefully compiled and organized to support intuitive analysis, making it a valuable tool for policymakers and researchers.Full article

►▼ Show Figures

Figure 1

27 pages, 1023 KiB

Open AccessArticle

Exploring Legislative Textual Data in Brazilian Portuguese: Readability Analysis and Knowledge Graph Generation

byGisliany Lillian Alves de Oliveira,Breno Santana Santos,Marianne Silva andIvanovitch Silva

Data2025,10(7), 106;https://doi.org/10.3390/data10070106 - 1 Jul 2025

Abstract

Legislative documents are crucial to democratic societies, defining the legal framework for social life. In Brazil, legislative texts are particularly complex due to extensive technical jargon, intricate sentence structures, and frequent references to prior legislation. The country’s civil law tradition and multicultural context introduce further interpretative and linguistic challenges. Moreover, the study of Brazilian Portuguese legislative texts remains underexplored, lacking legal-specific models and datasets. To address these gaps, this work proposes a data-driven approach utilizing large language models (LLMs) to analyze these documents and extract knowledge graphs (KGs). A case study was conducted using 1869proposals from the Legislative Assembly of Rio Grande do Norte (ALRN), spanning January 2019 to April 2024. The Llama 3.2 3B Instruct model was employed to extract KGs representing entities and their relationships. The findings support the method’s effectiveness in producing coherent graphs faithful to the original content. Nevertheless, challenges remain in resolving entity ambiguity and achieving full relationship coverage. Additionally, readability analyses using metrics for Brazilian Portuguese revealed that ALRN proposals require superior reading skills due to their technical style. Ultimately, this study advances legal artificial intelligence by providing insights into Brazilian legislative texts and promoting transparency and accessibility through natural language processing techniques.Full article

►▼ Show Figures

Figure 1

21 pages, 287 KiB

Open AccessArticle

Expert Experiences in Anonymizing Personal Data and Its Use as Open Data: Qualitative Insights

byNorbert Lichtenauer,Johann Guggumos,Matthias Kampmann,Juliane Kis,Florian Laumer,Elena März,Florian Wahl andSebastian Wilhelm

Data2025,10(7), 105;https://doi.org/10.3390/data10070105 - 1 Jul 2025

Abstract

Introduction: The effective and meaningful use of anonymized personal data, including open data, is globally significant across various sectors. Enhancing data utilization aims to generate substantial societal benefits and added value through innovations, products, and services. However, several legal, ethical, and technical challenges currently hinder the development and broader adoption of open data. Furthermore, the availability of technical support tools with high usability is especially desirable to facilitate the anonymization process effectively.Methods: As part of the EAsyAnon research project, preliminary insights were gathered through a scoping review that identified factors promoting or impeding the anonymization and use of personal data. Based on these findings, a structured interview guide was developed. Following a pretest, 19 interviews were conducted with diverse stakeholders from healthcare institutions, research organizations, public authorities, and private companies. The collected data were analyzed using Kuckartz’s structural content analysis methodology, supported by qualitative analysis software.Results: The content analysis yielded five overarching categories and 21 subcategories. These encompassed stakeholder experiences related to anonymization and open data processes, the various types and formats of personal data, identified barriers and enabling factors, support services, and the ethical and legal considerations associated with anonymization.Discussion: The findings highlight significant uncertainty among stakeholders regarding the anonymization of personal data. Although the importance and potential applications of open data for innovation and continuous improvement are widely acknowledged and supported, numerous challenges persist at both the macro and micro levels. The results emphasize a clear need for targeted support measures to address these challenges effectively.Full article

(This article belongs to the Special IssueEthical AI and Responsible Data Science)

►▼ Show Figures

Figure 1

14 pages, 228 KiB

Open AccessArticle

Extracting Information from Unstructured Medical Reports Written in Minority Languages: A Case Study of Finnish

byElisa Myllylä,Pekka Siirtola,Antti Isosalo,Jarmo Reponen,Satu Tamminen andOuti Laatikainen

Data2025,10(7), 104;https://doi.org/10.3390/data10070104 - 1 Jul 2025

Abstract

In the era of digital healthcare, electronic health records generate vast amounts of data, much of which is unstructured, and therefore, not in a usable format for conventional machine learning and artificial intelligence applications. This study investigates how to extract meaningful insights from unstructured radiology reports written in Finnish, a minority language, using machine learning techniques for text analysis. With this approach, unstructured information could be transformed into a structured format. The results of this research show that relevant information can be effectively extracted from Finnish medical reports using classification algorithms with default parameter values. For the detection of breast tumour mentions from medical texts, classifiers achieved high accuracy, almost 90%. Detection of metastasis mentions, however, proved more challenging, with the best-performing models Support Vector Machine (SVM) and logistic regression achieving an F1-score of 81%. The lower performance in metastasis detection is likely due to the more complex problem, ambiguous labeling, and the smaller dataset size. The results of classical classifiers were also compared with FinBERT, a domain-adapted Finnish BERT model. However, classical classifiers outperformed FinBERT. This highlights the challenge of medical language processing when working with minority languages. Moreover, it was noted that parameter tuning based on translated English reports did not significantly improve the detection rates, likely due to linguistic differences between the datasets. This larger translated dataset used for tuning comes from a different clinical domain and employs noticeably simpler, less nuanced language than the Finnish breast cancer reports, which are written by native Finnish-speaking medical experts. This underscores the need for localised datasets and models, particularly for minority languages with unique grammatical structures.Full article

►▼ Show Figures

Figure 1

15 pages, 770 KiB

Open AccessData Descriptor

NPFC-Test: A Multimodal Dataset from an Interactive Digital Assessment Using Wearables and Self-Reports

byLuis Fernando Morán-Mirabal,Luis Eduardo Güemes-Frese,Mariana Favarony-Avila,Sergio Noé Torres-Rodríguez andJessica Alejandra Ruiz-Ramirez

Data2025,10(7), 103;https://doi.org/10.3390/data10070103 - 30 Jun 2025

Abstract

The growing implementation of digital platforms and mobile devices in educational environments has generated the need to explore new approaches for evaluating the learning experience beyond traditional self-reports or instructor presence. In this context, the NPFC-Test dataset was created from an experimental protocol conducted at the Experiential Classroom of the Institute for the Future of Education. The dataset was built by collecting multimodal indicators such as neuronal, physiological, and facial data using a portable EEG headband, a medical-grade biometric bracelet, a high-resolution depth camera, and self-report questionnaires. The participants were exposed to a digital test lasting 20 min, composed of audiovisual stimuli and cognitive challenges, during which synchronized data from all devices were gathered. The dataset includes timestamped records related to emotional valence, arousal, and concentration, offering a valuable resource for multimodal learning analytics (MMLA). The recorded data were processed through calibration procedures, temporal alignment techniques, and emotion recognition models. It is expected that the NPFC-Test dataset will support future studies in human–computer interaction and educational data science by providing structured evidence to analyze cognitive and emotional states in learning processes. In addition, it offers a replicable framework for capturing synchronized biometric and behavioral data in controlled academic settings.Full article

(This article belongs to the Special IssueData Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

►▼ Show Figures

Figure 1

24 pages, 1586 KiB

Open AccessArticle

Effective Education System for Athletes Utilising Big Data and AI Technology

byMartin Mičiak,Dominika Toman,Roman Adámik,Ema Kufová,Branislav Škulec,Nikola Mozolová andAneta Hoferová

Data2025,10(7), 102;https://doi.org/10.3390/data10070102 - 24 Jun 2025

Abstract

Education leads to building successful careers. However, different groups of students have different studying preferences. Our target group are athletes, combining their education and sports training. The main objective is to provide recommendations for an effective education system for athletes, improving their chances of finding new careers after leaving sports. Such a system must include Big Data and utilise AI possibilities currently available that support athletes’ career planning and development in a meaningful way. The main objective is specified by the following partial objectives: identifying what types of Big Data to analyse in connection with the athletes’ education; revealing what AI tools to include in the athletes’ education for their better preparation for a career after sports; determining what knowledge of AI and Big Data athletes need to stay relevant once they enter the labour market. Our study combines secondary and primary data sources. The secondary data (used in the orientation analysis) include case studies on AI and Big Data connected to education. The primary data were collected via a survey performed on over 200 Slovak junior athletes. The results show directions for the sports policymakers and sports organisations’ managers willing to improve their athletes’ career prospects.Full article

(This article belongs to the Special IssueData Mining and Computational Intelligence for E-Learning and Education—3rd Edition)

►▼ Show Figures

Figure 1

8 pages, 786 KiB

Open AccessData Descriptor

OrthoKnow-SP: A Large-Scale Dataset on Orthographic Knowledge and Spelling Decisions in Spanish Adults

byJon Andoni Duñabeitia

Data2025,10(7), 101;https://doi.org/10.3390/data10070101 - 24 Jun 2025

Abstract

Orthographic knowledge is a critical component of skilled language use, yet its large-scale behavioral signatures remain understudied in Spanish. To address this gap, we developed OrthoKnow-SP, a megastudy that captures spelling decisions from 27,185 native Spanish-speaking adults who completed an 80-item forced-choice task. Each trial required selecting the correctly spelled word from a pair comprising a real word and a pseudohomophone foil that preserved pronunciation while violating the correct graphemic representation. The stimuli targeted six high-confusability contrasts in Spanish orthography. We recorded response accuracy and reaction times for over 2.17 million trials, alongside demographic and device metadata. Results show robust variability across items and individuals, with item-level metrics closely aligned with independent norms of word prevalence. A composite difficulty index integrating speed and accuracy further allowed fine-grained item ranking. The dataset provides the first population-scale norms of Spanish spelling difficulty, capturing regional and generational diversity absent from traditional lab-based studies. Public release of OrthoKnow-SP enables new research on the cognitive and demographic factors shaping orthographic decisions, and provides educators, clinicians, and developers with a valuable benchmark for assessing spelling competence and modeling written language processing.Full article

►▼ Show Figures

Figure 1

attachment

14 pages, 296 KiB

Open AccessArticle

Collecting and Analyzing IBD Clinical Data for Machine-Learning: Insights from an Italian Cohort

byAldo Marzullo,Victor Savevski,Maddalena Menini,Alessandro Schilirò,Gianluca Franchellucci,Arianna Dal Buono,Cristina Bezzio,Roberto Gabbiadini,Cesare Hassan,Alessandro Repici andAlessandro Armuzzi

Data2025,10(7), 100;https://doi.org/10.3390/data10070100 - 24 Jun 2025

Abstract

Research of Inflammatory Bowel Disease (IBD) involves integrating diverse and heterogeneous data sources, from clinical records to imaging and laboratory results, which presents significant challenges in data harmonization and exploration. These challenges are also reflected in the development of machine-learning applications, where inconsistencies in data quality, missing information, and variability in data formats can adversely affect the performance and generalizability of models. In this study, we describe the collection and curation of a comprehensive dataset focused on IBD. In addition, we present a dedicated research platform. We focus on ethical standards, data protection, and seamless integration of different data types. We also discuss the challenges encountered, as well as the insights gained during its implementation.Full article

(This article belongs to the SectionComputational Biology, Bioinformatics, and Biomedical Data Science)

►▼ Show Figures

Figure 1

attachment

30 pages, 30383 KiB

Open AccessTechnical Note

Dataset and AI Workflow for Deep Learning Image Classification of Ulcerative Colitis and Colorectal Cancer

byJoaquim Carreras,Giovanna Roncador andRifat Hamoudi

Data2025,10(7), 99;https://doi.org/10.3390/data10070099 - 24 Jun 2025

Abstract

Inflammatory bowel disease (IBD) is a chronic inflammatory condition of the gastrointestinal tract characterized by the deregulation of immuno-oncology markers. IBD includes ulcerative colitis and Crohn’s disease. Chronic active inflammation is a risk factor for the development of colorectal cancer (CRC). This technical note describes a dataset of histological images of ulcerative colitis, CRC (adenocarcinoma), and colon control. The samples were stained with hematoxylin and eosin (H&E), and immunohistochemically analyzed for LAIR1 and TOX2 markers. The methods used for collecting, processing, and analyzing scientific data, including this dataset, using convolutional neural networks (CNNs) and information about the dataset’s use are also described. This article is a companion to the manuscript “Ulcerative Colitis, LAIR1 and TOX2 Expression, and Colorectal Cancer Deep Learning Image Classification Using Convolutional Neural Networks”.Full article

(This article belongs to the SectionComputational Biology, Bioinformatics, and Biomedical Data Science)

►▼ Show Figures

Figure 1

20 pages, 2245 KiB

Open AccessArticle

Data-Driven Modeling and Simulation in Forestry and Agricultural Product Transportation Management by Small Businesses: A Case Study

byGalina Merkurjeva,Vitalijs Bolsakovs,Jurijs Merkurjevs,Andrejs Romanovs andWouter Faes

Data2025,10(7), 98;https://doi.org/10.3390/data10070098 - 24 Jun 2025

Abstract

This article proposes an innovative methodology for data-driven modeling and simulation of transportation management through cross-sectoral collaboration in small businesses. The present research is multidisciplinary and interdisciplinary in nature. We investigate the improvements in logistics management that can be achieved through cross-sector collaboration in agriculture and forestry. A data-driven method, such as symbolic regression, is used to identify the relationships between factors in a modeled system using mathematical expressions. These expressions are directly integrated into the simulation models. Simulation spreads the modeling of transportation processes over a period of time. The system dynamics model is designed to analyze and assess the performance of a system based on its past behavior and is, therefore, deterministic. The discrete-event model enables the simulation of future scenarios and outcomes over time, given random input variables. As new data become available, relationships within the symbolic regression method are discovered more accurately, and simulations are updated accordingly. The tools offered for implementation are supplemented by a multi-user web simulation. The proposed case study is based on a real-life example. The obtained results allow small agricultural companies to use transportation and labor resources more efficiently when organizing the transportation of their agricultural and forestry products. Integrating data-driven models into simulations enables a better interpretation of data across the entire data value chain.Full article

►▼ Show Figures

Figure 1

27 pages, 1050 KiB

Open AccessArticle

Developing Data Workflows: From Conceptual Blueprints to Physical Implementation

byBruno Oliveira andÓscar Oliveira

Data2025,10(7), 97;https://doi.org/10.3390/data10070097 - 23 Jun 2025

Abstract

Data workflows are an important component of modern analytical systems, enabling structured data extraction, transformation, integration, and delivery across diverse applications. Despite their importance, these workflows are often developed using ad hoc approaches, leading to scalability and maintenance challenges. This paper proposes a structured, three-level methodology—conceptual, logical, and physical—for modeling data workflows using Business Process Model and Notation (BPMN). A custom BPMN metamodel is introduced, along with a tool built on BPMN.io, that enforces modeling constraints and supports translation from high-level workflow designs to executable implementations. Logical models are further enriched through blueprint definitions, specified in a formal, implementation-agnostic JSON schema. The methodology is validated through a case study, demonstrating its applicability across ETL and machine learning domains, promoting clarity, reuse, and automation in data pipeline development.Full article

►▼ Show Figures

Figure 1

20 pages, 4787 KiB

Open AccessArticle

A Data Imputation Strategy to Enhance Online Game Churn Prediction, Considering Non-Login Periods

byJaeHong Lee,Pavinee Rerkjirattikal andSangGyu Nam

Data2025,10(7), 96;https://doi.org/10.3390/data10070096 - 23 Jun 2025

Abstract

User churn in online games refers to players becoming inactive for an extended period. Even a small increase in churn can lead to significant revenue loss, making churn prediction crucial for sustaining long-term player engagement. Although user churn prediction has been extensively studied, most existing approaches either ignore non-login periods or treat all inactivity uniformly, overlooking key behavioral differences. This study addresses this gap by categorizing non-login periods into three types, as follows: inactivity due to new or dormant users, genuine loss of interest, and temporary inaccessibility caused by external factors. These periods are treated as either non-existent or missing data and imputed using techniques such as mean or mode substitution, linear interpolation, and multiple imputation by chained equations (MICE). MICE was selected due to its ability to impute missing values more robustly by considering multivariate relationships. A random forest (RF) classifier, chosen for its interpretability and robustness to incomplete data, serves as the primary prediction model. Additionally, classifier chains are used to capture label dependencies, and principal component analysis (PCA) is applied to reduce dimensionality and mitigate overfitting. Experiments on real-world MMORPG data show that our approach improves predictive accuracy, achieving a micro-averaged AUC of above 0.92 and a weighted F1 score exceeding 0.70. These findings suggest that our approach improves churn prediction and offers actionable insights for supporting personalized player retention strategies.Full article

(This article belongs to the SectionInformation Systems and Data Management)

►▼ Show Figures

Figure 1

Journal Menu

►▼ Journal Menu

Journal Browser

►▼ Journal Browser

arrow_forward_iosForthcoming issue
arrow_forward_iosCurrent issue

Highly Accessed Articles

Latest Books

More Books and Reprints...

E-Mail Alert

News

7 July 2025
Meet Us at the 2025 Forty-Second International Conference on Machine Learning (ICML), 13–19 July 2025, Vancouver, Canada

4 July 2025
MDPI’s Newly Launched Journals in June 2025

2 July 2025
MDPI INSIGHTS: The CEO's Letter #24 - 2024 Impact Factor & CiteScore, MDPI Summits France & USA, Tu Youyou Award

More News & Announcements...

Topics

Propose a Topic

Topic inAlgorithms,Data,Earth,Geosciences,Mathematics,Land,Water,IJGI

Applications of Algorithms in Risk Assessment and EvaluationTopic Editors: Yiding Bao, Qiang Wei
Deadline: 31 July 2025

Topic inAI,Data,Economies,Mathematics,Risks

Advanced Techniques and Modeling in Business and EconomicsTopic Editors: José Manuel Santos-Jaén, Ana León-Gomez, María del Carmen Valls Martínez
Deadline: 30 September 2025

Topic inBiology,Data,Diversity,Fishes,Animals,Conservation,Hydrobiology

Intersection Between Macroecology and Data ScienceTopic Editors: Paulo Branco, Gonçalo Duarte
Deadline: 30 November 2025

Topic inApplied Sciences,Batteries,Buildings,Data,Electricity,Electronics,Energies,Smart Cities

Smart Energy Systems, 2nd EditionTopic Editors: Hugo Morais, Rui Castro, Cindy Guzman
Deadline: 30 December 2025

Conferences

Announce Your Conference

More Conferences...

Special Issues

Propose a Special Issue

Special Issue inData

Benchmarking Datasets in Bioinformatics, 2nd EditionGuest Editor: Pufeng Du
Deadline: 31 July 2025

Special Issue inData

Data Mining and Computational Intelligence for E-Learning and Education—3rd EditionGuest Editor: Antonio Sarasa-Cabezuelo
Deadline: 20 August 2025

Special Issue inData

Data in Astrophysics and Geophysics: Research and Applications, 3rd EditionGuest Editors: Vladimir Sreckovic, Milan S. Dimitrijević, Zoran Mijic
Deadline: 31 August 2025

Special Issue inData

Navigating Emerging Advancements and Challenges in AI and Big Data Technologies for Business and SocietyGuest Editor: Michael Gerlich
Deadline: 30 September 2025

More Special Issues

Topical Collections

Topical Collection inData

Modern Geophysical and Climate Data Analysis: Tools and MethodsCollection Editors: Vladimir Sreckovic, Zoran Mijic

Data, EISSN 2306-5729, Published by MDPI

RSSContent Alert

Subscribe to receive issue release notifications and newsletters from MDPI journals

Disclaimer Terms and Conditions Privacy Policy

We use cookies on our website to ensure you get the best experience.
Read more about our cookieshere.

Movatterモバイル変換

Journals

Topics

Information

Author Services

Initiatives

About

Notice

Notice

Journal Description

Data

Latest Articles

Journal Menu

Journal Browser

Highly Accessed Articles

Latest Books

E-Mail Alert

News

Topics

Conferences

Special Issues

Topical Collections

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Share Link

Share