Movatterモバイル変換

1256Accesses
Explore all metrics

Abstract

Business process management entails a multi-billion-dollar industry that is founded on modeling business processes to analyze, understand, improve, and automate them. Business processes consist of a set of interconnected activities that an organization follows to achieve its goals and objectives. While the existence of business process models in open source has been reported in the literature, there is little work in characterizing their landscape. This paper presents the first characterization of business process models in open source, particularly on GitHub. The landscape is formed by 25,866 business process models across 4,954 repositories, with 16% of the repositories belonging to organizations. We discover that models belong to at least 16 domains includingtraditional software,machine learning,sales,business services, andfinancial services. These models are created using at least 28 different tools. Our exploration into cloning among the models shows that about 90% of all models are clones of each other. Application domains such asmachine learning,traditional software, andbusiness services demonstrate a higher occurrence of clones while in another dimension, clones are found across more repositories owned by industry as compared to those owned by academia. Also, contrary to code clones, we find that the majority of process model cloning occurs across multiple repositories. While our study acts as a precursor for future efforts to develop effective modeling practices in the field of business processes, it also emphasizes the need to address cloning and its implications in the context of reuse, maintenance, and modeling approaches.

Lessons Learned from Co-Evolution of Software Process and Model-Driven Engineering

SAP Signavio Academic Models: A Large Process Model Dataset

Process mining: software comparison, trends, and challenges

Article30 December 2022

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1Introduction

Business processes consist of a set of interconnected activities that an organization follows to achieve its goals and objectives (Weske2007). Modeling of business processes is crucial to comprehend and manage these processes to ensure and enhance the efficiency and competitiveness of organizations (Dumas et al.2018). The business process management (BPM) industry is projected to reach a market capitalization of USD 83+ billion by 2032^Footnote1^,^Footnote2.

While there are many notations for modeling business processes, such as UML Activity Diagrams (Fowler2004), Event-driven Process Chains (EPCs) (Mendling2008), Business Process Model and Notation (BPMN) (OMG2013), and Yet Another Workflow Language (YAWL) (Van Der Aalst and Ter Hofstede2005), BPMN is widely accepted as the de facto standard in both industry and academia (OMG2013; Chinosi and Trombetta2012; Saeedi Nikoo et al.2020). BPMN is standardized by the Object Management Group (OMG2013). This paper studies business process models in BPMN notation.

Characterizing modeling and related artifacts is a foundational step in empirical software engineering research. Software engineering literature on characterization has focused on open source software from industry (Han et al.2021), software from specific domains like machine learning (Gonzalez et al.2020), automotive (Kochanthara et al.2022), video games (Murphy-Hill et al.2014), and bots (Wessel et al.2018). With the increasing adoption of BPMN models across organizations (Geiger et al.2018), it is vital to gain a deeper understanding of BPMN models in open source. Characterizing these open source models is crucial to understanding how BPMN models are developed in collaborative development settings. This can inform: (a) the development of tools that can serve the specific needs of open-source BPM projects; (b) influence best practices in BPMN model creation and maintenance; (c) guide standardization efforts and future revisions of BPMN standard.

The BPMN-related literature has focused on mining potential BPMN models from GitHub (Heinze et al.2020), building a dataset of BPMN models on GitHub (Türker et al.2022), exploring the usability of the BPMN models’ datasets in validation and verification stage of tool development (Heinze et al.2020), and design decisions in BPMN modeling (Lübke and Wutke2021). However, to the best of our knowledge, a characterization of BPMN models in open source is not presented in the literature. This paper aims to address this research gap by answering the following research questions:

RQ1: What is the landscape of BPMN models in open source?

We explore what application domains the BPMN models belong to, and what tools are used to create these models. We also explore other characteristics of the BPMN models and their repositories, including the temporal trends of the models and their ownership.

We find that the BPMN models on GitHub belong to at least 16 domains withmachine learning,traditional software,sales,business services, andfinancial services emerging as the most prevalent domains. We observe an increasing adoption of BPMN models on GitHub, as evidenced by the consistent growth in model creation and updates. While organization-owned repositories remain significant ($\approx $16%) contributors to BPMN models, the majority ($\approx $ 84%) of repositories hosting the models are owned by individual users. Furthermore, our study revealed a strong reliance on industry-leading process model development tool vendors, such as Drools (2023), Camunda (2023), Activiti (2023), and SAP Signavio (2023), for BPMN models on GitHub, highlighting their importance within the open source BPMN model development community.

Our characterization includes not only understanding aspects such as the diversity and landscape of BPMN models but also their reuse. Cloning is a critical aspect of reuse, which can degrade the quality of BPMN models (Haisjackl et al.2015). Our initial exploration into the BPMN models on GitHub revealed the possibility of duplication or cloning. Cloning or replicating process models across projects can significantly impact maintenance, quality assurance, and increase redundancy or inconsistency, while also affecting the efficiency of development processes and the reliability of BPMN models (Roy and Cordy2007; La Rosa et al.2015; Dumas et al.2013; Uba et al.2011). In BPMN models, cloning introduces challenges in maintainability due to increased complexity and potential inconsistencies (La Rosa et al.2015). Replicating similar fragments across different models necessitates applying changes multiple times, escalating the risk of errors and bug propagation. Cloned fragments make individual process models larger than they need to be, therefore affecting their comprehensibility (Dumas et al.2013), and thus their quality (Haisjackl et al.2015; Monden et al.2002). Cloning can significantly affect the evolution aspect because the effort required to evolve depends significantly on the amount of cloned content (Sneed2008; April and Abran2012). Cloning can also negatively impact design practice by discouraging refactoring and improvements (Rattan et al.2013).

The problem of clone detection has been extensively explored in the software engineering domain, mainly in the context of source code clone detection (Roy et al.2009), but also for model clone detection (Deissenboeck et al.2010; Störrle2015; Deissenboeck et al.2008; Rattan et al.2013; Babur et al.2019; Saeedi Nikoo et al.2022). There is a significant amount of duplication within and across repositories on GitHub (Lopes et al.2017; Allamanis2019; Spinellis et al.2020). For instance, Babur et al. (2019) report 66% of metamodels from GitHub are file-level exact duplicates. In the context of BPMN models, prior studies have explored duplication and similarity among BPMN models from specific organizations (Dijkman et al.2011; Uba et al.2011). While the mere existence of duplicate BPMN models on GitHub (based on their SHA-1 hashes) has been reported (Türker et al.2022), any analysis of similarity of these models has not been presented in the scientific literature. Therefore, we complement our prior research question on characterization of BPMN models in open source by answering our second research question:

RQ2: What is the extent of cloning in BPMN models in open source?

We conduct a cloning analysis on the entire model and subprocess levels in BPMN models. Specifically, we investigate exact clones (i.e, Type-A clones), and highly similar clones (i.e, Type B & C clones) (La Rosa et al.2015; Babur et al.2019; Saeedi Nikoo et al.2022). We explore the prevalence of cloning within repositories versus cloning across repositories. We quantitatively analyze the characteristics of clones, including what application domains they belong to and the kind of repositories they come from.

The majority of the model clones ($\approx $ 80%) were found to be exact clones, with cross-project clones being the most common ($\approx $ 57%) and predominantly associated with industry-owned (as opposed to academia-owned) repositories. An analysis of the model-level clones indicates thatmachine learning,traditional software, andbusiness services are the top three domains in the dataset. A further clone detection analysis on subprocess elements within the models reveals a substantial degree (i.e.,$\approx $ 92%) of cloning among these model fragments. It also shows that all the detected subprocess clones have at least one exact clone.

In summary, the primary contributions of this study are:

The first study into the characterization of BPMN models in open source as seen on GitHub,
The first study into cloning of BPMN models at model and subprocess levels on GitHub,
A unique dataset containing tagged, distinct, non-trivial BPMN models and their similarity to other models, and distinct, non-trivial BPMN subprocesses.

The structure of this paper is as follows. Section 2 provides a detailed description of our study design. In Section 3, we address RQ1 by presenting the results of our analyses conducted on the model set. Section 4 addresses RQ2, providing the outcomes of our clone analysis on the models and subprocesses within them. In Section 5 we discuss the main implications of our study. To ensure the validity of our findings, we discuss potential threats that may impact our results in Section 6. Section 7 provides a thorough review of prior work related to our study. Finally, Section 8 concludes the paper by summarizing our key contributions and suggesting possible avenues for future research.

2Study Design

We study BPMN models from GitHub. Our choice of GitHub is motivated by its position as the most popular platform to host open source projects for organizations and individual users alike with 420 million projects and 4.5 billion contributions as of 2023 (GitHub2023). In addition to the BPMN models themselves, GitHub also provides their development history, the context (e.g., the software project the BPMN models are a part of), and metadata (Munaiah et al.2017). This information is required for this study. While there are other collections of BPMN models (PROS-Lab2019; Sola et al.2022; Türker et al.2022), none of them provide any of the above-mentioned information beyond the BPMN models themselves.

The study design of prior similar studies on BPMN models (Uba et al.2011; Dijkman et al.2011) focuses on models from a single or a few organizations for checking the effectiveness of (clone detection) tools and thus is not applicable in our context.

Our study is structured in three parts. First, we identify non-trivial BPMN models from all potential BPMN models. The second and third parts use the BPMN models identified in the first part. The second part illustrates the descriptive statistics on the BPMN models and their repositories (in Section 3). The third part presents a characterization of similarities (cloning) among BPMN models and related descriptive statistics (in Section 4). The study design of the first two parts of this study takes inspiration from the recent landscape studies (Han et al.2021; Gonzalez et al.2020; Kochanthara et al.2022; Wessel et al.2018). The study design of the third part takes inspiration from the recent studies on software clones (Spinellis et al.2020; Allamanis2019; Gharehyazie et al.2019; Lopes et al.2017). The rest of this section presents the first part. An overview of the steps followed in the first, second, and third parts is shown in Fig. 1.

2.1Data Retrieval

This study builds on a recent dataset by Türker et al. (2022) from 2022. The artifacts in their dataset were collected out of a set of$\approx $82.8 million non-forked, non-deleted, public repositories, out of 113 million distinct, non-fork repositories from the latest database dump provided by GHTorrent (Gousios2013). While their dataset offers the latest and most extensive compilation of links to BPMN files on GitHub, its data extends up to March 2021 (i.e., based on the latest GHTorrent dump). This means that any BPMN models added to GitHub since then are absent from our analysis. However, the downloaded artifacts reflect the most up-to-date information available at the time of retrieval. We choose this dataset as it is the result of a mining process specific to BPMN models, and it is the most recent and largest available dataset covering open source projects. This dataset is formed by identifying BPMN artifacts and associated repositories using keywords includingbusiness andbpm, combined with tool names such ascamunda andsignavio from GHTorrent (Türker et al.2022). The dataset contains links to 327,436 potential BPMN artifacts across 18,534 repositories (Türker et al.2022). Note that directly mining GitHub is a resource-intensive process, which entails significant computing resources and potential expenses. Due to this we refrain from utilizing this method in our study, rather use the existing dataset as our starting point.

We used the links in the dataset and downloaded 287,317 potential BPMN models (September 2022) from GitHub. We used the links in Türker et al.’s dataset (Türker et al.2022) and downloaded (in September 2022) 287,317 potential BPMN models, belonging to 15,609 distinct repositories from GitHub. Around 40k ($\approx $12%) models couldn’t be downloaded due to unavailability, possibly because the files or their repositories were removed or renamed, or the repositories were made private. For replicability, the script used to download the files is presented in our replication package (Saeedi Nikoo et al.2023).

2.2Inclusion and Exclusion Criteria

A BPMN model represents the end-to-end workflow like a flowchart, using interconnected graphical elements (see Fig. 2). To answer our research questions (refer to Section 1), we need to identify meaningful, non-trivial BPMN models. To identify meaningful, non-trivial BPMN models (i.e., excluding non-BPMN and toy models) from 287,317 potential ones, we use the following inclusion-exclusion criteria. Our inclusion-exclusion criteria are inspired by relevant literature (Sola et al.2022; Corradini et al.2018). The inclusion-exclusion steps, along with the number of models left after each step are shown in Fig. 1. The individual steps of inclusion-exclusion were automated. The scripts used in each step and the corresponding list of included/excluded files are presented in the replication package (Saeedi Nikoo et al.2023).^Footnote3

Exclude non-BPMN files. The scope of this study is BPMN files in BPMN standard serialization format (OMG2013). However, the 287,317 artifacts from GitHub contain files beyond BPMN standard serialization format (OMG2013), including scripts (e.g., SQL scripts), BibTeXs, and PDFs. To exclude files other than BPMN standard serialization format, we use BPMN2 Modeler plugin, a widely used Eclipse-based graphical modeling tool for authoring business processes (Eclipse2021). The plugin parsed 278,804 files (Saeedi Nikoo et al.2023) out of 287,317 artifacts. We discard the remaining 8,473 files.
Exclude models with$\le 3$activities. This study focuses on non-trivial BPMN models. To identify non-trivial models, we use the information on the activities of the model. Activities in BPMN models consist of atomic and compound activities which are represented using task and subprocess elements, respectively (OMG2013) (see Fig. 2). The representative role of activities in business process models (Dumas et al.2018) implies that models with a few activity elements are most likely toy (i.e., not real-world) or trivial models (Corradini et al.2018). To identify the number of activities in each model, first, we convert each BPMN model to a list of activity labels using BPMN2 Modeler API (Eclipse2021). Now the length of this list gives the number of activity labels in the model. In the rest steps in this section, for each BPMN model, we use this converted format of list of strings. We identified that 205,542 out of 278,804 BPMN models from the previous step are models with$\le 3$ activities. After excluding these, we were left with 73,262 models.
Exclude models with all of their activity labels having less than three characters. In this study we focus on meaningful BPMN models, that represent real-world processes. As in the previous step, we use activity labels in BPMN models. Any meaningful activity label cannot consist of only two letters, since no one-letter or two-letter word (Wikibooks2024) can represent a self-contained, meaningful task. Therefore, any BPMN model with all of its activity labels that are one or two-letter words will not be a meaningful BPMN model. We identify such models using the converted format from the previous step (BPMN models as lists of strings of their activity labels). We identified and removed 9,095 models with less than three characters in their activity labels in the dataset. After excluding these models from 73,262 models from previous step, we were left with 64,167 models.
Exclude models composed of repeating terms. A frequency plotting of the activity labels of individual models revealed that there is a high number of models with only the terms“task", “script", and“subprocess" in all of their labels with at most one label having any other content. Examples (list of activity labels) of such processes are: (1)“subProcess", “Complete SubTask", “Complete Task A", “Complete Task B"; (2)“check before normal end", “SubProcess", “Service Task", “User Task", “User Task". It is trivial to see that a model with only these terms (or all but one activity) will least likely be one that represents a real-world process. Trivial single symbol tokens (e.g., “1", “A") are not considered as repeating terms. We identified 16,891 models with these repeating terms. After the exclusion of such models from 64,167 models from the previous step, we were left with 47,276 models.
Include only English models. We focus only on models in English. Thereby, we exclude models with non-English labels. We identify non-English models using an open source library, Lingua (2022), which has been shown to be effective in identifying natural language, especially for short text (Saier et al.2022). The language detection was done separately for each activity label of a model, and a model was removed if all its activity labels were identified as non-English. We identified 20,161 non-English models out of 47,276 from the previous step, which were excluded from the dataset. We were left with 27,115 models.

During our analysis, we encountered certain repositories with different URLs that were referring to the same repository. To prevent double-counting these repositories and their associated models, duplicates based on final resolved URLs for each repository were removed. This process resulted in the removal of 256 duplicated repositories in our dataset, collectively containing 1,249 models. These duplicated models were removed from the 27,115 models resulting after applying our inclusion-exclusion criteria. Thus, we are left with 25,866 models, originating from 4,954 repositories, that are investigated in this study.

To summarize, Türker et al.’s dataset (2022) serves as our starting point. We refine the dataset by eliminating noise, including non-BPMN files, trivial models, and non-English models. Additionally, we extend the dataset by adding metadata such as the types of repositories hosting the models, the application domains of the models, their temporal trends, and clones of the models and their fragments (explained in Sections3 &4).

3RQ1 – What is the Landscape of BPMN Models in Open Source?

We answer RQ1 in three dimensions. First, we classify the models according to their application domains. Second, we determine the range of modeling tools employed in developing these models. Third, we investigate the ownership of the models and their evolution over the years. Our findings are based on the data from 25,866 non-trivial BPMN models originating from 4,954 open source projects on GitHub spanning the years 2010 to 2021. The rest of this section is organized as follows. Section3.1 presents our approach and Section3.2, our findings.

3.1Approach

Application domains

We identify application domains of process models in two parts. In the first part, we refine a list of domains that the models in our dataset might belong to. In the second part, we classify each BPMN model into domain(s) identified in the first part.

Part 1: Model domain definition. To identify the potential domains for BPMN models in our dataset, we start with the literature on the classification of domains. The most popular domain classification schemas include: The Global Industry Classification Standard (GICS) (MSCI1999), The Refinitiv Business Classification (TRBC) (Reuters-Group2004), and Digital Commons Three-Tiered Taxonomy (Disciplines Digital commons three-tiered taxonomy of academic disciplines2016). The first two are known for their extensive but also granular coverage across various business domains, providing a global (non-region-specific) framework for business applications. The third one is more suitable for classifications based on academic disciplines. Since our context is business processes, we choose GICS and TRBC as our basis to identify domains for BPMN models in our dataset. To choose a subset of domains (GICS lists 163 sub-industries and TRBC lists 62 industry groups) from these classifications, we rely on activity labels because of their representative role in business process models (Dumas et al.2018).

Initially, we attempted a top-down approach for domain identification by creating a list of popular application domains for process modeling. This list was compiled from use cases documented in the proceedings of the top-tier business process conference series (on Business Process Management C2023) spanning a period of ten years. However, this approach proved ineffective in properly classifying the models in our dataset, as the chosen domains were not representative of the models in our dataset.

We adopt a bottom-up approach inspired by the open coding process from Grounded Theory (Corbin and Strauss1990) for classifying the individual activity labels in the models into a set of potential domains. We then use the potential domains to define the final list of domains.

We identify a potential list of domains using the following steps:

Label extraction: Extract activity labels for all BPMN models (i.e., 44,135 distinct labels).
Frequency-based sorting: Sort the labels based on their frequency of occurrence within models, from highest to lowest.
Label selection: Keep activity labels (see Fig. 2 for sample activity labels) having a minimum of 10 frequencies (2,494 out of 44,135 distinct labels).
Label domain assignment: Through open coding, we manually assign domain tags to these labels. During the process of label classification, we initially assigned labels to potential domains. For instance, labels like“check insurance” or“evaluate severity of claim” were classified under the insurance domain, or labels like“create loan request” or“invoice credit card” were classified under the financial domain. Note that the list of domains was not predetermined and evolved incrementally as more labels were analyzed. We provide a list of all activity labels and domains assigned in our replication package (Saeedi Nikoo et al.2023).
Model domain definition: We map each of our domains to the industry group within the GICS and TRBC taxonomies that has the closest name similarity. For instance, we maplogistics to the Freight & Logistics Services category,education to School, College & University, andbusiness services to the Professional & Commercial Services in the TRBC taxonomy. For some of the domains evolved in our open coding (e.g.,manufacturing), GICS and TRBC have a higher granularity (e.g., aerospace manufacturing or special material manufacturing). However, we did not have BPMN activity labels that could map each of the individual sub-domains at the higher granular level. Therefore, in such cases, one domain in our classification will map to multiple domains in underlying GICS and TRBC classifications. An exception to the above straightforward mappings, as in the examples before, is the domain, software. During our open coding, two subdomains (traditional software and machine learning software) evolved to map to the same domain, software, from GICS and TRBC. This might be due to the increased prominence of machine learning related software in the last decade, while the GICS and TRBC were introduced about two decades ago. Therefore, we chose to have two domains in our classification that map to the wider software domain:traditional software andmachine learning. The list of all the mappings is provided in our replication package (Saeedi Nikoo et al.2023). The same domain can have different names in GICS and TRBC. For instance, the Software & Services in the GICS taxonomy closely aligns with the Software & IT Services category in the TRBC taxonomy. To address this difference in naming schemas of GICS and TRBC, we redefined a consistent naming schema. For instance, by defining distinct domains such as financial services, logistics, insurance, and healthcare, we ensured that each domain encapsulates a coherent set of business activities covering the activity labels in our dataset (Taymouri et al.2021; Roy et al.2013). The complete domain mapping to taxonomies is available in our replication package (Saeedi Nikoo et al.2023).

Part 2: Model classification. In this part, we classify individual BPMN models into one or more of the domains identified in the part 1. First, we convert each model to a list of its activity labels (strings). Now, the model classification problem is transformed into a text classification problem, where a list of strings (activity labels of a BPMN model) needs to be classified into one or more predefined domains.

To classify textual data, there are automated ways in literature such as the Naive Bayes (Raschka2014), Support Vector Machines (Joachims2005), and Deep Learning (Minaee et al.2021). Since we lack a training set, non-generalizable learning approaches from the literature (Raschka2014; Joachims2005; Minaee et al.2021) are not practical in our case. However, Large Language Models (LLMs) have shown strong generalization to diverse downstream tasks (OpenAI2023; Zheng et al.2023) and manual classification of individual models (25,866 models) is not feasible in our context. Therefore, we choose state-of-the-art LLM, GPT-4 (OpenAI2024) for the classification of BPMN models in our dataset to domains.

For our context, we chose “zero-shot" learning. “Zero-shot" learning is a task in which an LLM is trained to identify and classify potentially unseen classes without any explicit training data for those classes (Kojima et al.2022). This approach has been shown to be effective for text classification and data augmentation (Lin et al.2023; Imran et al.2024; Zhou et al.2022; Li and Liang2021; Li et al.2024) in the literature.

We employed ChatGPT (GPT-4 model version “gpt-4") (Brown et al.2020) (Nov 2023) in zero-shot setup for classifying each model (represented as a list of its activity labels) to one or more of the 16 domains based on all the activity labels in a model. We discovered that out of the 25,866 models in our dataset, 13,898 of them share the exact same sequence of activity labels with other models. Therefore, we performed 11,968 (25,866 - 13,898) queries to ChatGPT instead of 25,866.The domain classification generated by ChatGPT for the unique models were then applied to all models sharing the same sequence of activity labels.

To classify BPMN models, we needed to create a prompt to interact with ChatGPT. White et al. (2023) introduce generic patterns to effectively interact with conversational LLMs. For our classification task, we use their template pattern to generate our prompts, which is the most appropriate to our result format. The template pattern (White et al.2023) is used to ensure the output of LLM follows a precise structure (i.e., in our case, a list of domain names). Accordingly, we use the following prompt:

The prompt evolved iteratively. This process involves trials with different prompts to optimize the accuracy of the resultant classification. Through experimentation, we observed a decreasing accuracy with an increase in prompt tokens. Specifically, attempting to classify multiple models simultaneously appeared to impact result accuracy negatively. To address this, we tailored the prompting so that each prompt includes the activity labels of a single BPMN model alongside the defined list of domains.

In our post-processing of the results, we observed that models initially classified as bothmachine learning andtraditional software (1,098 models) only align with thetraditional software category. Consequently, we opted to exclude themachine learning domain from these models. In addition, we observed 96 ($\approx $0.8%) instances of hallucinations out of 11,968 total attempts (one request per model) in the generated outputs by ChatGPT (Bang et al.2023), where the model generated domain names that were outside the list of domains we provided in our prompt. We manually classified these 96 models. For the complete list of models and their domains, please refer to our replication package (Saeedi Nikoo et al.2023).

Note that, for the models categorized asunknown, data beyond the model itself might help in classifying them to one or more domains. For instance, metadata such as the repository name, ownership details, and other projects by the same owner. The scope of this work is the models. Future research can explore such possibilities to extend and improve the validity of our classification.

Note that a repository may contain models that belong to various application domains. In this study, we limit our classification to the model level and leave repository-level domain classification as a future work.

Validation: To evaluate the results of transformer-based models like ChatGPT, metrics including accuracy, precision, recall, and the F1-score are used in literature (Mujahid et al.2023). F1-score^Footnote4 is more effective than other metrics in a situation where classes are imbalanced because it considers both precision and recall, thereby offering a more accurate depiction of the model’s performance (Imran et al.2024; Li et al.2024; Zhang et al.2022). In our study, a higher F1-score indicates that ChatGPT shows higher agreement with the raters’ classification. To calculate an overall score across all classes (i.e., application domains), we employ the widely used micro-averaged variant (Imran et al.2024; Liu et al.2019; Sokolova and Lapalme2009) of F1-Score.

To calculate F1-score, we picked a random sample set of near 1% of the classified model set (250 models) across the domains. Two authors then manually classified the sample set separately. Now, each model in the sample set has a list of one or more domains assigned manually by each of the two raters and a list of domains assigned by ChatGPT. Each author then tagged each model with one of the following labels: Yes, if ChatGPT outputed a superset of their classification; No, ChatGPT gave a non-matching classification; and Partially, if some part of their classification was in the classification by ChatGPT but part of their classification was not (applicable only for those with more than one domain). The inter-rater agreement between the two classifications was 0.82 as calculated using Cohen’s Kappa (Kvålseth1989), signifying a very good agreement. The two raters discussed their discrepancies until a decision was reached. Accordingly, 223 ($\approx $89%) of models resulted in correct classification (aka yes); 5 ($\approx $0.2%) as partially correct (aka partially); and 14 ($\approx $0.06%) incorrect classification (aka no), as rated by both raters. There were disagreements in 8 ($\approx $0.03%) cases. The F1-score calculated for the sample set is$\approx $0.80. Our validation results on a subset of the models indicate that the use of LLMs for classification purposes is promising (closer to one F1-score indicates near-to-perfect precision and recall with a score of 1 being the perfect precision and recall), achieving a near-to-perfect F1-score. Please refer to the replication package (Saeedi Nikoo et al.2023) for the classified sample set and the rated results.

For the models where there is complete disagreement between ChatGPT’s classification and that of the raters, the raters classified most of these models asunknown. However, ChatGPT classified them into specific domains. This includes the cases where ChatGPT assigned a domain based on minimal context. For instance, a model that contained an activity label “deliver mail” without additional context was classified by ChatGPT aslogistics. There are also borderline cases which contained abbreviations with no clear indication of their meaning. For example, a label containg “HC” was apparently interpreted by ChatGPT as healthcare and classified into thehealthcare domain. In addition, some models had non-English labels where only the non-English portions hinted at specific domain(s)^Footnote5. We believe that the first two cases mentioned above demonstrate examples of biased classification done by ChatGPT. Reasoning about such biases is beyond the scope of this study.

Tools for BPMN model development

Insights into the usage of BPMN modeling tools are a crucial step in understanding BPMN model development, including development-related preferences, practices, and workflow of BPMN model creation and maintenance. As a characterization of the BPMN models, we are interested in finding which modeling tools are used to develop the BPMN models in our dataset. To identify the tools used to develop each BPMN model in our dataset, we searched for vendor-specific meta-data (which is added by the tools that created the file) from BPMN models. This tool-specific metadata includes tags with tool names (e.g.,xmlns:camunda=“http://camunda.org/schema/1.0/bpmn"), target namespaces (e.g.,targetNamespace=“http://bpmn.io/schema/bpmn"), and exporter information (e.g.,exporter=“camunda modeler"). We then manually verified the names of the tools until we could no longer associate a particular model with a tool, as they were missing such vendor-specific metadata. The complete set of metadata extracted and the tool associations are available in the replication package (Saeedi Nikoo et al.2023).

Note that, in some instances, there are multiple tool names present in the model header. A potential reason could be the usage of one tool for model creation and another tool for model modification. In such cases, we associate the model with all the tools whose metadata were found in the model.

We also identified additional entries in the model headers that are not associated with modeling tools but rather pertain to other aspects, such as runtime, validation, or analytics of models. These entries were omitted as they were not relevant to the modeling aspect.

There is a chance that certain tools might have been utilized during the creation or modification of the BPMN models, yet no explicit data about these tools has been included within the models. Consequently, these tools might be missing from our consideration. For 2,169 models ($\approx $8.39%), there was no name in the models related to BPMN modeling tools, which were left out.

Ownership and evolution

To determine the owners of the repositories, we utilize the repository type information provided by GitHub. Using this information, we differentiate between user-owned and organization-owned repositories. Note that for 74 repositories and 363 models in total, we could not retrieve their metadata (i.e., repository type and model commit information), as they were not available at the time (in year 2023) of API calls. The part of analysis that uses these metadata does not cover those repositories and models.

To analyze how BPMN modeling in open source has evolved over time, we utilize two pieces of data available for files on GitHub: the creation time and the last update time. Our analysis covers a span of 11 years from 2010 to 2020, excluding the data for the year 2021 due to the incomplete coverage of the dataset for that year. For 150 repositories, we could not retrieve their creation time info due to the unavailability of the repositories at the time these data were retrieved (December 2023). Note that GitHub sets the last update time as the creation time when a file is created. Therefore, to count updates for a file, we do not consider its initial update.

Table 1 Identified domain categories and the associated number of labels and models per category

Full size table

3.2Findings

Application domains

BPMN models in our model set originate from at least 16 domains, showing that BPMN models on GitHub span a diverse range of application domains. The domains and the number of models from each domain are presented in Table 1. A list of BPMN models for each category is given in our replication package (Saeedi Nikoo et al.2023). The majority of BPMN models are fromtraditional software ($\approx $17% of the model set),machine learning ($\approx $17%),sales ($\approx $14%),business services ($\approx $13%), andfinancial services domains ($\approx $13%), showing the popularity of these domains for BPMN modeling. These domains cumulatively comprise$\approx $62% of the model set. Theagriculture domain with a converge of 20 models ($\approx $0.08%) is the domain with the least number of models.

Our model classification resulted in 6,857 models ($\approx $27% of all models) with more than one domain. This might be due to the overlapping nature of domains (see Table 1 for domain definitions). For instance a process about packaging and delivery of goods (logistics domain) may also contain activities about goods insurance (insurance domain), or a process containing activities about online product purchases (sales domain) may also contain activities related to the delivery of the product to the customers (logistics domains).

Tools for BPMN model development

We identify 28 tool suites used for BPMN model development. Figure 3 shows a list of these tools with their usage frequency. About 86% of the BPMN models’ development used one of the following five tool suits: Drools (by JBoss) (2023), Camunda (2023), Activiti (2023), bpmn.io (Camunda) (2023), and SAP Signavio (2023) which could be attributed to their extensive or useful features, and active community support. Among the five, SAP Signavio is the only closed-source tool, suggesting that some organizations and developers prefer proprietary solutions for business process modeling. This shows the dominance of open source tools for BPMN modeling within open source community. However, there is a long tail of tools with varying usage numbers, indicating that the GitHub community explores a wide range of tools for business process modeling.

The comparison of the identified tools and application domains suggests that certain tools have higher usage within specific domains. For instance, Drools, Camunda, and Activiti (all of which belong to open source) are among the most highly used tools in the five domains with most models. Figure 4 shows a heatmap representing this comparison. It shows thatmachine learning domain uses mostly Camunda and Drools;traditional software domain uses Activiti, bpmn.io, Drools, and Camunda, and models inunknown category have used Activiti the most. It also shows that the tools Drools and Camunda are used most across all domains. This might stem from the tools providing specialized features tailored to the requirements of those domains. Exploring these domain-specific functionalities could be valuable for users seeking optimal tools within their respective domains, especially those unfamiliar with the available tooling options.

Ownership and evolution

Results show significant variations in BPMN model creation across domains, as shown in Fig. 5. Certain domains likesales,business services,financial services, andtraditional software have a well-established history in BPMN modeling, spanning several years. Conversely, domains likemachine learning andmanufacturing represent relatively new entrants to the BPMN modeling landscape.

The number of BPMN models and repositories have increased steadily every year until 2019 and 2020 respectively, as shown in Fig. 6. The number of BPMN models created on GitHub has increased significantly after 2014. A sharp increase is seen around the year 2016 in the number of models created, but the trend doesn’t change until 2019, in which there is another sharp increase due to the substantial number (3,852) of models added to a single repository (UTS-AAi/AVATAR). The increase in model creation in recent years may be due to an increased awareness of BPMN, its relevance in different domains, or improved tooling for BPMN modeling. We notice a significantly higher number of updates, in comparison to models created, from 2010 to 2013. From 2014 onwards, although we see fluctuations in the model updating trends, the number of updates are lower but seem to be in proportion to the number of models created for most of the years, which may indicate more activity in BPMN modeling in open source. The decline in model updates might suggest that the created models are relatively stable or that the update frequency has decreased.

Until 2015, the proportion of newly created repositories containing BPMN models is close to the overall growth of GitHub repositories. From then until 2018, the rate of BPMN repository creation surpassed the rate of creation of all repositories on GitHub. In contrast to the model creation trend in 2019, we notice a drop in the creation rate of BPMN repositories in 2019.

As shown in Fig. 7, user-owned repositories have consistently played a prominent role in the development of BPMN models, contributing significantly over time. Also, the data indicates an overall gradual increase in contributions from organization-owned repositories, while it shows a decrease in user-owned repositories.

Tools Activiti (2023), Camunda (2023), Drools (2023), and SAP Signavio (2023) consistently show significant usage and maintain a continuous presence throughout the years, as shown in Fig. 8. However, tools like Flowable (2023) and bpmn.io (2023) show activity only in the last years. A reason could be that part of the tools from the second group may have come to existence later than the ones from the first group. Tools like ADONIS (2023), ARIS (2023), and Enterprise Architect (2023) show very little activity (i.e., less than 50 times) for BPMN modeling throughout the years.

Regarding the ownership, we identified 4,954 distinct repositories contributing a total of 25,866 BPMN models. Figure 9 provides an overview of the distribution of the models across repositories. For instance, 2,833 repositories host only one model from the dataset. It is shown that most ($\approx $ 94%) of the repositories contain fewer than 10 models, showing the presence of only few repositories with larger number of models.

While organizations still contribute a significant number of BPMN models, the majority of repositories and models are owned by individual users. Figure 10 shows the distribution of models per repository in total and separately for each repository type (User and Organization). Out of the 4,880 BPMN model repositories (for which metadata was retrieved), 795 ($\approx $16%) belong to organizations, while 4,085 ($\approx $84%) are owned by users. Regarding the number of models in these repositories, 6,313 models ($\approx $25%) are organization-owned, while users own 19,248 models ($\approx $75%). The median count of models is one, two, and one for User type, Organization type, and Total (User and Organization), respectively. The first boxplot (Total) shows few repositories as outlier (at the top) with highest number of models contributed by repositories. For instance, there are 20 repositories which contribute above 100 models to the dataset. Table 2 presents the 10 organization-type and user-type repositories that contribute the highest number of models. The results may imply that organizations on GitHub play a more focused role in contributing to specific projects, while individual users contribute to a broader range of BPMN models. Nevertheless, organizations on GitHub may also have a unique role in terms of promoting collaboration and knowledge sharing within their networks.

Camunda, bpmn.io, and Activiti are the top 3 tools mostly utilized in organization-owned repositories, while Camunda, Drools, and Activiti are the top 3 tools with highest prevalence in user-owned repositories (see Fig. 11). Regarding application domains, BPMN models in user-owned repositories are predominantly linked with the domains:machine learning,traditional software, andbusiness services, whereas models in organization-owned repositories are mostly associated with domainstraditional software,financial services, andsales (see Fig. 12).

Table 2 Top 10 Organization-type and user-type repositories (based on model count) with the number of BPMN models they host

Full size table

4RQ2 – What is the Extent of Cloning in BPMN Models in Open Source?

A characterization of BPMN models in open source necessitates an understanding of their uniqueness. While the preceding section explored the variety of models, this section focuses on quantifying the diversity of clones within open source models.

In this section, we explore the cloning among BPMN models and fragments of the models in open source. We characterize model clones across the following dimensions: First, we compare the cloning of models within and across repositories. Second, we classify the model clones according to their application domains. Third, we investigate the ownership of the model clones and their evolution over the years. Finally, we analyze cloning among model fragments. We present our approach in Section4.1 followed by findings in Section4.2.

4.1Approach

Identifying model clones.

Clones within process model repositories may arise as a result of copy-pasting or from the existence of multiple variants of a process, e.g., multiple claim handling processes in an insurance company, represented as separate models (La Rosa et al.2015). Such practice is known as clone-and-own, where new variants are created by copying and adapting existing ones (Rubin et al.2013). Clones can fall into the following two categories based on the classification schemes in literature (Alalfi et al.2012; Störrle2015; La Rosa et al.2015; Babur et al.2019; Saeedi Nikoo et al.2022):

Models that remain structurally unchanged. While the clones belonging to this category are structurally unchanged, they could have changes to formatting, layout, internal identifiers, and cosmetic changes in labels (lower-/uppercase, snake-/camel case, and other trivial changes) without change in letters or their order. This type of clone is also known as exact clones and falls under the category of Type-I (La Rosa et al.2015) or Type-A clones (Babur et al.2019; Saeedi Nikoo et al.2022) in the literature. In this paper, we use Type-A or exact clones to refer to this category. Note that general-purpose duplicate file finder tools can not recognize all Type-A clones in models. An example of this scenario is minor changes in models, like one model having additional white spaces, changes in order of elements, or formatting differences.
Syntactically similar models with their fragments undergone structural, type, label, or attribute changes. This category of clones includes models with changes, additions, or removals of names, types, or attributes of one or more elements. In literature, this type of clones is known as: (1) approximate or near-miss clones; (2) Type-II and Type-III clones; or Type-B and Type-C clones (La Rosa et al.2015; Babur et al.2019; Saeedi Nikoo et al.2022). In this paper, we use Type B & C clones to refer to this category. Note that Type-A clones are a subset of Type-B & C clones (La Rosa et al.2015).

In this study, we focus on syntactic similarity among models. Therefore, models without syntactic similarity while having behavioral and semantic similarities (referred to as Type-IV or Type-D clones) (Dijkman et al.2011; Heinze et al.2021) are out of our scope.

Two state-of-the-art tools available in open source for clone detection of process models are Apromore (Dumas et al.2013; La Rosa et al.2015) and SAMOS (Babur2019; Babur et al.2022; Saeedi Nikoo et al.2022). However, Apromore does not readily support BPMN files (La Rosa et al.2015). Therefore, we use SAMOS with its extension for BPMN models (Saeedi Nikoo et al.2022).

To identify Type-A and Type B & C clones we use the classification scheme from SAMOS tool (Saeedi Nikoo et al.2022; Babur et al.2019). According to the tool-related documents, when the tool outputs a distance of zero between a pair models, they are considered as Type-A clones (Saeedi Nikoo et al.2022; Babur et al.2019). If the distance is less than or equal to 0.30, then they are considered as Type B & C clones (Saeedi Nikoo et al.2022; Babur et al.2019).

SAMOS can handle up to a few thousand models at a time (Babur et al.2019) while we have tens of thousands of models. Therefore, to use SAMOS in our context, we partition the collection of all models into sets such that SAMOS can efficiently handle one set at a time. Our partitioning takes a two-phased approach:

Phase 1: Dataset slicing. In this phase, we slice the collection of all non-trivial models (25,866 models) into clusters, where each cluster consists at most of a few thousands of models. For this partitioning, we use models’ activity element labels since activity labels are one of the most important sources for process model similarity measurement (Schoknecht et al.2017).

We choose Density-Based Spatial Clustering of Applications with Noise (DBSCAN) clustering (Tan et al.2016) based on its demonstrated effectiveness in the context of BPMN models (Babur et al.2019; Saeedi Nikoo et al.2022). DBSCAN requires a distance measure and two parameters, the neighborhood radius and the minimum cluster size (Tan et al.2016). There are various distance measurement techniques in the literature, including Cosine, Manhattan, and Bray-Curtis (Deza et al.2009). Bray-Curtis is a good fit in our case, as it generates a normalized distance (Ricotta and Podani2017), which allows us to specify distance thresholds between models. This distance measure has previously shown effective in similar clone detection settings in prior studies (Babur et al.2019; Saeedi Nikoo et al.2022).

The neighborhood radius is defined as follows. Given a set of objectsO, the neighborhood of an object$o\, \in \, O$ is defined as the set of models$N_0 = \{o_i \in O \mid d(o, o_i)\le \epsilon \}$, where$d(o, o_i)$ is a distance measure betweeno and$o_i$ and$\epsilon $ is the neighborhood radius. We fixed the neighbourhood radius at 2 to retrieve clusters containing at least two fragments. We chose the distance threshold based on the largest cluster size produced given that threshold. We tested with distances with 0.1 increments. For distances above 0.5, the largest cluster size (>8k) exceeded the capacity of the clone detection tool. Accordingly, we set the distance threshold to 0.5.

Now, the DBSCAN clustering is executed in the following steps.

The process models are transformed into a text document in which each line corresponds to the activity label list of a single model.
Preprocess the text document using natural language processing (NLP) techniques (Gupta and Lehal2009) to eliminate attributes or information that are redundant or irrelevant (Uysal and Gunal2014). Our preprocessing of activity labels include expanding contractions, converting all letters to lowercase, eliminating punctuation marks, stripping digits and alphanumeric terms (Hernández et al.2022), filtering out stop-words, and applying stemming.
Models are then vectorized based on Term-frequency/inverse-document-frequency (TF-IDF) measure (Bafna et al.2016) to create a numeric vector for each model (Eq. 1). The vector for each model is formed as follows. For each wordw and documentd, we calculate:tf(w,d): the ratio of the number of appearances ofw ind divided by the total number of words ind.idf(w): the logarithm of the fraction of the total number of documents divided by the number of documents that containw. Now, the vector for each model is calculated as:
$$\begin{aligned} tfidf(w,d)=tf(w,d) * idf(w) \end{aligned}$$
(1)
DBSCAN clustering is performed using the vectorized models from the previous step.

As shown in Fig. 1, the clustering step resulted in 1,838 clusters and an outlier set of 2,483 models. To make sure we don’t miss possible clones within the outlier set, we also include this set in the clone detection phase. Note that the clone detection tool covers other aspects such as structural similarity and semantic similarity of labels, thus it is possible that there are also clones within the outlier class not captured in the slicing phase. To ensure there is no inter-cluster clones, we randomly selected 68 sample model pairs based on a confidence level of 0.90 and margin of error 0.10 (Taherdoost2017) across different clusters and ran clone detection on them. The clone detection tool detected no clone pair among any of the sample pairs.

Phase 2: Identification of model clones using SAMOS. SAMOS considers underlying graph structure in addition to the element labels considered in data slicing phase. SAMOS treats models in a manner similar to how documents are treated in the information retrieval field, and applies document clustering to models.

The workflow begins with feature extraction based on the language metamodel, identifying features such as labels of model elements or larger fragments like n-grams or subtrees. For example, a bigram might be$Emit\;invoice \rightarrow Receive\;payment$ from the model in Fig. 2. Next, a term-frequency-based Vector Space Model is computed, followed by calculating a distance matrix. In the final phase, clustering techniques are applied to identify clones. More details on different applications of SAMOS can be found in previous studies (Babur et al.2018,2019,2020,2022; Saeedi Nikoo et al.2022). We run SAMOS in bigram setting, which is the best possible option available for the clone detection of BPMN models (Saeedi Nikoo et al.2022).

Note that, for 97 clusters from the clustering phase (phase 1), the clone detection yielded no clones for the given (30%) threshold. Also, there were errors in 12 more clusters (totaling 129 models), in feature extraction of the clone detection tool, which was mainly due to the models not parsable by the tool. These were discarded.

In the rest of this paper, the sets of models resulting from the dataset slicing step (Phase 2) are mentioned as clusters, and the ones from the clone detection step (Phase 3), are denoted as clone classes. A clone class is a maximal set of model fragments in which a clone relation holds between any pair of model fragments (Kamiya et al.2002; Deissenboeck et al.2008).

Validation: To estimate precision and recall, we performed stratified sampling and randomly selected 69 pairs with a confidence level of 0.90 and a margin of error of 0.10 (Taherdoost2017). Precision was calculated separately for Type-A clones, and Type B & C excluding Type-A clones. The exclusion of Type-A clones from Type B & C clones ensures that we have unique set of sample pairs in both sets. First and second authors labeled the sampled pairs (Yes: actual clone, No: not clone). We calculate the precision and recall using the following formulas:

$$\textit{precision} = \frac{\textit{true positives}}{\textit{true positives} + \textit{false positives}} \textit{recall} = \frac{\textit{true positives}}{\textit{true positives} + \textit{false negatives}}$$

For Type-A samples, 63 pairs were identified as actual clones by both raters (true positives), while 6 pairs were labeled as non-clone by both raters (false positives). The tool did not fail to predict a Type-A clone when it was actually a Type-A clone (false negatives). Given these values, the precision is$\approx $91%, and the recall is 100%. For Type B & C samples, 65 pairs were identified as actual clones by both raters (true positives), while 4 pairs were labeled as non-clone by both raters (false positives). Again, the tool did not fail to predict a Type B & C clone when it was actually a Type B & C clone (false negatives). Given these values, the precision is$\approx $94%, and the recall is 100%. For Type-A clones, the inter-rater agreement was 0.90 as calculated using Cohen’s Kappa (Kvålseth1989), and for Type B & C excluding Type-A clones, the inter-rater agreement was$\approx $0.85, both signifying a very good agreement. The two raters discussed to resolve their discrepancies, which was one pair in each clone category.

Characterizing model clones

We characterize the cloning among BPMN models using the information gathered to answer RQ1 (see Section3.1). The data presented in Section3 is combined with cloning-related information to characterize the cloning of models across multiple dimensions including: application domains, ownership and evolution.

Identifying model fragment clones

Clones may occur at the entire model level, but may also occur in finer-grained fragments of process models (La Rosa et al.2015). Thus, examining cloning at both the model and fragment levels provides a holistic view of the cloning landscape. Understanding fragment-level clones may also help in recognizing opportunities for reuse and modular design, while reducing redundancy and enhancing the overall maintainability of BPMN models. To gain further insights into the finer-grained cloning patterns within the process models, we analyze cloning in specific kinds of process fragments called subprocesses. Subprocess is the only self-contained BPMN element with functionality similar to that of functions in programming languages, as they are invoked according to a call-and-return semantics (Uba et al.2011). A subprocess is a composite activity that can be broken down into smaller units of work. Figure 2 shows an example usage (Ship and invoice) of subprocess element. Subprocesses can also be good reuse candidates in recommendation systems similar to functions in programming languages (Hammad et al.2021), and are important in refactoring of process models (La Rosa et al.2015).

Similar to model-level clone analysis, we investigate Type-A and Type B & C clones in subprocesses. From the models left after the filtering steps in Section 2.2, we collect models containing one or more subprocesses. Models with subprocess(es) are identified using a heuristic, where we look for models containing tags withsubprocess keyword in them. This method is based on the fact thatsubprocess tag name is used in the XML schema for subprocess element as defined in BPMN 2.0 specification by Object Management Group (OMG2013).

After collecting these models, we performed clone detection on them using SAMOS in subprocess decomposition setting. For further details on decomposition styles for clone detection of BPMN models in SAMOS, please refer to study in Saeedi Nikoo et al. (2022).

Our search for BPMN models with subprocess elements resulted in 7,331 models. We followed similar inclusion/exclusion steps as we did for models in Section 2. First, we excluded subprocesses with no activity element. In this step, 608 subprocesses were removed. Second, we excluded non-English subprocesses, which led to the removal of 2,244 subprocesses. Third, we excluded subprocesses where all their labels have fewer than three letters. This led to the remaval of 869 subrocesses. In total, 4,667 subprocess fragments were left after the inclusion/exclusion steps. We ran SAMOS on these models with the subprocess setting.

Note that for a more accurate result, clones with containment relation (i.e., considering two graphs as clones only because one contains the other) should be avoided in clone detection analysis (Pham et al.2009). To avoid this issue in our analysis, we only include outer subprocesses (i.e., exclude nested subprocesses, if there are any). This approach has shortcomings though, as there might be clones among the nested subprocesses as well. We leave a better processing for clones with containment relations as a future work.

4.2Findings

Our clone detection resulted in 2,109 clone classes with 23,190 Type B & C clones (i.e., union of clone models). Out of these, 18,482 models are detected as Type-A clones. This shows that about 80% of the clones are exact clones (i.e., Type-A clones).

Our results show that about 90% of the BPMN models in open source are formed by clone-and-own practices (i.e, Type B & C clones) and out of them, 71% are copied without any further editing (i.e., Type-A clones). This finding aligns with prior research in software development (Lopes et al.2017; Allamanis2019; Spinellis et al.2020; Gharehyazie et al.2019; Babur et al.2019). A plotting of Type-A clones against Type B & C clones (Fig. 13) shows that the frequency of clone classes with lower number of clones is much higher compared to clone classes with higher number of clones. Most ($\approx $89%) of the clone classes (1,877 out of 2,109) contain fewer than 10 model clones. About 11% of clone classes contain ten or more (Type B & C) model clones (232 out of 2,109), while only about 1.5% of them contain more than a hundred (Type B & C) clones (31 out of 2,109).

Cloning among models within and across repositories

Clone classes consist of model clones originating from a varying set of repositories, as shown in Fig. 14(a). Our results shows that 913 (43%) of the total clone classes are composed of models from the same repository, while the majority (57%) consist of model clones from multiple repositories. This shows a different cloning behaviour compared with previous findings for code clones (Gharehyazie et al.2019), which indicate most clones come from the same repository. The maximum number of repositories contributing to a single clone class is 525, although only a few clone classes (12 out of 2,109) are composed of models from over 100 repositories.

We are also interested in identifying super-sources (Gharehyazie et al.2019) of clones. These are the repositories which contain a significant number of distinct model clones, where each clone belongs to a separate clone class. Median number of clone classes that models in a repository belong to is 1 (Fig. 13(b)). Super-sources are identifiable as outliers at the top of the boxplot in Fig. 13(b). For instance, 12 super-source repositories contribute Type B & C clones to more than 50 clone classes, cumulatively contributing about 11% of all Type B & C model clones (2,573 out of 23,190). Also, a significant portion of repositories (2,885) have only one Type B & C clone, and 2,198 of them have only one Type-A clone.

Table 3 (left) shows the top 5 repositories selected based on the count of clone classes they contribute to. For instance, the top repository (SpikeLavender/SO) contains model clones found in 85 clone classes. Although the third repository (jeremiahlumontod/activiti-myeis) contains many more model clones (768), it contributes clones to fewer (71) clone classes. Clones within repositories with the highest number of clones (Table 3 (right)) are not necessarily distinct, unlike those found in clone super-sources. The repository “jeremiahlumontod/activiti-myeis" exemplifies a clone super-source with one of the highest count of model clones.

Table 3 Left table shows top 5 repositories with highest number of distinct model clones (i.e., clone super-sources)

Full size table

Majority ($\approx $84% - 901 out of 1,073) of repositories that contain no model clones host just a single BPMN model as shown in Fig. 14(b). In total, 1,073 repositories contain no model clones.

Table 4 Domain categories and associated number of clone classes with number of model clones covered by those clone classes

Full size table

Application domains and model clones

Our results show that the cloning occurs across all the identified application domains (see Section3.2 for the list of application domains). Table 4 summarizes the domain categories identified in model clones. The domain categories for a clone class combine the domains associated with the models inside it. For instance, if some models in a clone class are associated withsales and others withlogistics, the clone class counts as part of both domain categories. Clones belonging to a domain appearing in a greater number of clone classes indicate a higher diversity of model clones associated with that domain.

Our results shows that for most (13) of the domains, more than 80% of the models associated with those domains are clones (see Fig. 15). About 25% of the model clones are associated with more than one domain, which may indicate the overlapping nature of these domains. The top 5 multi-domains (i.e., multiple domains assigned to a single model) ranked by the number of appearances include the following in order: “traditional software,business services", “sales,traditional software", “sales,logistics", “sales,leisure & recreation", “business services,financial services". For the full list of overlapping domains, please refer to the replication package (Saeedi Nikoo et al.2023).

Ownership and evolution of model clones

We obtained 3,812 distinct repositories from the clone classes, demonstrating the wide distribution of BPMN model clones on GitHub (see Fig. 16(a)). Among$\approx $53% of the repositories, only a single model clone exists in the model set, contrasting with the remaining 47%, where multiple model clones are present. The results also show that model clone counts are mostly close to the total model counts for the repositories. Figure 16(b) shows the close relationship between model and model clone counts. For 3,391 repositories ($\approx $87% of total), the model count and model clone count are equal, implying all the models in them are cloned.

We retrieved repository type information for the repositories with clones, resulting in 3,213 repositories asuser type, and 599 repositories asorganization. We further classified theorganization repositories, manually assigningacademia orindustry labels. Out of these, 44 ($\approx $7%) repositories were identified asacademia (covering 2,422 model clones), and 335 ($\approx $56%) repositories asindustry (covering 1,973 model clones), highlighting a more widespread industry related BPMN model cloning on GitHub. A total of 45 organization-owned repositories (222 model clones) were classified asother (e.g., government organizations or non-profit initiatives), and 175 other repositories (929 model clones) were labeled asunclear due to lack of clear information.

The results show that about 56% of the organization-owned repositories belong toindustry, while only about 7% belong toacademia. Regarding model clone population within the organization-owned repositories, about 44% of the clones belong to academic organizations, while about 36% belong to industrial organizations. This suggests that while industrial organizations own more repositories with clones, academic organizations play a significant role in the creation and distribution of model clones.

We observed a few cases where a repository owner has two repositories containing a set of models as exact clones, or versions of a same repository hosted by different owners. These occurrences may be more common than initially anticipated. It would be valuable to systematically identify such repositories and report the cloning rate within them, in addition to the overall cloning rates.

Our research reveals that the frequency of cloning has been in proportion to the number of models created (see Fig. 17), indicating a steady inclination to reuse existing models in a clone-and-own manner, which also represents a prominent challenge within open source software ecosystems (Lapeña et al.2016; Dubinsky et al.2013). For instance, 3,673 models were created in 2018, of which 3,289 ($\approx $90%) are detected as Type B & C clones, and 2,856 ($\approx $87%) are detected as Type-A clones. We exclude the results from 2021 onward, as the dataset covers only part of 2021. There seems a close proportion between the number of models created and model clones over the years. In 2019, about 59% of the process models are related to the field ofmachine learning from a single repository, mostly clones (check the classification results in Table4), resulting in a peak for model creation and cloning rate in that year.

Cloning among model fragments

The clone detection process yielded 4,284 Type B & C clones (i.e.,$\approx $92% of the subprocesses after filtering), indicating a high rate of cloning within subprocesses. This process resulted in 365 clone classes. Figure 18 provides an overview of these results. Each bar group represents a unique clone class size. For instance, there are 134 clone classes with size 2, containing 268 subprocess clones in total.

All subprocess clones from Type B & C clone classes are also among Type-A clones detected by the tool, meaning that for any Type B & C clone detected, there is at least one Type-A clone among the subprocesses. We identified 950 repositories as the sources of the models containing subprocess clones. The results show that the majority (63%) of clone classes comprise more than two subprocess clones, indicating a high reuse rate in subprocesses. Additionally, about 92% of subprocess clone pairs from all clone classes are part of models that are also clones. While most (92%) of the subprocess duplication is the result of duplication at the entire model level, for the remaining subprocess clones ($\approx $8%), reusability is independent of the model’s similarity. This highlights the need for fine-grained clone detection techniques (Saeedi Nikoo et al.2022; La Rosa et al.2015).

Validation: To estimate precision and recall, we performed stratified sampling, representing clone classes, and randomly selected 69 pairs with a confidence level of 0.90 and a margin of error of 0.10. Precision and recall were calculated only for Type B & C clones as all Type B & C suprocess clones were among Type-A clones. First and second authors labeled the sampled pairs (Yes: actual clone, No: not clone), and$\approx $74% inter-rater agreement was observed for the sample clones. The two raters discussed to resolve the two cases in which they had discrepancies. We calculate the precision and recall similar to that of model-level clones.

Based on the results, 64 pairs were identified as actual clones by both raters (true positives), while 5 pairs were labeled as non-clone by both raters (false positives). The tool did not fail to predict any Type B & C clone pair when it was actually a Type B & C clone (false negatives). Given these values, the precision is$\approx $93%, and the recall is 100%.

5Implications

Understanding the BPMN landscape in open source reveals multiple facets of potential implications. These include, enabling users to select the right model based on specific domain, creating compliance requirements targeted to specific industries, enabling informed choices about tools that fit the organization’s technical ecosystem and expertise, enabling tool vendors to identify future directions for tool features, enhancing the reusability of models, and shedding light on the evolution of business process management practices. We elaborate on some of the important implications below.

Temporal trends: The consistent increase of model creation and updates over the last years (Fig. 6), accompanied by a proportional relationship between model creation and clone instances, suggests an inclination to reuse existing models. This could be due to the reuse of certain process patterns or templates. A future research direction along these lines is on understanding how cloning practices evolve.

Ownership: Despite the majority of the repositories being user-owned, it is possible that some of these repositories began as personal projects due to the absence of company policies on open-sourcing projects (Kochanthara et al.2022). Given that organizational participation in open-source projects within the BPMN domain is still emerging, some projects that currently are user owned may not yet have been transferred to organizational ownership. This trend and its potential impact on our results present an interesting area for future research.

An investigation of the motivations behind individuals and organizations opting to open source, as well as knowing more about the users of these open source models, can help organizations strategically leverage external knowledge, foster innovation, build communities, and gain a competitive edge in the market.

BPMN modeling tools: Current tools (Drools2023; camunda2023; Activiti2023; bpmn.io2023; Signavio2023) are generic and do not cater domain specific needs. Our findings may also aid in the development of design or analysis tools that serve the specific needs of open source BPM projects. For instance we find that a significant fraction of (4,348 out of 25,866) BPMN models in open source are machine-learning related models. Therefore, BPMN tools catering to the specific needs of the ML-related models is a potential next step. In addition, understanding the landscape of BPMN models can inform the development of more sophisticated tools for model analysis and management. For instance, observing the consistent growth in BPMN model creation and updates suggests the need for tools that can track and analyze trends over time. These tools could help organizations identify emerging practices and technologies within their domains. In addition, tools that facilitate collaboration and repository management including features for version control, model sharing, and collaborative editing, could make it easier for individuals and organizations to work together.

Our findings suggest that certain tools have higher usage within specific domains. Exploring these domain-specific functionalities could be valuable for practioners seeking optimal tools within their respective domains, especially those unfamiliar with the available tooling options.

Understanding application domains: The identified domains can help researchers and practitioners better understand the usage and application of BPMN models in various industries and contexts. Our results point to the domains,traditional software,machine learning,sales,business services, andfinancial services, as most potential areas for future research and development in BPMN models given they contribute$\approx $62% of models in open source.

While BPMN is a general-purpose business process modeling notation, it provides extension mechanisms to capture domain-specific concepts, enabling the reuse of the modeling language in various contexts (Zarour et al.2020). Moreover, process modeling may differ across domains (Pinggera et al.2015). Therefore, any meaningful usage of business process models, targeted to specific domains (e.g., recommender systems targeted to specific domains or industries) or industries require identification of existing models specific to that domain/industry. Another direction that a domain classification offers is a comparison between the models from different domains.

Our study serves as a precursor for future research focused on specific domains by first providing an overview of BPMN models in open-source environments. This broader understanding helps contextualize the significance, size, and availability of models in various domains before delving into targeted analyses.

In our domain classification process, we solely utilized the activity labels of the models. A potential future research direction is leveraging richer context provided by additional information such as repository descriptions, readme files, and contributor profiles. Incorporating such additional details can further not only validate our classification but also identify domains for those models that are currently classified asunknown, potentially adding additional domains.

Further work can also focus on improving the accuracy of the domain classification for business process models. This could involve multi-shot prompting, fine-tuning, or training LLMs specially designed for business process models.

The large number of clones of BPMN models inmachine learning,traditional software,business services,sales, andfinancial services domains reflects the high number of models in these domains. This might stem from domain-specific patterns used in these models (Koschmider and Reijers2015) or potentially be influenced by variations specific to process models (Rosa et al.2017) in these domains that show up as multiple model variants. It may be that the domain involves more standardized processes that need to be followed, or it may include repetitive tasks or workflows that are performed frequently. For example, data preprocessing, model training, and evaluation are often reused across different projects related to machine learning (Xin et al.2021). Also, some domains are more likely to be regulated, which can lead to more cloning. For example, the finance industry is regulated by multiple agencies, which can lead to clones being created as different companies comply with the same regulations. Standardization bodies could create reference BPMN models that comply with the regulations specific to the domain and can act as a potential starting point for customization by practitioners for individual organizations (Frank2007; Li et al.2008). While this study is scoped at characterizing clones, a potential future direction for researchers is studying the factors that led to these characteristics and validating the above mentioned hypotheses.

Another aspect for future research in this context is investigating recurring process structures or designs within these domains, which may lead to the development of domain-specific process templates. Also, studies are required for the assessment of the quality and effectiveness of cloned BPMN models within these domains, e.g., are there issues related to correctness, efficiency, or compliance as a result of cloning and their variance of these issues across different domains. Another dimension is possible BPMN extensions (Zarour et al.2020) used in the cloned models and analyze their impact on cloning. Our study can act as a starting point in all the above directions.

To the best of our knowledge, this is the first attempt at classifying BPMN models across application domains. Our dataset with model classifications can be used in future research to train classifiers to classify BPMN models into multiple domains.

Clone analysis: In the context of BPMN process models, our research reveals that cross-repository cloning is more common than intra-repository cloning. This may be due to the reuse of common business workflows across projects. It would be interesting to further investigate this phenomenon.

Our results show that approximately 8% of the subprocess clone pairs are part of models that are not considered clones at the model level. This implies that even though the entire models may not be clones, there are subprocess clones across these models. An investigation of such non-clone models sharing cloned subprocesses would provide information on why developers reuse specific subprocesses across seemingly unrelated (i.e., non-clone) models.

Managing clones in BPMN models requires a multifaceted approach. Employing specialized clone management tools is essential for accurate clone detection, comprehensive analysis, detection of inconsistencies of clones and their changes, and to correct those inconsistencies (Nguyen et al.2011; Duala-Ekoko and Robillard2008). Our recommendation for organizations is to utilize automation for regular clone detection to identify duplicates within their model repositories. Adopting modular modeling techniques not only encourages reuse but also facilitates the consolidation of clones within BPMN models, thereby minimizing redundancy, enhancing maintenance, and avoiding bug propagation (Van Der Aalst and Van Hee2004). Maintaining metadata about clones would be useful for tracking model changes and efficiently managing them (Volanschi2018).

Our recommendation for BPMN modeling environments is to incorporate features that enable clone management directly within their environments. The BPMN environments can use metadata about clones for tracking model changes and efficiently managing clones (Volanschi2018).

While several studies have suggested approaches and solutions for clone management in software systems, we are unaware of successful implementation of such capabilities in industrial environments or widely-used modeling platforms. We aim to investigating this area in future work.

Note that, identifying best practices and validating their efficacy is beyond the scope of our study. Such a study will require aggregating best practices for managing clones in BPMN models and validating these practices with historical data and/or a study involving developers/experts.

Ownership of cloned models: The percentage of repositories ($\approx $77%) containing model clones points to potential increased maintenance costs and lower quality and reliability of software systems (Koschke2007). This impacts software development methodologies and emphasizes the necessity for greater caution when reusing model artifacts in industrial settings. To mitigate the expenses and resources needed for managing software projects, we suggest prioritizing the utilization of modular and reusable modeling.

According to the results,$\approx $22% of all repositories containing BPMN models do not contain any clones. Despite the fact that they have minimal number of models ($\approx $84% of them have only a single model), an investigation of these repositories and their models could offer valuable insights into why clones are absent in these repositories. It may be that these repositories host models associated with less common or specialized processes. Future research is needed to confirm this hypothesis.

Reuse opportunities: The frequency of cloned models closely matches the frequency of newly created models. This trend shows the need to develop more effective methodologies for managed reuse. Our findings motivate the creation of specialized tools (similar to the one proposed by Gharehyazie et al. (2019) for code clones) that can be used to streamline finding and tracking process model clones across open source projects, enforcing reuse and consistency. It may also motivate the creation of centralized repositories and search engines specialized for BPMN models based on their functionality, complexity, and other attributes (La Rosa et al.2011; PROS-Lab2019). Such platforms can facilitate knowledge sharing, collaboration, and reuse of process models and their fragments across projects and organizations.

Another direction of the reuse aspect of clones is in the context of recommender systems for process models (Almonte et al.2021). Through the identification and storage of recurring model fragments in repositories, process model recommender systems can utilize them to provide modelers with recommendations similar to model fragments under development (La Rosa et al.2015; Stephan2019). Furthermore, the unique dataset presented in this study can be utilized both as a training set, and for further pattern analysis within clones, which can be useful in development of recommender systems (Li et al.2013).

Generalizability: This study focused on BPMN models on GitHub. There are other BPMN models available in open source (PROS-Lab2019; Sola et al.2022), albeit less diverse than GitHub. GitHub hosts a more diverse set of projects and BPMN models within them, with different modeling purposes. For example, academic models often prioritize technical precision and correctness, while industry models usually target specific business goals like facilitating stakeholder alignment (Sola et al.2022). Additionally, there might be differences in modeling practices, objectives, and the characteristics of BPMN models between open-source and proprietary environments. Closed-source environments, such as those used by large enterprises, might exhibit a more domain-specific and standardized use of BPMN, with stricter version control, as opposed to the diverse and sometimes inconsistent modeling practices observed in open-source projects.

The methodology used in our study directly generalizes to any characterization study on any kind of process model. While our results might be unique to the specific mix of models available on GitHub, the generalizability of our findings (for instance, the comparative number of clones in different industries, the prominence in the usage of different tools for creating process models, and the prominence of BPMN models among different domains) need further studies, ideally from closed source settings. Such studies are the next step in expanding the understanding of model cloning across different industries.

6Threats to Validity

In this section, we address the potential validity concerns related to our study, as outlined by Wohlin et al.’s classification scheme (Wohlin et al.2012).

Conclusion Validity. We chose GitHub as the source based on its popularity and widespread usage (Cosentino et al.2017). However, this choice has led to the systematic exclusion of BPMN models from open source beyond GitHub. Therefore, the validity of our results might be limited to the models on GitHub. In addition, our original BPMN dataset is based on the GHTorrent database from March 2021 (Türker et al.2022). Therefore, our results are scoped to this period and might generalize to the timespan beyond. Further research into the missing models added after that date would provide additional value over our findings.

Internal Validity. The selection and elimination (i.e., filtering) criteria or process may inadvertently exclude certain models, might have led to a biased sample that does not accurately represent the population. Also, potentially we might have excluded models that represent real-world processes. For instance, models with formatting issues or errors in serialization format, or genuine processes with fewer than three activities. To mitigate this, we adopted a rigorous approach during the filtering steps. For that, we examined the results before and after each filtering step, with multiple trials on the filtering parameters, to ensure optimal results. Furthermore, the selection of external tools such as the BPMN2 modeler, for parsing BPMN models, and Lingua, for detecting non-English models, was guided by previous studies, indicating that these tools represent among the best options for the specific tasks at hand. Additionally, we ensured that the filtering criteria were not designed to cater exclusively to this particular dataset but rather aimed to be applicable to other datasets as well.

While our filtering steps aim to minimize noise, it is important to note that some models in our dataset may not accurately represent realistic BPMN models. Similarly, the repositories hosting these models may not always reflect real-world scenarios. In this study, we have not implemented specific criteria to assess repositories for credibility. To address this limitation, future research could employ techniques similar to the ones applied in Babur et al.’s work (2024) for a more in-depth evaluation of the reliability and authenticity of the repositories included in the dataset.

In all validation processes we follow a random sampling approach to minimize selection bias. Also, for each validation, we implemented a rigorous process which involved an inter-rater agreement by the first and second authors.

There is a threat regarding the filtering process applied to exclude non-English models. While we employed rigorous criteria by utilizing the Lingua tool (2022) to identify non-English models, there is a possibility, stemming from the limitation of this tool, that after the exclusion, some models could still contain mixed-language labels or non-English terms that were not accurately detected.

We adopted an approach to slice the model set into smaller clusters of data that is manageable by our clone detection tool. This may pose a threat regarding the final formation of clone classes. Since we didn’t directly apply clone detection to the entire model set, we took a precautionary measure by using a significantly higher similarity threshold (50%) at this stage compared to the clone detection step (30%).

A potential threat to our study’s validity arises from the limitations of SAMOS, the only available tool with a native support for BPMN model clone detection. Despite achieving a high precision for model and fragment-level clone detection, our reliance on SAMOS may not capture all clones, potentially limiting validity of our findings. To mitigate this threat, we performed manual, independent validation among two authors along with inter-rater agreement to ensure the reliability of results from SAMOS.

Construct Validity. We inherit threats regarding the inclusiveness of our dataset from the prior study (Türker et al.2022). Since the original mining in the prior study is based on the data from GHTorrent (from March 2021), we may have missed models not indexed in the GHTorrent since that time. Also, they may have missed identifying all BPMN models due to the search technique they have applied. Additionally, approximately 40k ($\approx $12%) models from the previous study were unavailable for download from GitHub, further affecting the completeness of our dataset. To mitigate this, a subsequent study could be conducted to identify any missing BPMN models. This could involve extending the time frame for data collection beyond the cut-off date of the initial study.

Our domain identification process relies on activity labels that appear at least 10 times within the dataset. While this criterion ensures a focused and consistent domain list, it is worth noting that examining less frequent labels could expand our domain categories further.

There is also a potential for researcher bias during the model exclusion steps. While we have documented inclusion-exclusion criteria and manually checked models, it is still possible that models were inadvertently excluded.

External Validity. Our study is based on BPMN models from GitHub, which may not be representative of all publicly available BPMN models. For example, there are other open-access datasets, such as the SAP Signavio Academic Models dataset and RePROSitory, as well as private and closed-source projects hosted on GitHub, which may have unique characteristics. Future work could explore these datasets to gain a better understanding of the generalizability of our conclusions.

A key limitation of our classification approach involves the challenges of reproducibility with OpenAI’s proprietary GPT models, and large language models (LLMs) in general (Biderman et al.2024; Spirling2023). Despite the use of the most up-to-date LLM available at the time we conducted this study, updates to GPT models by OpenAI could alter our experimental outcomes, which could complicate consistent comparisons across studies (Biderman et al.2024). To minimize the impact of this threat, we provide details of our experimental setup to ensure our results are as reproducible as possible.

Another potential threat is the use of GPT LLM in a zero-shot setting for the classification of individual BPMN models. These models are designed for general-purpose tasks, and might not be fine-tuned to classify process models to application domains. To address this threat, we evaluate the results produced by the LLM against classification by human raters and follow an assessment of the model’s effectiveness for the given task. Additionally, there could be a threat related to the construction of the prompt. To mitigate this threat, we followed existing literature, and iteratively improved the prompt to properly formulate it for use in LLMs in zero-shot context.

7Related Work

The related research to our study can broadly be classified into two categories: (1) studies about characterization of other software artifacts; and (2) studies about model/code clones in open source, including studies exploring BPMN on GitHub.

7.1Software Characterization

The studies related to the characterization of artifacts in the context of this paper can be broadly classified into two parts: (a) on BPMN-related artifacts and (b) stemming from the wider software engineering discipline. We summarize both, starting with the studies on BPMN-related artifacts.

Heinze et al. (2020), mine potential BPMN models in 10% of GitHub; they search for XML files containing a URL for the BPMN 2.0 schema definition. They perform a preliminary analysis on the obtained models, first to check their diversity, e.g., in terms of age, origin, and size; second, regarding their compliance with the BPMN 2.0 standard. In (2022), Türker et al., conduct a similar study as the previous study reported in Heinze et al. (2020), but this time they cover the whole GitHub. Heinze et al. in (2020), use their own dataset (as reported in (Heinze et al.2020)), and discuss how it can be used for guiding tool validation and development. They report a high number of syntax and semantic rule violations in the models, thus promoting the usefulness of linting tools for this purpose. Finally, in (2021) Lübke and Wutke study the layout choices of BPMN models by designers, and its relationship to other modeling parameters such as the use of tools, process model type, and design purpose. Although we employ the same dataset mined by Türker et al. (2022), we go further by performing a detailed characterization on the models, as well as clone analysis of the dataset which is not done in previous studies. There are also studies that explore BPMN models from publicly available data sources (Compagnucci et al.2024; Muehlen and Recker2024; Compagnucci et al.2021). While these studies analyze BPMN models from open source, their scope is limited to the usage of BPMN model elements with a focus on the syntactic dimension of the BPMN. However, our work analyzes other dimensions of BPMN utilization, such as their application domain, temporal trends, ownership and their cloning prevalence.

There are also studies about characterizing other software artifacts in open source. We can categorize these studies based on application domain (Gonzalez et al.2020; Murphy-Hill et al.2014; Wessel et al.2018) and other factors such as geography (Han et al.2021). We refer to some of these studies from the mentioned categories, which have inspired our study. One such recent study is by Kochanthara et al. (2022), in which they categorize and characterize automotive software on GitHub concerning various factors, such as temporal trends, popularity, and their development activities. There is another recent exploration done on the large Chinese technology companies, Alibaba, Baidu, and Tencent (2021). However, they conduct a regional exploration into open source software developed by Chinese technology companies, with a comparison to other software systems, and their intention of making their software open source (Han et al.2021). Another empirical study (Babur et al.2024) conducted on the language usage of Eclipse Modeling Framework metamodels in public engineered projects on GitHub provides insight on the usage patterns of specific language constructs and their alignment with the language design. Our study design is partly inspired by their generic analytics workflow, particularly the phases related to data collection, data filtering, and data analysis, as presented in this study. Another study investigates a decade of open source software related to ML and AI (Gonzalez et al.2020). The study provides a comparison between the trends in ML/AI evolution with general software systems. They also look at the autonomy and collaboration in these systems (Gonzalez et al.2020). Also, another study conducts an interview-based investigation to characterize and distinguish between video game development and the traditional software development (Murphy-Hill et al.2014). Finally, there are studies on bots that explore its applications, and how they can be useful in the development of other software systems (Wessel et al.2018). While the above studies characterize specific types of open source software in particular domains or regions, we inspire from these studies and focus specifically on BPMN models in open source, examining their unique characteristics.

7.2Clones in Open Source

In a study by Saeedi Nikoo et al. (2022), they present a tool for clone detection of BPMN models. As part of their evaluation, they conducted an experiment on a collection of 395 BPMN models collected from GitHub. While their study is a tool paper that demonstrates the effectiveness of the presented tool through a preliminary experiment on selection of GitHub models, this paper is an empirical study. Our aim is to offer a comprehensive characterization of BPMN models and their clones as found on GitHub. They introduce a new tool with an emphasis on its design and practical applications, whereas we investigate specific research questions with a focus on data analysis and contribute new empirical findings and insights.

Besides that, studies concerning the cloning of other model types in open source are also related to our work. Babur et al. conducted an experiment as part of their study where they mine for Ecore metamodels on GitHub (Babur et al.2019). The collected metamodels are used for the validation of the developed clone detection tool, as reported in the paper. In the iterative approach they take for clone detection, they look into exact and approximate clones of the collected metamodels, where they reveal a high degree of cloning on the models. Babur, et al. has other related work where they detect clones in open source feature models (Babur et al.2018) and industrial domain-specific models (Babur et al.2020).

Similarly, our study is related to code clones in open source data sets. Källén et al. in (2021) study the code cloning in Jupyter notebooks on GitHub. They analyze clones at the level of snippets contained in Jupyter notebooks. They study how these snippets are recurring across notebooks, where they look into identical and approximate clones. They also study clones from a repository view, where they look into the frequency of clones within and across repositories. Golubev et al. in (2021) compile a dataset of over 23k Java projects on GitHub, where they search for code clones using a tool of their choice on various levels, including methods and files. They study near-miss and exact clones over code fragments. They do analyses on various metrics such as the size and age of code fragments, the prevalence of their clones, and possible connections between exact and non-exact clones. They found out that cloning occurs throughout the entire existence of Java code, spanning across all years, and that method-level cloning is much more common than file-level cloning, with near 65% of methods having clones.

In another study, Gharehyazie et al. in (2019) do an in-depth study on cross-project code clones on GitHub. For that, they utilize a clone detection tool of their choice to find code fragments across projects, where they inspect their prevalence and characteristics using statistical and network science approaches. They also present an open source tool that aims to streamline the process of finding and tracking code clones on GitHub. And finally, Lopes et al. in (2017) do a large-scale study on projects using programming languages of Java, C++, Python, and JavaScript, hosted on GitHub. They investigate the relationship between languages used and the number of clones with various other quantitative and qualitative analyses on the dataset. They also introduce a publicly available map of code duplicates on GitHub repositories. While the above studies centered on code clones in various programming languages, our focus in this study is on business process model clones.

To the best of our knowledge, except the study in Saeedi Nikoo et al. (2022), there is no such research work that studies business process model clones in open source. Inspired by these studies, we offer an exploration into the BPMN models in open source. Along the lines presented in these studies, we present a qualitative and quantitative analysis on BPMN models in open source repositories.

8Conclusion and Future Work

This study characterizes BPMN models on GitHub as well as studies prevalence and types of clones of BPMN models and subprocess elements within them. We characterized 25,866 BPMN models from GitHub created in the span of 12 years from 2010 to 2021. Our characterization delved into factors such as application domain, temporal trends, ownership, and modeling tools used in the development of BPMN models. We also detected clones in BPMN models to identify Type-A and Type B & C clones. We found that a significant portion of model-level clones were exact clones and that clones frequently occurred across repositories rather than within the same repositories. Our results indicate that, despite a higher number of repositories in the industry hosting clones, the cumulative count of model clones is significantly greater in academia compared to industry.machine learning,traditional software, andbusiness services are the three top application domains that clones originate from. In addition, our cloning analysis on subprocess elements revealed an even higher rate of cloning among subprocesses compared to model-level cloning.

As future work, we aim to expand our study by including newer models not present in the latest GHTorrent database and by conducting independent investigations on other datasets originating beyond GitHub. We also aim to explore model cloning in closed-source projects and compare our findings with those from open source projects. Furthermore, we intend to conduct a comparative investigation of business process models in open source development using other notations, such as EPC. Such research will enhance our understanding of process models and their use in various contexts, aiding in the development of more effective modeling practices.

Data Availibility Statement

The data and code supporting the findings of this study are available in Zenodo with the identifierhttps://doi.org/10.5281/zenodo.13955920.

Notes

https://datahorizzonresearch.com/business-process-management-market-2874
https://www.globenewswire.com/news-release/2022/12/20/2577259/0/en/The-Global-Business-Process-Management-Market-size-is-expected-to-reach-40-6-billion-by-2028-rising-at-a-market-growth-of-19-4-CAGR-during-the-forecast-period.html
Due to the different licenses associated with individual repositories and GitHub terms of service (https://docs.github.com/en/github/site-policy/github-terms-of-service), it is not possible to share the BPMN files and associated contents publicly. Therefore we have shared links to the files.
$\textit{F1--score} = 2 * \frac{\textit{Recall*Precision}}{\textit{Recall*Precision}}$.
Note that, in our inclusion/exclusion phase, models with activity labels entirely in non-English languages are filtered out. However, models that include some non-English labels can pass through this filter. We encountered a minimal list of 4 such models out of the 250 models in our sample set.

References

Activiti (2023)https://www.activiti.org/. Last Accessed on 24 Dec 2023
Alalfi MH, Cordy JR, Dean TR, Stephan M, Stevenson A (2012) Models are code too: near-miss clone detection for simulink models. In: 2012 28th IEEE International Conference on Software Maintenance (ICSM). IEEE, pp 295–304
Allamanis M (2019) The adverse effects of code duplication in machine learning models of code. In: Proceedings of the 2019 ACM SIGPLAN international symposium on new ideas, new paradigms, and reflections on programming and software. pp 143–153
Almonte L, Guerra E, Cantador I, De Lara J (2021) Recommender systems in model-driven engineering: a systematic mapping review. Softw Syst Model 1–32
April A, Abran A (2012) Software maintenance management: evaluation and continuous improvement. John Wiley & Sons
Google Scholar
Babur Ö (2019) Model analytics and management. Ph.D. thesis, Technische Universiteit Eindhoven. Proefschrift
Babur Ö, Cleophas L, van den Brand M (2018) Model analytics for feature models: case studies for S.P.L.O.T. repository. In: Proc. of MODELS 2018 workshops, co-located with ACM/IEEE 21st Int. Conf. on Model driven engineering languages and systems. pp 787–792
Babur Ö, Suresh A, Alberts W, Cleophas L, Schiffelers R, van den Brand M (2020) Model analytics for industrial MDE ecosystems. In: Model management and analytics for large scale systems. Elsevier, pp 273–316
Babur Ö, Cleophas L, van den Brand M (2019) Metamodel clone detection with SAMOS. J Comput Lang 51:57–74
Article Google Scholar
Babur Ö, Cleophas L, van den Brand M (2022) SAMOS - a framework for model analytics and management. Sci Comput Program 223:102877
Article Google Scholar
Babur Ö, Constantinou E, Serebrenik A (2024) Language usage analysis for emf metamodels on github. Empir Softw Eng 29(1):23
Article Google Scholar
Bafna P, Pramod D, Vaidya A (2016) Document clustering: Tf-idf approach. In: 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT). IEEE, pp 61–66
Bang Y, Cahyawijaya S, Lee N, Dai W, Su D, Wilie B, Lovenia H, Ji Z, Yu T, Chung W, et al (2023) A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity.arXiv:2302.04023
bpmn.io (2023)https://bpmn.io/. Last Accessed on 24 Dec 2023
Biderman S, Schoelkopf H, Sutawika L, Gao L, Tow J, Abbasi B, Aji AF, Ammanamanchi PS, Black S, Clive J, DiPofi A, Etxaniz J, Fattori B, Forde JZ, Foster C, Jaiswal M, Lee WY, Li H, Lovering C, Muennighoff N, Pavlick E, Phang J, Skowron A, Tan S, Tang X, Wang KA, Winata GI, Yvon F, Zou A (2024) Lessons from the trenches on reproducible evaluation of language models
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Google Scholar
camunda (2023)https://camunda.com/. Last Accessed on 24 Dec 2023
Chinosi M, Trombetta A (2012) Bpmn: an introduction to the standard. Comput Stand Interfaces 34(1):124–134
Article Google Scholar
Community, A.:https://www.ariscommunity.com/. Last accessed on December 24, 2023
Compagnucci I, Corradini F, Fornari F, Re B (2024) A study on the usage of the bpmn notation for designing process collaboration, choreography, and conversation models. Bus Inf Syst Eng 66(1):43–66
Article Google Scholar
Compagnucci I, Corradini F, Fornari F, Re B (2021) Trends on the usage of bpmn 2.0 from publicly available repositories. In: International conference on business informatics research. Springer, pp 84–99
Corbin JM, Strauss A (1990) Grounded theory research: procedures, canons, and evaluative criteria. Qual Sociol 13(1):3–21
Article Google Scholar
Corradini F, Ferrari A, Fornari F, Gnesi S, Polini A, Re B, Spagnolo GO (2018) A guidelines framework for understandable bpmn models. Data Know Eng 113:129–154
Article Google Scholar
Cosentino V, Izquierdo JLC, Cabot J (2017) A systematic mapping study of software development with github. Ieee Access 5:7173–7192
Article Google Scholar
Deissenboeck F, Hummel B, Juergens E, Pfaehler M, Schaetz B (2010) Model clone detection in practice. In: Proceedings of the 4th international workshop on software clones. pp 57–64
Deissenboeck F, Hummel B, Jürgens E, Schätz B, Wagner S, Girard JF, Teuchert S (2008) Clone detection in automotive model-based development. In: Proceedings of the 30th international conference on software engineering. pp 603–612
Deza E, Deza MM, Deza MM, Deza E (2009) Encyclopedia of distances. Springer
Book Google Scholar
Dijkman R, Dumas M, Van Dongen B, Käärik R, Mendling J (2011) Similarity of business process models: metrics and evaluation. Inf Syst 36(2):498–516
Article Google Scholar
Dijkman R, Gfeller B, Küster J, Völzer H (2011) Identifying refactoring opportunities in process model repositories. Inf Softw Technol 53(9):937–948
Article Google Scholar
Disciplines: digital commons three-tiered taxonomy of academic disciplines (2016) Digital Commons Reference Material and User Guides. Paper 9.http://digitalcommons.bepress.com/reference/9
Drools (2023)https://www.drools.org/. Last Accessed on 24 Dec 2023
Duala-Ekoko E, Robillard MP (2008) Clonetracker: tool support for code clone management. In: Proceedings of the 30th international conference on Software engineering. pp 843–846
Dubinsky Y, Rubin J, Berger T, Duszynski S, Becker M, Czarnecki K (2013) An exploratory study of cloning in industrial software product lines. In: 2013 17th European conference on software maintenance and reengineering. IEEE, pp 25–34
Dumas M, García-Bañuelos L, La Rosa M, Uba R (2013) Fast detection of exact clones in business process model repositories. Inf Syst 38(4):619–633
Article Google Scholar
Dumas M, La Rosa M, Mendling J, Reijers HA, Dumas M, La Rosa M, Mendling J, Reijers HA (2018) Introduction to business process management. Fundamentals of business process management. pp 1–33
Dumas M, La Rosa M, Mendling J, Reijers HA, et al (2013) Fundamentals of business process management, vol 1. Springer
Dumas M, Rosa ML, Mendling J, Reijers HA (2018) Business process management, vol 64.https://doi.org/10.1016/j.datak.2007.06.004
Eclipse (2021) Bpmn2 modeler.https://www.eclipse.org/bpmn2-modeler
Flowable (2023).https://www.flowable.com/ Last Accessed on 24 Dec 2023
Fowler M (2004) UML distilled: a brief guide to the standard object modeling language. Addison-Wesley Professional
Google Scholar
Frank U (2007) Evaluation of reference models. In: Reference modeling for business systems analysis. IGI Global, pp 118–140
Geiger M, Harrer S, Lenhard J, Wirtz G (2018) Bpmn 2.0: the state of support and implementation. Future Gener Comput Syst 80:250–262
Article Google Scholar
Gharehyazie M, Ray B, Keshani M, Zavosht MS, Heydarnoori A, Filkov V (2019) Cross-project code clones in github. Empir Softw Eng 24(3):1538–1573
Article Google Scholar
Gharehyazie M, Ray B, Keshani M, Zavosht MS, Heydarnoori A, Filkov V (2019) Cross-project code clones in GitHub, vol. 24.https://doi.org/10.1007/s10664-018-9648-z
GitHub (2023) Github octoverse.https://octoverse.github.com/
Golubev Y, Bryksin T (2021) On the Nature of Code Cloning in Open-Source Java Projects. Proceedings - 2021 IEEE 15th International Workshop on Software Clones. IWSC 2021:22–28.https://doi.org/10.1109/IWSC53727.2021.00010
Gonzalez D, Zimmermann T, Nagappan N (2020) The state of the ml-universe: 10 years of artificial intelligence & machine learning software development on github. In: Proceedings of the 17th international conference on mining software repositories. pp 431–442
Gousios G (2013) The ghtorent dataset and tool suite. In: 2013 10th Working conference on Mining Software Repositories (MSR). IEEE, pp 233–236
Group B (2023) Adonis.https://www.boc-group.com/en/adonis/. Last Accessed on 24 Dec 2023
Gupta V, Lehal GS et al (2009) A survey of text mining techniques and applications. J Emerg Technol Web Intell 1(1):60–76
Google Scholar
Haisjackl C, Pinggera J, Soffer P, Zugal S, Lim SY, Weber B (2015) Identifying quality issues in bpmn models: an exploratory study. In: International workshop on business process modeling, development and support. Springer, pp 217–230
Hammad M, Babur Ö, Basit HA, van den Brand M (2021) Clone-advisor: recommending code tokens and clone methods with deep learning and information retrieval. PeerJ Comput Sci 7:e737
Article Google Scholar
Han J, Deng S, Lo D, Zhi C, Yin J, Xia X (2021) An empirical study of the landscape of open source projects in baidu, alibaba, and tencent. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, pp 298–307
Heinze TS, Stefanko V, Amme W (2020) BPMN in the Wild: BPMN on GitHub.com. In: Proceedings of the 12th ZEUS workshop on services and their composition. pp 2–5.https://www.semanticscholar.org/paper/BPMN-in-the-Wild%3A-BPMN-on-GitHub.com-Heinze-Stefanko/54e27c20a44109f64c76b083cfe8e5891617e4ac
Heinze TS, Stefanko V, Amme W (2020) Mining bpmn processes on github for tool validation and development. In: Enterprise, business-process and information systems modeling. Springer, pp 193–208
Heinze T, Amme W, Schäfer A (2021) Detecting semantic business process model clones. In: 13th European workshop on services and their composition, ZEUS 2021. CEUR-ws.org, pp 25–28
Hernández N, Batyrshin I, Sidorov G (2022) Evaluation of deep learning models for sentiment analysis. J Intelli Fuzz Syst 1–11. (Preprint)
Imran MM, Chatterjee P, Damevski K (2024) Uncovering the causes of emotions in software developer communication using zero-shot llms. In: Proceedings of the IEEE/ACM 46th international conference on software engineering. pp 1–13
Joachims T (2005) Text categorization with support vector machines: learning with many relevant features. In: Machine learning: ECML-98: 10th European conference on machine learning Chemnitz, Germany, April 21–23, 1998 Proceedings. Springer, pp 137–142
Källén M, Sigvardsson U, Wrigstad T (2021) Jupyter notebooks on GitHub: characteristics and code clones. Art Sci Engi Programm 5(3):100.https://doi.org/10.22152/programming-journal.org/2021/5/15
Article Google Scholar
Kamiya T, Kusumoto S, Inoue K (2002) Ccfinder: a multilinguistic token-based code clone detection system for large scale source code. IEEE Trans Software Eng 28(7):654–670
Article Google Scholar
Kochanthara S, Dajsuren Y, Cleophas L, van den Brand M (2022) Painting the Landscape of Automotive Software in GitHub. pp 215–226.arXiv:2203.08936
Kojima T, Gu SS, Reid M, Matsuo Y, Iwasawa Y (2022) Large language models are zero-shot reasoners. Adv Neural Inf Process Syst 35:22199–22213
Google Scholar
Koschke R (2007) Survey of research on software clones. In: Dagstuhl seminar proceedings. Schloss Dagstuhl-Leibniz-Zentrum für Informatik
Koschmider A, Reijers HA (2015) Improving the process of process modelling by the use of domain process patterns. Enterp Inf Syst 9(1):29–57
Article Google Scholar
Kvålseth TO (1989) Note on cohen’s kappa. Psychol Rep 65(1):223–226
Article Google Scholar
La Rosa M, Reijers HA, Van Der Aalst WM, Dijkman RM, Mendling J, Dumas M, García-Bañuelos L (2011) Apromore: an advanced process model repository. Expert Syst Appl 38(6):7029–7040
Article Google Scholar
La Rosa M, Dumas M, Ekanayake CC, García-Bañuelos L, Recker J, ter Hofstede AH (2015) Detecting approximate clones in business process model repositories. Inf Syst 49:102–125
Article Google Scholar
Lapeña R, Ballarin M, Cetina C (2016) Towards clone-and-own support: locating relevant methods in legacy products. In: Proceedings of the 20th international systems and software product line conference. pp. 194–203
Li XL, Liang P (2021) Prefix-tuning: optimizing continuous prompts for generation.arXiv:2101.00190
Li Y, Cao B, Xu L, Yin J, Deng S, Yin Y, Wu Z (2013) An efficient recommendation method for improving business process modeling. IEEE Trans Industr Inf 10(1):502–513
Article Google Scholar
Li L, Fan L, Atreja S, Hemphill L (2024) “hot’’ chatgpt: the promise of chatgpt in detecting and discriminating hateful, offensive, and toxic comments on social media. ACM Trans Web 18(2):1–36
Article Google Scholar
Lin YT, Papangelis A, Kim S, Lee S, Hazarika D, Namazifar M, Jin D, Liu Y, Hakkani-Tur D (2023) Selective in-context data augmentation for intent detection using pointwise v-information.arXiv:2302.05096
Lingua (2022) Lingua.https://github.com/pemistahl/lingua
Li C, Reichert M, Wombacher A (2008) Discovering reference process models by mining process variants. In: 2008 IEEE international conference on web services. IEEE, pp 45–53
Liu C, Osama M, De Andrade A (2019) Dens: a dataset for multi-class emotion analysis.arXiv:1910.11769
Lopes CV, Maj P, Martins P, Saini V, Yang D, Zitny J, Sajnani H, Vitek J (2017) Déjàvu: a map of code duplicates on github. Proc ACM Programm Lang 1(OOPSLA):1–28
Lopes CV, Martins P, Saini V, Vitek JaN, Lopes CV, Maj P, Martins P, Saini V, Yang D, Zitny J, Sajnani H, Déjàvu V, Map a, Duplicates C, Acm P, Lang P, Lopes CV, Maj P, Martins P, Saini V, Yang D, Zitny J, Sajnani H, Vitek JaN (2017) DéjàVu: a map of code duplicates on GitHub. Proc ACM Programm Lang 1(OOPSLA):1–28
Lübke D, Wutke D (2021) Analysis of prevalent BPMN layout choices on GitHub. CEUR Workshop Proc 2839(February):46–54
Google Scholar
Mendling J (2008) Event-driven process chains (epc). In: Metrics for process models. Springer, pp 17–57
Minaee S, Kalchbrenner N, Cambria E, Nikzad N, Chenaghlu M, Gao J (2021) Deep learning-based text classification: a comprehensive review. ACM Comp Surv 54(3):1–40
Article Google Scholar
Monden A, Nakae D, Kamiya T, Sato SI, Matsumoto KI (2002) Software quality analysis by code clones in industrial legacy software. In: Proceedings Eighth IEEE symposium on software metrics. IEEE, pp 87–94
MSCI (S &P) SP (1999) The global industry classification standard (gics).https://www.msci.com/our-solutions/indexes/gics
Muehlen Mz, Recker J (2013) How much language is enough? Theoretical and practical use of the business process modeling notation. Seminal Contributions to Information Systems Engineering: 25 Years of CAiSE 429–443
Mujahid M, Rustam F, Shafique R, Chunduri V, Villar MG, Ballester JB, Diez IdlT, Ashraf I (2023) Analyzing sentiments regarding chatgpt using novel bert: a machine learning approach. Information 14(9):474
Munaiah N, Kroh S, Cabrey C, Nagappan M (2017) Curating github for engineered software projects. Empir Softw Eng 22(6):3219–3253
Article Google Scholar
Murphy-Hill E, Zimmermann T, Nagappan N (2014) Cowboys, ankle sprains, and keepers of quality: How is video game development different from software development? In: Proceedings of the 36th international conference on software engineering. pp 1–11
Nguyen HA, Nguyen TT, Pham NH, Al-Kofahi J, Nguyen TN (2011) Clone management for evolving software. IEEE Trans Softw Eng 38(5):1008–1026
Article Google Scholar
OMG (2013) Business process model and notation v.2.0.2.https://www.omg.org/spec/BPMN/2.0.2
on Business Process Management, C (2023) Conferences on business process management.https://bpm-conference.org/
OpenAI (2023) ChatGPT.https://openai.com/blog/chatgpt
OpenAI (2024) Gpt-4 technical report
Pham NH, Nguyen HA, Nguyen TT, Al-Kofahi JM, Nguyen TN (2009) Complete and accurate clone detection in graph-based models. In: 2009 IEEE 31st International Conference on Software Engineering. IEEE, pp 276–286
Pinggera J, Soffer P, Fahland D, Weidlich M, Zugal S, Weber B, Reijers HA, Mendling J (2015) Styles in business process modeling: an exploration and a model. Softw Syst Model 14:1055–1080
Article Google Scholar
PROS-Lab (2019) Repository of open process models and logs.https://pros.unicam.it:4200/index
Raschka S (2014) Naive bayes and text classification i-introduction and theory.arXiv:1410.5329
Rattan D, Bhatia R, Singh M (2013) Software clone detection: a systematic review. Inf Softw Technol 55(7):1165–1199
Article Google Scholar
Reuters-Group (2004) Tthe refinitiv business classification (trbc).https://www.lseg.com/en/data-analytics/financial-data/indices/trbc-business-classification
Ricotta C, Podani J (2017) On some properties of the bray-curtis dissimilarity and their ecological meaning. Ecol Complex 31:201–205
Article Google Scholar
Rosa ML, Aalst WMVD, Dumas M, Milani FP (2017) Business process variability modeling: a survey. ACM Comp Surv 50(1):1–45
Article Google Scholar
Roy CK, Cordy JR (2007) A survey on software clone detection research. Queens School Comp TR 541(115):64–68
Google Scholar
Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: a qualitative approach. Sci Comput Program 74(7):470–495
Article MathSciNet Google Scholar
Roy S, Sajeev A, Bihary S, Ranjan A (2013) An empirical study of error patterns in industrial business process models. IEEE Trans Serv Comput 7(2):140–153
Article Google Scholar
Rubin J, Czarnecki K, Chechik M (2013) Managing cloned variants: a framework and experience. In: Proceedings of the 17th international software product line conference. pp 101–110
Saeedi Nikoo M, Babur Ö, van den Brand M (2022) Clone detection for business process models. PeerJ Comput Sci 8:e1046
Saeedi Nikoo M, Babur Ö, Van Den Brand M (2020) A survey on service composition languages. In: Proceedings of the 23rd ACM/IEEE international conference on model driven engineering languages and systems: companion proceedings. pp 1–5
Saeedi Nikoo M, Kochanthara S, Babur Ö, Van den Brand M (2023) Supplemental material including data and code for the paper “An empirical study of business process models and model clones on GitHub”.https://doi.org/10.5281/zenodo.13955920
Saier T, Färber M, Tsereteli T (2022) Cross-lingual citations in english papers: a large-scale analysis of prevalence, usage, and impact. Int J Digit Libr 23(2):179–195
Article Google Scholar
Schoknecht A, Thaler T, Fettke P, Oberweis A, Laue R (2017) Similarity of business process models-a state-of-the-art analysis. ACM Comp Surv 50(4):1–33
Article Google Scholar
Signavio S (2023)https://www.signavio.com/. Last Accessed on 24 Dec 2023
Sneed HM (2008) Offering software maintenance as an offshore service. In: 2008 IEEE International Conference on Software Maintenance. IEEE, pp 1–5
Sokolova M, Lapalme G (2009) A systematic analysis of performance measures for classification tasks. Inf Process Manag 45(4):427–437
Article Google Scholar
Sola D, Warmuth C, Schäfer B, Badakhshan P, Rehse JR, Kampik T (2022) Sap signavio academic models: a large process model dataset.https://doi.org/10.48550/ARXIV.2208.12223.arXiv:2208.12223
Spinellis D, Kotti Z, Mockus A (2020) A dataset for github repository deduplication. In: Proceedings of the 17th international conference on mining software repositories. pp 523–527
Spirling A (2023) Why open-source generative ai models are an ethical way forward for science. Nature 616(7957):413–413
Article Google Scholar
Stephan M (2019) Towards a cognizant virtual software modeling assistant using model clones. In: 2019 IEEE/ACM 41st international conference on software engineering: new ideas and emerging results (ICSE-NIER). IEEE, pp 21–24
Störrle H (2015) Effective and efficient model clone detection. In: Software, services, and systems. Springer, pp 440–457
Systems S (2023) Enterprise architect.https://sparxsystems.com/. Last Accessed on 24 Dec 2023
Taherdoost H (2017) Determining sample size; how to calculate survey sample size. Int J Econ Manag Syst 2
Tan PN, Steinbach M, Kumar V (2016) Introduction to data mining. Pearson Education India
Taymouri F, La Rosa M, Dumas M, Maggi FM (2021) Business process variant analysis: survey and classification. Knowl-Based Syst 211:106557
Article Google Scholar
Türker J, Völske M, Heinze TS (2022) BPMN in the Wild: A Reprise. CEUR Workshop Proc 3113(February):68–75
Google Scholar
Türker J, Völske M, Heinze T (2022) Github bpmn artifacts dataset 2021.https://doi.org/10.5281/zenodo.5903352
Uba R, Dumas M, García-Bañuelos L, Rosa ML (2011) Clone detection in repositories of business process models. In: International conference on business process management. Springer, pp 248–264
Uysal AK, Gunal S (2014) The impact of preprocessing on text classification. Inf Process Manag 50(1):104–112
Article Google Scholar
Van Der Aalst WM, Ter Hofstede AH (2005) Yawl: yet another workflow language. Inf Syst 30(4):245–275
Article Google Scholar
Van Der Aalst W, Van Hee KM (2004) Workflow management: models, methods, and systems. MIT press
Volanschi N (2018) Stereo: editing clones refactored as code generators. In: 2018 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, pp 595–604
Weske M (2007) Business process management architectures. Springer
Google Scholar
Wessel M, De Souza BM, Steinmacher I, Wiese IS, Polato I, Chaves AP, Gerosa MA (2018) The power of bots: characterizing and understanding bots in oss projects. Proceedings of the ACM on Human-Computer Interaction2(CSCW), 1–19
White J, Fu Q, Hays S, Sandborn M, Olea C, Gilbert H, Elnashar A, Spencer-Smith J, Schmidt DC (2023) A prompt pattern catalog to enhance prompt engineering with chatgpt
Wikibooks (2024) Scrabble.https://en.wikibooks.org/wiki/Scrabble/Two-Letter_Words
Wohlin C, Runeson P, Höst M, Ohlsson MC, Regnell B, Wesslén A (2012) Experimentation in software engineering. Springer Science & Business Media
Xin D, Miao H, Parameswaran A, Polyzotis N (2021) Production machine learning pipelines: empirical analysis and optimization opportunities. In: Proceedings of the 2021 international conference on management of data. pp 2639–2652
Zarour K, Benmerzoug D, Guermouche N, Drira K (2020) A systematic literature review on bpmn extensions. Bus Process Manag J 26(6):1473–1503
Article Google Scholar
Zhang B, Ding D, Jing L (2022) How would stance detection techniques evolve after the launch of chatgpt?arXiv:2212.14548
Zheng O, Abdel-Aty M, Wang D, Wang Z, Ding S (2023) Chatgpt is on the horizon: could a large language model be all we need for intelligent transportation?arXiv:2303.05382
Zhou K, Yang J, Loy CC, Liu Z (2022) Learning to prompt for vision-language models. Int J Comput Vision 130(9):2337–2348
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Eindhoven University of Technology, Eindhoven, The Netherlands
Mahdi Saeedi Nikoo, Sangeeth Kochanthara, Önder Babur & Mark van den Brand
Information Technology Group, Wageningen University & Research, Wageningen, The Netherlands
Önder Babur

Authors

Mahdi Saeedi Nikoo
View author publications
You can also search for this author inPubMed Google Scholar
Sangeeth Kochanthara
View author publications
You can also search for this author inPubMed Google Scholar
Önder Babur
View author publications
You can also search for this author inPubMed Google Scholar
Mark van den Brand
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence toMahdi Saeedi Nikoo.

Ethics declarations

Conflicts of Interest

The authors declare that they have no conflict of interest.

Additional information

Communicated by: Slinger Jansen.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visithttp://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saeedi Nikoo, M., Kochanthara, S., Babur, Ö.et al. An empirical study of business process models and model clones on GitHub.Empir Software Eng30, 48 (2025). https://doi.org/10.1007/s10664-024-10584-z

Download citation

Accepted:28 October 2024
Published:23 December 2024
DOI:https://doi.org/10.1007/s10664-024-10584-z

Movatterモバイル変換

An empirical study of business process models and model clones on GitHub

Abstract

Similar content being viewed by others

Lessons Learned from Co-Evolution of Software Process and Model-Driven Engineering

SAP Signavio Academic Models: A Large Process Model Dataset

Process mining: software comparison, trends, and challenges

1Introduction

2Study Design

2.1Data Retrieval

2.2Inclusion and Exclusion Criteria

3RQ1 – What is the Landscape of BPMN Models in Open Source?

3.1Approach

3.2Findings

4RQ2 – What is the Extent of Cloning in BPMN Models in Open Source?

4.1Approach

4.2Findings

5Implications

6Threats to Validity

7Related Work

7.1Software Characterization

7.2Clones in Open Source

8Conclusion and Future Work

Data Availibility Statement

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords