The filters for event data subsetting can mostly be divided into twotype: event filters and case filters. Event filters will subsetparts of cases based on criteria applied on the events(e.g. the resource which performed it), while case filters will subsetcomplete cases, based on criteria applied on the cases (e.g. the tracelength).
Each filter has areverse argument, which allows to reversethe filter very easily. Furthermore, each filter has aninterface-alternative, which can be called by adding ai beforethe function name.
The filter activity function can be used to filter activities byname. It has three arguments
## Number of events: 996## Number of cases: 498## Number of traces: 2## Number of distinct activities: 2## Average trace length: 2## ## Start eventlog: 2017-01-05 08:59:04## End eventlog: 2018-05-05 01:34:30## handling patient employee handling_id ## Blood test :474 Length:996 r1: 0 Length:996 ## Check-out : 0 Class :character r2: 0 Class :character ## Discuss Results : 0 Mode :character r3:474 Mode :character ## MRI SCAN : 0 r4: 0 ## Registration : 0 r5:522 ## Triage and Assessment: 0 r6: 0 ## X-Ray :522 r7: 0 ## registration_type time .order ## complete:498 Min. :2017-01-05 08:59:04 Min. : 1.0 ## start :498 1st Qu.:2017-05-06 12:31:43 1st Qu.:249.8 ## Median :2017-09-08 00:10:11 Median :498.5 ## Mean :2017-09-03 07:11:55 Mean :498.5 ## 3rd Qu.:2017-12-23 02:06:20 3rd Qu.:747.2 ## Max. :2018-05-05 01:34:30 Max. :996.0 ##As one can see, there are only 2 distinct activities left in theevent log.
It is also possible to filter on activity frequency. This filter usesa percentile cut off, and will look at those activities which are mostfrequent until the required percentage of events has been reached. Thus,a percentile cut off of 80% will look at the activities needed torepresent 80% of the events. In the example below, theleast frequent activities covering 50% of the event logare selected, since the reverse argument is true.
## # A tibble: 4 × 3## handling absolute relative## <fct> <int> <dbl>## 1 Check-out 492 0.401## 2 X-Ray 261 0.213## 3 Blood test 237 0.193## 4 MRI SCAN 236 0.192The filter_attributes function is a very generic function an can besupplied with conditions on the data set, in the same way as thedplyr::filter function. As such, it allows you to filter onevent or case attributes. Multiple conditions can be listed, separatedby a comma. In that case, the comma will be treated as “and”. You canuse the |-symbol to state “OR”. Since the patients dataset does not havemany additional attributes, the example below uses the resource andactivity. This filter is thus the same as the combination offilter_activity and filter_resource, in case both conditions wererequired. However, it has the advantange of stating both conditions asOR.
## # Log of 1522 events consisting of:## 2 traces ## 500 cases ## 761 instances of 2 activities ## 2 resources ## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 01:34:30 ## ## # Variables were mapped as follows:## Case identifier: patient ## Activity identifier: handling ## Resource identifier: employee ## Activity instance identifier: handling_id ## Timestamp: time ## Lifecycle transition: registration_type ## ## # A tibble: 1,522 × 7## handling patient employee handling_id registration_type time ## <fct> <chr> <fct> <chr> <fct> <dttm> ## 1 Registrat… 1 r1 1 start 2017-01-02 11:41:53## 2 Registrat… 2 r1 2 start 2017-01-02 11:41:53## 3 Registrat… 3 r1 3 start 2017-01-04 01:34:05## 4 Registrat… 4 r1 4 start 2017-01-04 01:34:04## 5 Registrat… 5 r1 5 start 2017-01-04 16:07:47## 6 Registrat… 6 r1 6 start 2017-01-04 16:07:47## 7 Registrat… 7 r1 7 start 2017-01-05 04:56:11## 8 Registrat… 8 r1 8 start 2017-01-05 04:56:11## 9 Registrat… 9 r1 9 start 2017-01-06 05:58:54## 10 Registrat… 10 r1 10 start 2017-01-06 05:58:54## # ℹ 1,512 more rows## # ℹ 1 more variable: .order <int>Similar to the activity filter, the resource filter can be used tofilter events by listing on or more resources.
## # A tibble: 2 × 3## employee absolute relative## <fct> <int> <dbl>## 1 r1 500 0.679## 2 r4 236 0.321The trim filter is a special event filter, as it also take intoaccount the notion of cases. In fact, ittrim cases such thatthey start with a certain activities until they end with a certainactivity. It requires two list: one for possible start activities andone for end activities. The cases will be trimmed from the firstappearance of a start activity till the last appearance of an endactivity. When reversed, theseslices of the event log will beremoved instead of preserved.
patients%>%filter_trim(start_activities ="Registration",end_activities =c("MRI SCAN","X-Ray"))%>%traces()## # A tibble: 2 × 3## trace absolute_frequency relative_frequency## <chr> <int> <dbl>## 1 Registration,Triage and Assessment,X-Ray 261 0.525## 2 Registration,Triage and Assessment,Bloo… 236 0.475This functions allows to filter cases that contain certainactivities. It requires as input a vector containing one or moreactivity labels and it has amethod argument. The lattercan have the valuesall,none orone_of. Whenset toall, it means that all the specified activity labelsmust be present for a case to be selected,none means that theyare not allowed to be present, andone_of means that at leastone of them must be present.
The case filter allows to subset a set of case identifiers. Asarguments it only requires a vector of case id’s. The selection can alsobe negated usingreverse = T.
Thefilter_endpoints method filters cases based on thefirst and last activity label. It can be used in two ways: by specifyingvectors with allowed start activities and/or allowed end activities, orby specifying a percentile. In the latter case, the percentile valuewill be used as a cut off. For example, when set to 0.9, it will selectthe most common endpoint pairs which together cover at least 90% of thecases, and filter the event log accordingly. This filter can also bereversed.
In order to extract a subset of an event log which conforms with aset of precedence rules, one can use thefilter_precedencemethod. There are two types of precendence relations which can betested: activities that shoulddirectly follow each other, oractivities that shouldeventually follow each other. The typecan be set with theprecedence_type argument. Further, thefilter requires a vector of one or more antecedents (containing activitylabels), and one or more consequents. Finally, also afilter_method argument can be set. This argument is relevantwhen there is more than one antecedent or consequent. In such a case,you can specify that all possible precedence combinations must bepresent (all), or at least one of them (_one_of).
There are three different filters which take into account thelength of a case:
Each of these filters can work in two ways, similar to the endpointsfilter: either by using an interval or by using a percentile cut off.The percentile cut off will always start with the shortest cases firstand stop including cases when the specified percentile is reached. Theprocessing and throughput time filters also have aunitsattribute to specify the time unit used when defining an interval. Allthe methods can be reversed by settingreverse = T.
Cases can also be filtered by supplying a time window to the methodfilter_time_period. There are four different filtermethods, of which one can be used as argument:
The selection can also be reversed. Note that there is a 5 filtermethod,trim, but this is actually an event filter and willthus be discussed in the next section.
The last case filter can be used to filter cases based on thefrequency of the corresponding trace. A trace is a sequence of activitylabels, and will be discussed in more detail in Section\(\ref{mining-and-analysis-1}\). There areagain two ways to select cases based on trace frequency, by interval orby percentile cut off. The percentile cut off will start with the mostfrequent traces. This filter also contains the reverse argument.