This repository contains the source code for the paper Discovering Process Models from Unlabelled Event Logs presented at the 7th International Conference on Business Process Management (BPM 2009) in Ulm, Germany, September 7-10, 2009.
The approach is referred to as the Multiple Instance Model (MIM) and the source code comprises three main files:
-
mim.py
contains the Python classmodel
and the routines to estimate the model parameters, namely the transition matrix M, the global transition matrix M+ used to initialize M, the source sequence s, and the separated source sequences y(1…K) that can be obtained from the symbol sequence x and the source sequence s. -
mimest.py
contains the sample code for creating and estimating a multiple instance model from a symbol sequence x. The input sequence is read from standard input with one symbol per line. -
mimgen.py
contains useful code to generate a symbol sequence x from a distribution of sequences with different probabilities, using a number of instances and a number of overlapping instances as specified in the input parameters.
In general, only mim.py
and mimest.py
will be required. Basically, mimest.py
imports and invokes mim.py
in order to estimate a model. To test this code one must have a sample symbol sequence x and therefore mimgen.py
is provided as a utility program to generate such sample sequence. For this purpose, an additional example file is provided:
example.txt
contains a sample distribution to be used as input tomimgen.py
in order to create a symbol sequence x that can given to mimest.py.
Here is a tutorial example on how to use the code:
-
Start by looking at the contents of
example.txt
, which basically contains the behavior that will be used to create the output symbol sequence x. In each line there is a sequence and its corresponding probability. These probabilities should add up to 1.0, but the code inmimgen.py
will normalize the distribution even if this is not the case. -
Assuming that Python 2 is installed and available in the system, open a command line and run:
$ python mimgen.py 20 5 example.txt sequence.txt
In this example, were are asking for a symbol sequence to be generated from a set of 20 instances (i.e. 20 sequences drawn from the set of possible sequences found in
example.txt
) and using a maximum number of 5 overlapping instances (i.e. at any moment there will be no more than 5 active instances). The input sequence distribution is to be read fromexample.txt
and the output symbol sequence x is to be written tosequence.txt
. -
Check the output that
mimgen.py
produces on the screen. In this example, the results could look like something similar to the following picture. Note that the actual results may differ sincemimgen.py
is picking the sequences and interleaving them randomly.There are different sections in this output. The first couple of lines display the positions in the symbol sequence. Then the complete symbol sequence is shown as a sequence of characters. The middle section with the numbered and dotted lines contains the sequence of events for each instance, placed at the position where they occurred. At the end, the symbol sequence is repeated again for convenience. The two lines before the last show the source number and these must agree with the line numbers shown in the left of the figure. The very last line displays a count of the active sources at each position in the sequence; here it can be seen that the maximum number of overlapping sources (5) has been achieved twice in this run.
-
Check the contents of
sequence.txt
, which contains the symbol sequence x that has been generated bymimgen.py
. In this file there is one event per line as required bymimest.py
. Verify that the sequence insequence.txt
is the same as the one displayed in the screen output ofmimgen.py
. -
Run
mimest.py
on the generated symbol sequence:$ python mimest.py < sequence.txt
and check the results, which show the transition matrix M and the distribution of sequences that is obtained from the separate source sequences y(1...K). The symbols "o" and "x" represent the special start and stop states, respectively.
Note that due to imprecise assignments there may be sequences that were not present in the original log and also the number of detected sources may differ.
-
Repeat the above steps with different parameters and even different input sequences and probabilities in
example.txt
. In general, the higher the number of instances, the more accurate the results become. The number of overlapping instances also has some influence in the results, as it will be easier for the algorithm to discover the original sequences if this number is low.
From step 5 above it is possible to use the transition matrix M to obtain a graphical model such as this one:
In the paper, we introduce the G-metric to compare the behavior of the estimated model to that of the true model. We also present other results on the performance and accuracy of the algorithm for different logs and under different working assumptions.
-
Download the 4 files:
mim.py
,mimest.py
,mimgen.py
andexample.txt
. -
Run:
$ python mimgen.py 20 5 example.txt sequence.txt
Try different values for the number of instances and the overlap (in the example above, the values used were 20 and 5, respectively)
-
Run:
$ python mimest.py < sequence.txt
See the publisher's website to download a citation in the desired format.