What are the limitations of traditional process mining?
Although extremely powerful, according to prof. van der Aalst, traditional process mining techniques have the following limitations:
- Data extraction and transformation is painful and needs to be repeated.
- Interactions between objects are not captured.
- 3D reality is squeezed into 2D event logs and models.
To understand each of these limitations, it’s first necessary to understand how modern processes work and how the data associated with them is stored.
A classic process model assumes a single case notion, the assumption that a single case (or process instance) happens in isolation. In event logs, like the one above, this single case notion is illustrated by the fact that one case identifier (Case ID) exists per event.
This is not how events happen in the real world nor is it how data is often stored in the information systems from which event log data is pulled. For example, the sales department may track their processes through digital documents like sales orders. When you apply traditional process mining tools to the sales order business process, you would look at a single object, the sales order and examine how it flows through the sales department. With this information you could identify areas for process improvement and implement process changes to close those execution gaps.
However, the sales department doesn’t operate in a vacuum. Events that occur during the sales process affect other departments, such as procurement, production, warehouse, distribution, finance and so on. Furthermore, different departments use different objects in their processes. Production could use production orders, finance could use invoices, procurement could use purchase orders and so on. All these objects are connected because the processes are connected.
Consider a customer who places an order for four items at the same time. One item is in stock and the other three need to be manufactured. For the items that need to be manufactured, three production orders are created. The in-stock item is packaged and shipped on its own, while the remaining items are packaged together and shipped later once they have been produced. The customer is then sent a single invoice.
https://delivery-p141552-e1488202.adobeaemcloud.com/adobe/assets/urn:aaid:aem:ffa7b493-1703-4c1f-b20f-1da764b1991c/as/what_is_ocpm_multiple_objects.png
Processes usually involve multiple objects (eg. sales order, sales order items, production orders, shipments and invoices) that are all related to each other.
This example illustrates how a single event, the creation of a sales order, can involve multiple objects (customers, sales orders, sales order items, production orders, shipments and invoices) that are all related to each other. For example, a customer can have multiple sales orders (one-to-many relationship). Likewise, a sales order can contain multiple sales order items and a sales order item can be included on multiple sales orders (many-to-many relationship). To handle these relationships, ERP, CRM, SCM and other business systems often store information in what’s known as a relational database.
In a relational model, data is organized into tables of rows and columns. Rows are called records, columns are called attributes and tables are relations. For example, in a customer table, each row could correspond to an individual customer record and each column would be an attribute such as name, address, phone number and the like. Thinking about our sales order example above, a sales order table could have a row/record for each order and columns/attributes for information such as the order number, date and time, customer, items ordered, price for each item, billing address, shipping address, shipment method and so on.
Each row within a table of a relational database has a unique identifier called a key. For example, on a customer table, each customer would be assigned a distinct customer ID that would be used as the table’s primary key (PK). On the sales order table, the primary key could be a sales order ID. Each time a new record is written to a table, a new primary key value is generated for that record.
Primary keys are how the system accesses information on an individual table and are also used to show the relationships between tables. For example, rows in the customer table can be linked to rows in the sales order table by adding a customer ID column to the sales order table. Each time a customer makes a new purchase, a record is added to the sales order table, and the customer’s unique customer ID is added as an attribute called a foreign key (a unique key from a linked row on another table). This model works well for one-to-many relationships, where many rows in one table have a column value that equals the primary key on another table.
To account for many-to-many relationships, the primary keys from related tables are often copied to an additional table that is used to resolve the relationship between the two entity tables. Consider for example, the many-to-many relationship between sales orders and sales order items outlined above. Here you could have a sales order table, a sales order item table and a third resolution table. The resolution table would contain at the very least a column for each of the foreign keys from the sales order and sales order item tables (primary keys on those tables) as well as a new primary key created from the two foreign keys.
In our simple sales orders and sales order items example, there are at least three tables needed to define the relationship. In the real world, it takes a lot more. Think back to the sales order process outlined above. A single sales order is related to four sales order items, three production orders, two shipments and one invoice. Each of these objects is related to other objects, such as customers, suppliers, shippers and so on. Therefore, it shouldn’t be hard to see why modern business information systems, such as an ERP system, use databases with hundreds, thousands and even hundreds of thousands of tables.
We can now consider the three limitations of traditional process mining noted earlier:
- Data extraction and transformation is painful and needs to be repeated. Classic process mining requires you to pull data from the multi-table, relational databases of the source information system(s) into a flat event log (i.e., table) where each event (i.e., row) refers to a case, an activity and a timestamp. In our sales order example for instance, each case would be a specific sales order, and we could ask a question like, What’s the most common reason for a blocked order? However, if we wanted to ask a question such as, Which customers are paying their invoices on time? we would need a different event log with invoices, not sales orders, as the case notion. Thus, we would need to go back to the data and make a new extraction, which can be a time consuming and painful process.
- Interactions between objects are not captured. When you extract data from a relational database and flatten it into a process mining event log where each event is a single case, you lose the interactions between objects defined in the source data. Likewise, any process models you generate from the flattened data only describe the lifecycle of a case in isolation. To get complete process transparency, you must understand the relationship between all the different objects and objects types involved in the process.
- Three-dimensional reality is squeezed into two-dimensional event logs and models. In traditional process mining, an event log is created for each object type and analyzed separately. For example, we create an event log for sales orders to examine the sales order process, an event log for customer invoices to look at the accounts receivable process, a log for vendor invoices to analyze accounts payable and an event log for purchase orders for the procurement process. Even if these flat event logs are linked through a related cases data model, we’re still squeezing three-dimensional, object-centric data and models into two-dimensional, case-centric event logs and process models.
This data flattening leads to the problems of convergence and divergence.
Convergence happens when one event is replicated across different cases, possibly leading to unintentional duplication. Consider the sales order process example outlined above, if we select a sales order item as a case notion, events at the sales order level (e.g., place order) would be duplicated for every sales order item related to a specific sales order. This replication of events can cause misleading diagnostics.
Divergence happens when there may be multiple instances of the same activity within a single case. Using our sales order process example from above, if we select sales order as a case notion, events at the sales order item level (e.g., pick item) may become indistinguishable and seemingly related. For example, the pick item and pack item events happen only once per item and in a fixed sequence (e.g., you pick the item then pack it), within a flattened sales order event log, these events may appear to happen many times and in a seemingly random order for an individual case.