Marco Bjarne Schuster, Boris Wiegand, and Jilles Vreeken. “Data is Moody: Discovering Data Modification Rules from Process Event Logs.” Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2024.
Although event logs are a powerful source to gain insight into the behavior of the underlying business process, existing work primarily focuses on finding patterns in the activity sequences of an event log, while ignoring event attribute data. Event attribute data has mostly been used to predict event occurrences and process outcome, but the state of the art neglects to mine succinct and interpretable rules describing how event attribute data changes during process execution. Subgroup discovery and rule-based classification approaches lack the ability to capture the sequential dependencies present in event logs, and thus lead to unsatisfactory results with limited insight into the process behavior.
Given an event log, we aim to find accurate yet succinct and interpretable if-then rules how the process modifies data. We formalize the problem in terms of the Minimum Description Length (MDL) principle, by which we choose the model with the best lossless description of the data. Additionally, we propose the greedy Moody algorithm to efficiently search for rules. By extensive experiments on both synthetic and real-world data, we show Moody indeed finds compact and interpretable rules, needs little data for accurate discovery, and is robust to noise.
Boris Wiegand, Dietrich Klakow, and Jilles Vreeken. “What are the Rules? Discovering Constraints from Data.” Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2024.
Constraint programming and AI planning are powerful tools for solving assignment, optimization, and scheduling problems. They require, however, the rarely available combination of domain knowledge and mathematical modeling expertise. Learning constraints from exemplary solutions can close this gap and alleviate the effort of modeling. Existing approaches either require extensive user interaction, need exemplary invalid solutions that must be generated by experts at great expense, or show high noise-sensitivity.
We aim to find constraints from potentially noisy solutions, without the need of user interaction. To this end, we formalize the problem in terms of the Minimum Description Length (MDL) principle, by which we select the model with the best lossless compression of the data. Solving the problem involves model counting, which is# P-hard to approximate. We therefore propose the greedy URPILS algorithm to find high-quality constraints in practice. Extensive experiments on constraint programming and AI planning benchmark data show URPILS not only finds more accurate and succinct constraints, but also is more robust to noise, and has lower sample complexity than the state of the art.
Boris Wiegand, Dietrich Klakow, and Jilles Vreeken. “Why Are We Waiting? Discovering Interpretable Models for Predicting Sojourn and Waiting Times.” Proceedings of the SIAM International Conference on Data Mining (SDM), 2023.
We consider the problem of discovering accurate, yet easily understandable graph-based models from complex event sequence data. Real-world event data, such as production logs, exhibit complex behaviors. These include sequences, choices, loops, optionals, and combinations thereof that make it hard to gain insight into what is going on, and how we can improve the process. Current approaches do not solve this problem satisfyingly, as their modeling language is too restricted to capture complex behavior or they return models that are still too difficult to understand.
We formalize the problem in terms of the Minimum Description Length (MDL) principle, by which we say that the best model provides the shortest lossless description of the data. The resulting problem is NP-hard, and hence we propose the greedy Proseqo algorithm to discover good models from data. Proseqo iteratively simplifies the current description by removing nodes, edges, and applying patterns, until MDL tells us to stop. For whenever this result is still too complex, we propose Prosimple, which iteratively removes further edges until we satisfy a user-specified threshold.
Through an extensive set of experiments, we show both methods perform very well in practice. They return simple models that reconstruct the ground truth well, need only little data to do so, are robust against noise, and scale well. A case study shows that, unlike the state of the art, we discover easily understandable models that capture the key aspects of the data generation process.
Boris Wiegand, Dietrich Klakow, and Jilles Vreeken. “Discovering Interpretable Data-to-Sequence Generators.” Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 2022.
We study the problem of predicting an event sequence given some meta data. In particular, we are interested in learning easily interpretable models that can accurately generate a sequence based on an attribute vector. To this end, we propose to learn a sparse event-flow graph over the training sequences, and statistically robust rules that use meta data to determine which paths to follow.
We formalize the problem in terms of the Minimum Description Length (MDL) principle, by which we identify the best model as the one that compresses the data best. As the resulting optimization problem is NP-hard, we propose the efficient CONSEQUENCE algorithm to discover good event-flow graphs from data.
Through an extensive set of experiments including a case study, we show that it ably discovers compact, interpretable and accurate models for the generation and prediction of event sequences from data, has a low sample complexity, and is particularly robust against noise.
Boris Wiegand, Dietrich Klakow, and Jilles Vreeken. “Mining Easily Understandable Models from Complex Event Logs.” Proceedings of the SIAM International Conference on Data Mining (SDM), 2021.
We consider the problem of discovering accurate, yet easily understandable graph-based models from complex event sequence data. Real-world event data, such as production logs, exhibit complex behaviors. These include sequences, choices, loops, optionals, and combinations thereof that make it hard to gain insight into what is going on, and how we can improve the process. Current approaches do not solve this problem satisfyingly, as their modeling language is too restricted to capture complex behavior or they return models that are still too difficult to understand.
We formalize the problem in terms of the Minimum Description Length (MDL) principle, by which we say that the best model provides the shortest lossless description of the data. The resulting problem is NP-hard, and hence we propose the greedy Proseqo algorithm to discover good models from data. Proseqo iteratively simplifies the current description by removing nodes, edges, and applying patterns, until MDL tells us to stop. For whenever this result is still too complex, we propose Prosimple, which iteratively removes further edges until we satisfy a user-specified threshold.
Through an extensive set of experiments, we show both methods perform very well in practice. They return simple models that reconstruct the ground truth well, need only little data to do so, are robust against noise, and scale well. A case study shows that, unlike the state of the art, we discover easily understandable models that capture the key aspects of the data generation process.