Interpretable Sequence Classification via Discrete Optimization

Maayan Shvo
Maayan Shvo
Andrew C. Li
Andrew C. Li
Rodrigo Toro Icarte
Rodrigo Toro Icarte
Cited by: 0|Bibtex|Views7|Links
Keywords:
early classificationdiscrete optimizationPercentage convergence accuracyDiscrete Optimization for Interpretable Sequence Classificationcontext-free grammarsMore(13+)
Weibo:
In this paper we proposed a method to address this class of problems by combining the learning of Deterministic Finite Automata sequence classifiers via Mixed Integer Linear Programming with Bayesian inference

Abstract:

Sequence classification is the task of predicting a class label given a sequence of observations. In many applications such as healthcare monitoring or intrusion detection, early classification is crucial to prompt intervention. In this work, we learn sequence classifiers that favour early classification from an evolving observation tra...More

Code:

Data:

0
Introduction
Highlights
  • Sequence classification—the task of predicting a class label given a sequence of observations—has a myriad of applications including biological sequence classification (e.g., (Deshpande and Karypis 2002)), document classification (e.g., (Sebastiani 2002)), and intrusion detection (e.g., (Lane and Brodley 1999))
  • While our approach does not scale as well as gradient-based optimization, our use of prefix trees significantly reduces the size of the discrete optimization problem, allowing us to tackle real-world datasets with nearly 100,000 observation tokens, as demonstrated in our experiments
  • We considered three goal recognition domains: Crystal Island (Ha et al 2011; Min et al 2016), a narrative-based game where players solve a science mystery; ALFRED (Shridhar et al 2020), a virtual-home environment where an agent can interact with various household items and perform a myriad of tasks; and MIT Activity Recognition (MIT-AR) (Tapia, Intille, and Larson 2004), comprised of noisy, real-world sensor data with labelled activities in a home setting
  • LSTM displayed an advantage over Discrete Optimization for Interpretable Sequence Classification (DISC) on datasets with long traces. n-gram models excelled in some low-data settings but perform poorly overall as they fail to model long-term dependencies
  • In this paper we proposed a method to address this class of problems by combining the learning of Deterministic Finite Automata (DFA) sequence classifiers via Mixed Integer Linear Programming (MILP) with Bayesian inference
  • The resulting DFA classifiers offer a set of interpretability services that include explanation, counterfactual reasoning, verification of properties, and human modification
Methods
  • Cumulative convergence accuracy (CCA) at time t is defined as the percentage of traces τ that are correctly classified given min(t, |τ |) observations.
  • The authors considered three goal recognition domains: Crystal Island (Ha et al 2011; Min et al 2016), a narrative-based game where players solve a science mystery; ALFRED (Shridhar et al 2020), a virtual-home environment where an agent can interact with various household items and perform a myriad of tasks; and MIT Activity Recognition (MIT-AR) (Tapia, Intille, and Larson 2004), comprised of noisy, real-world sensor data with labelled activities in a home setting.
  • Crystal Island and MIT-AR are challenging as subjects may pursue goals non-deterministically
Results
  • Detailed results for StarCraft, MIT-AR, Crystal Island, and BatteryLow are shown in Figure 3 while a summary of results from all domains is provided in Table 1.

    DISC generally outperformed n-gram, HMM, and DFAFT, achieving near-LSTM performance on most domains.
  • The results show that DISC has strong performance on each domain, only comparable by LSTM.
  • This demonstrates that DISC produces robust confidences in its predictions.
  • 3. DISC has a strong performance on each domain, only comparable by LSTM.
  • DISC has a strong performance on each domain, only comparable by LSTM
  • This suggests the confidence produced by DISC accurately approximates its true predictive accuracy.
Conclusion
  • The authors described an approach to learning DFAs for sequence classification based on mixed integer linear programming.
  • DISC makes similar Markov assumptions to HMM – that the information from any prefix of a trace can be captured by a single state – DISC only considers discrete state transitions, does not model an observation emission distribution, and regularizes the size of the model
  • The authors believe these were important factors in handling noise in the data.
  • DISC, achieves similar performance to LSTMs and superior performance to HMMs and n-grams on a set of synthetic and real-world datasets, with the important advantage of being interpretable
Summary
  • Introduction:

    Sequence classification—the task of predicting a class label given a sequence of observations—has a myriad of applications including biological sequence classification (e.g., (Deshpande and Karypis 2002)), document classification (e.g., (Sebastiani 2002)), and intrusion detection (e.g., (Lane and Brodley 1999)).
  • In hospital neonatal intensive care units, early diagnosis of infants with sepsis can be life-saving (Griffin and Moorman 2001)
  • Neural networks such as LSTMs (Hochreiter and Schmidhuber 1997), learned via gradient descent, are natural and powerful sequence classifiers (e.g., (Zhou et al 2015; Karim et al 2019)), but the rationale for classification can be difficult for a human to discern.
  • Methods:

    Cumulative convergence accuracy (CCA) at time t is defined as the percentage of traces τ that are correctly classified given min(t, |τ |) observations.
  • The authors considered three goal recognition domains: Crystal Island (Ha et al 2011; Min et al 2016), a narrative-based game where players solve a science mystery; ALFRED (Shridhar et al 2020), a virtual-home environment where an agent can interact with various household items and perform a myriad of tasks; and MIT Activity Recognition (MIT-AR) (Tapia, Intille, and Larson 2004), comprised of noisy, real-world sensor data with labelled activities in a home setting.
  • Crystal Island and MIT-AR are challenging as subjects may pursue goals non-deterministically
  • Results:

    Detailed results for StarCraft, MIT-AR, Crystal Island, and BatteryLow are shown in Figure 3 while a summary of results from all domains is provided in Table 1.

    DISC generally outperformed n-gram, HMM, and DFAFT, achieving near-LSTM performance on most domains.
  • The results show that DISC has strong performance on each domain, only comparable by LSTM.
  • This demonstrates that DISC produces robust confidences in its predictions.
  • 3. DISC has a strong performance on each domain, only comparable by LSTM.
  • DISC has a strong performance on each domain, only comparable by LSTM
  • This suggests the confidence produced by DISC accurately approximates its true predictive accuracy.
  • Conclusion:

    The authors described an approach to learning DFAs for sequence classification based on mixed integer linear programming.
  • DISC makes similar Markov assumptions to HMM – that the information from any prefix of a trace can be captured by a single state – DISC only considers discrete state transitions, does not model an observation emission distribution, and regularizes the size of the model
  • The authors believe these were important factors in handling noise in the data.
  • DISC, achieves similar performance to LSTMs and superior performance to HMMs and n-grams on a set of synthetic and real-world datasets, with the important advantage of being interpretable
Tables
  • Table1: A summary of results from all domains (DISC is our approach). With respect to the full dataset, N is the total number of traces, and |τ | is the average length of a trace. Reported are the percentages of traces correctly classified given the full observation trace, with 90% confidence error in parentheses. Highest accuracy is bolded
  • Table2: The average number of DFA states (first), and the average number of state transitions (second) in learned models for DISC (ours) and DFA-FT over twenty runs, using the experimental procedure in Section C.1
  • Table3: Results for the early classification experiment. Average utility per trace over twenty runs is reported with 90% confidence error, with the best mean performance in each row in bold
Download tables as Excel
Related work
  • We build on the large body of work concerned with learning automata from sets of traces (e.g., (Gold 1967; Angluin 1987; Oncina and Garcia 1992; Carmel and Markovitch 1996; Heule and Verwer 2010; Ulyantsev, Zakirzyanov, and Shalyto 2015; Angluin, Eisenstat, and Fisman 2015; Giantamidis and Tripakis 2016; Smetsers, Fiterau-Brostean, and Vaandrager 2018)). Previous approaches to learning such automata have typically constructed the prefix tree from a set of traces and employed heuristic methods or SAT solvers to minimize the resulting automaton. Here we follow a similar approach, but instead specify and realize a MILP model that is guaranteed to find optimal solutions given enough time; optimizes for a different objective function than those commonly used by previous work (see Section 3); does not assume noise-free traces or prior knowledge of the problem (e.g., a set of DFA templates); and introduces new forms of regularization.

    Some work in automata learning has also shown (some) robustness to noisy data. For instance, Xue et al (2015) combine domain-specific knowledge with domainindependent automata learning techniques and learn minimal DFAs that capture malware behaviour, with empirical results suggesting a degree of robustness to noisy data. While we eschew domain knowledge in this work, our approach allows for domain knowledge to be incorporated during the learning process. Ulyantsev, Zakirzyanov, and Shalyto (2015) also work with noisy data, but their SAT-based model assumes that at-most k training instances have wrong labels, which is not a natural hyperparameter in machine learning, and does not support regularization.
Funding
  • We gratefully acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada CIFAR AI Chairs Program, and Microsoft Research
  • Rodrigo also gratefully acknowledges funding from ANID (Becas Chile)
  • Resources used in preparing this research were provided, in part, by the Province of Ontario, the Government of Canada through CIFAR, and companies sponsoring the Vector Institute for Artificial Intelligence www.vectorinstitute.ai/partners
Study subjects and analysis
observations: 3
A decision is provided after each incoming observation based on the current state: yes for the blue accepting state, and no for red, non-accepting states. For example, on the trace (B, H2, H1, ) the DFA would transition through the states (q0, q3, q0, q2), predicting that the goal is not after the first three observations, then predicting the goal is after the fourth observation. Note that this learned DFA leverages biases in the data— namely, that in the training data the agent only pursues optimal paths

Reference
  • Amado, L.; Pereira, R. F.; Aires, J.; Magnaguagno, M.; Granada, R.; and Meneguzzi, F. 2018. Goal recognition in latent space. In 2018 International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE.
    Google ScholarLocate open access versionFindings
  • Angluin, D. 1987. Learning regular sets from queries and counterexamples. Information and computation 75(2): 87– 106.
    Google ScholarLocate open access versionFindings
  • Angluin, D.; Eisenstat, S.; and Fisman, D. 2015. Learning regular languages via alternating automata. In Twenty-Fourth International Joint Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Bernardi, M. L.; Cimitile, M.; Distante, D.; Martinelli, F.; and Mercaldo, F. 2019. Dynamic malware detection and phylogeny analysis using process mining. International Journal of Information Security 18(3): 257–284.
    Google ScholarLocate open access versionFindings
  • Camacho, A.; and McIlraith, S. A. 2019. Learning interpretable models expressed in linear temporal logic. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, 621–630.
    Google ScholarLocate open access versionFindings
  • Carmel, D.; and Markovitch, S. 199Learning models of intelligent agents. In AAAI/IAAI, Vol. 1, 62–67.
    Google ScholarLocate open access versionFindings
  • Chang, A.; Bertsimas, D.; and Rudin, C. 2012. Ordered rules for classification: A discrete optimization approach to associative classification. Poster in Proceesings of Neural Information Processing Systems Foundations (NIPS).
    Google ScholarLocate open access versionFindings
  • Chomsky, N. 1956. Three models for the description of language. IRE Transactions on information theory 2(3): 113– 124.
    Google ScholarLocate open access versionFindings
  • De la Higuera, C. 2010. Grammatical inference: learning automata and grammars. Cambridge University Press.
    Google ScholarFindings
  • Deshpande, M.; and Karypis, G. 2002. Evaluation of techniques for classifying biological sequences. In PacificAsia Conference on Knowledge Discovery and Data Mining, 417–431. Springer.
    Google ScholarLocate open access versionFindings
  • Doshi-Velez, F.; and Kim, B. 2017. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608.
    Findings
  • Dunning, T. 1994. Statistical identification of language. Computing Research Laboratory, New Mexico State University Las Cruces, NM, USA.
    Google ScholarFindings
  • Geib, C. W.; and Goldman, R. P. 2001. Plan recognition in intrusion detection systems. In Proceedings DARPA Information Survivability Conference and Exposition II. DISCEX’01, volume 1, 46–55. IEEE.
    Google ScholarLocate open access versionFindings
  • Geib, C. W.; and Kantharaju, P. 2018. Learning Combinatory Categorial Grammars for Plan Recognition. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), 3007–3014.
    Google ScholarLocate open access versionFindings
  • Ghalwash, M. F.; Radosavljevic, V.; and Obradovic, Z. 2013. Extraction of interpretable multivariate patterns for early diagnostics. In 2013 IEEE 13th International Conference on Data Mining, 201–210. IEEE.
    Google ScholarLocate open access versionFindings
  • Giantamidis, G.; and Tripakis, S. 20Learning Moore machines from input-output traces. In International Symposium on Formal Methods, 291–309. Springer.
    Google ScholarLocate open access versionFindings
  • Gold, E. M. 1967. Language identification in the limit. Information and control 10(5): 447–474.
    Google ScholarLocate open access versionFindings
  • Gold, E. M. 1978. Complexity of automaton identification from given data. Information and control 37(3): 302–320.
    Google ScholarLocate open access versionFindings
  • Griffin, M. P.; and Moorman, J. R. 2001. Toward the early diagnosis of neonatal sepsis and sepsis-like illness using novel heart rate analysis. Pediatrics 107(1): 97–104.
    Google ScholarLocate open access versionFindings
  • Ha, E. Y.; Rowe, J. P.; Mott, B. W.; and Lester, J. C. 2011. Goal recognition with Markov logic networks for playeradaptive games. In Seventh Artificial Intelligence and Interactive Digital Entertainment Conference.
    Google ScholarLocate open access versionFindings
  • Harman, H.; and Simoens, P. 2020. Action graphs for proactive robot assistance in smart environments. JOURNAL OF AMBIENT INTELLIGENCE AND SMART ENVIRONMENTS 12(2): 79–99.
    Google ScholarLocate open access versionFindings
  • Heule, M. J.; and Verwer, S. 2010. Exact DFA identification using SAT solvers. In International Colloquium on Grammatical Inference, 66–79. Springer.
    Google ScholarLocate open access versionFindings
  • Hochreiter, S.; and Schmidhuber, J. 1997. Long short-term memory. Neural computation 9(8): 1735–1780.
    Google ScholarLocate open access versionFindings
  • Hsu, E.-Y.; Liu, C.-L.; and Tseng, V. S. 2019. Multivariate Time Series Early Classification with Interpretability Using Deep Learning and Attention Mechanism. In PacificAsia Conference on Knowledge Discovery and Data Mining, 541–553. Springer.
    Google ScholarLocate open access versionFindings
  • Huang, H.-S.; Liu, C.-L.; and Tseng, V. S. 2018. Multivariate time series early classification using multi-domain deep neural network. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA), 90–98. IEEE.
    Google ScholarLocate open access versionFindings
  • Junger, M.; Liebling, T. M.; Naddef, D.; Nemhauser, G. L.; Pulleyblank, W. R.; Reinelt, G.; Rinaldi, G.; and Wolsey, L. A. 2009.
    Google ScholarFindings
  • 50 Years of integer programming 1958-2008: From the early years to the state-of-the-art. Springer Science & Business Media.
    Google ScholarFindings
  • Kantharaju, P.; Ontanon, S.; and Geib, C. W. 2019. Scaling up CCG-Based Plan Recognition via Monte-Carlo Tree Search. In Proc. of the IEEE Conference on Games 2019.
    Google ScholarLocate open access versionFindings
  • Karim, F.; Majumdar, S.; Darabi, H.; and Harford, S. 2019. Multivariate lstm-fcns for time series classification. Neural Networks 116: 237–245.
    Google ScholarLocate open access versionFindings
  • Kautz, H. A.; and Allen, J. F. 1986. Generalized Plan Recognition. In Fifth National Conference on Artificial Intelligence (AAAI), 32–37.
    Google ScholarLocate open access versionFindings
  • Kim, J.; Muise, C.; Shah, A.; Agarwal, S.; and Shah, J. 2019. Bayesian inference of linear temporal logic specifications for contrastive explanations. In Proceedings of the TwentyEighth International Joint Conference on Artificial Intelligence, IJCAI-19., volume 776.
    Google ScholarLocate open access versionFindings
  • Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
    Findings
  • Kupiec, J. 1992. Robust part-of-speech tagging using a hidden Markov model. Computer speech & language 6(3): 225–242.
    Google ScholarLocate open access versionFindings
  • Lane, T.; and Brodley, C. E. 1999. Temporal sequence learning and data reduction for anomaly detection. ACM Transactions on Information and System Security (TISSEC) 2(3): 295–331.
    Google ScholarLocate open access versionFindings
  • Levenshtein, V. I. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, 707–710.
    Google ScholarLocate open access versionFindings
  • Liu, J.; Shahroudy, A.; Xu, D.; and Wang, G. 2016. Spatiotemporal LSTM with trust gates for 3d human action recognition. In European conference on computer vision, 816– 833. Springer.
    Google ScholarLocate open access versionFindings
  • Ma, S.; Sigal, L.; and Sclaroff, S. 2016. Learning activity progression in LSTMs for activity detection and early detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1942–1950.
    Google ScholarLocate open access versionFindings
  • McDermott, D.; Ghallab, M.; Howe, A.; Knoblock, C.; Ram, A.; Veloso, M.; Weld, D.; and Wilkins, D. 1998. PDDL-the planning domain definition language.
    Google ScholarFindings
  • Miller, T. 2019. Explanation in artificial intelligence: Insights from the social sciences. Artificial Intelligence 267: 1–38.
    Google ScholarLocate open access versionFindings
  • Min, W.; Mott, B. W.; Rowe, J. P.; Liu, B.; and Lester, J. C. 2016. Player Goal Recognition in Open-World Digital Games with Long Short-Term Memory Networks. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (IJCAI), 2590–2596.
    Google ScholarLocate open access versionFindings
  • Neider, D.; and Gavran, I. 2018. Learning linear temporal properties. In 2018 Formal Methods in Computer Aided Design (FMCAD), 1–10. IEEE.
    Google ScholarLocate open access versionFindings
  • Oncina, J.; and Garcia, P. 1992. Identifying regular languages in polynomial time. In Advances in structural and syntactic pattern recognition, 99–108. World Scientific.
    Google ScholarLocate open access versionFindings
  • Pereira, R. F.; Oren, N.; and Meneguzzi, F. 2017. Landmarkbased heuristics for goal recognition. In Thirty-First AAAI Conference on Artificial Intelligence.
    Google ScholarLocate open access versionFindings
  • Pnueli, A. 1977. The temporal logic of programs. In 18th Annual Symposium on Foundations of Computer Science (sfcs 1977), 46–57. IEEE.
    Google ScholarLocate open access versionFindings
  • Polyvyanyy, A.; Su, Z.; Lipovetzky, N.; and Sardina, S. 2020. Goal Recognition Using Off-the-Shelf Process Mining Techniques. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, 1072–1080.
    Google ScholarLocate open access versionFindings
  • Ramırez, M.; and Geffner, H. 2011. Goal recognition over POMDPs: Inferring the intention of a POMDP agent. In Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI).
    Google ScholarLocate open access versionFindings
  • Riabov, A.; Sohrabi, S.; Sow, D.; Turaga, D.; Udrea, O.; and Vu, L. 2015. Planning-based reasoning for automated largescale data analysis. In Proceedings of the Twenty-Fifth International Conference on International Conference on Automated Planning and Scheduling, 282–290.
    Google ScholarLocate open access versionFindings
  • Rozenberg, G.; and Salomaa, A. 2012. Handbook of Formal Languages. Springer Science & Business Media.
    Google ScholarFindings
  • Sebastiani, F. 2002. Machine learning in automated text categorization. ACM computing surveys (CSUR) 34(1): 1–47.
    Google ScholarLocate open access versionFindings
  • Shridhar, M.; Thomason, J.; Gordon, D.; Bisk, Y.; Han, W.; Mottaghi, R.; Zettlemoyer, L.; and Fox, D. 2020. ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). URL https://arxiv.org/abs/1912.01734.
    Findings
  • Smetsers, R.; Fiterau-Brostean, P.; and Vaandrager, F. 2018. Model learning as a satisfiability modulo theories problem. In International Conference on Language and Automata Theory and Applications, 182–194. Springer.
    Google ScholarLocate open access versionFindings
  • Sohrabi, S.; Udrea, O.; and Riabov, A. V. 2013. Hypothesis exploration for malware detection using planning. In Proceedings of the Twenty-Seventh AAAI Conference on Artificial Intelligence, 883–889.
    Google ScholarLocate open access versionFindings
  • Sonnhammer, E. L.; Von Heijne, G.; Krogh, A.; et al. 1998. A hidden Markov model for predicting transmembrane helices in protein sequences. In Ismb, volume 6, 175–182.
    Google ScholarLocate open access versionFindings
  • Tapia, E. M.; Intille, S. S.; and Larson, K. 2004. Activity recognition in the home using simple and ubiquitous sensors. In International conference on pervasive computing, 158–175. Springer.
    Google ScholarLocate open access versionFindings
  • Ulyantsev, V.; Zakirzyanov, I.; and Shalyto, A. 2015. BFSbased symmetry breaking predicates for DFA identification. In International Conference on Language and Automata Theory and Applications, 611–622. Springer.
    Google ScholarLocate open access versionFindings
  • Vardi, M. Y.; and Wolper, P. 1986. An automata-theoretic approach to automatic program verification. In Proceedings of the First Symposium on Logic in Computer Science, 322– 331. IEEE Computer Society.
    Google ScholarLocate open access versionFindings
  • Wagner, R. A. 1974. Order-n correction for regular languages. Communications of the ACM 17(5): 265–268.
    Google ScholarLocate open access versionFindings
  • Wang, W.; Chen, C.; Wang, W.; Rai, P.; and Carin, L. 2016. Earliness-aware deep convolutional networks for early time series classification. arXiv preprint arXiv:1611.04578.
    Findings
  • Xing, Z.; Pei, J.; Dong, G.; and Yu, P. S. 2008. Mining sequence classifiers for early prediction. In Proceedings of the 2008 SIAM international conference on data mining, 644– 655. SIAM.
    Google ScholarLocate open access versionFindings
  • Xing, Z.; Pei, J.; and Keogh, E. 2010. A brief survey on sequence classification. ACM Sigkdd Explorations Newsletter 12(1): 40–48.
    Google ScholarLocate open access versionFindings
  • Xing, Z.; Pei, J.; Yu, P. S.; and Wang, K. 2011. Extracting interpretable features for early classification on time series. In Proceedings of the 2011 SIAM International Conference on Data Mining, 247–258. SIAM.
    Google ScholarLocate open access versionFindings
  • Xue, Y.; Wang, J.; Liu, Y.; Xiao, H.; Sun, J.; and Chandramohan, M. 2015. Detection and classification of malicious JavaScript via attack behavior modelling. In Proceedings of the 2015 International Symposium on Software Testing and Analysis, 48–59.
    Google ScholarLocate open access versionFindings
  • Zhou, C.; Sun, C.; Liu, Z.; and Lau, F. 2015. A CLSTM neural network for text classification. arXiv preprint arXiv:1511.08630.
    Findings
Your rating :
0

 

Tags
Comments