MarMot: Metamorphic Runtime Monitoring of Autonomous Driving Systems

  1. Ayerdi, Jon 1
  2. Iriarte, Asier 1
  3. Valle, Pablo 1
  4. Roman, Ibai 1
  5. Illarramendi, Miren 1
  6. Arrieta, Aitor 1
  1. 1 Mondragon University, Mondragon, Spain
Journal:
ACM Transactions on Software Engineering and Methodology

ISSN: 1049-331X 1557-7392

Year of publication: 2024

Volume: 34

Issue: 1

Pages: 1-35

Type: Article

DOI: 10.1145/3678171 GOOGLE SCHOLAR lock_openOpen access editor

More publications in: ACM Transactions on Software Engineering and Methodology

Abstract

Autonomous driving systems (ADSs) are complex cyber-physical systems (CPSs) that must ensure safety even in uncertain conditions. Modern ADSs often employ deep neural networks (DNNs), which may not produce correct results in every possible driving scenario. Thus, an approach to estimate the confidence of an ADS at runtime is necessary to prevent potentially dangerous situations. In this article we propose MarMot, an online monitoring approach for ADSs based on metamorphic relations (MRs), which are properties of a system that hold among multiple inputs and the corresponding outputs. Using domain-specific MRs, MarMot estimates the uncertainty of the ADS at runtime, allowing the identification of anomalous situations that are likely to cause a faulty behavior of the ADS, such as driving off the road.We perform an empirical assessment of MarMot with five different MRs, using two different subject ADSs, including a small-scale physical ADS and a simulated ADS. Our evaluation encompasses the identification of both external anomalies, e.g., fog, as well as internal anomalies, e.g., faulty DNNs due to mislabeled training data. Our results show that MarMot can identify up to 65% of the external anomalies and 100% of the internal anomalies in the physical ADS, and up to 54% of the external anomalies and 88% of the internal anomalies in the simulated ADS. With these results, MarMot outperforms or is comparable to other state-of-the-art approaches, including SelfOracle, Ensemble, and MC Dropout-based ADS monitors.

Funding information

Funders

  • Basque Government through their Elkartek
    • KK-2022/00119, KK-2022/00007
  • Department of Education, Universities and Research of the Basque Country

Bibliographic References

  • Raja Ben Abdessalem, Shiva Nejati, Lionel C. Briand, and Thomas Stifter. 2018a. Testing vision-based control systems using learnable evolutionary algorithms. In Proceedings of the 40th International Conference on Software Engineering. 1016–1026.
  • 10.1145/3238147.3238192
  • Aitor Arrieta. 2022. Multi-objective metamorphic follow-up test case selection for deep learning systems. In Proceedings of the Genetic and Evolutionary Computation Conference. 1327–1335.
  • Jon Ayerdi Pablo Valle Asier Iriarte Ibai Roman Miren Illarramendi and Aitor Arrieta. 2023. Dataset for “MarMot: Metamorphic runtime monitoring of autonomous driving systems. DOI: 10.5281/zenodo.10716933
  • Jonathan Bell, Christian Murphy, and Gail Kaiser. 2015. Metamorphic runtime checking of applications without test oracles. CrossTalk 28, 2 (2015).
  • 10.1145/2970276.2970311
  • Matteo Biagiola and Paolo Tonella. 2022. Testing the plasticity of reinforcement learning-based systems. ACM Transactions on Software Engineering and Methodology (TOSEM) 31, 4 (2022), 1–46.
  • Mariusz Bojarski Davide Del Testa Daniel Dworakowski Bernhard Firner Beat Flepp Prasoon Goyal Lawrence D. Jackel Mathew Monfort Urs Muller Jiakai Zhang Xin Zhang Jake Zhao and Karol Zieba. 2016. End to end learning for self-driving cars. arXiv:1604.07316. Retrieved from https://arxiv.org/abs/1604.07316
  • Alessandro Calo, Paolo Arcaini, Shaukat Ali, Florian Hauer, and Fuyuki Ishikawa. 2020. Generating avoidable collision scenarios for testing autonomous driving systems. In Proceedings of the IEEE 13th International Conference on Software Testing, Validation and Verification (ICST ’20). IEEE, 375–386.
  • T. Y. Chen, S. C. Cheung, and S. M. Yiu. 1998. Metamorphic Testing: A New Approach for Generating Next Test Cases. Technical Report HKUST-CS98-01, Department of Computer Science, The Hong Kong University of Science and Technology.
  • Yao Deng, Guannan Lou, Xi Zheng, Tianyi Zhang, Miryung Kim, Huai Liu, Chen Wang, and Tsong Yueh Chen. 2021. BMT: Behavior driven development-based metamorphic testing for autonomous driving models. In Proceedings of the IEEE/ACM 6th International Workshop on Metamorphic Testing (MET ’21). IEEE, 32–36.
  • Yao Deng, Xi Zheng, Tianyi Zhang, Huai Liu, Guannan Lou, Miryung Kim, and Tsong Yueh Chen. 2022. A declarative metamorphic testing framework for autonomous driving. IEEE Transactions on Software Engineering 49, 4 (2022), 1964–1982.
  • Raul Sena Ferreira, Jean Arlat, Jérémie Guiochet, and Hélène Waeselynck. 2021. Benchmarking safety monitors for image classifiers with machine learning. In Proceedings of the IEEE 26th Pacific Rim International Symposium on Dependable Computing (PRDC ’21). IEEE, 7–16.
  • Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the International Conference on Machine Learning. PMLR, 1050–1059.
  • 10.1145/3338906.3338942
  • Alessio Gambi, Marc Mueller, and Gordon Fraser. 2019b. Automatically testing self-driving cars with search-based procedural content generation. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 318–328.
  • 10.1109/ICRA46639.2022.9811924
  • 10.1145/3510003.3510188
  • 10.1109/ICSE48619.2023.00155
  • Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural network robustness to common corruptions and perturbations. Proceedings of the International Conference on Learning Representations (2019).
  • Jens Henriksson, Christian Berger, Markus Borg, Lars Tornberg, Cristofer Englund, Sankar Raman Sathyamoorthy, and Stig Ursing. 2019. Towards structured evaluation of deep neural network supervisors. In Proceedings of the IEEE International Conference on Artificial Intelligence Testing (AITest ’19). IEEE, 27–34.
  • 10.1073/pnas.79.8.2554
  • 10.1145/3377811.3380395
  • 10.1145/3460319.3464825
  • 10.1007/s10515-021-00310-0
  • 10.1109/ICSE.2019.00108
  • Anis Koubâa et al. 2017. Robot Operating System (ROS). Vol. 1. Springer.
  • Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. In Proceedings of the Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc.
  • LeoRover. 2022. LeoRover Dataset. Retrieved from https://www.kaggle.com/datasets/aleksanderszymaski/full-track
  • LeoRover. 2023. LeoRover. Retrieved from https://github.com/LeoRover
  • 10.5555/3103620.3103632
  • 10.1109/TSE.2022.3150788
  • 10.1109/ICRA40945.2020.9196844
  • Galen E. Mullins, Paul G. Stankiewicz, R. Chad Hawthorne, and Satyandra K. Gupta. 2018. Adaptive generation of challenging scenarios for testing and evaluation of autonomous vehicles. Journal of Systems and Software 137 (2018), 197–215.
  • Christian Murphy and Gail E. Kaiser. 2009. Metamorphic Runtime Checking of Non-Testable Programs. Retrieved from https://core.ac.uk/reader/161435520
  • Christian Murphy, Kuang Shen, and Gail Kaiser. 2009. Automatic system testing of programs without test oracles. In Proceedings of the 18th International Symposium on Software Testing and Analysis. 189–200.
  • 10.1145/3132747.3132785
  • Vincenzo Riccio and Paolo Tonella. 2023. When and why test generators for deep learning produce invalid inputs: An empirical study. In Proceedings of the IEEE/ACM 45th International Conference on Software Engineering (ICSE ’23). IEEE, 1161–1173.
  • Jeanine Romano, Jeffrey. D. Kromrey, Jesse Coraggio, Jeff Skowronek, and Linda Devine. 2006. Exploring methods for evaluating group differences on the NSSE and other surveys: Are the t-test and Cohen’s d indices the most appropriate choices. In Proceedings of the Annual Meeting of the Southern Association for Institutional Research. 1–51.
  • 10.1109/JPROC.2021.3052449
  • David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams. 1986. Learning representations by back-propagating errors. Nature 323, 6088 (1986), 533–536.
  • Franz Scheuer, Alessio Gambi, and Paolo Arcaini. 2023. STRETCH: Generating challenging scenarios for testing collision avoidance systems. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV ’23). IEEE, 1–6.
  • 10.1109/TSE.2016.2532875
  • 10.1016/j.jss.2020.110574
  • Andrea Stocco, Paulo J Nunes, Marcelo D’Amorim, and Paolo Tonella. 2022a. Thirdeye: Attention maps for safe autonomous driving systems. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.
  • Andrea Stocco, Brian Pulfer, and Paolo Tonella. 2022b. Mind the gap! A study on the transferability of virtual vs physical-world testing of autonomous driving systems. IEEE Transactions on Software Engineering 49, 4 (2022), 1928–1940.
  • Andrea Stocco and Paolo Tonella. 2022. Confidence-driven weighted retraining for predicting safety-critical failures in autonomous driving systems. Journal of Software: Evolution and Process 34, 10 (2022), e2386.
  • 10.1145/3377811.3380353
  • Yang Sun, Christopher M. Poskitt, Jun Sun, Yuqi Chen, and Zijiang Yang. 2022. LawBreaker: An approach for specifying traffic laws and fuzzing autonomous vehicles. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12.
  • 10.1145/3180155.3180220
  • Huiyan Wang, Jingwei Xu, Chang Xu, Xiaoxing Ma, and Jian Lu. 2020. Dissector: Input validation for deep learning applications by crossing-layer dissection. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 727–738.
  • Michael Weiss and Paolo Tonella. 2023. Uncertainty quantification for deep neural networks: An empirical comparison and usage guidelines. Software Testing, Verification and Reliability 33, 6 (2023), e1840.
  • 10.1109/ICSE43902.2021.00044
  • 10.1007/s10270-017-0609-6
  • Man Zhang, Bran Selic, Shaukat Ali, Tao Yue, Oscar Okariz, and Roland Norgren. 2016. Understanding uncertainty in cyber-physical systems: A conceptual model. In Proceedings of the Modelling Foundations and Applications: 12th European Conference, ECMFA 2016, Held as Part of STAF 2016. Springer, 247–264.
  • Mengshi Zhang, Yuqun Zhang, Lingming Zhang, Cong Liu, and Sarfraz Khurshid. 2018. DeepRoad: GAN-based metamorphic testing and input validation framework for autonomous driving systems. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 132–142.
  • Ziyuan Zhong, Gail Kaiser, and Baishakhi Ray. 2022. Neural network guided evolutionary fuzzing for finding traffic violations of autonomous vehicles. IEEE Transactions on Software Engineering 49, 4 (2022), 1860–1875.
  • Yuan Zhou, Yang Sun, Yun Tang, Yuqi Chen, Jun Sun, Christopher M. Poskitt, Yang Liu, and Zijiang Yang. 2023. Specification-based autonomous driving system testing. IEEE Transactions on Software Engineering 49, 6 (2023), 3391–3410.
  • Zhi Quan Zhou and Liqun Sun. 2019. Metamorphic testing of driverless cars. Communications of the ACM 62, 3 (2019), 61–67.