Can human testers understand automatically generated test cases?

post

database testing

developer productivity

How well do humans understand automatically generated tests?

Author

Gregory M. Kapfhammer

Published

2020

Introduction

Throughout my career, my research has focused on creating and evaluating methods for automatically generating test cases and test inputs, thereby supporting programmers with clever tests that may find defects in, for instance, their programs or database schemas. As an example, (Alsharif, Kapfhammer, and McMinn 2018) introduces a tool for automatically generating tests for relational database schemas and (McMinn and Kapfhammer 2016) presents a framework that supports the creation of automated test data generation tools for the Java programming language. However, until recently, none of my research papers assessed whether or not these approaches actually help human testers. Of course, my colleagues and I recognized that this was an area that warranted further investigation! I’m pleased to report that our recent paper, (Alsharif, Kapfhammer, and McMinn 2019) presents a human study of automated test data generation techniques for relational database schemas.

Research

Since relational databases are a key component of software systems ranging from small mobile to large enterprise applications, there are well-studied methods that automatically generate test cases for database-related functionality. As explained in a previous blog post called Introducing a research foundation for testing relational database schemas, a schema testing tool automatically generates INSERT statements with data values designed to either satisfy (i.e., be accepted into the database) or violate the schema (i.e., be rejected from the database). From my own experience, writing schema tests is difficult because a human tester has to carefully reason about the, sometimes complex, relational schema that protects the database’s contents. While having tools like SchemaAnalyst (McMinn et al. 2016) to automatically generate tests is useful, as I mentioned previously, there has been no research to analyze how well testers both understand tests involving SQL and decide if they reveal flaws. So, my collaborators and I decided to conduct a human study — both a first for me and for this research area!

The aforementioned paper, (Alsharif, Kapfhammer, and McMinn 2019) reports on a human study of test comprehension in the context of automatically generated tests, created by SchemaAnalyst, that assess the correct specification of the integrity constraints in a relational database schema. The study reveals two key findings. First, the choice of data values in INSERTs influences human understandability: the use of default values for elements not involved in the test (but necessary for adhering to SQL’s syntax rules) aided participants, allowing them to easily identify and understand the important test values. Yet, negative numbers and “garbage” strings hindered this process.

The second finding is more far reaching: humans found the outcome of test cases very difficult to predict when NULL was used in conjunction with foreign keys and CHECK constraints. This suggests that, while including NULLs can surface the confusing semantics of database schemas, their use makes tests less understandable for humans. While these results specifically apply to database schema tests, we anticipate that the results may apply to automatically generated tests for general-purpose programs in languages like Java.

Recommendations

In the context of database schema testing, my co-authors and I make the following suggestions for both the software engineers who manually write tests and for developers who create tools that automatically generate tests.

Negative numbers and NULL values are confusing for human testers and tests should only include them when they are needed to reveal a defect.
Test should use simple repetitions of numerical, categorical, and textual values for unimportant values because they make it easier for human testers to focus on the critical values on which the test’s status hinges.
Since readable strings, in comparison to random textual values, help to ensure that human testers better understand the intention of the test case, they should be used if possible.

Stay Connected

While some automated test data generation methods are starting to adopt these suggestions, more work remains to create automated test data generation tools and manual testing guidelines that ensure that tests are understandable by human testers. Do you have ideas about how to make tests more readable? Do you want to collaborate with me on research to make the next generation of automated test data generators? If so, then I hope that you will contact me with your ideas!

Return to Blog Post Listing

References

Alsharif, Abdullah, Gregory M. Kapfhammer, and Phil McMinn. 2018. “DOMINO: Fast and Effective Test Data Generation for Relational Database Schemas.” In Proceedings of the 11th International Conference on Software Testing, Verification and Validation.

———. 2019. “What Factors Make SQL Test Cases Understandable for Testers? A Human Study of Automated Test Data Generation Techniques.” In Proceedings of the 35th International Conference on Software Maintenance and Evolution.

McMinn, Phil, and Gregory M. Kapfhammer. 2016. “AVMf: An Open-Source Framework and Implementation of the Alternating Variable Method.” In Proceedings of the 8th International Symposium on Search-Based Software Engineering.

McMinn, Phil, Chris J. Wright, Cody Kinneer, Colton J. McCurdy, Michael Camara, and Gregory M. Kapfhammer. 2016. “SchemaAnalyst: Search-Based Test Data Generation for Relational Database Schemas.” In Proceedings of the 32nd International Conference on Software Maintenance and Evolution.