Friday, April 21, 2023

Bad Tests Running Wild - an #InflectraCON2023 Live Blog


 

Paul Grizzaffi Avatar

Paul Grizzaffi

Senior QE Automation Architect, Vaco

Paul and I go way back. It's always fun to see my fellow in heavy metal arms at these events. We frequently talk music as much as we talk testing, so we are often in each other's sessions and today is no exception. Plus, Paul and I love making musical puns in our talk titles, and seeing Bad Tests Running Wild, I knew that was a reference to Scorpions' lead-off track from 1984's "Love at First Sting", aka "Bad Boys Running Wild"... yeah, this is going to be fun :).


The point here is that, especially with CI/CD pipelines, we need to have the tests pass to successfully complete and deploy an application. If a test fails, the whole process fails. By virtue of how tests run in a CI/CD pipeline, we need to make sure that any test that we have can run all the time, independent of any other test, and independent of any state of our product. This means a flakey test can really derail us. Note, this is not talking about a test legitimately failing or finding a fault. This is more the "random timeout because of a latency that occurs and that has nothing to do with our application".    

Let's think about how we create our calls and procedures. Do we have everything under our own umbrella? How much of our solution uses third-party code? Do we understand that third-party code? If we are using threaded processes for concurrency, are all of our components able to use those concurrent thread approaches?  

Let's think about configuration and how we set things up. Why do we want or need parallelization? Overall, it comes down to time and speed. I remember well our earlier setup with Jenkins from about a decade ago. It took us several hours to run everything in serial. Thus, we needed to set up the environment in such a  way that we could run four servers in parallel. At a point, we have to look at the costs of running our CI/CD pipeline vs. the time it takes to deploy. Our sweet spot was determined to be four servers running in parallel. Those four servers ran our tests in twenty minutes and then did our deployment if everything went smoothly. Going from several hours to twenty minutes was a big time saving but yes, it cost to set up robust enough servers to get those savings in time. After those four servers, we determined that adding more servers created a less favorable cost to time savings, as compared to running four servers. Still, it was critical to make sure that any tests we ran and any states that changed had to be all self-contained. No test was allowed to leave any residual footprints. Additionally, we had to ensure that our main server and out client machines were responding quickly enough to make sure that we didn't have potential latency with multiple machines (heck, spinning up a machine in a different server farm could mess everything up, so you needed to make sure that everything was proximate to each other.   

Also, we are only considering what happens when a test fails when we don't want it to or it's not supposed to fail. However, we also have to consider the flip side, which is what happens if a test passes that shouldn't? That's the flip side of a flaky test. What if we have made a change but our test is too generic to capture the specific error that we have introduced? That means we may well have introduced a bug that we didn't or wouldn't catch. 

Risks are always going to be present and our goal as testers and automation specialists is that we want to look at the potential risks that are present. What is the basic risk we need to mitigate? What happens when we deploy our systems? Do we have the ability to back out of a change? What do we need to do to redeploy if necessary? If we deploy do we have an easy way to monitor what has gone in? Paul makes the point that, if a change is potentially expensive, then you probably need human eyes to watch and monitor the situation. If there's little cost or risk of failures, then it could be handled without a person looking over it. Regardless, you will need to have the ability to monitor and to that end, you need logs that tell you meaningful information. More to the point, you need to know where the logs are and that they are actually accessible.

As always, exciting, interesting, and great fod for thought. Thanks, Paul. Rock on!!!

No comments: