Gtac 2010 - What testability tells us about the software performance envelope

>> BINDER: Good morning, everybody. My name is Bob Binder. I want to talk to you this morning about testability. The title of the talk and the general structure of it have changed a little bit from the--what's in the announcements, but don't be worried, it will be pretty much the same content, just the stories are going to be told in a somewhat different manner. So, what I'd like to do is to talk a little bit first about why testability matters or at least why I think it matters? Look at two dimensions of testability from a kind of high level perspective. It's--I'm going to call them white box and black box. Talk a little bit about the role of test automation and what it plays in terms of testability, and then try to address some conclusions about strategy and how we go about the process of doing testing, designing test, running them and then, I think we'll have some time at the end for questions and answers. So why does testability matter? Basically, I look at the testability from an economic perspective, let's start with a few assumptions. In software, sooner is better than later. Bug escapes are bad. Fewer tests, again, other things being equal mean more escapes. And in testing, we have a fixed budget, so the question is given a fixed and finite amount of resources, of course, I don't know maybe Google--maybe things are different here. This is the perhaps--but, seriously, I think that--in most circumstances we have project of the deadline and finite amount of time and resources, ability, hours in a day, people, et cetera. And so the question becomes how can we put that to best use? And in terms of testing, the usual testing is we want to contribute value by improving, removing defects from the system and perhaps some of the knock on effects, the secondary effects of that that often happen from doing good testing. So, for me, testability is basically the thing that defines the limits on our ability to produce a complex system that has an acceptable risk of costly or dangerous defects. There are two dimensions of testability, effectiveness and efficiency which all come back to later on, towards at the end of this, but this basically--if you look at the total cost of doing a test, that's all the resources that are consumed in doing testing and divide that by the number of tests, for me, that's the average efficiency. And, when I look at this, it's not just the time that you might spend first writing a test object that mark for your source code, it's everything that you have to do afterwards. Somebody else might have to do or can't do because of the way things have been. And effectiveness is the average probability that when you run that test, you'll find a bug. Hopefully, that's low but not zero, at least, starting up. So, you know, other things being equal, higher testability means more better tests at the same cost, lower testability means fewer weaker tests at the same cost. What makes a system under test testable? Classically, there are two dimensions to this, controllability and observability, this goes back to harbor engineering, digital logic design. Harbor guys started to work on the testability issues a long time ago especially this was driven by increasing miniaturization. So when you went from LSI, Large-Scale Integration to VSLI, a Very Large-Scale Integration, several magnitude, more circuits on a piece of silicon and then now to the sort of, you know, nano scale wires that are in the computers we're all using. You know, you couldn't just stick a probe in it, right, like the wires are too small. It wasn't like, you know, the old bread board where you had everything exposed and hanging out. So, in order to determine what was going on within a circuit, we had to have a way of which you could controllably observe what was going on inside of a system, and to make a long story short about this, there was a whole standard for this, it's called the JET or JTAG standard. It's on every--basically every chip that's made of four additional wires that are coming out of it that allow you to do testing. And so the idea of controllability, observability, at least, from my perspective, it had this kind of roots in harbor engineering. Controllability means, "What do we have to do to run a test case? How hard is it? How expensive is it?" Does the system under task make it impractical to run some kinds of tests? There may be questions that we'd like to ask, it might be a scenario we'd like to evaluate because it's likely recurring in the real world, but it maybe then our test environment, that's prohibitably expensive, or just technologically infeasible; I'll give you some examples of this later on. Given a testing goal, do we know enough about the system, its behavior and its likely environment to produce a test which is realistic and meaningful? I might say, "Well, sure, how hard can that be?" Well, if you think about it, let's say a testing goal is we'd like to cover all the requirements of our system, one test for each. Well, sort of that presupposes that you actually have requirements. How many of you have you worked in a system where you had a full of requirements? Well, so, the knowledge that we have, what we begin to approach are testing with, what we drive our design of test cases does really effect on testability, and how much tooling, by the way can we afford to achieve controllability? So these are all kind of factors that influence this. Observability has kind of a symmetric relationship. What do we have to do to determine whether a test has passed or failed? Again, this may seem simple, and when you were talking about, you know, straightforward unit testing where we think it's kind of like on your desktop and under your control, you know, it's nice--a well-organized Sandbox, you know, that's not so hard to do. But the testing that I'm sure that many of you are involved in, large distributed systems, it's not quite so simple. And, again, the questions are how hard or expensive is it to achieve a particular kind of testing? And does the system under test such as the way it's structured or designed, make it hard or easy to do this. Can we easily find the information that we need to determine whether a particular situation has occurred or not, and do we know enough to determine pass, fail or did not finish. Here's a fishpond chart that I first produced about 15 years ago. I got interested in the subject of testability for a number of different reasons, the story is not terribly important. Anyhow, so I looked at all the factors in this, and this is kind of what I came with. Now, for those of you who don't have, you know, one thousand by one thousand vision, here's the short form. Testability, at least, in that initial analysis where we had six basic factors and each of them had a lot of separate individual things that were drivers. Today, I'm going to focus on, basically, on representation, implementation and through a certain extent, testables. I'm not going to talk a whole lot about process, how the tests were organized, I'm also going to talk a little bit about built-in test. There is an article in this, that you're going to read at your leisure if you like. The reason that I put this up to take away there is to say that testability is not a sort of single, at least, to my mind, it's not a single dimensional issue. It is, it's really--there's a whole web of forces and factors that influence whether or not a system in your particular context is testable. Let me give you some examples from personal experience of systems that I have worked on and the issues and controllability and observability that I've had to wrestle with in trying to create tests for these systems. Well, gooeys, everybody deals with gooey sooner or later. It's basically impossible to test the gooey, albeit manually interacting with it, without some abstract widget set get. If you do happen to have one, you have commercial tools such as, you know, HP's wing runner or something of that nature, or a Selenium, if you're using a web interface; it's great, it does a lot for you, but it's also brittle. This is a course of capture, replay; we all know what the headaches are involved there. Latency is another interesting problem. Latency in terms of response of a system and also the think time that users impose in terms of doing real interaction, or testings usually isn't very good at capturing that, so it's a controllability issue and we have a hard time actually dealing with the variations, response and think time. Dynamic widgets, by that, I mean, widgets were sent in to find themselves on the fly. And the things that are very specialized for which are abstract; setters and getters don't really, can't really deal with. Observability, not everything is a simple as a text box where you can sudden get a string and figure out whether this string says what you want. They're structured content with lists and all sorts of things and you know this needs to imply some kind of notion of a cursor, that's not just the tester, it's the position within the data structure, so can that be established, can it be manipulated to get something out of interest. There's a lot of noise in that determinism not text output, graphical output. It's notoriously difficult to parse if you can't parse it as text. There is some very interesting and successful attempts to extract information meaningfully from, you know, basically a bunch of bits to represent a picture. It's still a hard problem, I think. Image recognition is a test and I looked at this a few years ago and got a lot of interesting research proposals but nothing was immediately a takeaway at least as far, I mean, there's plenty of proprietary lockouts. The tested gooeys, you know, it's controllability, observability, even with the tooling set that we have--it's a fairly large industry that supports this, you know, it used to be close to a billion dollars a year, it's still not that great. One system I worked on, we had to basically drive a lot of exceptions out of the operating system. This is a Unix platform. There were hundreds of exceptions that the system under test could throw and the issue for us is whether or not the application that we were testing could actually catch them and do something reasonable with it. Well, we had to generate them first, so, how the heck you do that? You know how do you force exceptions? There were certain things that were kind of difficult to get to. And then another interesting issue--observability in this case was silent failures. So if you could force the exception often time, really, the application just say, "I don't care." So it really had no way of knowing perhaps other than just the absence of a response whether or not something that actually occurred that was, as expected. My first exposure to [INDISTINCT] programming was back in the Objective C World, the so called Next World; a very interesting experience. Objective C is a highly dynamic language in which it was common programming practice to define objects on the fly, to define the classes for those objects on the fly, so programs had this sort of feeling of writing themselves. Well, that's very interesting and it also creates a lot of headaches in terms of testing because you don't know what you're testing, what the testing target is, how to evaluate it and whether, you know, it's remotely close to what you want. And then these things tended to sprawl out of control, so the source code that you looked at was nothing like what the actual implementation was. There is a problem when we tried instrument objects on the fly. How many of you have worked in marked objects and a system that has a DBMS or it's a database or a large data store behind it? Okay, so, you know, as they say, "How's that working out for you?" This sometimes can be quite challenging. We may just want to take a little bit of piece of functionality away from the database for our particular application. It may turn out that--right on the marked object and some sense is a project with similar complexity to constructing the database management system itself. The system I worked on a number of years ago was multi-tier core application and we had a real challenge getting all the distributed objects to a particular desired state to achieve a particular test. So this was what I referred to as--I tried to describe the problem to people, my family, who are not software, you know, didn't know much about it, didn't want to. I said, "Well, it's like this, suppose you were, you know, you had a dog act. You had six different dogs. You want to get them all up on the stage at the same time, perched out a little chair and you know, balancing a ball on their nose and then barking up, you know, "Merry Christmas" or something like that," it was comparable to that. So there were lots of issues in controllability there and then we had some other interesting things that went on without tracing message propagation. So when you have distributive systems, message propagation and figuring out what happened in all the points in the path is quite a task. Another system I worked on was a cellular based station and this is kind of the ultimate non-testability. A base station is essentially--it's a big radio tower, and the physics of radio transmission are very hard to emulate. So you can't easily, you can kind of fake it out, but there are certain things that happen that are not easily emulated. So, basically, the best place to test the cellular base station is to take it out in the field and have, you know, 10,000 people pick up their cellphone and try to make a call. Well, that gets to be ridiculously expensive and by the way, the customers are paying for the base stations, want the people make the calls and not get them disrupted in the process. There's also lots of proprietary lockouts in this and sorts of other interesting things going on. The systems are never offline. So, the point here is the controllability, observability, you have some very real dimensions in the lots of different kinds of systems. Let's talk a little bit about some of the dimensions that come to us from the implementation. Things that drive complexity or drive the testability in the implementation are complexity and non-deterministic dependencies or what I call NDDs. Things that help are points of control and observation, built-in test, state helpers and good structure, and by the way, I'm not going to claim that this is an exhaustive list. But it is what I'm going to touch on today--just to give you some sense on what things you have to pay attention to. Before I do that--go much further into this, I want to introduce just a little bit of theory about testability and I hope I'm not taking away from what another speaker later who basically helps you define this theory some years ago, Jeff Offutt will say. But, to reveal a bug, the test has to do several things, several things have to lineup. You have to get the debugging code executed. You have to trigger the bug in that code location. And executing a piece of code even if it has a defect in it does not necessarily mean that it fails. When it does fail, we have to propagate the incorrect result to something that's observable. There has to be an observer of the incorrect result, and the incorrect result must be recognized to such by the observer. So we say, "Well, yeah, gee, that's all kind of obvious." Well, in a sense it is, but it's--there's some kind of interesting takeaways from this. Here's a trivial fragment of code, this is an example devised by Jeff many years ago. And here the bug in this is that it's the kind of thing that I used to do all the time, you know, wrong operator, so I should have an addition instead of subtraction but I didn't. Anybody want to guess what the test cases would reveal as this bug, I suppose you didn't know that, so no cheating. What test would you have chosen to exercise this method? Well, it turns out that we could do exhaustive testing on this, you know, about 65,000 possible inputs and those are the only three or six--is it six? Yeah, six test cases which would reveal this bug out of those 65,000, and if you've chosen one, the number one, which is kind of an obvious choice and a lot of testers say, "Well, it's, you know, we should at least do that," sorry, you wouldn't find the bug. You chose zero, well, you're better lucky, you had a better luck there. So, this is kind of a very low level, a notion of testability but it's an important one. It's where, basically, the ideas--what is the sort of propensity of code to hide bugs. You know, when it's wrong, how easy is it for us to determine or write a test that shows that it's wrong? And, this example is somewhat contrived but it's one that indicates that there are plenty of very simple circumstances where the answer to that is pretty darn hard to get those, to find those problems. So what can we do about this? We'll come to that later. Here's another one, couldn't find any good dancing dog pictures but I did find this--one of these dancing hamsters. Somebody really got busy with Photoshop-ing this, it wasn't me. This kind of suggests to me what is this sort of non-determinist dependencies are. These conditions, these are classic ones; message latency, threading all the wonderful things that could happen when you use threading in your applications, and create, replace, update, delete the typical operations unshared and unprotected data. So if all the stuff that used to happen before we had databases, sometimes it still does. So these are, basically, the things that we have that are hard to control in an environment and an application may allow or even rely on it. They tend to be things that can cause failures intermittently. Another kind of key element of testability is the extent to which our systems are complex. Software complexity is a subject which has lots of--people have said lots and lots of things about--I'm not going to get into too much detail today other than to say, it is kind of critically important for testability because the harder it is to get to that--each of those points in the code, the less likely you are to get there and the less likely you are there for to see the bug. We have two kinds of basic complexity, essential and accidental. Essential complexity is, basically, you have a big job, you have a big system, so you can't get away from that. Accidental complexity is what kind of gets dragged in usually kind of, coincidentally, because of technical decisions and commitments. There's a great analysis of this quite essential systems analysis that was published a long time ago and made the same kind of distinction. Usually we see some kind of graphed diagram or other way of representing complexity. I thought, today, you might like to see a somewhat different one. Well, this is a well-known modern artist, Jackson ***, who did some very interesting things. I find that looking at this picture, I don't know what your experience is, but as I look at it, there's something about it that draws you in. And that, my experience of kind of looking at it, being drawn in is that, I can't quite figure out what it is. And then there's sort of an echo somehow. Well, without getting too much further down that path, by the way, the music that I chose is one that illustrates kind of complexity and compositional structure in a similar vein. So, I thought this might suggest that complexity is kind of a psychological phenomenon and testability in our ability to understand things and then construct test from there, so think about Jackson *** the next time you think about complexity. What improves testability? Points of Control and Observation, state based test helpers, built-in test, well-structured code. I talk a little bit about each of these. What's a PCO? If you're familiar with something called TTCN-3? This is a notation, an abstract notation for test strategies and test harnesses for protocol verification. Basically, within it, it has this notion of point of control and abstraction. And that's an abstraction for any kind of interface of interest. So what do--have to do as testers basically to activate component and an aspect. You know what components are, what's an aspect? You may have heard this, it's a notion that says, there are some slice of functionality within a system that may not be--it may not map cleanly onto a [INDISTINCT], that's an aspect. These are things actually that quite often we're interested in testing. Aspects usually are more interesting, but usually are not directly controllable. So, for example, performance, the way the system respond or its weight of constructive resources is an aspect. It typically is a one interface which you need to touch to evaluate performance. What do you have to do as a tester or create your new test harness inspect the resulting state? Traces are one way of doing this, but they're often not sufficient or noisy, at least, traces that are designed for other purposes besides testing. Embedded state observers which I'll spend a little bit of time kind of sketching for you this morning are often affected but they can be kind of expensive to do and some people complain that they're polluting. So, aspects are often critical but typically not directly observable. So, our design for testability, this is like going back to that large scale integration where I can't put a probe into, you know, a wafer of silicon that has a nanometer wires in it. Design for testability is to determine the requirements for aspect oriented, points of control and observation and build those into your system. So ask yourself in advance as you're doing the system, what are the aspects that I care about, my customers care about and how can I observe those and how can I put something into my design which allows me to easily evaluate those. One thing along that vein might be state based test helpers. I was very interested in the subject a number of years ago and wrote some--and collected some patterns about doing this. Basically, to do state based testing, you need to do several things. You need to be able to set the state, get the current state or whatever it is you're testing, and then use something that I've called the--a logical state invariant fraction or an SF. You also typically will find it to be useful to do a reset which means to take the system under test back to some starting state. All the actions and events should be controllable and observable. What does this look like? Why are we interested? Here is an implementation model of a part of the system that supports a three-player game, a two-player game, a three-player game of racquet. Racquet game is similar to racquet ball or tennis or squash. The bottom is the state chart that might result when you show what to do for a three-player game as an extension of a two-player game. The test model is somewhat different. The test model is if we want to test three-player game, we need to consider the aggregate behavior not of just this individual unit, but of the entire unit. So that takes us to producing a test model that's called a flapping machine. The flapping state machine looks like this. And it takes into account all the interactions. Now, we can then produce a test player for this. There are many strategies for doing this. This is one that I like. And what this does is, basically, it traces all of the round trips within the state machine which will take you from one state and then to another, back to the same one. If we have K events and N states with logical state invariant fractions, that's basically K times N tests on the order. So it doesn't explode, which is good, right? If we don't have any logical state invariant functions and you really want to know what the resultant state is, so that is at the end of the test, you say "Okay, I did this. I did this. I did that. And then is that what happened? Did I get to the player two state serve? Is that actually what would occur?" The scheduling of processes and management of resources, you have to check all those things. So we had a real controllability-observability problem. The strategy was to add for every--into every class in the system an invariant function. They called it sanity checking. The invariant function would basically call another function which was globally allocated to determine whether or not the state, the conditions, the invariant conditions of that particular object were met. Invariant check could be--had some very simple global settings, one of which was that it could be the number of times that it actually fired and spent the CPU cycles to do their--to do its checking, could be scaled from usually once in every 256 times it was called, to always. So you could have a way to kind of randomly sampling as the system stabilized, you know where to dial that down, you wouldn't have to check everything. Perhaps, an earlier release is when the things they're still unstable, you do well to do more checking. Then there was kind of a clever trick, I'm sure some of you may use. It's a combination of cost inline, this is a C++ [INDISTINCT] which basically causes no object code to be generated without any changes to source code. So, because this is a ship product and they didn't want to leave the instrumentation in and they shipped the actual operating system but they didn't want to fuss with the source code because the risk of reducing--any introducing with Russians, this is a very clever strategy. We actually took the same strategy and built it in to the system that I mostly recently worked on was a test automation system. We use this and it was a quite effective. Percolation Pattern, basically, is designed by contract for class hierarchies and this is a way of enforcing compliance of something called Liskov Substitutability. That simply means that anything a based class does, a subclass should do only less of. If you implement this kind of with the "No code left behind," you have a way in which you can be sure that with these additional functions that the--it's kind of check, a runtime check on the consistency of the extensions to the class hierarchy. So you can do some pretty sophisticated things with built-in test. The issue here is, of course, you know, is it worth it? When you put the effort in the built-in test, you put it in once and it's there and it works. So I would say that when you have the opportunity to do things like this, it's at least worth thinking about. Well-structured code, this is a subject of which, you know, there's a lot that have been said about this, many well-established principles, I won't delve into that. There are several that turn out to be fairly significant, in particular, for testability. Here's one, no cyclic dependencies. This is a cyclic dependency as where A calls B; B calls C; C calls A; that's a cycle, those are bad, don't go there. Why? In terms of testing, it means we have to basically take all three all of those parts, everything within the scope of the cycle and test it as a unit. And then doing state set and get on that to bring something it participates into a cycle to a particular state may be difficult. There's an idea called levelization, John Lakos, has a great book about this. I recommend that to you if you're doing C++ development. Basically, one of the takeaways from that is to not allow static dependencies to leak across functional or package boundaries. So, something that is--performs a function that say one level of the stack should not reach up to another level of stack through a static compiled time dependency and make to do something else. And this is one which is king of general but really for a powerful for testability is to partition classes and packages to minimize interface complexity. So as you're designing, deciding what goes into where, and what the façade is, and what it looks like and you have several alternatives for that, choose the one which minimizes class complexity. All right. So, that's a lot. Let me ask you, there are--I could--maybe I just take a moment here and take if there's any questions at this point of what we talked about so far. Sam. >> SAM: Bob, you haven't mentioned security. >> BINDER: That's right I haven't. >> SAM: And I'm curious how you increase control and observability without increasing a tax surface from a security standpoint. >> BINDER. I think this is a tradeoff and it might--I don't have a good answer for that. I think you necessarily increase it. So it is like putting in a backdoor and somebody else, you know, the bad guys find out about it, they'll probably break it in and do something. So, it's definitely a concern. That's one of the tradeoffs that's involved in doing this. Yes sir. >> Hey, this is Ramesh here. So at least in my experience in terms of testability, there are very good points and thoughts. Most of the time when we do--when we think that we have done quite enough--like, you know, improvement in the testability, the time we measure, we always realize that, like, you know, still there is a long way to go. I'm just wondering whether is there anything covered in the slides that's going to be there in terms of how do you--say, for example, you talk about the states, right? I'm sure we all would design a test assuming that most of the tests are covered. But some of transitions which you would have not thought about it later on you realize that you are not covered at all. So, is that a way of finding these--because a challenge itself--it was like, you know, we always assume that we have done a better job but then we realize that there is something new to it. We have never even thought about it. So how do we find these...? >> BINDER: Well, if I understand your question, you're saying, you know, how can we have confidence that our test suites, our test strategies are complete? >> Yeah. And also is there any measurements which you could always suggest to us saying like, you know, these are the--we generally talk about the code coverage, the condition coverage and some of these. Is there any--and sometimes we always say the code coverage is fine, but we are not able to achieve like, you know, a better condition coverage and things like that. So, do you have... Do you have anything which would add more value to the basics to what we are talking, something we could also explore more? >> BINDER: Well, yeah. So this--the question is what--what kinds of additional criterion might we consider beyond code coverage metrics to help us have confidence that our test suites are complete. There's no end of coverages. So, you know, if you want to go on [INDISTINCT] some new ones, you know, be my guest. The whole point of coverage is to take--with respect to a particular testing goal and, say, how much of it have we done? How far, how close have we gotten to that? And why do we have testing goals? Well, because we have some intuition or suspicion at least that that is related to finding bugs. So underneath every testing goal is an assumption that says, I think if I look under this particular rack, it is more likely to find--more likely to find bugs. So I would say it depends on the particular kind of system that you're looking at if you want to develop more specialized kinds of criteria, you should look to that system itself and things in it that you were uncomfortable with or that you're not certain about or, you know, you said, we'll put this thing together, we'll do the best that we can. We had to punt on this one. We don't know. You know, sir, if their area is where you have either subjective or proven risk, non-risk of higher subjective assessment of a likelihood of a problem. I would then ask a question, how--what can we do to try to identify problems given that assumption and then go after that. So make up your own coverage criteria. All right, I'm going to get back into this. I wanted to just kind of change the pacing a little bit here. It's a lot of stuff. One last question. Yes, sir. >> I just have a comment on that. So I think you started out the definition of testing as a economics definition. I think what you just alluded to is it's all a risk assessment and how much time you have. There's no magic into that. >> BINDER: I'm sorry, say again. What's the question? >> Oh, it's just a comment. >> BINDER: A comment. >> Yes. >> BINDER: Okay. All right. >> I'm just supporting your testability as a economics... >> BINDER: Right. It's--if you know engineering--if you're an engineer, you don't like talking about money because money is dirty, then you can just say as tradeoffs. So you've been--you got an absolution there. Okay. Black Box Testability. Factors that decreases--this is looking at a system from an external perspective: sizes, nodes, variants and weather. Weather? What the heck is that? I'll tell you in a minute. Test model, oracle on automation. Again, I don't claim that this is a complete inventory of things to think about but it's the ones that I think that illustrates some of the things to think about. System Size. So in terms of the kind of all of technical, interesting technical things about system that we can talk about that drive testability, one that often, I think, is not often mentioned is how big is it? Because a huge system obviously is going to take more work to test. You know, my economic perspective of what testability is, that means if I assume I have a fixed amount of resources to do it, other things being equal, I'm going to be able to do less testing. So the larger the system, the more complex it is, the less that basically intrinsically lowers its testability, in my perspective. All right, so how can we scale our systems? There's many, again, many, many metrics. You know, choose the one that you like best. I hear something like, how many months that are well-understood, used cases, singularly invocable menu items, the command/sub command structure. Another one is computational strategy. Some systems have, you know, most of what they do is sort of visible at the boundaries. Some systems, most of what they do is hidden away. So if you look at transaction processing system, that's mostly visible at the boundaries. If you look at something like simulation, I worked on a--for a large oil company, a reservoir simulation system, which create a huge finite element models of--lots of mathematics on that to basically simulate the behavior of underground oil and gas, water reservoirs. It just chugged your way for weeks and weeks and weeks. You got maybe a hundred parameters that start as the simulation going and then it ran and then, you know, several weeks later, you got either a report or in some cases, a picture of what the things looked like underground or at least what their best guess was. Most of the work of that was going on in those computations. Video games are another interesting area where it's kind of the surface of that and how you size those is a--has its own unique dimensions. Another way of sizing a system is storage; how many tables, how many views, all those--what are the things that we put into it and look at. What about the extent of the network? How many independent nodes do you have to get going? So that's how many dogs do I have to line up on the stage and make them bark out, you know, Jingle Bells. Client/server systems, it's simple, okay, at least two; maybe a lot more, depending on what I want to accomplish. And tiered systems, of course, we can have division of labor across different kinds of servers and computers, maybe a peer to peer systems. This example is from one that I worked on recently. It's the--basically, an explanation of the Microsoft implementation for two-phase commit and it takes five computers due to two face commit. So in our test lab, this is actually a fact not fiction, the test lab, we actually had five computers that each performed those roles. If you want to get a little more formal about this, you look like model--modeling in mathematics. You want to find the minimum spanning tree or at least one minimum spanning or one, quantify the minimum spanning tree and you must have at least one of those online. So if you're devising a large network system and you have lots of nodes that have to participate and you have allocated functionality across those nodes, well, what does that imply for testing? That means that you're going to have to have a lab where you have each--one of each, at least. Variants, this is another dimension of the test problem that often is not paid a lot of attention too until, you know, several weeks before you have to shift. How many configuration options are there? Configuration option, something you usually you set once, the user sets, you set it and forget it. But there's, you know, lots and lots of them and there are many possible interactions, many things that can go wrong. How many platforms are supported? How many versions of Windows will this run up, does it run on the mat, does it run on the, you know, which flavor of Linux et cetera, et cetera. How many localization variants do we support? How many additions for commercial competitive marketing purposes? Each of those has a combination and each those can have interaction effects, everybody who's been in the commercial software world knows that if you don't test these stuff, you will pay dearly. Combination coverage. One way, one strategy for picking these combinations is to do what's called paralyzed testing. That basically means try to be sure that you do each one with the other at least once. So, it's not about strategy. Actually, it's very powerful in terms of finding bugs. Worst case for pair-wise is basically a product the size of the number of options. So if you have five options of five, that's basically 25 tests at least. There are a lot of very sophisticated pair-wise selection strategies that try to compress that without going into that--the number can be reduced with some good tools for choosing the pairs. But this is another element of system testability in size. Weather. What I meant by weather was environmental stuff, stuff that is the real world--what your system has to operate in and you have--all you can do is complain about it but you can't control it. So the example I mentioned earlier, cellular-based station. We really--we had to struggle in that circumstance to find ways to adequately load the system without basically fielding it. By the time it got fielded and the customer had paid for it, because each of those cellular stations was about 10 million bucks each at least, the customers wanted to use them, they didn't want us to test them. What about an expensive server firm? I think Google as an organization knows a lot about big expensive server firms. Let's say you have one that's going to support a certain kind of demand, a certain kind of computing or you can have basically have a second one which you dedicate for testing or do you show it somehow? Suppose there are competitor or aggressor capabilities, which are part of the real world that your system has to be deployed at, how easily can you replicate those in your lab? If you try to do anything in cyber warfare, this is an interesting problem. The Internet, of course, the way that I think of this is like you can never step into the same river twice or in other words, it's a little bit hard to get--recreate--to recreate circumstances when you're dependent upon variable--all the vigories of the network communication. And suppose, of course, that you have--you may or may not have experienced this, I have a few times where there is no test environment. I worked in an early version of the--what was then called the Chicago Stock Exchange and putting in one of the early four trading systems and there's only one computer, basically, at the Exchange. And during the day, they'll run the existing stuff. We couldn't shut it down and run our apps on it because it would literally, billions of dollars riding on this machine. And if it'd hiccup, you know, there was--there was a lot of grief. So we had to test from basically--and then by the time everything was wrapped up, after the end of the trading day, it was 10 o'clock, so we got to test from about 10 to 4 a.m. in the morning, you know, and just like Cinderella's coachman mice, we had to be out of there and had to have the system clean and so it could be rebooted for the production run in the next day. And there were actually some times when, surprisingly, the test system did something bad and there were some very tense moments in getting that system brought up the next morning. So there are circumstances where the production [INDISTINCT] says, the production [INDISTINCT] the production field system must be used for test and the kinds of things that you can do are limited. You can't stress it. One of the things I wanted to do in mobile testing early on--we went to similar test [INDISTINCT] look, we've got this great tool for you, we're going to put it on, on the air and it's going to saturate your cell tower. I said, no, you aren't going to do that. So this is what I mean by environmental factors. And this varies, of course, from one system to another but it is part of the dimension of testability. So, other things being equal, a larger system is less testable. You spread the same budget more, you get--and it becomes thinner. So let's say--here's some hypothetical case, 10,000 featured points, 6 network nodes, 20 options, 5 variables each, and then we can run from 9 a.m. to 3 p.m. in the afternoon. How big is that? [PAUSE] Well, it's big. I don't know if the correlation here is exact but I did some--this is the M66 galaxy. And one pixel in this picture is about 400 light years. And the distance across the solar system, best estimate is about 1/400th of a light year, so it's big. But, you know, if you start to think about the number of states in a complex system and the number of combinations of things that we have to test, pretty soon, you get up to astronomical numbers. So, while the comparison is somewhat--for dramatic effect, it's not entirely [INDISTINCT]. The other element of black box testability is understanding. What do we know? How much do we know about a system? What's our primary source of knowledge of the system that we are testing? Is it documented? Is it validated or is intuited or guessed or perhaps it doesn't even exist yet. If you want a good place to start in terms of requirements at least, there are many standards and guidelines for this. IEEE 830 is a good one. It's kind of old. It's fairly simple but I still think it provides a lot of useful guidance. So how do we know what we're going to test? How do we know what our system is supposed to do besides just guessing at? If you ever had the circumstance where you go into a room with other developers and you think that you know what the system is supposed to do and so do they. And after talking for 5, 10 minutes you get this kind of queasy, uneasy feeling like, "What the hell did they just say," or something to that effect. And then after about half an hour later, everybody leaves the meeting and, you know, something dramatic may happen. But we've known for a long time that getting this shared vision of a complex system is essentially the biggest challenge in software engineering. Get a roomful of people, 20 people working on something extraordinarily complex and difficult and they'll all have a picture in their mind. I can guarantee it's not the same picture. Maybe mostly the same but it's not the same. And then as--we as testers, we try to say, "Well, which picture is right? Which one should I believe?" The tester then takes that, introduces a test model from that. There are many different kinds of test models, you know, different strokes for different folks. I think having a test model of any kind is better than having none. Test models may be formal, they may be informal, et cetera, et cetera. One distinction I like to make is are they test-ready or are they kind of hints. A test-ready model is one in which you can commit the code or is already in code and you can produce tests from or you can evaluate tests against it. And then finally, do we have an oracle? And I don't mean the database company. An oracle is basically something which allows us to determine whether or not a test result is as expected or not. In nano-based testing, which is something that I like and I do a lot of, it's relatively easy to produce tens of thousands, millions of tests automatically in a matter of minutes. Now, the question then becomes what if I run those tests, what happens? And I decide whether the results running them are actually [INDISTINCT]. And if I can't, the tests are not very meaningful. So do we have an oracle? Is it computable or is it judgment? Sometimes the best that we can do are oftentimes and in circumstances, it's a good strategy to have a person interact with the system and then judge, decide whether or not the system makes sense. Finally, let me say a few things about test automation. I'm a big believer in automation. I know there are other people, and the testing rules were big believers in people. I believe in people but I believe in computers for certain things that people are--computers are better than people in certain kinds of testing tasks. In particular, in bigger systems, we need more tests. Automation properly used, of course, it can't be misused, gives us intellectual leverage. It allows us to kind of extend our vision and understanding across a very large and complex space. It's repeatable. We can scale a functional test for load test and many other kinds of things. There's lots and lots of different kinds of automation. I'm sure that you'll hear about different strategies in this conference today and tomorrow. I mentioned just a few here. This is far from exhausted list. Auto-based testing, and again an area of my particular interest does two things, generates tests and good model-based testing systems also choose our models carefully so that they can serve the purpose of evaluation also. Finally, why does test automation matter? This is a kind of notional slide. I don't claim that this has any deep research behind it, but it's kind of the way that I look at the world. If we look at effectiveness or ability to create a system which is reliable and let's suppose that we categorize this according to liability or availability statistics 5 nines or 1 nine. Five nines basically means a system which has about five minutes of downtime a year, 1 nine is a minute that a system that has about five minutes of down time a day. If we look at another factor of efficiency, so if I can produce a system that is 5 nines versus one that is 1 nine, with the same--with the testing strategy, I would say my testing strategy is more effective if I can achieve higher reliability at the same cost. Productivity, this is kind of the total cost of tests per hour. How many tests per hour or per cost or whatever your unit of measure is can I do? And my experience of manual testing were in this region where we can get probably 2 nines, so a system that would run for several weeks without burping seriously and we're going get on average about one test an hour. And I take into this, my experience of total cost of testing, not just the first time that you write the test but when you come back to it later and you have to maintain it and fix it or throw it out and start it all. So it's kind of the total cost to that along with every thing else. So if you take all the inputs, economic and otherwise, that's what I'm talking about here. Of course, any reasonable tester can do more than one test an hour and I'm saying that on average, that's about what it ranges. The scripting, both of the kind of GUI capture replay as well as unit-based testing with the various test frameworks, we can get about an order of magnitude, a better productivity. And in my experience, this helps us find other things being equal about another notch up in bugs. My own experience in creating model-based testing systems for specialized purposes puts this up, I claim and I do have some data to back this up that we're able to achieve two orders of magnitude better productivity in terms of number of test generated per economic cost and producing them. And at the same tests, the tests were much more intensive, broad and reached into parts of the system that we could not have done or imagined as a kind of doing simply manual or, let's say traditional kinds of testing. I worked on for the last several years in model-based testing vision which, unfortunately, is incomplete but my intent was to take this, because this has a lot of--oops, a lot of kind of hokey limitations I did that customers wanted. So I, you know, was on their nickel and didn't do all the things that I wanted. But I believe that model-based testing properly understood can get us to some--what might seem to be kind of fantastic levels of efficiency as well as effectiveness. So the kind of test automation that you have is a factor in your testability. What this--the take away from this chart for me is that--is the system scale and complexity in difficulty of testing, you're getting bigger, you're not getting smaller, right? All bigger systems were getting smaller. We have an expanding universe. If we stick with strategies like this, basically, you're going to run into more problems than you want. You're going run out of resources or you're going to produce systems that are unnecessarily buggy or unacceptably buggy. You won't be able to keep up. Talk a little bit about strategy. So what's the bottom line? How do we improve testability? Well, for white box testing, you've seen that there's several things that we can try to improve; built-in test, state helpers, PCOs. We'd like to try to maximize those and minimize the corresponding blockers of testability. Wit black box, the same thing, we'd like to maximize our ability to produce meaningful models, evaluate the results, and have a harness that helps us do that. And we'd like to minimize all that other stuff. Okay, so that's not very profound. The thing that's of interest of me is that who owns these factors in most organizations? Now, this may not be true on yours. So if it's not, you should think yourself lucky. But in most organizations that I worked with, the fact is that testers don't typically control or own the work that drives testability. The things that drive testability are basically dictated or handed to them by the architect with people who were managing the system and the developers. Testers are basically working on the test parts and this is determined by someone else. So the things that kind of set the bounds on how effective you can be often are outside your control. So this is kind of a whole process in organizational issue. It's a whole another subject, I'm not going to attempt to discuss that. But it's something I think you may want to reflect on and if you're circumstance is not like this, again, I say, consider yourself lucky. So here's a strategy box finally at the end. Let's suppose we're in a circumstance where we have a high black box testability and low white box testability. My argument is we should emphasize functional test of black box approach because we can't really do much on the white box. So the implementation might be--this might be a legacy system which is all hard to test, what should we do? "Okay, let's not kill ourselves. Let's go for the low-hanging fruit." Pardon the expression. Emphasize functional testing when that is the thing you do. Because if you think about the elephant strategy as you try to dig into the code, yeah, it's a losing battle. You can burn a lot of cycles, burn a lot of time and money and not get the report. Symmetrically, kind of the other thing is true when you have high white box testability, you have a system that is cooperative, well structured, but you might not know much about its behavior. For example, the system is relatively new, something has just been through development in beta test, first time. You might want to emphasize implementation specific aspects of it. It invited the question that the gentleman asked there, what kind of things might we know, might we be--should we pay more attention to? It depends on what you expect to go wrong. If you're in a circumstance where you have low testability on both counts, I think your best attack is to learn how to manage expectations. And then finally if you are in a circumstance where you have produced a system which has, you know, works kind of high, has achieved high testability both in the implementation and representation sides, I think you are trying to figure out how to do it again. Because the news here is that you've done a tremendously good job. You've done something that's unusual, unique and you're going to try to figure out what the magic was that made it happen and put it in the bottom. Okay, so that basically is my story and I'll take questions. I think we might have a little bit of time. I don't want to run over too much. Yes, sir. >> Hello. My name is Rizuan. You mentioned something about the testing the production environment. So how do you manage the strategy for testability in terms of performance, in terms of load stress testing? Do you do some, like, capacity planning or estimation, you do a prediction on everything based on your knowledge, you know, white box testing and black box testing? How do you do that in advance? >> BINDER: It's a little hard. That's a kind of a general question. It's sort of hard to answer in general without knowing the specifics. Yeah, I think you just have to look to the tradeoffs and decide what makes sense and what's doable there. So it's negotiable. Other questions? All right, so I have a question for you. When I play the Messier, the M66 galaxy slide, I actually, I had an internal debate about what kind of music to play along with it. And this is something, an orchestral piece which has some very, very dramatic brass in it. I thought that the other thing that might work in that was Jimmy Hendricks's Purple Haze. So I don't know, would you had preferred to hear Purple Haze this morning or I don't know. Okay. Well, another question over here. >> You see one thing that you didn't mention is about the test data. So a lot of times when we are doing the performance test and load testing. Here I am. >> BINDER: Oh, okay. >> Yeah. So, I was going through a lot of open questions you're posing in the dark and then one thing I thought the extra element is the test data, especially when we are doing the performance load and stress test when we actually load the databases--no, when we are required to do that for about 40% of the production data. So, my question is, like, don't you think test data is a challenge when we're actually handling, you know, the testability in the kind of time it takes for us to set the stage? >> BINDER: Yeah, it certainly it and that's a very good point. How do you populate and instantiate that setup--yes, I didn't get in that but in a data intensive system which has a large database, getting that just to some initial useable state which is consistent can be quite a lot of work. And so it's something that--it's worthwhile paying attention to and I think automating as well. If you have a model-based test then we say that if you have a model from which you can generate some kind of--assume certain scenarios and then generate a database snapshot that corresponds to the scenarios, unload the database, reload it with that scenario, I found that to be a very useful kind of tool to have in the situation you described. Other questions? Okay. Well, thank you very much for your attention this morning.