Tip:
Highlight text to annotate it
X
>>
BINDER: Good morning, everybody. My name is Bob Binder. I want to talk to you this morning
about testability. The title of the talk and the general structure of it have changed a
little bit from the--what's in the announcements, but don't be worried, it will be pretty much
the same content, just the stories are going to be told in a somewhat different manner.
So, what I'd like to do is to talk a little bit first about why testability matters or
at least why I think it matters? Look at two dimensions of testability from a kind of high
level perspective. It's--I'm going to call them white box and black box. Talk a little
bit about the role of test automation and what it plays in terms of testability, and
then try to address some conclusions about strategy and how we go about the process of
doing testing, designing test, running them and then, I think we'll have some time at
the end for questions and answers. So why does testability matter? Basically, I look
at the testability from an economic perspective, let's start with a few assumptions. In software,
sooner is better than later. Bug escapes are bad. Fewer tests, again, other things being
equal mean more escapes. And in testing, we have a fixed budget, so the question is given
a fixed and finite amount of resources, of course, I don't know maybe Google--maybe things
are different here. This is the perhaps--but, seriously, I think that--in most circumstances
we have project of the deadline and finite amount of time and resources, ability, hours
in a day, people, et cetera. And so the question becomes how can we put that to best use? And
in terms of testing, the usual testing is we want to contribute value by improving,
removing defects from the system and perhaps some of the knock on effects, the secondary
effects of that that often happen from doing good testing. So, for me, testability is basically
the thing that defines the limits on our ability to produce a complex system that has an acceptable
risk of costly or dangerous defects. There are two dimensions of testability, effectiveness
and efficiency which all come back to later on, towards at the end of this, but this basically--if
you look at the total cost of doing a test, that's all the resources that are consumed
in doing testing and divide that by the number of tests, for me, that's the average efficiency.
And, when I look at this, it's not just the time that you might spend first writing a
test object that mark for your source code, it's everything that you have to do afterwards.
Somebody else might have to do or can't do because of the way things have been. And effectiveness
is the average probability that when you run that test, you'll find a bug. Hopefully, that's
low but not zero, at least, starting up. So, you know, other things being equal, higher
testability means more better tests at the same cost, lower testability means fewer weaker
tests at the same cost. What makes a system under test testable? Classically, there are
two dimensions to this, controllability and observability, this goes back to harbor engineering,
digital logic design. Harbor guys started to work on the testability issues a long time
ago especially this was driven by increasing miniaturization. So when you went from LSI,
Large-Scale Integration to VSLI, a Very Large-Scale Integration, several magnitude, more circuits
on a piece of silicon and then now to the sort of, you know, nano scale wires that are
in the computers we're all using. You know, you couldn't just stick a probe in it, right,
like the wires are too small. It wasn't like, you know, the old bread board where you had
everything exposed and hanging out. So, in order to determine what was going on within
a circuit, we had to have a way of which you could controllably observe what was going
on inside of a system, and to make a long story short about this, there was a whole
standard for this, it's called the JET or JTAG standard. It's on every--basically every
chip that's made of four additional wires that are coming out of it that allow you to
do testing. And so the idea of controllability, observability, at least, from my perspective,
it had this kind of roots in harbor engineering. Controllability means, "What do we have to
do to run a test case? How hard is it? How expensive is it?" Does the system under task
make it impractical to run some kinds of tests? There may be questions that we'd like to ask,
it might be a scenario we'd like to evaluate because it's likely recurring in the real
world, but it maybe then our test environment, that's prohibitably expensive, or just technologically
infeasible; I'll give you some examples of this later on. Given a testing goal, do we
know enough about the system, its behavior and its likely environment to produce a test
which is realistic and meaningful? I might say, "Well, sure, how hard can that be?" Well,
if you think about it, let's say a testing goal is we'd like to cover all the requirements
of our system, one test for each. Well, sort of that presupposes that you actually have
requirements. How many of you have you worked in a system where you had a full of requirements?
Well, so, the knowledge that we have, what we begin to approach are testing with, what
we drive our design of test cases does really effect on testability, and how much tooling,
by the way can we afford to achieve controllability? So these are all kind of factors that influence
this. Observability has kind of a symmetric relationship. What do we have to do to determine
whether a test has passed or failed? Again, this may seem simple, and when you were talking
about, you know, straightforward unit testing where we think it's kind of like on your desktop
and under your control, you know, it's nice--a well-organized Sandbox, you know, that's not
so hard to do. But the testing that I'm sure that many of you are involved in, large distributed
systems, it's not quite so simple. And, again, the questions are how hard or expensive is
it to achieve a particular kind of testing? And does the system under test such as the
way it's structured or designed, make it hard or easy to do this. Can we easily find the
information that we need to determine whether a particular situation has occurred or not,
and do we know enough to determine pass, fail or did not finish. Here's a fishpond chart
that I first produced about 15 years ago. I got interested in the subject of testability
for a number of different reasons, the story is not terribly important. Anyhow, so I looked
at all the factors in this, and this is kind of what I came with. Now, for those of you
who don't have, you know, one thousand by one thousand vision, here's the short form.
Testability, at least, in that initial analysis where we had six basic factors and each of
them had a lot of separate individual things that were drivers. Today, I'm going to focus
on, basically, on representation, implementation and through a certain extent, testables. I'm
not going to talk a whole lot about process, how the tests were organized, I'm also going
to talk a little bit about built-in test. There is an article in this, that you're going
to read at your leisure if you like. The reason that I put this up to take away there is to
say that testability is not a sort of single, at least, to my mind, it's not a single dimensional
issue. It is, it's really--there's a whole web of forces and factors that influence whether
or not a system in your particular context is testable. Let me give you some examples
from personal experience of systems that I have worked on and the issues and controllability
and observability that I've had to wrestle with in trying to create tests for these systems.
Well, gooeys, everybody deals with gooey sooner or later. It's basically impossible to test
the gooey, albeit manually interacting with it, without some abstract widget set get.
If you do happen to have one, you have commercial tools such as, you know, HP's wing runner
or something of that nature, or a Selenium, if you're using a web interface; it's great,
it does a lot for you, but it's also brittle. This is a course of capture, replay; we all
know what the headaches are involved there. Latency is another interesting problem. Latency
in terms of response of a system and also the think time that users impose in terms
of doing real interaction, or testings usually isn't very good at capturing that, so it's
a controllability issue and we have a hard time actually dealing with the variations,
response and think time. Dynamic widgets, by that, I mean, widgets were sent in to find
themselves on the fly. And the things that are very specialized for which are abstract;
setters and getters don't really, can't really deal with. Observability, not everything is
a simple as a text box where you can sudden get a string and figure out whether this string
says what you want. They're structured content with lists and all sorts of things and you
know this needs to imply some kind of notion of a cursor, that's not just the tester, it's
the position within the data structure, so can that be established, can it be manipulated
to get something out of interest. There's a lot of noise in that determinism not text
output, graphical output. It's notoriously difficult to parse if you can't parse it as
text. There is some very interesting and successful attempts to extract information meaningfully
from, you know, basically a bunch of bits to represent a picture. It's still a hard
problem, I think. Image recognition is a test and I looked at this a few years ago and got
a lot of interesting research proposals but nothing was immediately a takeaway at least
as far, I mean, there's plenty of proprietary lockouts. The tested gooeys, you know, it's
controllability, observability, even with the tooling set that we have--it's a fairly
large industry that supports this, you know, it used to be close to a billion dollars a
year, it's still not that great. One system I worked on, we had to basically drive a lot
of exceptions out of the operating system. This is a Unix platform. There were hundreds
of exceptions that the system under test could throw and the issue for us is whether or not
the application that we were testing could actually catch them and do something reasonable
with it. Well, we had to generate them first, so, how the heck you do that? You know how
do you force exceptions? There were certain things that were kind of difficult to get
to. And then another interesting issue--observability in this case was silent failures. So if you
could force the exception often time, really, the application just say, "I don't care."
So it really had no way of knowing perhaps other than just the absence of a response
whether or not something that actually occurred that was, as expected. My first exposure to
[INDISTINCT] programming was back in the Objective C World, the so called Next World; a very
interesting experience. Objective C is a highly dynamic language in which it was common programming
practice to define objects on the fly, to define the classes for those objects on the
fly, so programs had this sort of feeling of writing themselves. Well, that's very interesting
and it also creates a lot of headaches in terms of testing because you don't know what
you're testing, what the testing target is, how to evaluate it and whether, you know,
it's remotely close to what you want. And then these things tended to sprawl out of
control, so the source code that you looked at was nothing like what the actual implementation
was. There is a problem when we tried instrument objects on the fly. How many of you have worked
in marked objects and a system that has a DBMS or it's a database or a large data store
behind it? Okay, so, you know, as they say, "How's that working out for you?" This sometimes
can be quite challenging. We may just want to take a little bit of piece of functionality
away from the database for our particular application. It may turn out that--right on
the marked object and some sense is a project with similar complexity to constructing the
database management system itself. The system I worked on a number of years ago was multi-tier
core application and we had a real challenge getting all the distributed objects to a particular
desired state to achieve a particular test. So this was what I referred to as--I tried
to describe the problem to people, my family, who are not software, you know, didn't know
much about it, didn't want to. I said, "Well, it's like this, suppose you were, you know,
you had a dog act. You had six different dogs. You want to get them all up on the stage at
the same time, perched out a little chair and you know, balancing a ball on their nose
and then barking up, you know, "Merry Christmas" or something like that," it was comparable
to that. So there were lots of issues in controllability there and then we had some other interesting
things that went on without tracing message propagation. So when you have distributive
systems, message propagation and figuring out what happened in all the points in the
path is quite a task. Another system I worked on was a cellular based station and this is
kind of the ultimate non-testability. A base station is essentially--it's a big radio tower,
and the physics of radio transmission are very hard to emulate. So you can't easily,
you can kind of fake it out, but there are certain things that happen that are not easily
emulated. So, basically, the best place to test the cellular base station is to take
it out in the field and have, you know, 10,000 people pick up their cellphone and try to
make a call. Well, that gets to be ridiculously expensive and by the way, the customers are
paying for the base stations, want the people make the calls and not get them disrupted
in the process. There's also lots of proprietary lockouts in this and sorts of other interesting
things going on. The systems are never offline. So, the point here is the controllability,
observability, you have some very real dimensions in the lots of different kinds of systems.
Let's talk a little bit about some of the dimensions that come to us from the implementation.
Things that drive complexity or drive the testability in the implementation are complexity
and non-deterministic dependencies or what I call NDDs. Things that help are points of
control and observation, built-in test, state helpers and good structure, and by the way,
I'm not going to claim that this is an exhaustive list. But it is what I'm going to touch on
today--just to give you some sense on what things you have to pay attention to. Before
I do that--go much further into this, I want to introduce just a little bit of theory about
testability and I hope I'm not taking away from what another speaker later who basically
helps you define this theory some years ago, Jeff Offutt will say. But, to reveal a bug,
the test has to do several things, several things have to lineup. You have to get the
debugging code executed. You have to trigger the bug in that code location. And executing
a piece of code even if it has a defect in it does not necessarily mean that it fails.
When it does fail, we have to propagate the incorrect result to something that's observable.
There has to be an observer of the incorrect result, and the incorrect result must be recognized
to such by the observer. So we say, "Well, yeah, gee, that's all kind of obvious." Well,
in a sense it is, but it's--there's some kind of interesting takeaways from this. Here's
a trivial fragment of code, this is an example devised by Jeff many years ago. And here the
bug in this is that it's the kind of thing that I used to do all the time, you know,
wrong operator, so I should have an addition instead of subtraction but I didn't. Anybody
want to guess what the test cases would reveal as this bug, I suppose you didn't know that,
so no cheating. What test would you have chosen to exercise this method? Well, it turns out
that we could do exhaustive testing on this, you know, about 65,000 possible inputs and
those are the only three or six--is it six? Yeah, six test cases which would reveal this
bug out of those 65,000, and if you've chosen one, the number one, which is kind of an obvious
choice and a lot of testers say, "Well, it's, you know, we should at least do that," sorry,
you wouldn't find the bug. You chose zero, well, you're better lucky, you had a better
luck there. So, this is kind of a very low level, a notion of testability but it's an
important one. It's where, basically, the ideas--what is the sort of propensity of code
to hide bugs. You know, when it's wrong, how easy is it for us to determine or write a
test that shows that it's wrong? And, this example is somewhat contrived but it's one
that indicates that there are plenty of very simple circumstances where the answer to that
is pretty darn hard to get those, to find those problems. So what can we do about this?
We'll come to that later. Here's another one, couldn't find any good dancing dog pictures
but I did find this--one of these dancing hamsters. Somebody really got busy with Photoshop-ing
this, it wasn't me. This kind of suggests to me what is this sort of non-determinist
dependencies are. These conditions, these are classic ones; message latency, threading
all the wonderful things that could happen when you use threading in your applications,
and create, replace, update, delete the typical operations unshared and unprotected data.
So if all the stuff that used to happen before we had databases, sometimes it still does.
So these are, basically, the things that we have that are hard to control in an environment
and an application may allow or even rely on it. They tend to be things that can cause
failures intermittently. Another kind of key element of testability is the extent to which
our systems are complex. Software complexity is a subject which has lots of--people have
said lots and lots of things about--I'm not going to get into too much detail today other
than to say, it is kind of critically important for testability because the harder it is to
get to that--each of those points in the code, the less likely you are to get there and the
less likely you are there for to see the bug. We have two kinds of basic complexity, essential
and accidental. Essential complexity is, basically, you have a big job, you have a big system,
so you can't get away from that. Accidental complexity is what kind of gets dragged in
usually kind of, coincidentally, because of technical decisions and commitments. There's
a great analysis of this quite essential systems analysis that was published a long time ago
and made the same kind of distinction. Usually we see some kind of graphed diagram or other
way of representing complexity. I thought, today, you might like to see a somewhat different
one. Well, this is a well-known modern artist, Jackson ***, who did some very interesting
things. I find that looking at this picture, I don't know what your experience is, but
as I look at it, there's something about it that draws you in. And that, my experience
of kind of looking at it, being drawn in is that, I can't quite figure out what it is.
And then there's sort of an echo somehow. Well, without getting too much further down
that path, by the way, the music that I chose is one that illustrates kind of complexity
and compositional structure in a similar vein. So, I thought this might suggest that complexity
is kind of a psychological phenomenon and testability in our ability to understand things
and then construct test from there, so think about Jackson *** the next time you think
about complexity. What improves testability? Points of Control and Observation, state based
test helpers, built-in test, well-structured code. I talk a little bit about each of these.
What's a PCO? If you're familiar with something called TTCN-3? This is a notation, an abstract
notation for test strategies and test harnesses for protocol verification. Basically, within
it, it has this notion of point of control and abstraction. And that's an abstraction
for any kind of interface of interest. So what do--have to do as testers basically to
activate component and an aspect. You know what components are, what's an aspect? You
may have heard this, it's a notion that says, there are some slice of functionality within
a system that may not be--it may not map cleanly onto a [INDISTINCT], that's an aspect. These
are things actually that quite often we're interested in testing. Aspects usually are
more interesting, but usually are not directly controllable. So, for example, performance,
the way the system respond or its weight of constructive resources is an aspect. It typically
is a one interface which you need to touch to evaluate performance. What do you have
to do as a tester or create your new test harness inspect the resulting state? Traces
are one way of doing this, but they're often not sufficient or noisy, at least, traces
that are designed for other purposes besides testing. Embedded state observers which I'll
spend a little bit of time kind of sketching for you this morning are often affected but
they can be kind of expensive to do and some people complain that they're polluting. So,
aspects are often critical but typically not directly observable. So, our design for testability,
this is like going back to that large scale integration where I can't put a probe into,
you know, a wafer of silicon that has a nanometer wires in it. Design for testability is to
determine the requirements for aspect oriented, points of control and observation and build
those into your system. So ask yourself in advance as you're doing the system, what are
the aspects that I care about, my customers care about and how can I observe those and
how can I put something into my design which allows me to easily evaluate those. One thing
along that vein might be state based test helpers. I was very interested in the subject
a number of years ago and wrote some--and collected some patterns about doing this.
Basically, to do state based testing, you need to do several things. You need to be
able to set the state, get the current state or whatever it is you're testing, and then
use something that I've called the--a logical state invariant fraction or an SF. You also
typically will find it to be useful to do a reset which means to take the system under
test back to some starting state. All the actions and events should be controllable
and observable. What does this look like? Why are we interested? Here is an implementation
model of a part of the system that supports a three-player game, a two-player game, a
three-player game of racquet. Racquet game is similar to racquet ball or tennis or squash.
The bottom is the state chart that might result when you show what to do for a three-player
game as an extension of a two-player game. The test model is somewhat different. The
test model is if we want to test three-player game, we need to consider the aggregate behavior
not of just this individual unit, but of the entire unit. So that takes us to producing
a test model that's called a flapping machine. The flapping state machine looks like this.
And it takes into account all the interactions. Now, we can then produce a test player for
this. There are many strategies for doing this. This is one that I like. And what this
does is, basically, it traces all of the round trips within the state machine which will
take you from one state and then to another, back to the same one. If we have K events
and N states with logical state invariant fractions, that's basically K times N tests
on the order. So it doesn't explode, which is good, right? If we don't have any logical
state invariant functions and you really want to know what the resultant state is, so that
is at the end of the test, you say "Okay, I did this. I did this. I did that. And then
is that what happened? Did I get to the player two state serve? Is that actually what would
occur?" The scheduling of processes and management of resources, you have to check all those
things. So we had a real controllability-observability problem. The strategy was to add for every--into
every class in the system an invariant function. They called it sanity checking. The invariant
function would basically call another function which was globally allocated to determine
whether or not the state, the conditions, the invariant conditions of that particular
object were met. Invariant check could be--had some very simple global settings, one of which
was that it could be the number of times that it actually fired and spent the CPU cycles
to do their--to do its checking, could be scaled from usually once in every 256 times
it was called, to always. So you could have a way to kind of randomly sampling as the
system stabilized, you know where to dial that down, you wouldn't have to check everything.
Perhaps, an earlier release is when the things they're still unstable, you do well to do
more checking. Then there was kind of a clever trick, I'm sure some of you may use. It's
a combination of cost inline, this is a C++ [INDISTINCT] which basically causes no object
code to be generated without any changes to source code. So, because this is a ship product
and they didn't want to leave the instrumentation in and they shipped the actual operating system
but they didn't want to fuss with the source code because the risk of reducing--any introducing
with Russians, this is a very clever strategy. We actually took the same strategy and built
it in to the system that I mostly recently worked on was a test automation system. We
use this and it was a quite effective. Percolation Pattern, basically, is designed by contract
for class hierarchies and this is a way of enforcing compliance of something called Liskov
Substitutability. That simply means that anything a based class does, a subclass should do only
less of. If you implement this kind of with the "No code left behind," you have a way
in which you can be sure that with these additional functions that the--it's kind of check, a
runtime check on the consistency of the extensions to the class hierarchy. So you can do some
pretty sophisticated things with built-in test. The issue here is, of course, you know,
is it worth it? When you put the effort in the built-in test, you put it in once and
it's there and it works. So I would say that when you have the opportunity to do things
like this, it's at least worth thinking about. Well-structured code, this is a subject of
which, you know, there's a lot that have been said about this, many well-established principles,
I won't delve into that. There are several that turn out to be fairly significant, in
particular, for testability. Here's one, no cyclic dependencies. This is a cyclic dependency
as where A calls B; B calls C; C calls A; that's a cycle, those are bad, don't go there.
Why? In terms of testing, it means we have to basically take all three all of those parts,
everything within the scope of the cycle and test it as a unit. And then doing state set
and get on that to bring something it participates into a cycle to a particular state may be
difficult. There's an idea called levelization, John Lakos, has a great book about this. I
recommend that to you if you're doing C++ development. Basically, one of the takeaways
from that is to not allow static dependencies to leak across functional or package boundaries.
So, something that is--performs a function that say one level of the stack should not
reach up to another level of stack through a static compiled time dependency and make
to do something else. And this is one which is king of general but really for a powerful
for testability is to partition classes and packages to minimize interface complexity.
So as you're designing, deciding what goes into where, and what the façade is, and what
it looks like and you have several alternatives for that, choose the one which minimizes class
complexity. All right. So, that's a lot. Let me ask you, there are--I could--maybe I just
take a moment here and take if there's any questions at this point of what we talked
about so far. Sam. >> SAM: Bob, you haven't mentioned security.
>> BINDER: That's right I haven't. >> SAM: And I'm curious how you increase control
and observability without increasing a tax surface from a security standpoint.
>> BINDER. I think this is a tradeoff and it might--I don't have a good answer for that.
I think you necessarily increase it. So it is like putting in a backdoor and somebody
else, you know, the bad guys find out about it, they'll probably break it in and do something.
So, it's definitely a concern. That's one of the tradeoffs that's involved in doing
this. Yes sir. >> Hey, this is Ramesh here. So at least in
my experience in terms of testability, there are very good points and thoughts. Most of
the time when we do--when we think that we have done quite enough--like, you know, improvement
in the testability, the time we measure, we always realize that, like, you know, still
there is a long way to go. I'm just wondering whether is there anything covered in the slides
that's going to be there in terms of how do you--say, for example, you talk about the
states, right? I'm sure we all would design a test assuming that most of the tests are
covered. But some of transitions which you would have not thought about it later on you
realize that you are not covered at all. So, is that a way of finding these--because a
challenge itself--it was like, you know, we always assume that we have done a better job
but then we realize that there is something new to it. We have never even thought about
it. So how do we find these...? >> BINDER: Well, if I understand your question,
you're saying, you know, how can we have confidence that our test suites, our test strategies
are complete? >> Yeah. And also is there any measurements
which you could always suggest to us saying like, you know, these are the--we generally
talk about the code coverage, the condition coverage and some of these. Is there any--and
sometimes we always say the code coverage is fine, but we are not able to achieve like,
you know, a better condition coverage and things like that. So, do you have... Do you
have anything which would add more value to the basics to what we are talking, something
we could also explore more? >> BINDER: Well, yeah. So this--the question
is what--what kinds of additional criterion might we consider beyond code coverage metrics
to help us have confidence that our test suites are complete. There's no end of coverages.
So, you know, if you want to go on [INDISTINCT] some new ones, you know, be my guest. The
whole point of coverage is to take--with respect to a particular testing goal and, say, how
much of it have we done? How far, how close have we gotten to that? And why do we have
testing goals? Well, because we have some intuition or suspicion at least that that
is related to finding bugs. So underneath every testing goal is an assumption that says,
I think if I look under this particular rack, it is more likely to find--more likely to
find bugs. So I would say it depends on the particular kind of system that you're looking
at if you want to develop more specialized kinds of criteria, you should look to that
system itself and things in it that you were uncomfortable with or that you're not certain
about or, you know, you said, we'll put this thing together, we'll do the best that we
can. We had to punt on this one. We don't know. You know, sir, if their area is where
you have either subjective or proven risk, non-risk of higher subjective assessment of
a likelihood of a problem. I would then ask a question, how--what can we do to try to
identify problems given that assumption and then go after that. So make up your own coverage
criteria. All right, I'm going to get back into this. I wanted to just kind of change
the pacing a little bit here. It's a lot of stuff. One last question. Yes, sir.
>> I just have a comment on that. So I think you started out the definition of testing
as a economics definition. I think what you just alluded to is it's all a risk assessment
and how much time you have. There's no magic into that.
>> BINDER: I'm sorry, say again. What's the question?
>> Oh, it's just a comment. >> BINDER: A comment.
>> Yes. >> BINDER: Okay. All right.
>> I'm just supporting your testability as a economics...
>> BINDER: Right. It's--if you know engineering--if you're an engineer, you don't like talking
about money because money is dirty, then you can just say as tradeoffs. So you've been--you
got an absolution there. Okay. Black Box Testability. Factors that decreases--this is looking at
a system from an external perspective: sizes, nodes, variants and weather. Weather? What
the heck is that? I'll tell you in a minute. Test model, oracle on automation. Again, I
don't claim that this is a complete inventory of things to think about but it's the ones
that I think that illustrates some of the things to think about. System Size. So in
terms of the kind of all of technical, interesting technical things about system that we can
talk about that drive testability, one that often, I think, is not often mentioned is
how big is it? Because a huge system obviously is going to take more work to test. You know,
my economic perspective of what testability is, that means if I assume I have a fixed
amount of resources to do it, other things being equal, I'm going to be able to do less
testing. So the larger the system, the more complex it is, the less that basically intrinsically
lowers its testability, in my perspective. All right, so how can we scale our systems?
There's many, again, many, many metrics. You know, choose the one that you like best. I
hear something like, how many months that are well-understood, used cases, singularly
invocable menu items, the command/sub command structure. Another one is computational strategy.
Some systems have, you know, most of what they do is sort of visible at the boundaries.
Some systems, most of what they do is hidden away. So if you look at transaction processing
system, that's mostly visible at the boundaries. If you look at something like simulation,
I worked on a--for a large oil company, a reservoir simulation system, which create
a huge finite element models of--lots of mathematics on that to basically simulate the behavior
of underground oil and gas, water reservoirs. It just chugged your way for weeks and weeks
and weeks. You got maybe a hundred parameters that start as the simulation going and then
it ran and then, you know, several weeks later, you got either a report or in some cases,
a picture of what the things looked like underground or at least what their best guess was. Most
of the work of that was going on in those computations. Video games are another interesting
area where it's kind of the surface of that and how you size those is a--has its own unique
dimensions. Another way of sizing a system is storage; how many tables, how many views,
all those--what are the things that we put into it and look at. What about the extent
of the network? How many independent nodes do you have to get going? So that's how many
dogs do I have to line up on the stage and make them bark out, you know, Jingle Bells.
Client/server systems, it's simple, okay, at least two; maybe a lot more, depending
on what I want to accomplish. And tiered systems, of course, we can have division of labor across
different kinds of servers and computers, maybe a peer to peer systems. This example
is from one that I worked on recently. It's the--basically, an explanation of the Microsoft
implementation for two-phase commit and it takes five computers due to two face commit.
So in our test lab, this is actually a fact not fiction, the test lab, we actually had
five computers that each performed those roles. If you want to get a little more formal about
this, you look like model--modeling in mathematics. You want to find the minimum spanning tree
or at least one minimum spanning or one, quantify the minimum spanning tree and you must have
at least one of those online. So if you're devising a large network system and you have
lots of nodes that have to participate and you have allocated functionality across those
nodes, well, what does that imply for testing? That means that you're going to have to have
a lab where you have each--one of each, at least. Variants, this is another dimension
of the test problem that often is not paid a lot of attention too until, you know, several
weeks before you have to shift. How many configuration options are there? Configuration option, something
you usually you set once, the user sets, you set it and forget it. But there's, you know,
lots and lots of them and there are many possible interactions, many things that can go wrong.
How many platforms are supported? How many versions of Windows will this run up, does
it run on the mat, does it run on the, you know, which flavor of Linux et cetera, et
cetera. How many localization variants do we support? How many additions for commercial
competitive marketing purposes? Each of those has a combination and each those can have
interaction effects, everybody who's been in the commercial software world knows that
if you don't test these stuff, you will pay dearly. Combination coverage. One way, one
strategy for picking these combinations is to do what's called paralyzed testing. That
basically means try to be sure that you do each one with the other at least once. So,
it's not about strategy. Actually, it's very powerful in terms of finding bugs. Worst case
for pair-wise is basically a product the size of the number of options. So if you have five
options of five, that's basically 25 tests at least. There are a lot of very sophisticated
pair-wise selection strategies that try to compress that without going into that--the
number can be reduced with some good tools for choosing the pairs. But this is another
element of system testability in size. Weather. What I meant by weather was environmental
stuff, stuff that is the real world--what your system has to operate in and you have--all
you can do is complain about it but you can't control it. So the example I mentioned earlier,
cellular-based station. We really--we had to struggle in that circumstance to find ways
to adequately load the system without basically fielding it. By the time it got fielded and
the customer had paid for it, because each of those cellular stations was about 10 million
bucks each at least, the customers wanted to use them, they didn't want us to test them.
What about an expensive server firm? I think Google as an organization knows a lot about
big expensive server firms. Let's say you have one that's going to support a certain
kind of demand, a certain kind of computing or you can have basically have a second one
which you dedicate for testing or do you show it somehow? Suppose there are competitor or
aggressor capabilities, which are part of the real world that your system has to be
deployed at, how easily can you replicate those in your lab? If you try to do anything
in cyber warfare, this is an interesting problem. The Internet, of course, the way that I think
of this is like you can never step into the same river twice or in other words, it's a
little bit hard to get--recreate--to recreate circumstances when you're dependent upon variable--all
the vigories of the network communication. And suppose, of course, that you have--you
may or may not have experienced this, I have a few times where there is no test environment.
I worked in an early version of the--what was then called the Chicago Stock Exchange
and putting in one of the early four trading systems and there's only one computer, basically,
at the Exchange. And during the day, they'll run the existing stuff. We couldn't shut it
down and run our apps on it because it would literally, billions of dollars riding on this
machine. And if it'd hiccup, you know, there was--there was a lot of grief. So we had to
test from basically--and then by the time everything was wrapped up, after the end of
the trading day, it was 10 o'clock, so we got to test from about 10 to 4 a.m. in the
morning, you know, and just like Cinderella's coachman mice, we had to be out of there and
had to have the system clean and so it could be rebooted for the production run in the
next day. And there were actually some times when, surprisingly, the test system did something
bad and there were some very tense moments in getting that system brought up the next
morning. So there are circumstances where the production [INDISTINCT] says, the production
[INDISTINCT] the production field system must be used for test and the kinds of things that
you can do are limited. You can't stress it. One of the things I wanted to do in mobile
testing early on--we went to similar test [INDISTINCT] look, we've got this great tool
for you, we're going to put it on, on the air and it's going to saturate your cell tower.
I said, no, you aren't going to do that. So this is what I mean by environmental factors.
And this varies, of course, from one system to another but it is part of the dimension
of testability. So, other things being equal, a larger system is less testable. You spread
the same budget more, you get--and it becomes thinner. So let's say--here's some hypothetical
case, 10,000 featured points, 6 network nodes, 20 options, 5 variables each, and then we
can run from 9 a.m. to 3 p.m. in the afternoon. How big is that? [PAUSE] Well, it's big. I
don't know if the correlation here is exact but I did some--this is the M66 galaxy. And
one pixel in this picture is about 400 light years. And the distance across the solar system,
best estimate is about 1/400th of a light year, so it's big. But, you know, if you start
to think about the number of states in a complex system and the number of combinations of things
that we have to test, pretty soon, you get up to astronomical numbers. So, while the
comparison is somewhat--for dramatic effect, it's not entirely [INDISTINCT]. The other
element of black box testability is understanding. What do we know? How much do we know about
a system? What's our primary source of knowledge of the system that we are testing? Is it documented?
Is it validated or is intuited or guessed or perhaps it doesn't even exist yet. If you
want a good place to start in terms of requirements at least, there are many standards and guidelines
for this. IEEE 830 is a good one. It's kind of old. It's fairly simple but I still think
it provides a lot of useful guidance. So how do we know what we're going to test? How do
we know what our system is supposed to do besides just guessing at? If you ever had
the circumstance where you go into a room with other developers and you think that you
know what the system is supposed to do and so do they. And after talking for 5, 10 minutes
you get this kind of queasy, uneasy feeling like, "What the hell did they just say," or
something to that effect. And then after about half an hour later, everybody leaves the meeting
and, you know, something dramatic may happen. But we've known for a long time that getting
this shared vision of a complex system is essentially the biggest challenge in software
engineering. Get a roomful of people, 20 people working on something extraordinarily complex
and difficult and they'll all have a picture in their mind. I can guarantee it's not the
same picture. Maybe mostly the same but it's not the same. And then as--we as testers,
we try to say, "Well, which picture is right? Which one should I believe?" The tester then
takes that, introduces a test model from that. There are many different kinds of test models,
you know, different strokes for different folks. I think having a test model of any
kind is better than having none. Test models may be formal, they may be informal, et cetera,
et cetera. One distinction I like to make is are they test-ready or are they kind of
hints. A test-ready model is one in which you can commit the code or is already in code
and you can produce tests from or you can evaluate tests against it. And then finally,
do we have an oracle? And I don't mean the database company. An oracle is basically something
which allows us to determine whether or not a test result is as expected or not. In nano-based
testing, which is something that I like and I do a lot of, it's relatively easy to produce
tens of thousands, millions of tests automatically in a matter of minutes. Now, the question
then becomes what if I run those tests, what happens? And I decide whether the results
running them are actually [INDISTINCT]. And if I can't, the tests are not very meaningful.
So do we have an oracle? Is it computable or is it judgment? Sometimes the best that
we can do are oftentimes and in circumstances, it's a good strategy to have a person interact
with the system and then judge, decide whether or not the system makes sense. Finally, let
me say a few things about test automation. I'm a big believer in automation. I know there
are other people, and the testing rules were big believers in people. I believe in people
but I believe in computers for certain things that people are--computers are better than
people in certain kinds of testing tasks. In particular, in bigger systems, we need
more tests. Automation properly used, of course, it can't be misused, gives us intellectual
leverage. It allows us to kind of extend our vision and understanding across a very large
and complex space. It's repeatable. We can scale a functional test for load test and
many other kinds of things. There's lots and lots of different kinds of automation. I'm
sure that you'll hear about different strategies in this conference today and tomorrow. I mentioned
just a few here. This is far from exhausted list. Auto-based testing, and again an area
of my particular interest does two things, generates tests and good model-based testing
systems also choose our models carefully so that they can serve the purpose of evaluation
also. Finally, why does test automation matter? This is a kind of notional slide. I don't
claim that this has any deep research behind it, but it's kind of the way that I look at
the world. If we look at effectiveness or ability to create a system which is reliable
and let's suppose that we categorize this according to liability or availability statistics
5 nines or 1 nine. Five nines basically means a system which has about five minutes of downtime
a year, 1 nine is a minute that a system that has about five minutes of down time a day.
If we look at another factor of efficiency, so if I can produce a system that is 5 nines
versus one that is 1 nine, with the same--with the testing strategy, I would say my testing
strategy is more effective if I can achieve higher reliability at the same cost. Productivity,
this is kind of the total cost of tests per hour. How many tests per hour or per cost
or whatever your unit of measure is can I do? And my experience of manual testing were
in this region where we can get probably 2 nines, so a system that would run for several
weeks without burping seriously and we're going get on average about one test an hour.
And I take into this, my experience of total cost of testing, not just the first time that
you write the test but when you come back to it later and you have to maintain it and
fix it or throw it out and start it all. So it's kind of the total cost to that along
with every thing else. So if you take all the inputs, economic and otherwise, that's
what I'm talking about here. Of course, any reasonable tester can do more than one test
an hour and I'm saying that on average, that's about what it ranges. The scripting, both
of the kind of GUI capture replay as well as unit-based testing with the various test
frameworks, we can get about an order of magnitude, a better productivity. And in my experience,
this helps us find other things being equal about another notch up in bugs. My own experience
in creating model-based testing systems for specialized purposes puts this up, I claim
and I do have some data to back this up that we're able to achieve two orders of magnitude
better productivity in terms of number of test generated per economic cost and producing
them. And at the same tests, the tests were much more intensive, broad and reached into
parts of the system that we could not have done or imagined as a kind of doing simply
manual or, let's say traditional kinds of testing. I worked on for the last several
years in model-based testing vision which, unfortunately, is incomplete but my intent
was to take this, because this has a lot of--oops, a lot of kind of hokey limitations I did that
customers wanted. So I, you know, was on their nickel and didn't do all the things that I
wanted. But I believe that model-based testing properly understood can get us to some--what
might seem to be kind of fantastic levels of efficiency as well as effectiveness. So
the kind of test automation that you have is a factor in your testability. What this--the
take away from this chart for me is that--is the system scale and complexity in difficulty
of testing, you're getting bigger, you're not getting smaller, right? All bigger systems
were getting smaller. We have an expanding universe. If we stick with strategies like
this, basically, you're going to run into more problems than you want. You're going
run out of resources or you're going to produce systems that are unnecessarily buggy or unacceptably
buggy. You won't be able to keep up. Talk a little bit about strategy. So what's the
bottom line? How do we improve testability? Well, for white box testing, you've seen that
there's several things that we can try to improve; built-in test, state helpers, PCOs.
We'd like to try to maximize those and minimize the corresponding blockers of testability.
Wit black box, the same thing, we'd like to maximize our ability to produce meaningful
models, evaluate the results, and have a harness that helps us do that. And we'd like to minimize
all that other stuff. Okay, so that's not very profound. The thing that's of interest
of me is that who owns these factors in most organizations? Now, this may not be true on
yours. So if it's not, you should think yourself lucky. But in most organizations that I worked
with, the fact is that testers don't typically control or own the work that drives testability.
The things that drive testability are basically dictated or handed to them by the architect
with people who were managing the system and the developers. Testers are basically working
on the test parts and this is determined by someone else. So the things that kind of set
the bounds on how effective you can be often are outside your control. So this is kind
of a whole process in organizational issue. It's a whole another subject, I'm not going
to attempt to discuss that. But it's something I think you may want to reflect on and if
you're circumstance is not like this, again, I say, consider yourself lucky. So here's
a strategy box finally at the end. Let's suppose we're in a circumstance where we have a high
black box testability and low white box testability. My argument is we should emphasize functional
test of black box approach because we can't really do much on the white box. So the implementation
might be--this might be a legacy system which is all hard to test, what should we do? "Okay,
let's not kill ourselves. Let's go for the low-hanging fruit." Pardon the expression.
Emphasize functional testing when that is the thing you do. Because if you think about
the elephant strategy as you try to dig into the code, yeah, it's a losing battle. You
can burn a lot of cycles, burn a lot of time and money and not get the report. Symmetrically,
kind of the other thing is true when you have high white box testability, you have a system
that is cooperative, well structured, but you might not know much about its behavior.
For example, the system is relatively new, something has just been through development
in beta test, first time. You might want to emphasize implementation specific aspects
of it. It invited the question that the gentleman asked there, what kind of things might we
know, might we be--should we pay more attention to? It depends on what you expect to go wrong.
If you're in a circumstance where you have low testability on both counts, I think your
best attack is to learn how to manage expectations. And then finally if you are in a circumstance
where you have produced a system which has, you know, works kind of high, has achieved
high testability both in the implementation and representation sides, I think you are
trying to figure out how to do it again. Because the news here is that you've done a tremendously
good job. You've done something that's unusual, unique and you're going to try to figure out
what the magic was that made it happen and put it in the bottom. Okay, so that basically
is my story and I'll take questions. I think we might have a little bit of time. I don't
want to run over too much. Yes, sir. >> Hello. My name is Rizuan. You mentioned
something about the testing the production environment. So how do you manage the strategy
for testability in terms of performance, in terms of load stress testing? Do you do some,
like, capacity planning or estimation, you do a prediction on everything based on your
knowledge, you know, white box testing and black box testing? How do you do that in advance?
>> BINDER: It's a little hard. That's a kind of a general question. It's sort of hard to
answer in general without knowing the specifics. Yeah, I think you just have to look to the
tradeoffs and decide what makes sense and what's doable there. So it's negotiable. Other
questions? All right, so I have a question for you. When I play the Messier, the M66
galaxy slide, I actually, I had an internal debate about what kind of music to play along
with it. And this is something, an orchestral piece which has some very, very dramatic brass
in it. I thought that the other thing that might work in that was Jimmy Hendricks's Purple
Haze. So I don't know, would you had preferred to hear Purple Haze this morning or I don't
know. Okay. Well, another question over here. >> You see one thing that you didn't mention
is about the test data. So a lot of times when we are doing the performance test and
load testing. Here I am. >> BINDER: Oh, okay.
>> Yeah. So, I was going through a lot of open questions you're posing in the dark and
then one thing I thought the extra element is the test data, especially when we are doing
the performance load and stress test when we actually load the databases--no, when we
are required to do that for about 40% of the production data. So, my question is, like,
don't you think test data is a challenge when we're actually handling, you know, the testability
in the kind of time it takes for us to set the stage?
>> BINDER: Yeah, it certainly it and that's a very good point. How do you populate and
instantiate that setup--yes, I didn't get in that but in a data intensive system which
has a large database, getting that just to some initial useable state which is consistent
can be quite a lot of work. And so it's something that--it's worthwhile paying attention to
and I think automating as well. If you have a model-based test then we say that if you
have a model from which you can generate some kind of--assume certain scenarios and then
generate a database snapshot that corresponds to the scenarios, unload the database, reload
it with that scenario, I found that to be a very useful kind of tool to have in the
situation you described. Other questions? Okay. Well, thank you very much for your attention
this morning.