Post release testing
and measurement
Purpose
It must be understood that all measurement involves sampling.
A sample is a small subset of a large and usually indefinable
population. During development, the user sample sizes available
to most design teams are extremely small: usually on the order
of 3 to 10 users. Similarly, testing procedures will usually
focus on a small number of user tasks -- presumably those
adjudged to be the most critical ones. Typically, a testing
session will last up to 60 minutes and test the execution
of 4 to 5 tasks, whereas the product will typically support
a large and possibly indefinable number of tasks. Thus task
sample sizes will also be extremely small.
Compared to the projected sale and use of a product, these
are extremely risky samples to base major decisions on. If
we have a measure whose standard deviation (spread) is as
low as 4.00, fig. 1 shows how with small sample sizes, the
95% confidence area (where we are 95% sure the true population
mean lies on the basis of our sample means) is extraordinarily
large and only becomes reasonable when we have measured approx.
40 users or investigated usability over approx. 40 tasks.

Fig 1: 95% confidence interval of the mean with various small
sample sizes.
The particular difficulty is that we do not know whereabouts
within the 95% confidence interval the true (population) mean
does actually lie.
The situation is not as bad as it appears in the graph, because
usually, testing during development attempts to diagnose as
well as measure, and user data is usually qualitative rather
than quantitative, but still, an inappropriate choice of user
testers and tasks, with small samples of both, may give either
extremely optimistic results or extremely pessimistic ones
- and we have no way of telling.
However, there is usually no way of being able to test with
large samples until the product is released.
The development of applications for the web may prove to
be an exception: a web site may be launched in a Beta version
state to a limited but usually sizeable sample of users, and
testing can be carried out on that basis, to achieve a well-tested
final version. However, this is only possible if the measures
used in testing are reliable themselves, not something one
can take for granted.
Thus there is a considerable need to continue testing and
measurement after the product has been released, to gain an
increasingly truer picture of how the product is performing
in relation to the usability goals set for it. Data from such
an activity can be used by the organisation developing the
product in a number of ways. It can:
- give advance warning of where user support will be most
needed;
- indicate what are the good sales points of the product;
- prioritise bug fixes and improvements;
- be used as benchmarks for future releases;
- feed into the requirements spec of the next version.
Selecting the right method: discussion of principles
A frequently asked question is: what are the right methods
to use for evaluation? Which methods should an organisation
acquire first?
It is extremely important to note that unless an organisation
has mature quality processes, at least to levels 2 or 3 of
the CMM hierarchy, for instance, then usability testing will
do very little to the organisation as a whole, although it
may demonstrate the gains to be made in small, localised projects,
which may become an incentive to increase general CMM level.
In fact, unless carefully handled, usability testing in an
immature organisation may actually demonstrate that usability
testing is neither cost effective nor useful, thus putting
the entire organisation back in its attempts to develop better
quality standards.
The ISO 9241/11 definition of usability is well worth remembering:
The effectiveness, efficiency and satisfaction with which
a well defined sample of users carry out a fixed set of tasks
in a particular environment with a particular release of the
software.
Four conclusions can be made from this definition, leading
to an acquisition strategy of four main classes of usability
methods:
- Understand the users, tasks, and environment: the Context
of Use analysis method;
- Effectiveness: find ways of measuring how effective the
users are in carrying out the set tasks;
- Efficiency: measure user efficiency (performance testing),
and also cognitive workload to understand the cost of efficient
performance;
- Satisfaction: acquire a user satisfaction questionnaire
and keep records of data obtained with it.
Follow the entries in the methods table for more information.
Selecting the right method: a quick approach for newbies
Here is a kit of measurement methods that is designed to
get the novice practitioner off to a quick start. The emphasis
is on ‘guerilla methods’: that is, methods which can be used
on their own to make a point. Later on, you’ll want to connect
the methods together to show there’s an underlying thread
behind all this usability work and to use the more complex
methods which yield more data.
Context of Use
Effectiveness: Use ‘participatory
evaluation’
Efficiency: Use ‘Time on Task’
Satisfaction: Use SUS, or SUMI
if you can afford it (see ‘subjective assessment’).
Reporting and documentation
No amount of testing and measurement is any use if you cannot
make an impact on your organisation with your results. Secondly,
if you don’t keep records, you won’t improve your processes.
Reporting is therefore useful for two reasons:
- for your self, to develop your own casebook;
- for others in your organisation, so they get to hear about
your work.
The fundamental principle in any method of reporting is the
5-minute rule: the most important people you hope to influence
will not spend more than five minutes reading your report.
This means you have to get your message across within the
first few pages. Both the schemes mentioned below allow you
to present a report following this principle.
The second principle is the principle of traceability. You
must lay down an audit trail in your report so that you, your
reader, and your reviewer can always see how you came from
your findings to your conclusions. This is harder than science,
where your paper will come under scrutiny by expert peer reviewers
and will therefore, if published, receive their imprimatur.
Your reports will be issued by you under yours or your manager’s
authority. A sceptical reviewer must be able to follow your
trail.
There are two standards you can
follow.
|