Newsletter Mar 2005 – Louise Pryor

News update 2005-03: March 2005
===================

A monthly newsletter on risk management in financial services,
operational risk and user-developed software from Louise Pryor
(http://www.louisepryor.com).

Comments and feedback to news-admin@louisepryor.com. Please tell me if
you don’t want to be quoted.

Subscribe by sending an email to news-subscribe@louisepryor.com.
Unsubscribe by sending an email to news-unsubscribe@louisepryor.com.
Newsletter archived at http://www.louisepryor.com/newsArchive.do.

In this issue:
1. There’s a little man inside the box…
2. Somebody else’s problem
3. FSA update
4. Count on failure
5. Newsletter information

===============
1. There’s a little man inside the box…

The clocks changed in the UK at the weekend, as they do twice a
year. So you’d think that computer systems would be able to cope,
and that there would be no major disruption. And, on the whole,
you’d be right, though you wouldn’t necessarily know it from the
press coverage.

About 1,500 Barclays ATMs (out of a total of about 4,000) were out
of action for over 12 hours on Sunday. We were told that a manager
put the clocks back rather than forward, and that this mistake had
caused the problems. The Daily Telegraph carried a leader opining
on the lessons that Barclays could learn from its employee’s
blunder.

But hang on a minute: A real live person, changing the clocks in
the data centre at 01:00 on Sunday morning? It just doesn’t make
sense. Why on earth wouldn’t the time change be automated? After
all, it is in just about every other computer in the world. Did you
have to change the time on your PC this weekend?

And in fact, Barclays say that it was a hardware fault, and not
related to the time change at all. This is much more plausible, and
is what I heard a Barclays person say on the radio. But if it’s
true, where did the story of the error-prone manager come from? The
Telegraph said that they had it from customer services staff.

I imagine it happened something like this: The ATMs go down. (And,
it appears, the online banking too). Calls pile into the call
centre. Nobody at the call centre knows what the problem is. (And
why should they know? They are not omniscient, and these things
often take time to track down.) They are talking to each other
about what is going on. Someone says that it must be something to
do with the clocks changing, as that’s something that doesn’t
happen every day. And someone else says “Yeah, I bet that’s
it. Some stupid person changed them in the wrong direction!” And
before you know where you are, an off the cuff remark (probably
made in jest) has spread around the call centre and becomes the
official version.

People are very unwilling to believe in coincidences. They also
have mental models of how things work. And surprisingly often,
those mental models boil down to a little man in the box (or, in
this case, in the data centre). So when they were told that the
problem arose because a person made a mistake, they didn’t stop to
think about whether the story really made sense.

http://news.zdnet.co.uk/hardware/0,39020351,39193138,00.htm
http://makeashorterlink.com/?M170229CA
http://www.forbes.com/facesinthenews/2005/03/28/0328autofacescan05.html
http://edition.cnn.com/2005/BUSINESS/03/28/barclays.machines/

===============
2. Somebody else’s problem

There have been a number of stories about outsourcing and its
problems recently, though they are rarely expressed as such. To
mammothly over-simplify, the trouble with outsourcing is that you
lose control, the benefit is that you offload the problems onto
someone else. The risk is that there are gaps: the outsourcer
doesn’t deal with the problems, and you no longer can.

Computer security is a prime candidate for outsourcing. Specialists
can do a much better job of keeping up with all the latest threats
and how to deal with them. But a number of organisations recently
lost whole tranches of email messages because there was a bug in a
system that an outsourcer used for email scanning. When the update
mechanism tried to install the updates on the customer networks,
the system started to delete all emails by default. Oops! At least
one customer claimed that someone at the outsourcer said that the
update hadn’t been tested, but this was denied.

http://news.zdnet.co.uk/internet/security/0,39020375,39189933,00.htm

Internet hosting is also outsourced. There are comparatively few
organisations that have the expertise or funds to run a full data
centre themselves; it’s a field in which there are very definitely
economies of scale. Interestingly, there are often two or three
layers of outsourcing: the end customer uses an ISP, who in turn
uses one of the big data centres (or may bulk buy from another ISP,
who uses…). So if anything goes wrong in one of the big data
centres, the effects are widely felt.

Which is just what happened recently. A routine test was being
carried out when a fault developed in a switchgear panel (whatever
that is). This caused a short circuit in the UPS (uninterruptible
power supply) modules, so everything moved to battery power. The
fire alarms also went off (I can’t make out whether this was
connected, or just a coincidence), the building was evacuated and
everyone stood around outside while the batteries ran down.
Customers suffered hours of downtime, and a number of them had
equipment destroyed by a power surge that occurred at some point
during the episode.

One of the problems here for the end customer is that they may not
even know where the chain ends for them, and so have next to no
chance of really being able to manage the risks. I believe that the
physical bits and bytes that make up my web sites currently live in
Calgary, for example, but when I chose my hosting company I was at
least as interested in the software they supported as the historic
uptimes. And I didn’t do any work on finding out whether I expected
future performance to reflect historic, or whether there were
special factors that should have caused me to be wary. (In fact, I
haven’t had any trouble since my last move 18 months ago).

http://makeashorterlink.com/?G3B0219CA
http://news.zdnet.co.uk/business/0,39020645,39190518,00.htm

A recent Gartner survey has pointed out another problem with
outsourcing: it can raise costs. Apparently outsourced customer
service operations can cost almost a third more than those retained
in-house.

http://makeashorterlink.com/?H221249CA

According to Jamie Oliver, this applies to school meals, too.

===============
3. FSA update

The FSA’s new web site appears to have outgrown some of its
teething problems. Many of the old links now work again, which
makes life easier.

New issues of both the General Insurance and Life Insurances
newsletters have appeared. Both of them contain information on the
FSA’s current thinking on various aspects of the ICAS process,
including confidence levels and time horizons.

http://www.fsa.gov.uk/pubs/other/gi_newsletter5.pdf
http://www.fsa.gov.uk/pubs/other/li_newsletter3.pdf

New consultation and discussion papers out this month:
—————————————————–

CP05/4 FSMA 2 Year Review: Financial Ombudsman Service

DP05/1 Integrated Regulatory Reporting (IRR) for: Deposit
takers, principal position takers, and other investment
firms subject to the Capital Requirements Directive

Feedback published this month:
—————————–

PS05/3 Implementation of the Market Abuse Directive

A list of current consultations is available at
http://www.fsa.gov.uk/Pages/Library/Policy/CP/current/index.shtml

===============
4. Count on failure

One of the reasons for Google’s success is that the folk there
count on bad things happening. It’s well known that they use
large numbers of cheap machines for the heavy computations that are
involved in indexing so many web pages, instead of buying expensive
supercomputers. A normal PC might fail once in three years (that
seems a bit optimistic to me), so if you have thousands of them you
can expect on the order of one failure a day. So they assume that
failures will happen, and develop systems to handle them.

http://makeashorterlink.com/?V2A1149CA

It’s obvious when you put it like that: clearly you should allow
for anything that happens as often as once a day. But where do you
draw the line? And how can you tell how often failure is likely to
occur? Consider spreadsheets, for example (I had to get there
eventually…) How often are you likely to get something going
wrong in a spreadsheet? An optimistic estimate is that about 1% of
unique formulae will have errors in them. A spreadsheet only has to
have about 69 unique formulae to be more likely than not to contain
an error. And that’s not a particularly large spreadsheet. So what
do you do about it? Testing, reviewing, good development
processes… If you want to know more, do get in touch!

I discussed banks, robberies and phishing in the last issue. This
month it came to light that key logging software was used in an
attempt to steal £220 million from a Japanese bank in the
city. Arrests have been made. The bank’s security worked well, in
that it was internal security officers who first spotted the
attempt.

http://www.timesonline.co.uk/article/0,,2-1529429,00.html

===============
5. Newsletter information

This newsletter is issued approximately monthly by Louise Pryor
(http://www.louisepryor.com). Copyright (c) Louise Pryor 2005. All
rights reserved. You may distribute it in whole or in part as long
as this notice is included. To subscribe, email
news-subscribe@louisepryor.com. To unsubscribe, email
news-unsubscribe@louisepryor.com. All comments, feedback and other
queries to news-admin@louisepryor.com. Archives at
http://www.louisepryor.com/newsArchive.do.