Any work done on a computer must be safe.
There are different causes of danger to computer systems and the data they hold.
Basically there are three main sources of non-safety. Accident, mistakes (or errors), and malevolence (or malice). We can assume that people will try to help minimise the effects of the first two of these, however at least someone is trying to not be helpful with the last.
There are different types of accident. Those that happen outside the computer and those that happen inside the computer.
Typically the accidents which happen outside the computer are known as disasters - fires, earthquakes, power failures, water damage.
Preventing disasters is not something that is currently in our power. However certain precautions can be taken to minimise the effects.
Since a disaster can destroy the physical computer system. The only way to save something in this situation is to make sure that regular (automatic) backups are made of all files, and that the backup is stored somewhere completely separate from the computer centre.
Equipment failure is more common than we would like. All it takes is one bad experience and we never take it for granted again (or at least until a sizable chunk of time free from failure has occurred). The failure might be a disk head crash, a processor overheating etc.
Whatever the cause of the failure we can either try and mask the error or try to minimise the damage done.
Fault tolerant systems try to mask or hide the error from the users. If part of the computer breaks or is acting erratically the system will detect this and switch access to another device or processor.
Finding the fault is one problem. Some systems include microdiagnostic code in their microcode to verify the operation of instructions. Other methods we will look at below.
Multiprocessors are frequently designed to be fault tolerant, even if a processor malfunctions the other processors can take the load. To make this work, there have to be multiple copies of all critical data. Idle processor time can be spent looking for problems.
Multiprocessor systems should be designed to keep running (even though not at full capacity) in the case of a processor failure. We use the term graceful degradation to mean that the users do not experience a sudden change in behaviour.
Some "reliable" systems have levels of redundancy built in to them. The system never works at full capacity, processor power and disk and memory storage are duplicated just waiting for those times when something goes wrong.
Such a system must make sure the system knows which units are currently active and when units come back on line after being repaired that they are made full copies as soon as possible.
I said there were other ways of detecting errors. In data communications we commonly use parity bits and checksums. Then our software can tell if the data was scrambled. If we add a bit more redundancy to our data we can include error correcting codes.
In particular you have probably heard of RAID systems (Redundant Arrays of Inexpensive Disk). Where the same information may be stored several times. Originally this approach came from disk striping where subblocks of a file block would be stored over several disks. The idea was that all the disks could be accessed in parallel, speeding up access to the data.
RAID systems work in different ways:
Another approach is majority polling. If we do have several devices maintaining the same data we can use that fact to detect faults. Usually there will be three devices with the same data and the data is taken from all devices and compared. If there is a discrepancy (hopefully two against one) the data from the majority is taken as correct. If a device consistently seems to be holding incorrect data it needs to be fixed.
Special data which needs to be kept safe or data which we currently don't have a need for but don't want to lose can be archived. This means taking a copy of the data and storing it somewhere else. This can be done by the user however if the system has access to another system for archival storage a proper archiving package controlled by someone employed to do the job is better.
If we don't try to mask breakdowns from the users we need to do something to minimise the pain. The usual way to do this is by backing up the file system.
Even though we may want to allow the user the luxury of backing up their data when they want to it is far safer to have the backups taken automatically.
There are two types of backup full and incremental.
A full backup copies the complete file system (possibly excluding temporary files). This is time consuming and takes up a lot of space. However it has the advantage of being easy to restore the system to the state it was in.
Incremental backups only save copies of the files which have changed since the last backup. This is only a small proportion of all files in the system, hence is much faster to do and each backup occupies far less space. The disadvantage is that it is a much more complicated process to restore the system from an incremental backup.
What should be done with files which have been deleted under an incremental backup regime?
A combination of the two is commonly used. A full backup may be made weekly, and an incremental backup daily.
When it comes to the work actually being done when a fault occurs then many systems don't try. However some systems have to, e.g. transaction handling systems.
These systems have write-ahead logs, whereby all changes to the system are stored in advance, including the old and new data values. After the change has succeeded the log has a commit record written. This way if the system goes down, it can always be brought back to the state it was in by two operations, redo and undo.
Any transaction which was committed has redo applied to it. Redo is written in such a way that applying the same operation more than once does not produce incorrect effects (idempotent). Any transaction which didn't commit is undone. Now the system is back where it was.
These logs are huge. And restoring the system from the time it was booted would be a huge process if it wasn't for checkpoints. The state of system is stored completely and we only have to redo and undo since the last checkpoint.
In this case we have a situation analogous to file backups. We must take regular snapshots of currently running processes, say after every few important changes. Checkpoints are analogous to full backups, the write-ahead logs are analogous to incremental backups.
The next category of problems we are going to consider are mistakes or errors. These are similar to accidents and some can have exactly the same results. Say for example someone inserting a tape the wrong way and destroying the data.
However we are going to concentrate on errors associated with the user and any programs they are using.
In this case we can try to implement schemes to minimise the chance of errors occurring. Thus providing a fence at the top of the cliff rather than an ambulance at the bottom.
Not only programs which you are running. What about programs that someone else is running?
How do we get safer programs?
A lot of work has gone into making programs safer. In some situations it is absolutely necessary that we have safe programs, e.g. air traffic control, flying planes, control of dangerous manufacturing techniques, nuclear power stations, computer controlled x-ray equipment.
How we set up and organise our programming teams has an effect on their products. Small teams of programmers working on program modules with well defined specifications and interfaces give the best results.
At the least we want to use structured programming techniques, even though these days it is more likely to be OO design.
The choice of programming language can be substantial. Some languages deal far better with specific types of problem. Similarly some languages (e.g. functional languages) can be far easier to check for correctness.
If the program is important enough, or if we have enough time, we can try to prove our program correct. We still seem to be quite some way off being able to do this for realistic programs in realistic languages.
Given that we can't prove our programs correctness we can test it thoroughly. There are different methods of testing, black box, code read throughs, special case testing etc.
You may think that since we have memory protection and processor level, then programs can't really damage anything they shouldn't. Unfortunately this is not true. A program may have complete permission to do something stupid. Even more unfortunate is when a program manages to crash the system. But surely that is impossible.
A study was done on common UNIX commands. To check them for bugs they were pumped random input. Under this scheme many of the programs died (I think it was about 25%), some even killing the system with them.
Apart from mistakes in the programs that we use, we can make mistakes ourselves. The most common is deleting files, or storing something other than what we want over existing information which we intended to keep.
Notorious commands like "rm -r *" should not be allowed. Or at least the user should have to explicitly verify that they really want the system to do what they just told it to. The same goes for throwing away documents in word processor and other applications. The verification should be more then just pushing return or the space bar. Ideally the user should be requested to do something which requires at least some attention. Emacs asks you to type "yes", "y" by itself is not good enough.
It sometimes helps to be working from a simple single user system. In cases like this if we accidentally delete a file and realise straight away, we can usually restore the data using an undelete utility. On most multiuser systems this can't be done because the space used by the file has most probably been quickly used by some other file.
Minimising damage from mistakes is very similar to minimising damage from system or device failures. We need to keep copies of the information we want to protect.
The backup system we have already talked about can be useful here. Since we have a regular backup, when we delete a file accidentally we can always restore a previous version, as long as a backup was done while a previous version existed.
Unfortunately when we are working with documents we frequently do things which are wrong. We would like to have some sort of universal UNDO command. In fact sometimes I wish we had something like that for real life as well.
Some programs implement unlimited depths of UNDO. The Macintosh seems to be content with only one. How could the system help with levels of UNDO?
One way is to maintain different generations of files. As new versions of a file are saved the older versions aren't lost, instead they are just made older. In fact only a small number of generations may be saved (say 4 or 6), enough to take you back to something which you remember had things how you wanted them.
It is important to remember that file generations are different from archives and backups even though they do share some things in common. The biggest difference is that archives and backups must be stored on a different device (preferably at a different site). Whereas file generations should be immediately accessible to the user.