File systems - Part 1

Introduction

So far we have discussed how people control computers, with the user interface and how their work can be protected with security. Arising out of security we talked about protecting objects and some of the most important objects to protect are files.

Files are very important because on most current computing systems they are the only permanent information. When a person logs out or switches the computer off the only information which will still be there when they return will be what they saved in a file (or what the system saved in a file for them).

What are they?

So a file is just a permanent collection of data. If this is the case we need to be able to retrieve this data so that it can be used again. So we need to have some way of referring to the data and distinguishing this clump of data from that clump of data. In other words we need a name for the file. This doesn't have to be a string of text (although it almost always is), it could be an icon, a sound, or an address on the device the file is stored on.

There is other information which we would like to know about files (or the operating system needs to know). As we saw in the section on security we need to know who owns the file and what rights the owner and other users have on the file. Other important information includes the size of the file, possibly the type of information stored in the file, when the file was last changed or accessed, whether the file has been backed up or not etc.

It seems quite clear that since we are going to want to store anything in a file we need to have a flexible system. There are two ways of approaching this flexibility:

we can design a truly flexible system which enables us to store information in all sorts of expandible structures. The structures have to be expandible because we don't know what sorts of files we will be dealing with in the future.
or we can ignore the idea and say that everything has to fit in a common structure. If your data doesn't fit nicely into the common structure then you write little routines which pack it into the structure when you save your information and which unpack it when you read the information back in.

We will come back to this question when we examine the concept of file types. Almost all operating systems understand at least two different types of files.

What should the file system do?

Command level file operations

There are some things we want to do with files from the command level. In this situation we work with the file as a whole. What are the sorts of things we would like to do with complete files? All of the following operations should only be allowed after checking the users privileges. We can only allow authorised access.

Create

Some systems allow us to create files from the command level. More commonly files are created by programs such as editors (this is because there is seldom a reason to create a file until there is something to store in it). UNIX allows us to create files by redirecting output to them.

What does file creation entail? Space must be found for both the data and attributes on the file system (more on this when we come to implementation) and a file table entry must be filled in for the file pointing to the allocated area. Some systems require that the size of the file be specified when it is created. This is usually tied in to an implementation requirement, files may be stored in contiguous secondary storage for example.

For most of what we want to do this is an unnecessary restriction. Files are limited enough by virtue of the contiguous nature of the underlying devices. E.g. You can't easily insert new material into a file. We certainly do want to be able to allow files to grow as much as they want (until the device or the user's disk quota is full).

Delete

Since some of the files are ours we want to be able to manage them, this includes being able to get rid of them. We can delete files in different ways. When we come to file systems with generations we will have to handle deletions differently. It is pretty common for deletions to return file space (preferably clearing it if we are talking about disk files) and then to mark the file table entry as deleted.

Copy

This should be a simple copy of all the information, maybe with different attributes such as creation date? In this way it is possible to have a last modified date before the creation date on some systems.

Move

This could be a copy followed by a deletion but since the information is already on the device, why don't we just change its file table entries? It may not be as simple as that because the file system might present a logical file space which spans several different partitions or devices (even over a network). In this case a true copy will have to be made.

Append

Sometimes we want to combine the information of one file with that of another. The system can do this either trivially by tacking on all the bytes of one file onto another or it can do it sensibly if it understands what sort of data is held in the files (e.g. files containing video or sound). UNIX does it trivially, but presumes you are working with text files.

Attach

Make this file appear somewhere else in the directory structure as well as where it currently is, the UNIX "ln" command. More about this when we discuss directories.

Change attributes and name

Some of these should be unchangeable but it makes sense to allow the user to fiddle some of these.

Command level directory operations

Similar commands should be available to cope with directories. In some ways directories are like ordinary files and in some ways they need to be handled specially. For example when we delete a directory we really want to make sure that it is empty first. If the user really wants the directory deleted all files in the directory must be deleted before the directory disappears, otherwise we will end up with files which we can't access anymore. Just like dynamically allocated variables to which the pointers are lost. Because of this many systems (such as MSDOS don't allow the deletion of directories until all files contained in the directory have already been removed.

Apart from creating, deleting, changing names, moving, etc of directories one command specific to directories is changing the value of the current directory. In almost all hierarchic file systems there is the concept of current working directory. This makes searching for files and naming files easier for the user and the system.

Command level volume operations

There are also commands to do with the device that the files are found on. Generally these are disk drives and so the commands are things like initialising and verifying disks. Other examples of this type of command would be mounting and dismounting file systems whereby a new disk is added to or removed from the file system. More on this when we come to implementation later in the year.

The Macintosh system is interesting because it does a lot more work than other personal computer operating systems when a new disk (e.g. a floppy) is mounted.

First of all the disk is checked to see that it has a file system on it which is recognised by the Mac. This includes MSDOS disks amongst others. If not it gives the user a message about whether they want the disk formatted.

If there is an understood file system, it is then checked to see if there are any executable programs stored on it. These are registered in the desktop, just in case a file is opened which needs that program. More about this when we talk about file attributes in a moment.

Program level file operations

The other type of operations the system should provide are to do with accessing the files from within our programs. All of the above commands should also be available from within a program, however programs also want to deal with the actual data stored in the file, not just the file itself. Before we will look at these operations we need to widen our view of files. It has been becoming a little narrow with some of the above commands.

Device independence

From the point of view of our programs it is not important where the data comes from, in fact it is convenient if programs can accept data from a variety of sources. If we want to allow this we can't recompile our program every time we want to use a different input or output device. The most common requirement would be to change input from a file to the terminal, and the same for output or else to a printer. Because our program does not change, the system must make all of these devices appear similar. This is known as device independence. We will return to the implementation of this when we look at devices later in the course.

For the moment, the important thing to realise is that in theory at least our program doesn't know what type of thing is providing or taking its information and if we design our file system properly it is the thing which takes care of this, at least at the top level.

From the point of view of our program data is coming in from the file system and going out to the file system.

Streams

This gives us another way of looking at files. Rather than the static things which exist outside programs we can look at them from the program point of view. From the program a file is something which provides or accepts a "stream" of data. At one level how the file is structured is irrelevant, as long as the correct bytes arrive at the program when it needs them. Even a random access file whereby the program directs the file system to move to a specific location within the file and return the contents of that location can be seen as nothing more than output and inputs on a continuous(?) stream. By the way what we mean by streams in this section of the course has nothing (not quite true) to do with UNIX streams which are used to provide layers of service between devices and processes.

So when we look at input and output we can view any source or destination as a stream.

From the programs point of view, it makes requests either for data from a stream or puts data to a stream. The stream can also provide information about its state to the program, such as there is no more information at the moment, or the file cannot be written to.

In order for all sources and destinations of input and output to be used in the same manner extra layers of software specific to the different devices must be added. The process will use a general interface which talks to device specific code (the device driver). The details of the interface and how requests can be controlled by the operating system we will look at when we get to implementation later in the year.

Operations on streams

If we restrict ourselves to thinking about streams, the operations we require become very simple.

Read

Certainly we need to be able to read data from the stream. What do we need in order to do this? At a bare minimum we need to say where we want to put the data we are requesting from the stream and how much data we require (this doesn't have to be done explicitly, it could be done by the type of variable we are requesting to read, or the stream may have a record length associated with it). And of course we need to specify which stream we want the data to come from.

Seeking

The commonest form of file access is sequential which corresponds to the idea of streams. Data is read in the order it is stored in the file. Of course if we are using a file from a disk device (which allows random access) we may in fact say where we want to get data from. A position within the file. Some systems require that every read call must include such file position information. It is common for systems to keep track of where the last read was made. In this way the programs which read the file don't need to explicitly state where the next read is to come from. The file system keeps track of it for them. If the program wants to read from somewhere else it makes a file reposition or seek call. Seeks can be relative to the start or the end of the file, or to the current position within the file.

Write

A write operation is similar. We need to say which data (where it is and how much) we want to write to which stream. Similar comments apply to moving the write position with a seek into the file.

Open

Both reads and writes need to direct their attention to a particular stream. To get access to a stream we need to make a link between the internal world of the process and the external world of the devices which provide the stream. This is normally accomplished with an "open" system call. This requires the file or stream name. More on names in a moment.

It is informative to think about what the job of such a call actually is. Usually it maps a file name to some other way of talking about the file in subsequent reads and writes. It has to find what device the named file is located on and where the file is on the device (more about this when we talk about implementation). It does this by accessing a file directory and finding the named file's entry. It notifies the operating system that this particular file is going to be used, this can help the operating system make subsequent accesses to the file more efficient. Information such as where the file exists in secondary storage can be loaded into memory to enable fast location. If the operating system knows about the type of access a process requires it can help with deadlock prevention and some concurrency problems e.g. if a process is writing the file, it is probably not a good idea to allow another process to open it for writing.

Of course we need to check to see if the file exists. If it doesn't and is being opened for writing we may have to create the file.

A more general condition performed while opening a stream is checking the access rights of the process (or its owner) to the stream. As I mentioned in the security section, it would be safer to do this on every access however in the move to a practical solution many operating systems only check such rights when a file is opened.

Buffers

Another task is to allocate a buffer for use by the stream. Data may not come in from the device associated with the stream in the right sized chunks. In particular when we are talking about disk devices, they tend to move data around in blocks of between 512 and 8192 bytes. The file system may make such blocks uniform by putting a logical layer over the device layer.

By having a buffer set aside for the stream, as data is requested, blocks of data are placed in the buffer. Then subsequent reads of say 4 bytes can come from the buffer, without having to do another access to the device.

Other systems don't associate permanent buffers with open files. UNIX is interesting, it doesn't allocate permanent buffers, but the standard io library routines do.

A side effect of using a file buffer is that we speed up execution of our processes. Every time the file system has to go out to disk and retrieve information our process has to wait (usually at least). If our last read actually brought the information we want next time into our file buffer, then we don't have to do a physical read next time. This is the reason that character by character i/o isn't as costly as you might think it should be. Similarly when writing information to a file, we just put the information in the buffer and the file system writes it, either when the buffer is full, or is not needed anymore (because the file has been closed or the process has terminated).

Buffering will be discussed in detail in Bruce's part of the course.

Other approaches

Is there any other way of doing these things? Is an "open" call actually necessary? (Well, it simplifies many of the questions. If we just used a filename and said read("filename") we would have to use a different method to indicate reading from the current position rather than from the beginning of the file. We could still do this with an open("filename") and then refer to the file information block with the filename. What about opening the same file twice?).

Close

If the file system has been informed that a process is going to use a file with the open system call, it must also be informed when the file is no longer needed by the process. The information concerning the use of the file needs to be changed. This may mean that another process which had been waiting for the file can now be given access and continue. Alternatively if the close results in no processes using the file and no other processes are waiting to use it the system information about the active file is no longer necessary and can be returned, along with any buffers the file may have used.

If information about the file has changed then this must be written to disk, e.g. time of last access.

Locking

Since two or more files may want to change a file at the same time, it is usual to provide some way of locking files. The system may only allow one process at a time to write to a file. Database systems must allow more than this, they require finer grain record locking.

If the system doesn't enforce sequential write access, the processes themselves will have to use some other mechanism in order to guarantee the integrity of the file data. Bruce will be covering some of these techniques in his part of the course. Many UNIX programs use a lock file. They try to create a file with a common name. If the file already exists they know that the program they want to read from is currently being written to, and they try again later.

This technique has all sorts of problems:

When should the program try again?
What if there has been an error and the lock file is never going to be removed?

Back to the lecture index

On to the next lecture