File systems - Part 2

Last time we saw a common set of commands which the operating system must provide to deal with files, and the file system. In this lecture we are going to look at files themselves and discuss ways in which the operating system can help us deal with files, if it knows something about the types of data they contain. But first we need to look at the second important data structure which the file system must control. The first was the file buffer, where data is stored by the operating system between the process and the device.

File Information Block

When using a file we don't want to keep track of all the information required to access it - the operating system is supposed to do that for us. All information about where the file is, where the buffer associated with the file is, how many processes are reading the file, where we are up to in reading this file, is the file opened for reading or writing etc needs to be stored somewhere.

A data structure with this file information goes by different names in different systems. A commonly used name is "file descriptor", however this means something related but different in UNIX. To avoid confusion we will call it a file information block. Every open file associated with a process has one. In fact it makes sense to keep them together in a table, the process's open file table.

Unfortunately we have a bit of problem here. Some information is needed by the file system irrespective of the process using the file e.g. how many processes are using the file, is it being written to? This last value must be decremented every time the file is closed, and when the count reaches zero, the file information block can be released (after ensuring that any changes have been written to disk).Whereas some information is going to be different for each file, which file block am I reading, where am I up to in the file (for sequential access). And so we are forced into having a two part file information block, one part associated with each process using the file and one part outside individual processes. Each process file information block points to the system file information block associated with the same file.

Why do we need a file information block? Couldn't we provide all the information we need when we want to read a particular number of bytes from a file?

The short answer is yes, we could survive without a file information block. After all where did most of this information come from. It came from the file table permanently stored on the device, and the information could be read from the device every time it was needed. This certainly saves memory, but it is usually regarded as unacceptably slow. We want to minimise the number of times we go to disk for every file access.

There are other limitations, every read or write would have to pass as a parameter, where it wanted to read from. And how would the system know if a file was open for writing by more than one process. So in practice we always use something like a file information block.

File Handles

We touched on this earlier when asking questions about "open". When we use the "open" file system call we invariably are given some pointer or small integer value which we use for all our subsequent actions on that file. This goes by different names (in UNIX this is the "file descriptor"). We will refer to these as file handles.

All a file handle has to do is to give the correct connection between our program and the corresponding file information block we think we are dealing with. Usually it is just an integer index into an open file table.

The peculiar thing is that there is another level in the UNIX file system. The inodes (we mention later but are basically the file table entries on UNIX disk devices) are brought into memory when a file is accessed and are known as the incore inodes. Incore inodes are different from their original copies because they record the changes which are made to the file (when last accessed etc) as well as keeping track of how many processes currently have this file open.

UNIX provides an extra level between the per process file table (which is really just a table of pointers to the next level) and the system open file table - the incore inodes. This intermediate open file table or file-structure table (see pg 639 of the textbook) contains the file position. This table is shared by related processes giving the ability for these processes to read consecutive parts of the file for example.

Regardless of how many processes are accessing a file they all eventually refer back to the file via its one incore inode.

File Attributes

Regardless of the underlying structure of our file system, operating systems need to know various things about our files. The first (and most obvious) is the filename itself. There must be some way of identifying each particular set of data. Apart from the name, there are other sorts of information about files which most operating systems deal with, we will give these the general name of file attributes.

Names

Names have several tasks. Since each user commonly has several hundred files on a multiuser system one of the tasks of a filename must be aiding the user, both to remember what that particular file was called, possibly remember where it was placed and when they come to look at the file, what was in that file?

This means that filenames should usually be readable text. We could try icons but most of us would have difficulty designing unique ones which conveyed the information we would need. For graphics or video files, miniature pictures of the contents may be better for when the user is browsing looking for a particular picture, however, we still want to associate a textual name with the file for other ways of referring to it.

Other consequences are that we need to have a reasonable length. About 30 characters seems adequate for most purposes. Some systems allow filenames of "any" length, usually limited to 256 (or 65536) characters.

Very long filenames have the problem of being difficult to type correctly, either directly or in our programs.
They are also difficult to display. If two files have the same name for 200 characters, the user interface is going to have a hard time displaying the difference.

Normally when we are trying to store information we try to group it into related blocks. We can do the same sort of thing with files via their names. We could make sure that all files from our book on the America's Cup start with the words "America's Cup", then we may have files associated with each challenge, and we could name these with the year of the challenge - "America's Cup 1992". We could then have a file dealing with each race in the final - "America's Cup 1992 Race 4". By the way this shows that it would be nice if we could use special characters in our filenames as well as alphabetic and numeric characters. MS-DOS forces one of the worst filename conventions imaginable on us (now undone with Windows 95?).

Grouping files according to name

Of course it becomes a strain to remember and type in such long names (even selecting them from a menu can be a problem if the menu is very long). Instead we would like to be able to refer to them according to components in the filename. If we are going to do this we need to have some way of separating the components. A typical way to do this is to allocate some character such as "/", "\", or ":", so that our big filename above becomes - "America's Cup:1992:Race 4".

What is to stop us calling it "Race 4:America's Cup:1992"? Well nothing really. When we come to look at the way our file system may be implemented and we see that they almost all use some sort of directory hierarchy, then we have reasons why this form is not as good as our first one. Thinking in terms of chapters of sections of books tends to give a clear structure to the way we name such files. However not all such structure is evident. If our book was on all ocean boat racing we may have structured our chapter as "1992:America's Cup:Race 4".

This system provides us with a way of successively refining our filenames. Since we have a multiuser system it makes sense to give each user a starting component of their own login name for all of their private files.

In later systems we have worked on the analogy between trying to store information in files and storing information in filing cabinets. In the same way we put files in folders and folders in drawers and drawers in cabinets, we do analogous things with computer files.

Directories

The other aspect about filenames is that the system must be able to use them to provide you with the information stored in the file. Somehow the system must be able to use the name you provide to find the information stored in the file.

What must the operating system have in order to find your file?

Obviously it must have some sort of table in which it stores the filename and other information, including where the data in the file is actually stored. On older and smaller systems there was only one such table per disk device. The obvious limitation of this was that the table could grow too big, remember that each file has to have a unique name and selecting from a large list can be tedious both for the system and the for the user.

We can solve the problem for the system using a convenient data structure such as a B-tree or a hash table. However we are still stuck with the problem for the user. In particular if we have a multiuser system.

Oberon again

Since Oberon was designed as a single user system, it had what is known as a flat directory. Each file name could appear only once. The file name information was actually stored in a B-tree to enable fast searches to find the file table information about a file. Note that this structure has nothing to do with the appearance of the file system to the user. It is purely an implementation detail.

Two level

An improvement over this method was to have a system of directories. Each user on the system was provided with a directory. Each directory contained all of the user's files. The files complete name consisted of the user's directory name and the file name.

Multi-level

The obvious extension to this layout is the hierarchical system that exists on most popular operating systems. The file layout is seen as a tree, with files being at the leaves. Now we can take all the components of our pathname derived above and turn all of them except the last one into directories.

There are all sorts of interesting questions about the implementations of directories. Should they be just like ordinary files or should they be treated in special ways by the system? Should the file attributes (to be explained shortly) be stored in the directories? They aren't in UNIX.

The advantages of a multilevel hierarchy are that it is easier for users to find their way around. Referring to a file in the "current directory" is always just a simple matter of referring to the filename. It is easier to ensure unique system wide filenames because the full filename includes all enclosing directories.

Sharing

It is common for such systems to allow sharing of parts of the directory tree. In other words a directory may be included in different parent directories. In UNIX this type of sharing is set up with the "ln" command. "ln" also allows aliasing of files (or directories) so that different names can lead to the same files (or directories). "ln" doesn't allow absolute links to other directories because it doesn't want cycles being formed in our directory graph (it is no longer a tree, because the same file can be a descendent of more than one parent), however it does allow "symbolic" links to directories. A "symbolic" link is just a file, marked in a special way so that the operating system knows it points to another file. The contents of the link itself is merely the name of the file the link is pointing to. If you do an "ls -l" you see this information.

A little later on we may see how UNIX finds its way through the directory structure. At the moment you can imagine that each pathname component (as it is called) is searched for in turn from the root of the directory tree. In UNIX this root is actually called "/" which is a little peculiar since this is the character used to separate the pathname components. Other peculiarities include naming the current directory "." (why not call it its true name?) and the current directories parent directory "..".

Back to the lecture index

On to the next lecture