Peter Gutmann, pgut001@cs.auckland.ac.nz
http://www.cs.auckland.ac.nz/~pgut001/pubs/sal.html
Code analysis tools (or more generally formal methods) have had a long and rather troubled history that may be summarised briefly by saying that they're so incredibly painful to use that almost no-one does. This has lead to Eric's Razor (after Eric S.Raymond, author of “The Cathedral and the Bazaar”):
In order to gain user acceptance, a software assurance tool mustn't require that you re-invent your entire universe before allowing you to write 'Hello, world'.
PREfast is a (rare) example of such a tool. Another one is the open-source Deputy, which I'll provide a comparison with later on. There are also various commercial tools available like the Coverity static code analyzer and Fortify security analyzer that are extremely powerful but also extremely expensive. If you're interested in this sort of thing at a more technical level, Secure Programming with Static Analysis provides a general overview, although much of the book is devoted to general secure-programming issues rather than the juicy details of how a static analyser works.
At the moment PREfast is only included in the Deluxe Professional Enterprise Team High-def Surround-sound Extreme Plus Ultra edition of Visual Studio. Luckily though there's a backdoor method of getting it that works with all versions down to the free Express edition, and (at least as of late 2007) looks like it'll continue to be supported in the future. If you download the latest Windows SDK (which is currently the Windows SDK Update for Windows Vista, but use whatever version is current at the time you read this), you can get the latest version of the development tools to go with your existing Visual Studio setup. Download the installer for the SDK and run it. You'll be given a long list of things to be installed totaling what seems like about half a terabyte of data. Deselect everything and then select only “Vista headers + libraries” and “VC++ compilers” (but omitting the IA64 one). The total download size should now be about 50MB, or about 10-15MB less if you don't care about the x86-64 stuff (note that if you're using the free Express edition you definitely don't want to bother getting the x86-64 stuff since it won't work with that).
Go through the install and then fire up Visual Studio. Open up Tools |
Options | Projects and Solutions | VC++ Directories, and in “Executable
Files” add the path to the compiler tools in the SDK (which should be
something like /Program Files/Microsoft SDKs/Windows/v6.0/VC/Bin
,
but not the other 'Bin' directory /Program Files/Microsoft
SDKs/Windows/v6.0/Bin
, note the lack of a VC
in the path).
Then in “Include Files” add the path to the headers. In this case
there are two directories to add, first something like /Program
Files/Microsoft SDKs/Windows/v6.0/VC/INCLUDE
for the standard C headers
and then something like /Program Files/Microsoft
SDKs/Windows/v6.0/Include/
for the Win32 headers (if you don't add both
of these then you won't get the SAL-annotated versions, since Visual Studio
will fall back to the default non-annotated ones). Finally, add the library
files in the same location. These should be the first entries in each list,
before all of the existing entries.
Now you're ready to go, using the PREfast-enabled compiler from the SDK rather than the default that comes with the IDE. Open the project properties and click on “Configuration Manager”. In the “Active solution configuration” combobox select “<New…>”. Give the new configuration a name like “Analyse” (if you're from the US and spell things funny then you can use “Analyze” instead) and copy across the settings from some convenient existing configuration like “Debug”.
Finally, again in the project properties, go to Configuration Properties |
C/C++ | Command Line, and in the “Additional Options” box add
/analyze
.
That's it, you're now set to run PREfast when you select the “Analyse” configuration.
So you've installed PREfast and fired up your project in Visual Studio. There's an immediate temptation to use the Big Bang approach and build everything in analysis mode to see what happens. If you try this you'll get about, oh, eight million warnings, and probably decide that it's not worth the trouble. This is not a good way to get started.
Instead, you're better off building your code file by file, annotating each one as you go. Open the first file and then in turn open the headers that it uses (not the system headers, Microsoft has already annotated those for you). Annotate your header files and the source file and then build just that one file. Fix any problems (if possible), or add the necessary annotations to guide PREfast if the problem is (genuinely) a false positive (more on those further on). Then do the same for the next file, until you've completed the entire project.
Unless you're an exceptionally talented programmer who's managed to write near-perfect code, this is not going to be a small task. For any reasonable- sized project plan to spend at least a couple of weeks going through all of your code adding the necessary annotations and making code changes. On the other hand the level of checking that you're getting by doing this will definitely pay off in the long term.
Once we get down to the specifics of using SAL things get a bit
complicated since the documentation is rather confusing. If you look at the
documentation
for VS 2008 you'll see that it actually talks about the low-level
attribute notation used internally (or at least in very low-level headers) by
the compiler, and not what you'd be using to annotate your code. The best
source for proper documentation is either
A
Brief Introduction to the Standard Annotation Language (SAL) or the
(somewhat device driver-specific)
PREfast
for Drivers. There's also an
overview page
in MSDN that doesn't use the attribute notation, but what's there doesn't
correspond to what's in the SAL header file from the SDK or in the previous
two linked articles. There is however an
older
version of the page available that does correspond to the header and the
other articles. Finally, the only really complete source seems to be the code
comments in the SAL header file itself, /Program Files/Microsoft
SDKs/Windows/v6.0/VC/INCLUDE/sal.h
.
If you look inside sal.h
you'll see that there are vast numbers of
possible annotations, most of which you'll probably never use. The reason for
this huge list is that in order to avoid requiring users to write long strings
of attributes to annotate their code, Microsoft have collected the most common
combinations into a single name like “__deref_inout”, which
expands to “__notnull __elem_readableTo(1) __pre __deref __valid __post
__deref __valid __refparam” (phew!), each of which in turn expands
further to a “__declspec(…)” in the older implementation or
the “([…])” attribute notation in the newer one.
The resulting mass of annotations can be rather intimidating. Rather than trying to use SAL directly, I've found that it's easier to identify common code patterns and then create a subset of the SAL annotations that work for you, which I'll cover in the next section. This both helps manage the complexity and means that you can retarget your annotations for whatever format might be used in future versions of Visual Studio.
The first step in setting up a SAL workflow is to identify common patterns
in your code that correspond to SAL annotations. For example one pattern
that's used frequently is { buffer, length }
, in which a function
takes a buffer containing a set number of bytes and does something with it.
An example of this is memcpy()
, which might be annotated as:
memcpy( dest, __in_bcount( length ) src, length );
which tells PREfast that the number of bytes in src
is given
by length
.
memcpy()
also requires a second annotation to tell PREfast how
many bytes of output are going to be produced. Since this is the same as the
number of input bytes, the final annotation is:
memcpy( __out_bcount( length ) dest, __in_bcount( length ) src, length );
whose meaning should be fairly obvious. Using these annotations, PREfast
can now check that the dest
buffer is large enough to contain
what's in src
.
There are some other annotations that you can add as well, but some of the
more obvious ones are automatic if you're reasonably careful with your
function declarations. For example the length
parameter would
in theory need a __in
annotation, but if you use
const
with the declaration then it's automatic.
Another common pattern is { buffer, maxLength, *length }
, in
which a function takes a buffer of a certain maximum length and deposits data
in it, returning an indication of how many bytes were processed. Windows
contains many instances of this pattern, an example being
ReadFile
, which might be annotated as:
ReadFile( HANDLE hFile,
__out_bcount_part( nNumberOfBytesToRead, *lpNumberOfBytesRead ) \
LPVOID lpBuffer,
DWORD nNumberOfBytesToRead,
LPDWORD lpNumberOfBytesRead,
LPOVERLAPPED lpOverlapped );
which tells PREfast that lpBuffer
has a maximum possible size
of nNumberOfBytesToRead
and the number of bytes actually read
will be returned in *lpNumberOfBytesRead
.
All in all there's a huge number of annotation combinations that are
possible. You can get an idea of what's available in the
MSDN
summary. As you can see by cross-referencing this to sal.h
,
this can lead to a combinatorial explosion of annotation complexity.
In order to manage this complexity, the next step once you've identified
useful patterns is to reduce the number of annotations to something that you
can work with and (hopefully) remember. What I've found works best for me is
to use the preprocessor to create names for the small subset of annotations
I'll be using that are immediately meaningful to me. So instead of having to
try and recall whether a parameter that returns a pointer is meant to be
annotated with __out
, __deref_out
,
__out_ecount
, __deref_out_z
,
__deref_out_ecount
, or several dozen other annotations, I can
just say OUT_PTR
and leave it at that. Isomorphic annotations
like __deref_out_opt
and __deref_opt_out
are
particularly entertaining, so having a single form that's meaningful for you
helps a lot here.
Another advantage to using your own names is that you're not tied to the
use of sal.h
in all situations. In other words by including in
a global header code like:
#if defined( _MSC_VER ) && defined( _PREFAST_ )
#include <sal.h>
#define IN_BUFSIZE __in_bcount
[…]
#else
#define IN_BUFSIZE( size )
[…]
#endif
you can ensure that your code builds without problems whether code analysis is available or not. Here's an extract from the header file I use with SAL:
#if defined( _MSC_VER ) && defined( _PREFAST_ )
[…]
#define IN_BUFSIZE __in_bcount
#define IN_BUFSIZE_OPT __in_bcount_opt
[…]
#define OUT_BUFSIZE __out_bcount_part
#define OUT_BUFSIZE_OPT __out_bcount_part_opt
#define OUT_PTR __deref_out
#define OUT_PTR_OPT __deref_opt_out
#else
[…]
#define IN_BUFSIZE( size )
#define IN_BUFSIZE_OPT( size )
[…]
#define OUT_BUFSIZE( max, size )
#define OUT_BUFSIZE_OPT( max, size )
#define OUT_PTR
#define OUT_PTR_OPT
#endif /* VC++ with source analysis enabled */
This has reduced the 700-odd-line sal.h
to just over a dozen
annotations that I actively need and use, represented in a form that I find
easy to remember.
A final advantage is the fact that you can use this extra level of indirection when the SAL annotation format changes. An earlier section mentioned the changes in the annotation format from Visual Studio 2005 to Visual Studio 2008, by using indirection you can say:
#if _MSC_VER <= 1400
#define IN_BUFSIZE __in_bcount
#else
#define IN_BUFSIZE _In_count_
#endif /* Different versions of SAL markup */
(Note that this uses the notation from the
Visual Studio
2008 MSDN documentation, which doesn't correspond to any version of
sal.h
that I've been able to get my hands on so I haven't been
able to check it).
One thing that you'll quickly notice when you start annotating your code with SAL is that (unless you're exceptionally disciplined) you've taken various shortcuts that need to be fixed. For example instead of:
int function( void *buffer, const int maxLength, int *length );
it's probably been much easier write:
int function( void *buffer, const int maxLength );
with the length returned as the function return value and errors reported
as negative values (this is a pattern carried over from the Posix API, which
really likes to do this). In other words the return value is being overloaded
to perform two tasks at once. While PREfast can perform at least some
checking on this sort of function, it can't perform the same level as you'd
get with the function success status and the returned length information
separated out (there is an annotation __success
but it seems
intended to designate when by-reference parameters are modified rather than
for this type of overloaded return-value handling).
This is why I mentioned earlier that you should plan to take at least a few weeks on this if you want to do the job properly. Using PREfast effectively isn't just a case of spraying annotations around your code, you actually have to put in a bit of effort to fix problem areas.
Here's a short list of typical code patterns and the annotations that you
might use with them. (Note: This is taken from comments in sal.h
and may not be 100% correct).
int writeData( const void *buffer, const int length );
This function accesses a block of memory of size length
(without modifying it). The resulting annotation is pretty
straightforward:
int writeData( __in_bcount( length ) const void *buffer, const int length );
A simple variation on this is the read/write version of the above:
int readData( void *buffer, const int maxLength, int *length );
This function takes a block of memory of up to maxLength
bytes
and returns the byte count in length
:
int readData( __out_bcount_part( maxLength, *length ) void *buffer,
const int maxLength, int *length );
The next function returns a pointer to a list element:
int getListPointer( void **listPtrPtr );
The annotation for this is:
int getListPointer( __deref_out void **listPtrPtr );
to indicate that the result is stored in the dereference of
*listPtrPtr
(I've mapped this to OUT_PTR
when I use
it because this is easier to remember than __deref_out
).
The next function is a standard by-reference call:
int getInfo( struct thing *thingPtr );
The annotation is again fairly straightforward:
int getInfo( __inout struct thing *thingPtr );
A final example for standard null-terminated C strings:
int writeString( const char *string );
The annotation is again straightforard:
int writeString( __in_z const char *string );
Note that PREfast can perform much better checking if you provide an
explicit length (although calling strlen()
to fake out the
analyser just before you call the function isn't going to help much). The
TR
24731 safer-C functions also address this in some detail.
(If anyone has any further common patterns that they'd like to see listed here, please let me know).
One problem that every code analysis tool has is false positives. Fortunately PREfast has relatively few of them, with one big exception: If a pointer is supposed to be non-NULL, it'll complain if you use it without checking it first. This is problematic in terms of false positives because passing complex data structures around by reference is standard practice and if you have several levels of nesting then a single check at the top level is sufficient — there's no need to keep checking for null pointers at every level of the call stack.
Similarly, if you're doing something like following a linked list using code like:
for( listCursor = listHead; listCursor != NULL; listCursor = listCursor->next )
{
do_function( listCursor );
}
then you know that anything passed to do_function
, and any
functions that it in turn calls, will be non-NULL. However, PREfast can't
easily detect this and warns about a potential NULL pointer dereference in
do_function()
and each function that it in turn passes the
pointer to. The brute-force fix for this problem is:
#if defined( _MSC_VER ) && defined( _PREFAST_ )
#pragma warning( disable: 6011 )
#endif /* VC++ with source analysis enabled */
but this may be overkill since it's turning off all warnings about
potential NULL pointer dereferences and not just the flood of false positives.
Luckily there's an alternative solution, the under-documented
__analysis_assume
annotation which you can use to guide the
analyser. So in the above example you would use:
__analysis_assume( listCursor != NULL );
to remove the false positive. You need to be a bit careful in how you use
__analysis_assume
. It may look like the traditional C
assert
but its function is quite different. To make sure I don't
accidentally use it as if it was an assert
I've mapped it to
ANALYSER_HINT
in my code-analysis header file.
Some of the problem of false positives with pointers is caused by the fact that there's no way to tell PREfast that you specifically want things checked in the caller rather than the callee. In other words if you have a function like:
int function( __in_bcount( length ) const void *buffer, const int length );
then PREfast will warn when it processes the code in
function()
that an access at a disallowed point in
buffer
is possible if length
has an inappropriate
value. At this point you need to track down every possible use of
function()
in your code to make sure that length
has
an appropriate value. What would be more useful is an ability to tell PREfast
that it should check all calls to function()
but not
function()
itself:
int function( __in_bcount( length ) const void *buffer, __check_caller const int length );
This annotation would direct PREfast to assume that length
has
a valid value when it's passed to function()
and instead check
every call to function()
to make sure that what's being passed is
valid. Implementing this would require an ability to specify via SAL what
values length
can legally take, which is covered in the next
section.
A second area in which PREfast occasionally has problems is when figuring out which variables are live or not. Consider the following code sequence:
int var1, var2, var3;
status = foo( &var1 );
if( status == SUCCESS )
status = bar( &var2 );
if( status == SUCCESS )
status = baz( &var3 );
if( status != SUCCESS )
return( status );
result = var1 + var2 + var3;
In some cases PREfast has trouble following the code flow and emits a
used-before-initialised warning. This only happens occasionally, and I'm not
sure if there's a way to fix this (__analysis_assume
doesn't
address it because you can say __analysis_assume( ptr != NULL );
but there's no way to say __analysis_assume( value ==
__initialised__ );
). This is probably one of those things that you
just have to live with. Fortunately it's reasonably rare. In addition
there's quite a bit of overlap between PREfast warnings of this type and
compiler warning level 4 warnings, so if you eliminate all the warning level 4
warnings you'll get rid of most of the PREfast ones as well.
A variation of this is that PREfast has problems with the code flow in
min()
and max()
and similar
x ? y : z
-style operations, so that limiting the byte count for
an operation with min()
still produces a warning about a
potential buffer overrun. For example a code sequence:
__out_bcount_full( dest, maxLength );
[…]
length = min( maxLength, byteCount );
memcpy( dest, src, length );
will produce a warning that the buffer is maxLength
bytes long
but UINT_MAX bytes might be written.
In the case of the used-before-initialised warning, one possible workaround is to define a dummy initialiser:
#define DUMMY_INIT 0
#define DUMMY_INIT_PTR NULL
and use that to perform a dummy initialisation of any variables for which PREfast returns a false positive (this also helps for other false positives such as when building with warning level 4 but without PREfast). One downside to this is that if you're telling others to do this rather than doing it yourself you're going to encounter remarkable levels of pushback from developers who refuse to insert totally superfluous initialisations into their code just because some stupid compiler can't figure out what is and isn't initialised. A few public executions pour encourager les autres may be required.
A second problem, which is a bit more hypothetical, arises when the following sequence of events occurs:
This is where the use of the preprocessor comes in. Remember the earlier discussion about using a custom mapping from whatever works for you to SAL? If you're using the preprocessor in this way, you can add different defines for genuine vs. fix-false-positive annotations and then when a new generation of PREfast arrives you can temporarily disable the fix-false-positive annotations to see if (a) they're still a problem and (b) there isn't something else there that wouldn't have been detected otherwise.
A final problem, which I'll cover a bit more in the next section, is that there's no way to tell PREfast about integer ranges. For example if you have code like:
int length;
get_data_length( &length );
memcpy( dest, src, length );
and you know that get_data_length()
will only ever return a
value from 1 to 10, there's no way to tell PREfast this. As a result it'll
issue a warning saying that dest
is only X bytes long but
up to UINT_MAX bytes may be copied. This one goes into my wishlist in the
next section.
The one big thing missing from PREfast is handling of integer ranges and
range checks. The previous section already gave one example of this where the
analyser has to assume that all integer values are a worst-case
INT_MAX/UINT_MAX because there's no way to tell it otherwise. What would help
here is the addition of an integer range notation, say __range
,
that allows you to specify an allowable range for a variable. In the above
example it'd be:
get_data_length( __out __range( 1, 10 ) int *length );
This then leads on to a second wishlist item, the extension of this ability to perform range checking to cover integer values in general. Detecting problems with integer ranges is something that makes finding buffer overflows look like child's play by comparison (the recent universal-remote Windows kernel vulnerability may have been due to an integer-range error, for example (although this is based on third-part reverse-engineering comments since no technical details have yet been released)). CERT's secure coding guidelines provide an overview of the magnitude of the problem, but its true size is only really apparent when you start going through some code trying to apply these rules to all uses of chars, ints, and longs.
To make use of PREfast for range checking, when declaring a variable you'd use:
int __range( 1, 10 ) foo;
and PREfast would warn when an operation on foo
could
potentially cause it to go outside this range. If you're familiar with Pascal
and related languages you'll have encountered this before, although Pascal et
al do their checking at runtime rather than compile time. It's also supported
in the Cyclone safer-C variant,
which uses the same mechanism for checking integer ranges as it does for array
bounds checks:
typedef { int x : x >= 1 && x <= 10 } smallint;
A more typical use for restricted-subrange integers would be to declare a distinct type based on ranges:
typedef int __range( 1, 10 ) smallint;
and then wherever you used smallint
you'd get warnings if it
could potentially go outside the defined range. Even more so than buffer
overflows, integer range checking is something that really needs automated
assistance because once you get to more than one or two levels of arithmetic
operations it's no longer possible for most humans to track where the numbers
are going.
A final wishlist feature is something like SQL's EXPLAIN
verb
to provide more information on why PREfast thinks that a particular issue is a
problem. Without knowing why a (confusing) warning is issued it's too easy to
dismiss something as a false positive when it may really be a well-hidden
genuine problem that only PREfast's analysis is capable of spotting.
In the introduction I mentioned that a few other tools like PREfast exist. One that's been around for awhile is splint, which evolved from the earlier LCLint, which in turn came from the even earlier Larch formal specification and verification project. Unfortunately splint is dangerously close to violating Eric's Razor: It's very labour-intensive, there's a large amount of annotation required to get useful output, and the huge amounts of false positives produced require even further annotations to suppress them.
Here's a quick example of splint annotation. Remember the earlier comment
about Microsoft reducing long strings of annotations into a single combined
form in sal.h
? With splint you don't have this facility, so you
end up with annotations like:
void /*@alt char * @*/ strcpy( /*@unique@*/ /*@out@*/ /*@returned@*/ char *s1,
char *s2 ) /*@modifies *s1@*/ /*@requires maxSet( s1 ) >= maxRead( s2 ) @*/
/*@ensures maxRead( s1 ) == maxRead( s2 ) @*/;
(yes, there's a function prototype buried under all that!). The SAL equivalent is something like:
void strcpy( __out_z char *s1, __in_z const char *s2 );
The closest open-source equivalent that I've found to PREfast is Deputy, which is quite SAL-like in its notation. For example the same function as above decorated for use by Deputy would be:
void strcpy( NTS char *s1, NTS const char *s2 );
(with NTS being short for “null-terminated string”). The nice thing about this is that because PREfast and Deputy use quite similar notation it should be possible to annotate code so that it can be checked by both tools, where one may find problems that the other can't detect (so far I've only checked this on a small amount of toy code, SAL is Windows-only and Deputy is anything-but-Windows only so it's a bit tricky setting up a full- blown test).