Peter Gutmann, pgut001@cs.auckland.ac.nz
http://www.cs.auckland.ac.nz/~pgut001/pubs/paper_howto.html
February 2008
Every year I serve on the program committees of a number of computer conferences, which means that I get given a pile of papers to review for publication. Every year I see the same problems with submitted papers. And every year I decide I should write down some guidelines on things to look out for when writing a computer conference/journal paper. This year I've finally got around to it...
The following is a list of the problems that most commonly crop up in submitted papers. Treat this as a checklist to apply to the paper you're planning to write, or have already written.
Providing a summary of existing work in the area covered by your paper is extremely important. This serves two purposes, firstly it lets the reader know that you're familiar with the field, and more importantly it lets you know whether what you're doing is useful and/or original or not. I don't know how many papers I've received that I've had to reject with a comment like “This is a poorly-done reinvention of a technique from the 1980s” (and no matter what topic your paper is on, at least one referee is going to tell you that it's a reinvention of something that Multics was doing 40 years ago). Many authors simply plough ahead with a topic without (apparently) performing any checking to see whether anyone has done this before. Others do a quick check with Google and leave it at that, however online archives typically only reach back a decade or so and are unlikely to tell you that you're reinvented something that first appeared four decades ago. Just because a paper was published in 10 BG (Before Google) doesn't mean that the ideas in it don't exist.
I can't over-emphasise how important this first step is. I've seen fifteen- to twenty-page papers killed in an instant because the authors never checked to see whether someone else had had this idea before them.
Experimental design can be a bit of an art form. As the saying goes, a problem well stated is a problem half solved. Although it's relatively rare to find the kind of biased metrics that are traditional in competitive benchmarking (when the goal is to prove that your system is better than the competition's), it's unfortunately too common to find problems like an inappropriate level of detail (either too much or too little), or poor or even no analysis.
The level-of-detail issue is an important one to get right. You want to present enough information that others can repeat the experiment, but not enough detail to overwhelm the reader. At the too-little-detail end of the scale, I've seen papers with results that would be impossible for anyone else to reproduce, either because the raw data isn't available (a relatively common problem that others have complained about), or because data acquisition details are insufficiently specified, or because the data-to-information transformation step(s) are insufficiently explained. The goal of a scientific process is repeatability and verifiability. This can be particularly important if the results are unusual or controversial. In a psychology paper published in the late 1990s with rather astonishing results the authors took the unusual step of having the work repeated by independent teams in the US and Europe, as well as making the raw data available for others to use (many did, and got the same results as the original authors). Without this level of detail, the very next research result could invalidate yours, and without repeatability no-one can demonstrate that you were correct.
At the other end of the scale are the papers that are decorated with various distractions like snippets of source code that caught the authors' attention and long-winded discussions of possibly interesting but not terribly relevant minutiae. Yes, it's interesting stuff, but it detracts greatly from the main flow of the work. If you're going to do this, put it in a separate section so that it doesn't distract from the main body of the work.
One other thing that you need to do is state your assumptions in advance. This is particularly important for papers in areas like networking and security, where the lack of a proper network performance model or threat model can make a paper almost meaningless (“assume a perfectly spherical elephant of negligible mass and volume”). In theory this could help kill a paper — I've seen security papers where the threat model is taken straight from la-la land — but in practice this doesn't seem to be much of a problem. So a threat model shouldn't be just “we'll defend against anything that it's cryptographically interesting to defend against” but an at least vaguely realistic assessment of practical threats. One paper that I really like that does this is “Tor: The Second-Generation Onion Router ” by Roger Dingledine, Nick Mathewson, and Paul Syverson, for which the bulk of the paper is made up of threat modelling and design considerations. When was the last time that you saw something like the following security requirement in a paper: “A hard-to-use system has fewer users, and because anonymity systems hide users among users, a system with fewer users provides less anonymity. Usability is thus not only a convenience: it is a security requirement”? These guys really put some thought into the design of all aspects of their system.
Unrealistic models on the other hand are a sure sign that your workflow was { design a cool algorithm, invent an artificial problem for it to solve }. A friend once told me about a paper he'd co-authored on a neat algorithm for which they had to invent an artificial problem for the algorithm to solve. The next year the same conference where they'd published it received three more papers on solutions to this non-problem. It probably has its own research field by now. So you can usually get these things published if you're careful about which conference you submit to. On the other hand if I'm refereeing something I'd prefer to see an honest assessment of the algorithm's value, “it's a cool algorithm and we wanted to share the design with others ”, rather than some totally bogus attempt at a justification for publishing.
While an artificial network or threat model isn't a killer, lack of any model at all is, because people are going to come up with situations that your design doesn't handle and tell you to resubmit once you've addressed them. Providing a model frames your solution and lets the reader know what it is that you're trying to achieve. It also means that reviewers will try and pick holes in your model rather than your design. An unfortunate side-effect of this is that you can then design completely unrealistic models, see the previous paragraphs.
If your paper is presenting a radical new solution for a long-standing problem and you can't believe that no-one has ever thought of this before because it's so simple and obvious then it's usually because someone has thought of it before and it's not simple at all. If your paper requires global PKI, DNSSEC, global single-sign-on, universal deployment of smart cards or TPMs, in fact anything with the word “global” or “ universal” in it then it's a sign that you haven't actually solved the problem at all but just made it someone else's problem. In computational complexity theory a standard way of showing that a problem is NP-complete is by showing that it reduces to another known NP-complete problem. If your paper reduces trivially to an existing unsolveable problem then what you're presenting isn't part of the solution set, it's part of the problem set. Imagine someone submitting a paper to an algorithms conference that begins “We present a radical new algorithm for perfoming XYZ that reduces to the halting problem. Assuming a solution to the halting problem, our algorithm can then be used to ...”. This is the algorithmic-complexity equivalent of what most sweeping-solution-to-a-longstanding-security-problem papers are trying to do.
For some kinds of work it's more or less expected that you demonstrate some level of real-world relevance or functionality, for example by creating a prototype or plugin to demonstrate that what you've dreamed up actually works. In specific cases it's mandatory, for example when the work that you're presenting is something that's intended to be used by humans rather than, say, a new routing algorithm.
The computer science field has a proud tradition of analytical advocacy papers, ones in which the author analyses a problem, presents a proposed a solution, and then advocates its universal adoption without ever passing through the intermediate step of empirically evaluating whether the solution actually works or not. This has led in the past to attempts to deploy technology that we now know, either through hard experience or, eventually, when someone publishes a paper that evalutes its effectiveness, just doesn't work no matter how much you tweak it.
If whatever you're proposing requires action by or interaction with humans, you need to perform some sort of real-world evaluation of the technology to demonstrate that it actually works in the real world. If the result of the evaluation is that it doesn't work in practice, you can at least salvage a publishable paper by turning it into a “Why XYZ Doens't Work” publication (although admittedly papers with negative results are rather harder to publish than papers with positive ones).
The topic of how to run a human-factors study is far too complex to cover here, see any standard reference on how to do this if you're unsure. One source would be Jeffrey Rubin et al's “Handbook of Usability Testing ”, although that's aimed more at someone creating a finished product than testing whether a particular individual idea works or not. Larry Christensen's “Research Methods, Design, and Analysis” may be a better place to start, and since it's a standard text you can get copies of earlier editions, under the former name “Experimental Methodology”, very cheaply.
As a quick rule of thumb though, make sure you pick an appropriate sample size, be careful how you set up your study (people change their behaviour if they know you're checking for factor X, so you may need to disguise what you're looking for in order to make things more realistic), and as the following section points out, be careful how you analyse your results. If you want to get some advice from those who have gone before you, find a few papers from others who have worked in this area and see how they carried out their evaluations.
One thing that can completely kill a paper is lack of analysis of the results. I've seen otherwise fine papers die in review because the “ analysis” consists of the authors stating what's obvious from the graphs they present (“foo increases linearly”) but never explain why foo increases linearly, or whether foo increasing linearly is a good or bad thing. More scary are papers where highly suspicious results (for example an O(n) and O(n^3) algorithm having identical runtimes) pass unnoticed. This is an indication that either the authors have been careless or that they don't understand the data that they're looking at, neither of which provide much confidence in a scientific paper.
When you present your results, you need to provide an analysis of why you obtained those results, not just a voiceover for the graphs and diagrams. If you have an O(n) algorithm but are seeing identical runtimes for n=100, n= 1000, and n=10000, you'd better have a good explanation for this. More generally, any significant anomaly in your results will need either further investigation or an explanation. One of the purposes of a paper is to demonstrate and convey to readers an understanding of what you're doing. Multiple unexplained anomalies can lead to the impression that this understanding is absent in the paper.
Even worse than an unexplained anomaly is a totally unnoticed anomaly. I once read a paper on crypto performance measurement that compared (among other things) the performance of DES with triple DES, where triple DES is DES iterated three times. The execution times for DES and triple DES were identical (!!), but the author's hadn't appeared to have noticed this (in case you're wondering, the explanation was that they were using a triple DES crypto core that took advantage of the encrypt-decrypt-encrypt (EDE) design to turn triple DES into single DES by setting all three keys to the same value, so that both single and triple DES went through the same crypto core).
A feature of many network protocols (and underneath the protocols, network hardware) is the extensive cacheing and buffering performed whenever possible to reduce latency and speed throughput. One conference received a paper whose results showed that the SSL/TLS handshake was about a thousand times faster than the IPsec IKE handshake. Now despite the best efforts of the IPsec design committee, they probably didn't manage to make IKE more than a few times slower than SSL/TLS. What had happened was that SSL has the capability to perform a re-handshake using cached data, and what the paper was actually measuring was the overhead of n IPsec handshakes vs. 1 SSL handshake and n-1 cached reconnects.
Another thing that's important when presenting results of measurements is to explain (and demonstrate understanding of) exactly what it is that you're measuring. The two protocols mentioned above, IPsec and SSL, both have numerous variations in their crypto mechanisms that can lead to vastly different performance results. For example the IKE handshake was originally designed by mathematical cryptographers who assumed that all crypto operations had an O( 1 ) cost. As a result, the original IKE public-key handshake design required no less than four (!!) very expensive private-key operations, compared to zero for the SSL client and one for an SSL server.
It's important in areas like this to both describe explicitly which types of operations your results are presenting and to either measure the standard, default operating mode or if you're using variant modes to make it clear to the reader that the results are atypical and not necessarily representative of normal performance.
(There's a lot more to say about this area such as techniques for analytical modeling, how to handle outliers, sensitivity analysis, and so on. This is covered in a number of standard texts, of which one of the most extensive is Raj Jain's “The Art of Computer Systems Performance Analysis ”. You also need to adjust for things like clock granularity, the overhead of clock sampling, influence from external factors like other network traffic or CPU load, and so on. For example relying on the operating system's ability to generate distinct timing data on a per-process level isn't going to help if another process is thrashing the CPU cache, which no amount of process-granularity timing is going to fix).
In terms of statistical analysis of results, papers tend to fall into two categories: Either very little, or way too much. Very little can work (it's an O(n) algorithm, end of story). The other end of the scale is rather more problematic. The result is a paper that's only comprehensible to another statistician, a variation of the “much data, little analysis” problem mentioned earlier. Applying Noodleheinz's strategy for quantum normalisation to yield a correlation factor of 0.167 with a variance of 7.31 to a pile of data may be a fine demonstration of your grasp of statistics (or perhaps a demonstration of your lack of grasp of statistics), but it may as well be written in Klingon for most of the audience (unless your audience consists of the attendees of a conference on network performance analysis). If you're going to apply statistical techniques, explain what you're doing and explain the significance of the results so that the reader can understand it as well.
Some papers are severely let down by the quality of the graphs and tables that they contain. There have been plenty of books written on the graphical representation of data, so I'll just go through the main problem areas here. The four biggest offenders that I've seen are inappropriate use of graphs, inappropriate scales, bad labeling, and bad choice of units.
In a number of cases graphs are used where none are necessary. Graphing white noise is an extreme example of this. Graphs are intended to show a trend in data. If there is none, use a bar graph or a table, not a line graph. Similarly, if your graph has less than (say) five data points, use a table, or just put the data inline in the text.
When you're presenting information on a graph, don't try and overload it. In almost all cases a graph should present one set of y-variables (and by extension one set of x-variables). In some very special cases you can use two sets of y-variables, but you'd better have a very good reason for this. Any more than this and you're making your graph incomprehensible. In addition, don't try and cram more than five or six curves onto a graph, or you run into the same problem.
Use of inappropriate scales results in data being crammed into one corner on one side of a graph, or all detail for 99% of the data being lost through an attempt to fit a single anomalous spike into the result. Look at your graph. If 90% of it is taken up by empty real estate, you need to consider an alternative way of representing it. Conversely, if 90% of it is taken up by ink, you've got a similar problem. A good rule of thumb is to try to maximise the information-to-ink ratio in your graph. There's something called the three-quarter-high rule which says that the highest point on the graph should be at least three quarters of the horizontal offset of the rightmost point, which is meant to avoid biased representations when using techniques like nonzero origins of the axes, but can lead to odd-looking graphs (I'll leave it up to you whether you want to use this or not).
Another thing to be aware of is that when plotting data resulting from random quantification of performance so that if you were to repeat the experiment the results wouldn't be exactly identical, you should really show confidence intervals in your graph. People often don't do this and it doesn't seem to cause any problems, but you should at least mention the fact in the accompanying text if what you're measuring requires it. Obviously if it's some completely deterministic process there's no need to do this, and doing so will just clutter the graph or text.
Bad labeling is another problem, and can take many forms. I've seen text crammed in at odd angles, complex graphs with data represented in near- indistinguishable shades of grey, graphs where it wasn't obvious whether large values were better or worse, graphs where the axes were barely labeled (10 minutes? 10 K/s? 10 MB?), graphs where the units aren't obvious (“Time ” and a scale of 1...10, which could be nanoseconds, minutes, days, or years), and everything in between. Remember, graphs are intended to visually represent trends in data. If this isn't immediately obvious from your graph, redesign it or replace it with a table.
Your conference paper, once printed, is going to be in black-and-white. Referring to the red line or the green bar on a graph isn't notably helpful in this case.
Finally, something that besets both graphs and tables is the inappropriate choice of scaling units, so that everything ends up with strings of leading or trailing zeroes. Go with the most common scaling unit. Even if “ seconds” is a nice basic time unit, if all of your graph labels or table entries are coming out as “0.00000x s” then it's time to consider using microseconds and not seconds.
This is a brief overview of some of the bigger problems I've seen in paper submissions. If you want a more detailed analysis of how to use diagrams and tables I'd recommend Stephen Few's “Show Me the Numbers: Designing Tables and Graphs to Enlighten”, which'll give you pretty much all you'll ever need on the topic.
A problem that crops up in some papers is that the authors construct an (often elaborate) demonstration and analysis of a problem without really providing a solution. If the problem is of the grand-challenge type (for example breaking AES encryption) then this is fine. Showing how to do this is enough to get you worldwide attention, and fixing it becomes someone else's problem. On the other hand if the problem is a TCP/IP congestion issue under certain network conditions then simply demonstrating that there's a problem isn't sufficient. Even providing a general solution (“congestion control is Good”) may not be sufficient. If your paper has pointed out that there's a problem, you also need to provide the fix for the problem. The contribution in an AES-breaking paper is to demonstrate that AES is broken, but the contribution in a general paper is the solution to the problem, not the problem itself.
If you're planning to submit a paper to a conference then some time well before you submit (in fact preferably before you start writing) grab a copy of the proceedings from a recent conference and look at the material there. Is this an appropriate sort of conference to submit your work to? If the conference papers are all prose with the odd illustration or graph and your paper consists of 98.5% greek symbols and 1.5% prepositions and conjunctions then you probably want to find somewhere else to submit it. If you're planning to submit a printout of Powerpoint slides and no other paper published at that conference in the past has ever consisted of Powerpoint slides (yes, I've seen this happen), you've also chosen the wrong venue.
If you do decide that the conference and your paper are a good match, check your formatting to make sure that it matches that of existing published papers. While the format of the published papers won't always be completely on target, you should be able to spot whether what you're using is way off the mark.
Acronyms can be a particular problem with technical papers. If you're not careful with them you can overwhelm the reader and make the paper very difficult to understand. Scan your paper for acronym-riddled sentences and try and rephrase them in plain English, because someone who isn't intimimately familiar with what you're doing (which many readers won't be) probably won't be able to do much with a sentence like this. As a rule of thumb you should expand the acronym the first time that it's used (unless it's a very common one), and again if it's the first use in several pages. In other words if you use an acronym on page 2 of your paper and then again on page 11 you should expand it both times to avoid forcing the reader to backtrack through 9 pages of text to try and locate the initial definition of the acronym. In addition if even the expanded form of the acronym is non-obvious (Illudium Explosive Space Modulator), add a few further words explaining what it is.
Traditionally conferences used anonymous submissions, but more recently there's been a trend towards having author names and affiliations on submissions in order to catch conflicts of interest in which a reviewer is inadvertently assigned a close colleague's paper to review. Check to see which option your chosen conference is using. It's possible in some cases that if the conference requires author identities in order to identify conflict-of-interest issues and you haven't provided this, your paper may be rejected because it can't be safely reviewed.
Try and stay within the required page count. Most reviewers are reasonably lenient and will allow a little bit extra (particularly if it's a good paper), but some will simply stop reading at the published page limit. If you're going to submit a 15-page paper to a conference with a 10-page limit, you'd better have a proof of the Goldbach conjecture or something similar on the 15th page.
Before you send in your paper, try and get someone who's not directly involved in your research to check your work for comprehensibility. If all else fails, use the Mechanical Turk.
Run a spell-checker over your submission before you submit it. Running into a pile of typos and spelling errors in a submitted paper creates a poor impression of the amount of care taken with the paper as a whole.
Make sure that your references have a consistent style, and contain enough information for the reader to be able to locate them without having to Google for bits and pieces of the title and author name in the hopes of locating it. I once referee'd a paper in which the author had somehow managed to get every single reference in a different form. Having also read a security standard he'd authored, I now understand why almost no-one has ever managed to implement it in an iteroperable manner.
If you're including source code in your paper, make sure that it actually
compiles. Getting a submission with code that's obviously not syntactically
valid immediately raises warning flags about the paper as a whole. If your
code isn't mean to compile, indicate that it's pseudocode or an outline of the
actual code, either by explicitly describing it as pseudocode (“the
algorithm in a C-like notation is presented in Figure 5”) or by using
code elements that are obviously not meant to be real code but merely
indicative of what should be there (“forall i = 0...max { array[ i
] = 0; }
”).
There are certain types of problems for which it's possible to arbitrarily tune input parameters or some part of the algorithm in order to obtain results that differ from from any current ones. Not really better or worse results, just different. You may have seen examples of this already, it's the sort of paper whose destiny it is to end its days as a one-line note in the “ Related Work” section of the next paper on the topic that comes along. While this sort of problem is perfect for someone trying to establish a publication record for a tenured position, you'll have to decide whether you're really contributing anything to the field by publishing yet another variation on a problem that's already covered in a dozen other papers containing different variations of the same problem. The publish-or-perish issue is a long-standing problem in academia, one good overview is David Parnas' Stop the Numbers Game from the November 2007 Communications of the ACM.
There's a reason why conferences and journals provide detailed feedback from referees rather just a credit-card style “accepted/declined”. Someone once submitted a paper to a conference that duplicated (somewhat poorly) work that others had done a few years earlier (see “Existing Material” above). I sent it back with a reference to the earlier work and a longish explanation of how they needed to provide significant new material in order to make it suitable for publication. A few months later they submitted the same paper, unchanged, to another conference. Since they hadn't bothered to update the paper, I didn't bother to update my comments, and sent it back with exactly the same feedback. A few months later they tried again at a third conference, and the same thing happened. I guess eventually they found a conference obscure enough that there was no referee present who recognised that it was a poor duplication of existing work and got it published there.
The point here is that the referees aren't just providing comments because they want to exercise their typing skills. If you get a paper back with comments that there's a serious problem with it then there (usually) really is a serious problem with it that needs to be addressed. Even if it's quite obvious that the referees are idiots and don't understand your work, they are (hopefully) representative of your audience as a whole and if the referees don't understand it then it's likely that the audience won't either (although obviously there are exceptions — a paper on the wonders of X.509 PKI submitted to an OSS conference full of SSH and PGP users, or a paper on SSH's key-continuity key management submitted to an X.509 PKI conference, is going to be a tough sell no matter how well it's written).
In general though trying a different set of referees isn't going to change anything because they'll find the problem as well. Sure, you can eventually get around it by finding a venue so obscure that no-one will notice, but if you're doing it for the publication credits are you really going to get much value from being listed in the appendix to the apocrypha of the Proceedings of the Bratislava Philological Society?