Georgia Frantzeskou, Stephen MacDonell, Efstathios Stamatatos, Stefanos
Gritzalis, Examining the significance of high-level programming features in
source code author classification, Journal of Systems and SoftwareVolume 81,
Issue 3, Selected Papers from the 2006 Brazilian Symposia on Databases and on
Software Engineering, March 2008, Pages 447-460.

DOI: 10.1016/j.jss.2007.03.004 

Abstract: 

The use of Source Code Author Profiles (SCAP) represents a new, highly accurate
approach to source code authorship identification that is, unlike previous
methods, language independent. While accuracy is clearly a crucial requirement
of any author identification method, in cases of litigation regarding
authorship, plagiarism, and so on, there is also a need to know why it is
claimed that a piece of code is written by a particular author. What is it about
that piece of code that suggests a particular author? What features in the code
make one author more likely than another? In this study, we describe a means of
identifying the high-level features that contribute to source code authorship
identification using as a tool the SCAP method. A variety of features are
considered for Java and Common Lisp and the importance of each feature in
determining authorship is measured through a sequence of experiments in which we
remove one feature at a time. The results show that, for these programs,
comments, layout features and package-related naming influence classification
accuracy whereas user-defined naming, an obvious programmer related feature,
does not appear to influence accuracy. A comparison is also made between the
relative feature contributions in programs written in the two languages.

Keywords: Authorship; Source code; Program features; Fraud