Nicholas Hynes (nick@1site.co.nz).
Igor Kononenko. "Estimating Attributes: Analysis and Extensions of RELIEF". In Proceedings of European Conference on Machine Learning, pp 171-182, 1994.
Feature selection, dimensionality reduction, data preprocessing, data mining, feature interaction, feature dependency, information gain, RELIEF, RELIEF-F.
M. Dash, H. Liu, J. Yao. "Dimensionality Reduction for Unsupervised Data". In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence, pp 523-539, 1997.
Thomas G. Dietterich. "Machine Learning Research: Four Current Directions". In AI Magazine, pp 97-136, v.18 n.4, Winter 1997.
Kenji Kira, Larry A. Rendell. "The Feature Selection Problem: Traditional Methods and a New Algorithm". In Proceedings to the 10th National Conference on Artificial Intelligence; sponsored by the American Association for Artificial Intelligence, pp 129-134, 1992.
Igor Kononenko, E. Šimec, M. Robnik-Šikonja. "Overcoming the Myopia of Inductive Learning Algorithms with RELIEF". In Machine Learning, pp 67-80, v.6, 1994.
The need to process large databases is becoming increasingly common. Contemporary data warehouses can house millions of transactions per day; full text databases learners typically deal with tens of thousands of features; and vision systems, spoken word and character recognition problems all require hundreds of classes and may have thousands of input features. (Dietterich: 1997, Dash et. al.: 1997). However the current machine learning toolset is ill-equipped to deal with contemporary datasets which can have hundreds, or even thousands of features. Many algorithms exhibit poor complexity with respect to the number of features. Furthermore, when faced with many noisy features, some algorithms take an inordinately long time to converge, or never converge at all. And even if they do converge conventional algorithms such as C4.5 and backpropogation will tend to construct poor classifiers. (Dietterich: 1997).
Many of these problems can be alleviated by preprocessing the dataset to remove noisy and low-information bearing attributes. "Feature selection is the problem of choosing a small subset of features that ideally is necessary and sufficient to describe the target concept." (Kira & Rendell: 1992). The benefits of preprocessing to remove noisy, non-predictive features include:
In this paper Kononenko extends an algorithm - called RELIEF - that deals with the problem of feature selection. The original algorithm was developed by Kira & Rendell (Kira & Rendell:1992) to overcome problems of complexity and feature dependencies. Many of the current algorithms at the time used an information gain approache to feature selection:
But there were problems with this approach: it assumed that the features were statistically independent of each other. Datasets such as the 'Parity 3+3' problem would yield a zero information gain accross the dataset, including the subset of relevant features. To get around this problem Kira & Rendell used a nearest neighbour algorithm on a two class classification problem, to find the nearest neighbours of each member of the dataset. Their algorithm (RELIEF) would find both the nearest member of the same class, and that of the opposing class. RELIEF would calculate the 'distance' between features in the sampled training instance and its neighbours. Features which bear more information about the target class will typically be close in value to their nearest neighbours in the same class, and relatively far away from members of the opposite class.
Kira & Rendell's research bore some very good results, but there were many restirictions on the original RELIEF algorithm. Kononenko extended RELIEF in a series of stages that dealt, in turn, with problems caused by:
Kononenko created a number of datasets based on the benchmark 'Parity 3+3' dataset. The family of datasets created would group one, two or three attributes together. Collectively the group would have a modula two sum of 0 (or 1) with an assigned probability. The probability would depend on the class. So, for instance, members of class 1 would tend to have attributes V1, V2 and V3 that sum to 1, while members of class 0 would have values for V1, V2 and V3 that sum to 0. The probabilities were different for different groups of attributes, so that each group of attributes provide a different amount of information about the actual target class for each training instance.
Kononenko derived a metric to describe the amount of information provided by each attribute, and called it the 'Intended Information Gain' (IIG). If three attributes together could predict the target class 85% of the time, then they would have an intended information gain which summed to 85% of the gain assigned to an attribute which always correctly predicts the class. This metric is very similar to the 'Information Gain' metric, but is robust to dependancies between attributes. The Intended Information Gain for each attribute, in each dataset, was calculated by hand, given the probability distribution used to create the test sets.
The various versions of RELIEF were tested against datasets with different numbers of dependant attributes, varying amounts of missing attributes, up to four different class lables and a range of class noise (generated by changing the class label for correctly classified items). In each case the Intended Information Gain of each feature was compared to RELIEF'S 'weight' for that feature. The comparison was done by plotting IIG vs. weight for each attribute, and measuring the linear correlation coefficient between the two parameters. The results show excellent correlations in almost all of the tests performed. The final test was performed on a 'real world' dataset called 'Primary Tumor', and RELIEF-F performed very well.
The paper was well presented, easy to understand, and usually fleshed out with just the right amount of detail to explain the underlying processes and concepts. Kononenko employs a scientific technique and this is reflected by the transparency with which he presents the algorithms used, the experiments performed and results obtained.
There were some short-comings, though, that warrent a mention. Firstly the complexity of the algorithm used was not given proper attention. Kira & Rendell reported that the original algorithm would complete in O(pn) time, linear to the number of features times the number of training examples. This analysis assumed that the nearest neighbour classification would be of smaller complexity or had been calculated prior to running the algorithm. RELIEF-F's extensions to the original algorithm would require that the k nearest neighbours to each instance be found for each other class. Calculating this type of information is not a trivial task. Given that this algorithm is specifically designed to aid the researcher who is dealing with a cumbersomely large database, with many features (which typically requires many training examples to differentiate noise from information), then the complexity of the algorithm deserved at least a mention.
Secondly, the experiments performed reported very good results, but it should be pointed out that the test sets chosen for this paper were somewhat limited. The parity style datasets are useful for demonstrating RELIEF'S ability to handle dependencies between attributes, but they have a very limited range of attributes (i.e. all attributes are either 0 or 1). It would be nice to see evidence that the extensions to RELIEF worked with a range of different inputs. The only real world dataset used ('Primary Tumor') has, I believe, only two target classes, which is unable to demonstrate RELIEF-F'S performance with a muli-class problem. If you recall, dealing with multiple classes was one of the primary goals of Kononenko's research.
Thirdly, the mathematical analysis presented in the paper was unclear. Kononenko presents a possible interpretation for RELIEF'S success by comparing it to the 'gini-index'. However the approximation relies on two mathematical assumptions that aren't justified by the analisys given. Of primary concern was the assumption that a normalized squared probability would approximately equal the original probability (see Figure 3); and the unexplained definition for the conditional probability of choosing two instances of the same class with the same attribute value. The latter was further confused by presenting it as a simple conditional probability (see Figure 4).
The probability in Figure 4 seems to assume that features are all discrete, since Pr(X = x) is zero for continuous values. This was not explored in the paper.
Finally; the use of the correlation coefficient was a nice touch. A good linear correlation between a feature's intended information gain and RELIEF'S weight for that feature, is a good indication that RELIEF is correctly ranking the features in their order of importance. The use of a correlation coefficient allowed Kononenko to summarize a large amount of information in a concise manner. However there are some circumstances when this was not appropriate. More detail is needed to understand the results when the correlation coefficient is small - are the deviations from the intended information gain random, or do they follow some non-linear function? Is the basic ranking of important to unimportant features still preserved? The author did provide further information in the context of the real-world test using the 'Primary Tumor' dataset.
These criticisms shouldn't detract from the real value of the paper. Kononenko's extensions to RELIEF have transformed it from a brilliant piece of theoretical research into a commercially applicable utility. The evolution of the RELIEF algorithm is presented in an interesting, informative and scientific manner.