Use of a sequential forward floating search algorithm to detect subgroup features in heterogeneous data sets
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
In pattern recognition problems, a feature is defined as any property that may be useful in differentiating among classes of inputs. A classic example entails using color and shape as features to sort apples and bananas into their respective classes. Both the generation and assessment of features has become an essential part of modern machine learning and pattern classification endeavors, particularly with the continued growth of feature spaces in recent years. In support of such problems, many feature selection algorithms with distinct benefits and drawbacks have been devised. Despite these advancements in the field, questions remain about feature subset selection and performance in certain scenarios. Many classes, such as some found in the medical domain, are not reliably described by a single defining feature. In these heterogeneous classes, detecting feature synergies is essential to high performance classification. A confounding issue occurs when sample sizes for heterogeneous classes are small. Small sample size is often unavoidable for various reasons, and such situations may present challenges to commonly employed techniques of feature selection, such as statistical methods. In this dissertation, a sequential forward floating search (SFFS) algorithm is investigated as a feature subset selection technique on heterogeneous data sets of small sample size. The objective is to assess SFFS performance against statistical selection techniques that are common in the literature. To this end, an exemplar data set from the neuroimaging domain is used to represent a typical heterogeneous data set. This data set is analyzed using both traditional approaches and with the SFFS algorithm. The findings of this investigation are then used to inform synthetic data simulations to establish the ground-truth performance of SFFS with respect to such heterogeneous data. Results of this investigation show SFFS to be a sensitive technique for feature subset selection in data sets. In many cases, SFFS is shown to outperform the t-test in individual feature detectability. However, potential outliers and increased variability in small sample size data remain a concern for successful assessment of classification results in specific problems.