Feature selection
System modeling is an important component of managing and upgrading large distributed systems. We consider a system model to be a mathematical or quantitative representation of how a system behaves. For example, a neural network can be trained with user access data to determine what file types that are most accessed in a storage cluster. Using such models, systems engineers can make concrete observations of the systems, and use such findings to make adjustments or recommendations for upgrades. As systems become more complex and demand for higher performance increases, so has the metrics collected from such systems. These metrics often contain vast amounts of data and are sometimes distributed for public analysis such as the Backblaze dataset.
However, the usage of such datasets can be problematic: as the number of tracked metrics increases, so does the complexity of choosing the right metrics to model a specific aspect of a system. Feature selection, including variable or attribute selection, is a set of quantitative techniques that selects a subset of relevant features from a larger feature set, which can then be used to accurately construct a model. Common feature selection techniques, such as Principal Component Analysis, reduce the number of features in a dataset to a target amount of features (e.g. 50 to 4). However, it is difficult to know beforehand exactly how many features to reduce, especially when models are sensitive to both what features are used for training and how many. Modeling a system becomes a balancing challenge of discovering what features should be used for training, how many of them should be used, and how what combinations of features affects accuracy with a specific modeling technique. Unlike other machine learning-based problems, this interaction is a major hurdle to overcome when maximizing accuracy.
We propose an automated and generalized feature selection method, WinnowML, that can be customized to analyze different neural network architectures and different problems such as classification and regression. WinnowML is a feature selection method that provides a reduced set of features for a model architecture when the interaction of model type, chosen features, and feature importance impacts modeling accuracy. This reduced set of features considers both how important each feature is towards modeling accuracy and how many of those features must be used. WinnowML ranks features in a dataset using techniques such as Permutation Feature Importance (PFI) or Self Organizing Maps to identify the features that best model a target value given a neural network architecture provided by the user. Experimentally, we demonstrate that \prjname outperforms using all features available for training and PCA.
Status
In the past six months, we have created the WinnowML system which combines wrapper and filter techniques to identify a subgroup of features to accurately model the measured value which the user wants to model. We have applied this to a regression problem, where we are modeling the throughput of the CERN EOS system, and a classification problem, where we are classifying disk failures using the Backblaze dataset. We compared the accuracy results of WinnowML to a widely used feature selection method: Principal Component Analysis (PCA). PCA is most commonly used to lower dimensionality in large dimensional objects such as images and video data. We found that when the WinnowML selected feature group is used, the prediction error is 2X smaller than when PCA uses the same amount of features. Those items have a lot of noise which impacts the prediction accuracy of the model. Log data already has less noise and therefore applying PCA will reduce the information gained by the model from those features.
For the next 6 months, we will work on finishing up the paper and submitting it to Sigmetrics’20. After the paper is submitted we will apply WinnowML to clustering problems. Additionally, we will compare our grouping algorithm to other existing algorithms. Our grouping algorithm needs to not generate too many groups since WinnowML needs to run a reasonable amount of time.