Data Mining and Trend Analysis
From Capsil Wiki
With the large amounts of data obtained from sensors, and the availability of storage space, the role of data-mining in wireless sensor networks is to provide techniques for the analysis of data, identifying underlying rules and features, discovering patterns and trends. Usually, data-mining tends to work from the data up, and the most efficient techniques are normally those developed with an orientation towards large volumes of data. IBM has identified two types of models that can be used to mine information: 1- The verification model, which takes a hypothesis from the user and tests the validity of it against the data and 2- the discovery model, which requires the system to automatically discover frequently occurring patterns and trends without intervention or guidance from the user. Data-mining is a very important topic in BSN data analysis and it encompasses many areas that can be expanded. However, in this section, we will focus on several issues that are of importance to data processing in BSN. These are:
- Data pre-processing
- Data visualisation
- Descriptive Modelling and Clustering
- Predictive Modelling
- Pattern Mining
Contents |
Data-Preprocessing
Since sensor data can be noisy, inconsistent or incomplete, preprocessing data is needed as a first step before data-analysis. The tasks in data-preprocessing include:
- Data integration: involves combining data at different sources and providing the user with a unified view.
- Data cleaning: filling in missing values, smoothing noisy data and removing outliers.
- Data transformation: aggregation of data and normalisation.
- Data reduction and discretisation: reducing the number of attributes by removing irrelevant attributes or dimensionality reduction (see Sensor Fusion page). Other methods include aggregation, sampling and clustering.
Data Visualisation
In many cases, visualising data before analysis offers the user some guidance on existing patterns in a dataset as well as relationships between variables. However, high data dimensionality and large amounts of data could prevent the user from visualising all the data available in a database. Thus, methods of reducing the data (mentioned above) could be used before visualisation. Summary variables can be created to summarize important data, examples are: Range, variance, mean, median, maximum, minimum, covariance and correlation matrices. Single variable plots showing histograms and scatter plots including several variables can be used to observe data relationships and outliers. Projection methods (detailed in Sensor Fusion) can be used to project multivariate data into 2 dimensions. Projection pursuit is a method that involves finding the most 'interesting' possible projections in multivariate data. Projections that deviate from a Normal distribution are considered to be more interesting. As each projection is found, the data is reduced by removing the component along that projection. The process is repeated to find new projections. Interesting links are:
-NASA's scientific visualisation studio
-National Visualisation and Analytics centre
Descriptive Modelling and Clustering
Descriptive models generally provide a summary of the data which could lead to the observation of the main trends that exist in a dataset. In this section we will cover some of the widely used techniques in this area:
Modeling probability distributions
In many applications, data can be described by using probabilistic distributions. Some of the methods used are:
- Parametric models: Include the use of a single Gaussian or a mixture of Gaussians, among other methods of modelling probabilistic distributions (Bernoulli, Poisson). Gaussian Mixture models(GMM) aim at modelling several components (Gaussians) weighted by a parameter. Advantages of using GMMs are that they have well studied statistical inference techniques, offer a flexibility in using component distributions, and allow obtaining a density estimation for each cluster.
- Non-parametric models: Include Kernel density estimation, but some can be computationally expensive to compute on large datasets.
- Graphical models: Probabilistic graphical models provide a compact representation of joint probability distributions. The nodes represent random variables and arcs represent dependence between these variables. Graphical models can be directed or undirected. The latter is used more to express causality between variables. Some tutorials on graphical models are available on: Murphy: A Brief Introduction to Graphical Models and Bayesian Networks and Jordan: An introduction to variational models for graphical models. Applications to activity recognition are summarised in the following web page.
Clustering
Clustering is the grouping of data into subsets or clusters, where the items (data) in each cluster share common traits. Clustering is a widely used method in data mining, some of the clustering methods include:
- k-means: ASsigns each point to the cluster whose centroid is nearest. An interactive demo is available on the following webpage. Code and examples are available in the Netlab toolbox.
- Fuzzy c-means clustering: A method that allows a data point to belong to more than one cluster. A tutorial is available on: Fuzzy c-means clustering.
- QT (Quality threshold) clustering: QT clustering looks for clusters of data points such that each point in the cluster is within a specified distance (based on a user-defined distance metric) of every other point in the cluster. This technique is widely used for gene clustering. More info is available on: the GeneSpring Tutorial.
- Self Organising maps: The self-organising map consists of a regular unit of neurons that attempts to represent all observations with optimal accuracy using a restricted set of models. At the same time the models become ordered on the grid so that similar models are close to each other and dissimilar models far from each other. A webpage on SOMs is available which provides demos, software and research papers.
- Multi-dimensional scaling: This visualisation technique starts with a matrix of item-item similarities then assigns a location of each item in a low dimensional space which can be used for graphing or 3D visualisation. More information and the relationship to factor analysis is available on the following webpage.
- Hierarchical clustering: In hierarchical clustering, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object. Hierarchical Clustering is subdivided into agglomerative methods, which proceed by series of fusions of the n objects into groups, and divisive methods, which separate n objects successively into finer groupings. Some more information on hierarchical clustering is available on the following webpages: How to explain hierarchical clustering, a tutorial on clustering algorithms. An application to energy efficiency in wireless sensor networks is available on Bandyopadhyay and Coyle.
Predictive Modelling
Predictive modelling involves both regression and classification. Regression models continuous valued functions whereas classification aims at predicting class labels (both discrete and nominal).
Regression methods include: Neural networks, Kernel machines and Decision trees. Classification includes model based techniques that are aimed at using a model based on training data for testing new data points, and discriminative techniques that aim at separating between classes without specifically having to model each class. Model-based classification techniques include Gaussian mixture models (GMMs)- mentioned above, Naive Bayes and other Bayesian techniques. Temporal models aim at learning models of the data by taking into consideration its temporal patterns. These include Hidden Markov Models, Conditional Random fields and Dynamic Bayesian Networks. A summary of discriminative models for classification is available on the following web page.
Pattern Mining
Pattern mining is the task of finding existing patterns in data. In this context patterns often means association rules between variables. A survey on pattern mining with a focus on association rules was compiled by Geothals. A summary of the current status and future directions is given in Jiawei et al.
While there is significant variability in human activity that can be observed from wearable sensor data, there is typically a repeating structure over the long term. To this end, frequent pattern mining constitutes a promising technique for the discovery of rich routine information and has been successfully applied to a number of applications. FP-Stream is an example of a frequent pattern mining algorithm that incorporates temporal information by mining a data stream at different granularities. It constructs a prefix-tree of patterns, where each node specifies support for a pattern over time windows. Other fast algorithms have been investigated for finding frequent patterns. Closet+, for example, is an algorithm which can quickly find closed frequent patterns using optimized search strategies, and provides advantages over existing mining algorithms in terms of runtime and memory usage. Based on FP-stream and Closet+, a data structure called the routine tree designed to describe behaviour patterns is proposed in Ali et al. . The routine tree is a variable resolution mapping of time periods within a day to frequent patterns of activity levels.