tumbling ideas: January 2007

Wednesday, January 31, 2007

What are the population, sample, training set, design set, validation set, and test set?

Sunday, January 28, 2007

Some Generic Challenges for Computer Models

Source for this discussion (with thanks).

Data Smoothing and Data Filtering

Data smoothing is a form of low pass filtering, which means that it blocks out the high frequency components (short wiggles) in order to emphasis the low frequency ones (longer trends).

There are two popular forms; (a) the running mean (or moving average) and (b) the exponentially weighted average. They are both implemented by means of efficient recursive formulae:

or from an imaging processing website

Smoothing is a process by which data points are averaged with their neighbours in a series, such as a time series, or image. This (usually) has the effect of blurring the sharp edges in the smoothed data. Smoothing is sometimes referred to as filtering, because smoothing has the effect of suppressing high frequency signal and enhancing low frequency signal. There are many different methods of smoothing...

blah blah - need to rewrite.
What I am doing is using heuristic technique
i am only filtering out very low frequency, very high numerical values to eliminate bias due to large numerical values in the set.

Friday, January 26, 2007

Curse of Dimensionality

Curse of Dimensionality :

The exponential growth in the complexity of the problem that results from an increase in the number of dimensions (for example, the dimension of input vector).

From the wikipedia:

The curse of dimensionality is a term coined by Richard Bellman to describe the problem caused by the exponential increase in volume associated with adding extra dimensions to a (mathematical) space.

The curse of dimensionality is a significant obstacle in machine learning problems that involve learning from few data samples in a high-dimensional feature space.

Sunday, January 21, 2007

Cross Validation and Split- Sample Method

According to the Neural Net Usenet -
FAQs : What are cross-validation and bootstrapping?

Cross Validation

In k-fold cross-validation, you divide the data into k subsets of
(approximately) equal size. You train the net k times, each time leaving out one of the subsets from training, but using only the omitted subset to compute whatever error criterion interests you.

If k equals the sample size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a more elaborate and expensive version of cross-validation that involves
leaving out all possible subsets of v cases.

Split Sample or Hold Out

cross-validation is quite different from the "split-sample" or "hold-out" method that is commonly used for early stopping in NNs. In the split-sample method, only a single subset (the validation set) is used to estimate the generalization error, instead of k different subsets; i.e., there is no "crossing".

While various people have suggested that cross-validation be applied to early stopping, the proper way of doing so is not obvious.

The rest of the document is interesting - it defines and discusses Jackknifing and Bootstrapping.

MATLAB neural network manual does not use either terms (as far as I can see, and I was mistaken earlier in thinking that it is called cross validation in MATLAB) - it uses the term "early stopping" for improving generalisation.
(pg 5-55, Neural Network Toolbox User's guide Version 4)

pps: there is more to the term 'cross validation', and the ambiguous way it is being used in the literature. I have seen more than one paper using the term in place of early stopping (etc). will investigate on that later if necessary - else will stick to the definition as above.

Thursday, January 18, 2007

Sensitivity Analysis - the confusion between definitions

Sensitivity Analysis (SA) is the study of how the variation in the output of a model (numerical or otherwise) can be apportioned, quantitatively or qualitatively to different sources of variation

\cite{AppliedEuropeanCommission2006}.

SA has also been defined mathematically as differentiation of output with respect to input \cite{Saltelli2006}. This confusion is apparent in the reviews of SA techniques for neural networks \cite{Olden2004}, \cite{Gevrey2003}. While Olden \cite{Olden2004} refer to the algorithms using various names, including one called SA; Gevrey \cite{Gevrey2003} uses SA as a generic term encompassing all the techniques used to compare contribution of variables in the neural network model. We will use SA as a generic term encompassing all the various techniques following the definition above given by \cite{AppliedEuropeanCommission2006}.

Thursday, January 11, 2007

How much is too much in sampling? How much is enough?

tumbling ideas