tumbling ideas

Wednesday, February 28, 2007

LaTeX Tutorials

Decent LaTeX lectures here, 1, 2, 3, 4, 5, 6, 7.

Sunday, February 25, 2007

Misspecified Model

Could not find the definition of 'misspecified model' directly. However, there are 2 links talking about it -

the statistical model he used to reach his conclusions is "misspecified." This means, in part, that he did not adequately account for other factors which have an impact on crime rates - and which provide an alternate explanations for his findings. When a statistical model is misspecified, it cannot be used as the basis from which to draw conclusions about the impact of policy decisions. One clue that a model is misspecified is if it produces implausible findings.

now if you see that this is being said in a debate against fire arms, you'll take the implausable findings with a pinch (no more) of salt.

Even a misspecified model can be highly accurate in its predictions. Its problems will show up in other ways (e.g. non-random error, which I am guessing is probably the case here; they are probably more likely to be wrong for some cases than for others, e.g. self-funders).
(In point of fact -- most media types tend to make predictions based upon isspecified models. And the biggest problem is not misspecification, but that they do not realize that they are actually using models in the first place. This is one of the many problems that occur when English majors do political science.)

ah that too is from a debate! the other side - lovely!!

oh and from
@inbook { White2006 ,chapter = "Approximate nonlinear forecasting methods",title = "Handbook of Economic Forecasting",volume = "1",author = "Halbert White",publisher = "Elsevier B.V."year = 2006}

When one's goal is to make predictions, the use of a misspecified model is by no means fatal. Our predictions will not be as good as tehy would be if \mu
(true function) were accessable.

Wednesday, February 14, 2007

Regression Coefficient

An asymmetric measure of association; a statistic computed as part of a regression analysis.

www.ojp.usdoj.gov/BJA/evaluation/glossary/glossary_r.htm

when the regression line is linear (y = ax + b) the regression coefficient is the constant (a) that represents the rate of change of one variable (y) as a function of changes in the other (x); it is the slope of the regression line

wordnet.princeton.edu/perl/webwn

Time-Series Analysis. You can use regression analysis to analyze trends that appear to be related to time.

general knowledge isnt it?

The type of inference

In any kind of choice between techniques, it is important to know the type of inference we want to make. There is no universal solution, because there is no loss-less generalised answer.

Reliability in statistical techniques

Reliability addresses the question of whether repeated application of a procedure will produce similar results.

J Scott Armstron, Fred Collopy,
Error measures for generalising about forecasting methods: Emperical comparisons;
Pg 69-80; International Journal of Forecasting; Vol 8; Year 1992

From before:

Stability is consistency of results, during validation phase, with different samples of data

(Monica Adya and Fred Collopy, J Forecast. 17 481-495 (1998) )

To look at stability in SA, we will define stability as
the consistency of (SA) results within candidate (networks) solutions;

this can be easily justified for SA results in FFNN because we already know that there is high redundancy in free parameters in FFNNs. Therefore, the SA technique that shows consistency between all the networks - and also shows corresponding change in consistency when the data quality is seen to change... promises to be a better technique???

Wednesday, January 31, 2007

What are the population, sample, training set, design set, validation set, and test set?

Sunday, January 28, 2007

Some Generic Challenges for Computer Models

Source for this discussion (with thanks).

Data Smoothing and Data Filtering

From Numberwatch UK

Data smoothing is a form of low pass filtering, which means that it blocks out the high frequency components (short wiggles) in order to emphasis the low frequency ones (longer trends).

There are two popular forms; (a) the running mean (or moving average) and (b) the exponentially weighted average. They are both implemented by means of efficient recursive formulae:

or from an imaging processing website

Smoothing is a process by which data points are averaged with their neighbours in a series, such as a time series, or image. This (usually) has the effect of blurring the sharp edges in the smoothed data. Smoothing is sometimes referred to as filtering, because smoothing has the effect of suppressing high frequency signal and enhancing low frequency signal. There are many different methods of smoothing...

blah blah - need to rewrite.
What I am doing is using heuristic technique
i am only filtering out very low frequency, very high numerical values to eliminate bias due to large numerical values in the set.

Friday, January 26, 2007

Curse of Dimensionality

Curse of Dimensionality :

The exponential growth in the complexity of the problem that results from an increase in the number of dimensions (for example, the dimension of input vector).

From the wikipedia:

The curse of dimensionality is a term coined by Richard Bellman to describe the problem caused by the exponential increase in volume associated with adding extra dimensions to a (mathematical) space.

The curse of dimensionality is a significant obstacle in machine learning problems that involve learning from few data samples in a high-dimensional feature space.

Sunday, January 21, 2007

Cross Validation and Split- Sample Method

According to the Neural Net Usenet -
FAQs : What are cross-validation and bootstrapping?

Cross Validation

In k-fold cross-validation, you divide the data into k subsets of
(approximately) equal size. You train the net k times, each time leaving out one of the subsets from training, but using only the omitted subset to compute whatever error criterion interests you.

If k equals the sample size, this is called "leave-one-out" cross-validation. "Leave-v-out" is a more elaborate and expensive version of cross-validation that involves
leaving out all possible subsets of v cases.

Split Sample or Hold Out

cross-validation is quite different from the "split-sample" or "hold-out" method that is commonly used for early stopping in NNs. In the split-sample method, only a single subset (the validation set) is used to estimate the generalization error, instead of k different subsets; i.e., there is no "crossing".

While various people have suggested that cross-validation be applied to early stopping, the proper way of doing so is not obvious.

The rest of the document is interesting - it defines and discusses Jackknifing and Bootstrapping.

MATLAB neural network manual does not use either terms (as far as I can see, and I was mistaken earlier in thinking that it is called cross validation in MATLAB) - it uses the term "early stopping" for improving generalisation.
(pg 5-55, Neural Network Toolbox User's guide Version 4)

pps: there is more to the term 'cross validation', and the ambiguous way it is being used in the literature. I have seen more than one paper using the term in place of early stopping (etc). will investigate on that later if necessary - else will stick to the definition as above.

Thursday, January 18, 2007

Sensitivity Analysis - the confusion between definitions

Sensitivity Analysis (SA) is the study of how the variation in the output of a model (numerical or otherwise) can be apportioned, quantitatively or qualitatively to different sources of variation

\cite{AppliedEuropeanCommission2006}.

SA has also been defined mathematically as differentiation of output with respect to input \cite{Saltelli2006}. This confusion is apparent in the reviews of SA techniques for neural networks \cite{Olden2004}, \cite{Gevrey2003}. While Olden \cite{Olden2004} refer to the algorithms using various names, including one called SA; Gevrey \cite{Gevrey2003} uses SA as a generic term encompassing all the techniques used to compare contribution of variables in the neural network model. We will use SA as a generic term encompassing all the various techniques following the definition above given by \cite{AppliedEuropeanCommission2006}.

Thursday, January 11, 2007

How much is too much in sampling? How much is enough?

Monday, December 25, 2006

Statistical Model Fit Measure

Akaike's Information Criterion

Saturday, December 16, 2006

To Print Blog Post

To print blog post, increase the margin as shown in the page settings:

Thursday, December 14, 2006

predicting change or predicting absolute values

we define
forecasting as testing model on data not utilised to develop model and
predicting as testing model on data which is obtained from observing the system in a future time.

in that case, the above graph shows a prediction of algal biomass one hour ahead of time. the system was developed to emulate the natural function relating (water quality parameters) and (chlorophyll one hour ahead in time) as observed in a period of 388 hrs, which is around 16 days and 4 hrs. it is validated over the next (approximately) 16 days and is tested over the next 32 days. The figure below summarises this.

what we find is that we follow trends well, but base value is lost. which suggests that we might as well try to predict 'change' in algal biomass. we could experiment by defining change as a vector value - with a magnitude and a direction.

One reason why this idea has not been persued yet is also because the above graph collapses to almost gibberish when time gap is increased further - see below.

To be fair, the above graph is not ALWAYS the case, the exact graph changes often enough - in one case it was tracing ok for a while and then turned into a straight line. but in all cases the regression coeff drops to something like .3 and the graph whatever be their flaw they all have this in common that they ARE NOT ACCURATE!

Monday, December 11, 2006

Predictions and Forecasts

Typically, the terms are used as synonyms - and it is important to remember that.
Principles of Forecasting, which is linked to by Journal of Forecasting, defines

Forecasting
Estimating in unknown situations. Predicting is a more general term and connotes estimating for any time series, cross-sectional, or longitudinal data. Forecasting is commonly used when discussing time series.

Prediction
A statement regarding future events or events that are unknown to the forecaster. Generally used as synonymous with forecast. Often, but not always used when the task involves forecasting with cross-sectional data (e.g., personnel predictions).

Forecast
A prediction or estimate of an actual value in a future time period (for time series) or for another situation (for cross-sectional data). Forecast, prediction, and prognosis are typically used interchangeably.

However since there are predictions and then there are predictions and then there are ofcourse forecasts - many people use these terms as 'jargon' with subtle differences; usually these are not obvious. Google gives some results which i will put when i update.

The defintion that is of significance to me was found on the NASA website where it talks in context of fluid dynamics here

Prediction.
Prediction is defined as
Use of a CFD model to foretell the state of a physical system under conditions for which the CFD model has not been validated. (AIAA G-077-1998)
Prediction is going beyond the validation database and performing simulations of untested systems.

The rest of the definitions at the NASA site are quite relevant too.

Saturday, December 09, 2006

On plagiarism

...just because a student gets the book on the works cited page, doesn’t mean the student hasn’t plagiarized. A misplaced comma, a forgotten quotation mark, a borrowed phrase, or image, is still plagiarism if it's not clear who the originator of the material is. It’s not intentional plagiarism. It’s not the kind of plagiarism that would get a kid kicked out of class, but it is the kind of plagiarism I have to talk to them about, make sure they are aware of it and fix it.

Quinn discussing plagiarism here.

Friday, December 08, 2006

Journals

Some major journals - as found on Jstor

Find each journal link as well -

In 2003, the Annual Review of Ecology and Systematics became the Annual Review of Ecology, Evolution, and Systematics or here

The next three are publications of Ecological Society of America which also publishes other periodicals:
Ecological Applications
Ecological Monographs
Ecology

The British Ecological Society publishes:
Journal of Ecology
Journal of Animal Ecology - apparently ranked 11th in the world.
and other periodicals

Friday, December 01, 2006

Enivironmental Modelling and Monitoring - Sites

1. UCL DEPARTMENT OF GEOGRAPHY
ENVIRONMENTAL MONITORING AND MODELLING GROUP
Well designed site.
Has a list of
- publications
- researcher contacts
- PhD projects - ongoing and recently finished.

2. School of GeoSciences, Institute of Geography, Science and Engineering at The University of Edinburgh
This site did not impress.
firstly, they a lot about GIS (Geographic Infomation Systems) - somehow the context is not clear. is the term being used as a synonym of all geo monitoring techniques or what?
secondly, the site seems to have been last updated in 2000 (either that or they havent published anything since)

Environmental Fellowship Program
Monitoring and Modeling
University of Massachussettes Amhersts
This is a work group, consisting people from different departments

work in process

Thursday, November 30, 2006

Comparing Models - 2

The question is - is the error being amplified or is the accuracy being amplified?

in high variance systems, it appears, that the model ends up emulating different sections of the data set. From the point of veiw of the statistical measures of accuracy, models with very different ____ qualities, may appear equivalent.

Since this situation would always be reflected in higher error in atleast one of the three error values, we at least know when the model is definitely incomplete. an objective measure of completeness is not easily found because we do not have information other than the training data (which is called - lack of meta data) to compare it with. an issue resulting from dealing with a largly unknown system.

Regarding Sensitivity Analysis
if similarities are found between complete models and incomplete models -
can it be concluded -
that the similarities are strongly persistent in the entire set.

(data and random nos. should not give the same kinds of results - ... does this need any more work to be done.

eventually, the results of sensitivity analysis is dependent on
- the raw data,
- the neural network model