|
Regression analysis is any statistical method where the mean of one or more random variables is predicted conditioned on other (measured) random variables. In particular, there are linear regression, logistic regression, Poisson regression and supervised learning. Regression analysis is more than curve fitting (choosing a curve that best fits given data points): it involves fitting a model with both deterministic and stochastic components. The deterministic component is called the predictor and the stochastic component is called the error term.
Sometimes there are only two variables, one of which is called X and can be regarded as constant, i.e., non-random, because it can be measured without substantial error and its values can even be chosen at will. For this reason it is called the independent or controlled variable. The other variable called Y, is a random variable called the dependent variable, because its values depend on X. In regression we are interested in the variation of Y on X.
Typical examples are the dependence of the blood pressure Y on the age X of a person, or the dependence of the weight Y of certain animals on their daily ration of food X. This dependence is called the regression of Y on X.
See also: multivariate normal distribution, important publications in regression analysis.
Regression is usually posed as an optimization problem as we are attempting to find a solution where the error is at a minimum. The most common error measure that is used is the least squares: this corresponds to a Gaussian likelihood of generating observed data given the (hidden) random variable. In a certain sense, least squares is an optimal estimator: see the Gauss-Markov theorem.
The optimization problem in regression is typically solved by algorithms such as the gradient descent algorithm, the Gauss-Newton algorithm, and the Levenberg-Marquardt algorithm. Probabilistic algorithms such as RANSAC can be used to find a good fit for a sample set, given a parametrized model of the curve function.
Regression can be expressed as a maximum likelihood method of estimating the parameters of a model. However, for small amounts of data, this estimate can have high variance. Bayesian methods can also be used to estimate regression models. A prior is placed over the parameters, which incorporates everything known about the parameters. (For example, if one parameter is known to be non-negative a non-negative distribution can be assigned to it.) A posterior distribution is then obtained for the parameter vector. Bayesian methods have the advantages that they use all the information that is available and they are exact, not asymptotic, and thus work well for small data sets. Some practitioners use maximum a posteriori (MAP) methods, a simpler method than full Bayesian analysis, in which the parameters are chosen that maximize the posterior. MAP methods are related to Occam's Razor: there is a preference for simplicity among a family of regression models (curves) just as there is a preference for simplicity among competing theories.
General formulation
We want to predict the values of a random variable Y conditioned on other random variables called factors. Let be the number of factors used for this prediction.
will denote a probability space and (Γ,S) will be a measure space such that (Γ, + ,.) is an ordered field (e.g. and with ). We can now define the dependent variable and . Now, let F be a set of functions defined on Ω with values in Γ such that and d be a metric such that (F,d) is a complete metric space.
We are looking for a measurable function such that is minimal.
Linear regression
Linear regression is the most common case in practice. We suppose that the function f depends linearly on the covariates so we are really just looking for the right coefficients.
Let Θ be a set of coefficients. The hypothesis of the linear regression is:


and the metric used is:
![\forall f,g\in F, d(f,g) = \mathbb{E}[(f-g)^2]](/math/4119e96d927f3632206571fc19e85778.png)
We therefore want to minimize , which means that
.
Hence, we only need to find .
In order to solve this problem efficiently, several methods exist. The most common one is the Gauss-Markov method, but it requires extra hypotheses.
The Gauss-Markov linear model
Under assumptions which are met relatively often, there exists an optimal solution to the linear regression problem. These assumptions (called Gauss-Markov hypothesis) are:
We use the linear regression model, and We then define the error independent where and I is the identity matrix. 
Least-squares estimation of the coefficients
We want an estimate of . Under the Gauss-Markov assumptions, there exists an optimal solution. We can see the unknown function as the projection of Y on the subspace of F generated by . Let , where X is the matrix whose columns are .
If we define the scalar product by and write for the induced norm, the metric d can be written . Minimizing this norm is equivalent to projecting orthogonally Y on the subspace induced by .
because the projection is orthogonal, therefore, an estimate of the unknown coefficients is
. This is called the least-squares estimate of the linear regression coefficients.
How good is this estimate? Under the Gauss-Markov assumptions, the Gauss-Markov theorem states that the least-square estimation of the linear regression coefficients are the best we can do. More precisely, under the Gauss-Markov assumptions, of all unbiased estimators of the linear regression coefficients, the least-square ones are the most efficient ones.
Things look great, but no matter how attractive, this method lacks robustness: departure from the normality assumptions will corrupt the results. However, this method is the most widely used in practice, and because of the central limit theorem, for large values of n, the Gauss-Markov assumptions are often met.
If the Gauss-Markov hypotheses are not met, a variety of techniques are available.
Example
The simplest example of regression is in the one dimensional case. We are given a vector of x values and another vector of y values and we are attempting to find a function such that f(xi) = yi.
- let

Let's assume that our solution is in the family of functions defined by a 3rd degree Fourier expansion written in the form:
- f(x) = a0 / 2 + a1cos(x) + b1sin(x) + a2cos(2x) + b2sin(2x) + a3cos(3x) + b3sin(3x)
where ai,bi are real numbers. This problem can be represented in matrix notation as:

filling this form in with our given values yields a problem in the form Xw = y

This problem can now be posed as an optimization problem to find the minimum sum of squared errors.
3rd degree Fourier function


solving this with least squares yields:

thus the 3rd-degree Fourier function that fits the data best is given by:
- f(x) = 4.25cos(x) − 6.13cos(2x) + 2.88cos(3x).
See also
References
- Audi, R., Ed. (1996) The Cambridge Dictionary of Philosophy. Cambridge, Cambridge University Press. curve fitting problem p.172-173.
- David Birkes and Yadolah Dodge, Alternative Methods of Regression (1993), ISBN 0-471-56881-3
- W. Hardle, Applied Nonparametric Regression (1990), ISBN 0-521-42950-1
- J. Fox, Applied Regression Analysis, Linear Models and Related Methods. (1997), Sage
External links
|