...

Open source softwares - Regression

Back to Course

Lesson Description


Lession - #1519 Regression Multicollinearity


Collinearity is a direct association between two explicatory variables. Two variables are perfectly collinear if there's an exact direct relationship between them. For illustration,{ displaystyle X,{ 1}} X,{ 1} and{ displaystyle X,{ 2}} X,{ 2} are impeccably collinear if there exist parameters{ displaystyle lambda,{ 0}} lambda,{ 0} and{ displaystyle lambda,{ 1}} lambda,{ 1} similar that, for all observations i, we have

displaystyle X,{ 2i} = lambda,{ 0} lambda,{ 1} X,{ 1i}.} X,{ 2i} = lambda_0lambda_1 X,{ 1i}.
Multicollinearity refers to a situation in which further than two explanatory variables in a multiple regression model are largely linearly related. We've perfect multicollinearity if, for illustration as in the equation over, the correlation between two independent variables is equal to 1 or − 1. In practice, we infrequently face perfect multicollinearity in a data set. further generally, the issue of multicollinearity arises when there's an approximate linear relationship among two or further independent variables.

A definition of multicollinearity.

Mathematically, a set of variables is perfectly multicollinear if there exist one or further exact direct connections among some of the variables. For illustration, we may have

displaystyle lambda,{ 0} lambda,{ 1} X,{ 1i} lambda,{ 2} X,{ 2i} cdots lambda,{ k} X,{ ki} = 0} lambda,{ 0} lambda,{ 1} X,{{ 1i}} lambda,{ 2} X,{{ 2i}} cdots lambda,{ k} X,{{ ki}} = 0

holding for all observations i, where{ displaystyle lambda,{ k}} lambda,{ k} are constants and{ displaystyle X,{ ki}}{ displaystyle X,{ ki}} is the ith observation on the kth explanatory variable. We can explore one issue caused by multicollinearity by examining the process of attempting to obtain estimates for the parameters of the multiple regression equation

{ displaystyle Y,{ i} = beta,{ 0} beta,{ 1} X,{ 1i} cdots beta,{ k} X,{ ki} varepsilon,{ i}.} Y,{{ i}} = beta,{ 0} beta,{ 1} X,{{ 1i}} cdots beta,{ k} X,{{ ki}} varepsilon,{{ i}}.

The ordinary least squares estimates involve inverting the matrix

displaystyle X{ T} X} X{{ T}} X

where

displaystyle X ={ begin{ bmatrix} 1 & X,{ 11} & cdots & X,{ k1} vdots & vdots & & vdots 1 & X,{ 1N} & cdots & X,{ kN} end{ bmatrix}}}{ displaystyle X ={ begin{ bmatrix} 1 & X,{ 11} & cdots & X,{ k1} vdots & vdots & & vdots 1 & X,{ 1N} & cdots & X,{ kN} end{ bmatrix}}}

is an N ×( k 1>
matrix, where N is the number of compliances and k is the number of explicatory variables( with N needed to be greater than or equal to k 1>
. still, at least one of the columns of X is a linear combination of the others, and so the rank of X( and thus of XTX>
is less than k 1, If there's an exact linear relationship( perfect multicollinearity>
among the independent variables.

Perfect multicollinearity is fairly common when working with raw datasets, which constantly contain redundant information. Once redundancies are identified and removed, still, nearly multicollinear variables frequently remain due to correlations essential in the system being studied. In such a case, rather of the below equation holding, we've that equation in modified form with an error term{ displaystyle v,{ i}} v,{ i}

displaystyle lambda,{ 0} lambda,{ 1} X,{ 1i} lambda,{ 2} X,{ 2i} cdots lambda,{ k} X,{ ki} v,{ i} = 0.} lambda,{ 0} lambda,{ 1} X,{{ 1i}} lambda,{ 2} X,{{ 2i}} cdots lambda,{ k} X,{{ ki}} v,{ i} = 0.

In this case, there's no exact linear relationship among the variables, but the{ displaystyle X,{ j}} X,{ j} variables are nearly perfectly multicollinear if the variance of{ displaystyle v,{ i}} v,{ i} is small for some set of values for the{ displaystyle lambda} lambda's. In this case, the matrix XTX has an inverse, but is ill- conditioned so that a given computer algorithm may or may not be suitable to compute an approximate inverse, and if it does so the performing computed inverse may be largely sensitive to slight variations in the data( due to magnified effects of either rounding error or slight variations in the sampled data points>
and so may be very inaccurate or very sample-dependent.