Transformation of the observed data is a very common practice when a troubling degree of near multicollinearity is detected in a linear regression model. However, it is important to take into account that these transformations may affect the detection of this problem, so they should not be performed systematically. In this paper we analyze the transformation of the data when applying the R package mcvis, showing that it only detects essential near multicollinearity when the studentise transformation is performed.
Given the model \(\mathbf{Y} = \mathbf{X} \cdot \boldsymbol{\beta} + \mathbf{u}\) for \(n\) observations and \(p\) independent variables where \(\mathbf{Y}\) is a vector that contains the observations of the dependent variables, \(\mathbf{X} = [\mathbf{1} \ \mathbf{X}_{2} \dots \mathbf{X}_{p}]\) is a matrix whose columns contain the observations of the independent variables (where the first column is a vector of ones representing the intercept) and \(\mathbf{u}\) represents the spherical random disturbance, the existence of linear relationships between the independent variables of a model is known as multicollinearity. It is well-known that a high degree of multicollinearity can affect the analysis of a linear regression model. In this case, it is said that the multicollinearity is troubling (Novales 1988; Ramanathan 2002; Wooldridge 2008; Gujarati 2010). It is also interesting to note the distinction made, for example, by Marquardt (1980) or Snee and Marquardt (1984), between essential (near-linear relationship between at least two independent variables excluding the intercept) and non-essential multicollinearity (near-linear relationship between the intercept and at least one of the remaining independent variables).
Note that the detection process is key to determining which tool is best suited to mitigation of the problem: for example, ridge regression of Hoerl and Kennard (1970) and Hoerl and Kennard (1970); LASSO of Tibshirani (1996) or elastic net of Zou and Hastie (2005), among others.
The most commonly applied measures to detect whether the degree of multicollinearity is troubling are the following:
Another set of measures to detect the existence of troubling multicollinearity are the matrix of simple linear correlations between the independent variables, \(\mathbf{R} = \left( cor(X_{l}, X_{m}) \right)_{l,m=2,\dots,p}\) and its determinant, \(| \mathbf{R} |\). García et al. (2019) show that values for the coefficient of simple correlation between the independent variables higher than \(\sqrt{0.9}\) and determinant lower than \(0.1013 + 0.00008626 \cdot n - 0.01384 \cdot p\) indicate a troubling degree of multicollinearity (see Salmerón et al. (2021a) or Salmerón et al. (2021b) for more details). The first value differs strongly from the threshold normally proposed equal to 0.7 to indicate a problem of near collinearity (see, for example, Halkos and Tsilika (2018)).
Also useful to use the coefficient of variation (CV), values less than 0.1002506 indicate the existence of troubling multicollinearity (see Salmerón et al. (2020b) for more details).
García et al. (2016) and Salmerón et al. (2020a) showed that the VIF is invariant to origin and scale changes, which is the same as saying that model \(\mathbf{Y} = \mathbf{X} \cdot \boldsymbol{\beta} + \mathbf{u}\) and the model \(\mathbf{Y} = \mathbf{x} \cdot \boldsymbol{\beta} + \mathbf{u}\) present the same VIF, where \(\mathbf{x} = [\mathbf{x}_{1} \ \mathbf{x}_{2} \dots \mathbf{x}_{p}]\) with \(\mathbf{x}_{i} = \frac{\mathbf{X}_{i}-a_{i}}{b_{i}}\) for \(a_{i} \in \mathbb{R}\), \(b_{i}>0\) and \(i=1,\dots,p\). Note that if \(a_{i} = \overline{\mathbf{X}}_{i}\), \(\mathbf{x}_{1}\) is a vector of zeros, i.e. the intercept disappears from the model. Instead, Salmerón et al. (2018b) showed that the CN is not invariant to origin and scale changes, meaning that the two previous models present different CNs. This fact implies that models \(\mathbf{Y} = \mathbf{X} \cdot \boldsymbol{\beta} + \mathbf{u}\) and \(\mathbf{Y} = \mathbf{x} \cdot \boldsymbol{\beta} + \mathbf{u}\) present different eigenvalues.
Consequently, transforming the data in a linear regression model may affect the detection of the multicollinearity problem depending on the diagnostic measure. Furthermore, note that this sensitivity to scaling is due to the fact that there are certain transformations (such as data centering) that mitigate the multicollinearity problem, so that the dependence or otherwise on scaling simply highlights the capacity/incapacity of each measure to detect this reduction of the degree of near multicollinearity.
Therefore, when transforming the data in a linear regression model and analyzing whether the degree of multicollinearity is of concern or not, it is necessary to be clear whether the measure used to detect it is affected by the transformation and whether it is capable of detecting the two types of near multicollinearity mentioned (essential and non-essential). Thus, in this paper we analyze the MC index recently presented in Lin et al. (2020).
First, this paper will briefly review the MC index. In order to show that the MC index depends on the transformation of the data and its inability to detect non-essential multicollinearity, we present two simulations with a troubling degree of essential and non-essential multicollinearity, respectively, and a third simulation where the degree of multicollinearity is not troubling. For all these cases, we calculate the different measures to detect multicollinearity commented on the introduction together with the MC index. Two empirical applications recently applied in the scientific literature are also presented. After a discussion of the results we propose a scatter plot between the VIF and CV to detect which variables are the cause of the troubling degree of multicollinearity and the kind of multicollinearity (essential or non-essential) existing in the model. Finally, the main conclusions of the paper are summarized.
The MC index presented in Lin et al. (2020) is based on the existing relation between the VIFs ant the inverse of the eigenvalues of the matrix \(\mathbf{Z}^{t} \mathbf{Z}\) where \(\mathbf{Z}\) represents the standardized matrix of \(\mathbf{X}\). This is to say, is the matrix \(\mathbf{x}\) mentioned in the introduction when for all \(i\) it is obtained that \(a_{i} = \overline{\mathbf{X}}_{i}\) and \(b_{i} = \sqrt{n \cdot var \left( \mathbf{X}_{i} \right)}\) where \(var \left( \mathbf{X}_{i} \right)\) represents the variance of \(\mathbf{X}_{i}\). More precisely, taking into account the fact that in the main diagonal of \(\left( \mathbf{Z}^{t} \mathbf{Z} \right)^{-1}\) we find the VIFs (for standardized data), it is possible to establish the following relation:
\[\begin{equation} \left( \begin{array}{c} VIF(2) \\ \vdots \\ VIF(p) \end{array} \right) = \mathbf{A} \cdot \left( \begin{array}{c} \frac{1}{\mu_{2}} \\ \vdots \\ \frac{1}{\mu_{p}} \end{array} \right), \end{equation}\]
where \(\mathbf{A}\) is a matrix that depends on the eigenvalues of \(\mathbf{Z}^{t} \mathbf{Z}\) and \(\mu_{p}\) is the maximum eigenvalue of this matrix.
From this relationship, resampling and obtaining the regression of \(1/\mu_{p}\) as a function of the VIFs, Lin et al. (2020) proposed the use of the t-statistics to conclude which variable contributes the most to this relationship, thus identifying which variables are responsible for the degree of approximate multicollinearity in the model. These authors defined the MC index as an index from zero to one, larger values indicating greater contribution of the variable i in explaining the observed severity of multicollinearity.
Taking into account that the calculation of the MC index is based on the relation established between the VIFs and the inverse of the smallest eigenvalue, it seems logical to consider that the transformation of the data may affect the calculation of this measure. Thus, it is possible to conclude:
In conclusion, regardless of whether the data is transformed or not, the MC index is not capable of detecting non-essential multicollinearity. It is expected that it will show its usefulness in the case of essential multicollinearity.
Therefore, when Lin et al. (2020) commented that there are different views on what centering technique is most appropriate in regression […] To facilitate this diversity of opinion, in the software implementation of mcvis, we allow the option of passing matrices with different centering techniques as input. The role of scaling is not the focus of our work as our framework does not rely on any specific scaling method, far from facilitating the use of their proposal, they consider scenarios for which the MC index is not designed, since in their theoretical development and step 2 of their method, the standardization of the data is performed. However, as shown in this paper, the MC index is capable of detecting multicollinearity of the essential type only when used with its default option studentise.
In this section, different versions of the matrix \(\mathbf{X} = [\mathbf{1} \ \mathbf{X}_{2} \ \mathbf{X}_{3} \ \mathbf{X}_{4}]\) will be simulated. The results on the correlation matrix, its determinant, condition number (with and without intercept), variance inflation factor and coefficient of variation are obtained using the multiColl package (see Salmerón et al. (2021a) and Salmerón et al. (2021b) for more details).
In all cases, are calculated the values for the MC index for each set of simulated data considering the two alternative transformation for the data: Euclidean (centered by mean and divided by Euclidean length) and studentise (centered by mean and divided by standard deviation). In each case (simulation and kind of transformation), the calculation of the MC index was performed 100 times.
In this case, 100 observations are generated according to \(\mathbf{X}_{i} \sim N(10, 100)\), \(i=2,3\), and \(\mathbf{X}_{4} = \mathbf{X}_{3} - \mathbf{p}\) where \(\mathbf{p} \sim N(1, 0.5)\). The goal of this simulation is to ensure that the variables \(\mathbf{X}_{3}\) and \(\mathbf{X}_{4}\) will be highly correlated (essential multicollinearity). This fact is confirmed when taking into account the following results in relation to the correlation matrix, correlation matrix’s determinant, the CN with and without the intercept (with its corresponding increasing), the VIFs and the coefficient of variation of the different variables.
RdetR(X_S1)
$`Correlation matrix`
X2_S1 X3_S1 X4_S1
X2_S1 1.00000000 -0.0875466 -0.09110579
X3_S1 -0.08754660 1.0000000 0.99881845
X4_S1 -0.09110579 0.9988184 1.00000000
$`Correlation matrix's determinant`
[1] 0.002330192
CNs(X_S1)
$`Condition Number without intercept`
[1] 38.67204
$`Condition Number with intercept`
[1] 66.94135
$`Increase (in percentage)`
[1] 42.22997
VIF(X_S1)
X2_S1 X3_S1 X4_S1
1.013525 425.587184 425.860062
CVs(X_S1)
[1] 1.239795 1.052896 1.166335
Table 1 shows three random iterations, the average value of the 100 times and the standard deviation. As expected in the case of essential multicollinearity, from the average values of Simulation 1 it is noted that (specially with the transformation studentise) the MC index correctly identified that the variables \(\mathbf{X}_{3}\) and \(\mathbf{X}_{4}\) are causing the troubling degree of essential multicollinearity. However, it is noted that in some cases the intercept or the variable \(\mathbf{X}_{2}\) are identified as relevant in the existing linear relations when the Euclidean transformation is performed. This behavior is not observed when the studentise transformation is performed. This fact seems to indicate that the MC index depends on the transformation with the studentise transformation being the most appropriate.
X2 | X3 | X4 | |
---|---|---|---|
Euclidean - Random 1 | 0.2846628 | 0.3670736 | 0.3482636 |
Euclidean - Random 2 | 0.1466484 | 0.4444270 | 0.4089246 |
Euclidean - Random 3 | 0.4026253 | 0.3140131 | 0.2833616 |
Euclidean - Average | 0.2505292 | 0.3899482 | 0.3595226 |
Euclidean - Standard Deviation | 0.1075164 | 0.0550333 | 0.0526817 |
Studentise - Random 1 | 0.0000338 | 0.4942761 | 0.5056901 |
Studentise - Random 2 | 0.0001294 | 0.4901233 | 0.5097473 |
Studentise - Random 3 | 0.0000307 | 0.4950536 | 0.5049157 |
Studentise - Average | 0.0000519 | 0.4944290 | 0.5055191 |
Studentise - Standard Deviation | 0.0000264 | 0.0023510 | 0.0023528 |
In this case, 100 observations are generated according to \(\mathbf{X}_{i} \sim N(10, 100)\), \(i=2,3\), and \(\mathbf{X}_{4} \sim N(10, 0.0001)\). The goal of this simulation is to ensure that the variable \(\mathbf{X}_{4}\) will be highly correlated to the intercept (non-essential multicollinearity). This fact is confirmed from the following results taking into account that Salmerón et al. (2020b) showed that a value of the CV lower than 0.1002506 indicates a troubling degree of non-essential multicollinearity.
RdetR(X_S2)
$`Correlation matrix`
X2_S2 X3_S2 X4_S2
X2_S2 1.0000000 -0.08754660 0.06676070
X3_S2 -0.0875466 1.00000000 0.09445547
X4_S2 0.0667607 0.09445547 1.00000000
$`Correlation matrix's determinant`
[1] 0.9778526
CNs(X_S2)
$`Condition Number without intercept`
[1] 2.999836
$`Condition Number with intercept`
[1] 2430.189
$`Increase (in percentage)`
[1] 99.87656
VIF(X_S2)
X2_S2 X3_S2 X4_S2
1.013525 1.018091 1.014811
CVs(X_S2)
[1] 1.239794695 1.052896496 0.001022819
Table 2 presents three random iterations, the average value of the 100 times and the standard deviation for Simulation 2 for Euclidean and studentise transformations. Before commenting the results of Simulation 2, it is important to take into account the fact that with transformations that imply the elimination of the intercept it will not be possible to detect the non-essential multicollinearity. Note that in some occasions, when Euclidean transformation is performed, it is concluded that \(\mathbf{X}_{3}\) and \(\mathbf{X}_{4}\) are the most relevant while, when studentise transformation is performed, all variables seem to present the same relevance. In the first case, a higher stability is observed by considering the average values, although the conclusion is that there is a relation between \(\mathbf{X}_{3}\) and \(\mathbf{X}_{4}\) when the relationship is between the intercept and \(\mathbf{X}_{4}\).
X2 | X3 | X4 | |
---|---|---|---|
Euclidean - Random 1 | 0.2671991 | 0.2097102 | 0.5230907 |
Euclidean - Random 2 | 0.1649092 | 0.5119198 | 0.3231711 |
Euclidean - Random 3 | 0.2126011 | 0.2678652 | 0.5195337 |
Euclidean - Average | 0.1678676 | 0.3335266 | 0.4986058 |
Euclidean - Standard Deviation | 0.0725673 | 0.0845660 | 0.1029678 |
Studentise - Random 1 | 0.3307490 | 0.3342851 | 0.3349658 |
Studentise - Random 2 | 0.3646487 | 0.2995420 | 0.3358093 |
Studentise - Random 3 | 0.3107573 | 0.3481032 | 0.3411395 |
Studentise - Average | 0.3541923 | 0.3150319 | 0.3307758 |
Studentise - Standard Deviation | 0.0201173 | 0.0203717 | 0.0187872 |
Finally, in this case 100 observations are generated according to \(\mathbf{X}_{i} \sim N(10, 100)\), \(i=2,3,4\). The goal of this simulation is to ensure that the degree of multicollinearity (essential and non-essential) will be not troubling. This fact is confirmed when taking into account the following results.
RdetR(X_S3)
$`Correlation matrix`
X2_S3 X3_S3 X4_S3
X2_S3 1.0000000 -0.08754660 0.06676070
X3_S3 -0.0875466 1.00000000 0.09445547
X4_S3 0.0667607 0.09445547 1.00000000
$`Correlation matrix's determinant`
[1] 0.9778526
CNs(X_S3)
$`Condition Number without intercept`
[1] 2.07584
$`Condition Number with intercept`
[1] 3.60862
$`Increase (in percentage)`
[1] 42.47552
VIF(X_S3)
X2_S3 X3_S3 X4_S3
1.013525 1.018091 1.014811
CVs(X_S3)
[1] 1.239795 1.052896 1.019006
Table 3 presents three random iterations, the average value of the 100 times and the standard deviation for Simulation 3 for Euclidean and studentise transformations. Simulation 3 shows different situations depending on the transformation: when the Euclidean transformation is performed, the variable \(\mathbf{X}_{3}\) is also identified apart from variable \(\mathbf{X}_{4}\); with the studentise transformation, all the variables seem to be relevant.
X2 | X3 | X4 | |
---|---|---|---|
Euclidean - Random 1 | 0.1318422 | 0.3832372 | 0.4849206 |
Euclidean - Random 2 | 0.1679453 | 0.4315253 | 0.4005294 |
Euclidean - Random 3 | 0.0728076 | 0.5693939 | 0.3577986 |
Euclidean - Average | 0.1083812 | 0.4107279 | 0.4808910 |
Euclidean - Standard Deviation | 0.0410296 | 0.0661189 | 0.0628212 |
Studentise - Random 1 | 0.3586837 | 0.3154482 | 0.3258681 |
Studentise - Random 2 | 0.3805084 | 0.3205751 | 0.2989166 |
Studentise - Random 3 | 0.3651891 | 0.3069920 | 0.3278189 |
Studentise - Average | 0.3541923 | 0.3150319 | 0.3307758 |
Studentise - Standard Deviation | 0.0201173 | 0.0203717 | 0.0187872 |
From the above results, it is concluded that the MC index applied individually is not able to detect if the degree of multicollinearity is troubling. This conclusion is in line with the comment presented by Lin et al. (2020) where it is stated that those classical collinearity measures are used together with mcvis for the better learning of how one or more variables display dominant behavior in explaining multicollinearity. That is to say, it is recommended to use measures such as the VIF and the CN to detect whether the degree of multicollinearity is troubling and, if it is, then use the MC index to detect which variables are more relevant.
We should reiterate the fact that the results of the MC index depend on the transformation performed with the data.
Finally, it is worthy of note that the lowest dispersion is obtained when the studentise transformation is performed, which indicate that with this transformation a higher stability exists in the results obtained with the 100 iterations performed.
In this section we will analyze two examples applied recently to illustrate the multicollinearity problem. The first one focuses on the existence of non-essential approximate multicollinearity while the second one is focused on essential multicollinearity.
Salmerón et al. (2020b) analyzed the Euribor as a function of the harmonized index of consumer prices (HICP), the balance of payments to net current account (BC) and the government deficit to net non-financial accounts (GD).
The following determinant of the matrix of correlations of the independent variables, the VIFs, condition number without intercept and with intercept and the coefficients of variation indicate that the degree of essential near multicollinearity is not troubling while the non-essential type (due to the variable HIPC) is.
Figure 1 shows a tour displayed with a scatterplot by using the tourr package. This package allows tours of multivariate data, see Wickham et al. (2011) for more details. From the tour on all the explanatory variables (it runs for 3.47 minutes in the html version), no linear relation is observed between the explanatory variables. Note that this package does not allow us to work with the intercept.