Curse of dimensionality?
Published:
This is a summary of an interesting chat between the greatest Loyal Data Scientist Chris and me.
The whole thing happened because I came across a post about “the relationship between the curse of high-dimensionality and degrees of freedom” on LinkedIn. I do not completely agree on the connection between the two concepts but the post was interesting as it explained high-dimensionality via Euclidean space. After a quick Google search, I realize the curse of high-dimensionality is a problem in many different fields, not only in Statistics.
Continuing with Euclidean space, we can consider the curse of high-dimensionality via a sampling perspective. For one random variable with sample space of real numbers ($\mathcal{R}$), we probably won’t need too many samples to get a good taste of how it is distributed across its sample space. For two random variables with a joint distribution on $\mathcal{R}^2$, we will need more samples to find where it is more dense in the 2-D Euclidean space. Now, imagining we have three random variables on $\mathcal{R}^2$, it will be more challenging to figure out where it is more or less dense in a 3-D space. I understand this is how the post tried to explain the curse of high-dimensionality.
While “degrees of freedom”” is a different concept in my opinion. When estimating parameters of interest, we are solving some estimation equations (i.e. score functions for maximum likelihood estimation, moment functions for method of moments estimation, derivatives of objective function for general prediction problems, etc.). So the data is restricted to these estimation equations. Once parameters are estimated, only fewer samples are free to change. For example, we have $N$ data points and are interested in the mean ($k = 1$). Once the sample average is computed, only $N-1$ data points are free to change and the change of samples will not impact the average. Imaging we have $N+k$ parameters (the curse of high-dimensionality) to be estimated, it is impossible to estimate them with $N$ samples. In other words, a negative ($-k$) degrees of freedom does not make sense. Intuitively, we cannot magically “create” additional $k$ data points.
So the curse of high-dimensionality and degrees of freedom are two individual and different concepts, and of-course, I do not think they are related.
While Chris also emphasized that the curse-of-dimensionality is more about the joint distribution of $\mathbf{X}$ (covariates, independent variables, factors, etc). The greatest example of the “curse” is knn-algorithm will fail as the number of $X$’s increases. This is because as dimension increases, even with a large number of samples, it is impossible to find many neighbors of a single data point to compute an average. However, linear model will work well in terms of predictions even when the dimension of $\mathbf{X}$ is pretty large. The need of lasso-regression or penalty is more of avoiding over-fitting and providing good inference.
From a statistical perspective, I believe in any model is wrong but some are useful and the simpler the better. I think in practice the goal is always to find an plausible model that approximates the truth. So the curse of high-dimensionality for me is really, even though big data is available, we cannot find important/useful information from them by a complex model. In practice, the truth is always complex and involves much randomness that we cannot control. If we can find a simple model to explain the data generation mechanism, why not?
Chris and I highly agree with each other on “simplicity is good”. He also added there can be a cost for not controlling for factors that are indeed important. If we have infinite data, we can include as many factors as we want, but this is not the case in real world. The two of us now kinda focus on different problems, I work more with statistical inference while he works more with prediction (my words). I think there is slight difference on the focus of two problems. For inference, it is very important to make the model as simple as possible, separating out the parameters of interest and the nuisance parameters. The goal is to efficiently estimate the parameters of interest and also draw inference. Including too many parameters of interest will not help the inference. We do need to be careful to confirm the model support the inferential goal. However, for prediction, it is more important to find a working model that can somehow accurately predict the response. Intuitively, as long as the model works, it is not that important to figure out how many parameters are included, which of them can be nuisance parameters. Explainable machine learning model is fansinating but also challenging.