Question title says it all, what's the most efficient way to calculate the pairwise partial correlation between every column of a matrix controlling for every other variable?
Basically, something analogous to the cor
function below, but resulting in the partial correlations instead of simple correlations.
#> cor(iris[,-5])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
#Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
#Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
#Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
The result should match the one we obtain with the ppcor
library:
#> ppcor::pcor(iris[,-5])$estimate
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 0.6285707 0.7190656 -0.3396174
#Sepal.Width 0.6285707 1.0000000 -0.6152919 0.3526260
#Petal.Length 0.7190656 -0.6152919 1.0000000 0.8707698
#Petal.Width -0.3396174 0.3526260 0.8707698 1.0000000
We know the pairwise partial correlations controlling for every other variable can be obtained through the inversion of the correlation or covariance matrix (see here) in O(n^3) time. So a possible solution is simply:
pcor.solve = function(x){
res = solve(cov(x))
res = -res/sqrt(diag(res) %o% diag(res))
diag(res) = 1
return(res)
}
This is basically a stripped-down version of ppcor::pcor
. The result is:
pcor.solve(iris[,-5])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 0.6285707 0.7190656 -0.3396174
#Sepal.Width 0.6285707 1.0000000 -0.6152919 0.3526260
#Petal.Length 0.7190656 -0.6152919 1.0000000 0.8707698
#Petal.Width -0.3396174 0.3526260 0.8707698 1.0000000
Note however the covariance matrix (or the correlation matrix, the result is the same) must be positive definite.
As this bogs down mostly to an effective inversion operation, I've looked at this thread in stats.SE. qr.solve
and chol2inv
can be used in covariance matrices to the same effect.
pcor.qr = function(x){
res = qr.solve(cov(x))
res = -res/sqrt(diag(res) %o% diag(res))
diag(res) = 1
dimnames(res)[[1]] = dimnames(res)[[2]] = colnames(x)
return(res)
}
pcor.qr(iris[,-5])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 0.6285707 0.7190656 -0.3396174
#Sepal.Width 0.6285707 1.0000000 -0.6152919 0.3526260
#Petal.Length 0.7190656 -0.6152919 1.0000000 0.8707698
#Petal.Width -0.3396174 0.3526260 0.8707698 1.0000000
pcor.chol = function(x){
res = chol2inv(chol(cov(x)))
res = -res/sqrt(diag(res) %o% diag(res))
diag(res) = 1
dimnames(res)[[1]] = dimnames(res)[[2]] = colnames(x)
return(res)
}
pcor.chol(iris[,-5])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 0.6285707 0.7190656 -0.3396174
#Sepal.Width 0.6285707 1.0000000 -0.6152919 0.3526260
#Petal.Length 0.7190656 -0.6152919 1.0000000 0.8707698
#Petal.Width -0.3396174 0.3526260 0.8707698 1.0000000
SVD can also be used to solve. If we have a positive definite square matrix, it's SVD decomposition is A = UDU^T, and it's inverse is simply A^-1 = UD^-1U^T.
pcor.svd = function(x){
res = svd(cov(x))
res = res$v %*% diag(1/res$d) %*% t(res$v)
res = -res/sqrt(diag(res) %o% diag(res))
diag(res) = 1
dimnames(res)[[1]] = dimnames(res)[[2]] = colnames(x)
return(res)
}
pcor.svd(iris[,-5])
# Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length 1.0000000 0.6285707 0.7190656 -0.3396174
#Sepal.Width 0.6285707 1.0000000 -0.6152919 0.3526260
#Petal.Length 0.7190656 -0.6152919 1.0000000 0.8707698
#Petal.Width -0.3396174 0.3526260 0.8707698 1.0000000
microbenchmark
with 10000 repeats:
library(microbenchmark)
#iris
dt1 = iris[,-5]
microbenchmark(
ppcor = ppcor::pcor(dt1)$estimate,
solve = pcor.solve(dt1),
qr = pcor.qr(dt1),
chol = pcor.chol(dt1),
svd = pcor.svd(dt1),
times = 10000L)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# ppcor 247.728 267.790 314.8356 280.853 296.248 196962.601 10000 c
# solve 176.816 198.743 217.1298 205.274 221.603 2425.964 10000 b
# qr 240.264 258.459 282.7005 270.123 285.518 4015.438 10000 c
# chol 131.562 148.824 163.3567 154.423 167.019 1593.205 10000 a
# svd 179.615 199.675 219.2781 208.074 223.469 1920.710 10000 b
#random data
dt2 = cbind(rnorm(1E4), rnorm(1E4)+2)
microbenchmark(
ppcor = ppcor::pcor(dt2)$estimate,
solve = pcor.solve(dt2),
qr = pcor.qr(dt2),
chol = pcor.chol(dt2),
svd = pcor.svd(dt2),
times = 10000L)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# ppcor 243.063 267.323 306.4535 284.585 311.177 1833.936 10000 d
# solve 180.548 190.812 222.6685 198.277 216.004 84776.704 10000 a
# qr 229.068 248.662 282.8142 262.658 285.518 1954.301 10000 c
# chol 179.148 189.413 212.6551 198.277 216.005 1383.733 10000 a
# svd 213.672 230.933 262.5084 243.529 264.058 5261.543 10000 b
#uncorrelated data
dt3 = cbind(sin(seq(0, 2*pi, length.out = 1000L)), cos(seq(0, 2*pi, length.out = 1000L)))
microbenchmark(
ppcor = ppcor::pcor(dt3)$estimate,
solve = pcor.solve(dt3),
qr = pcor.qr(dt3),
chol = pcor.chol(dt3),
svd = pcor.svd(dt3),
times = 10000L)
#Unit: microseconds
# expr min lq mean median uq max neval cld
# ppcor 142.759 162.354 188.7767 172.1500 191.745 2230.021 10000 d
# solve 80.711 89.108 102.8269 92.3740 101.704 1709.372 10000 a
# qr 130.629 145.092 168.0627 153.0220 169.351 4914.910 10000 c
# chol 79.777 87.709 102.2984 92.3740 101.238 6731.117 10000 a
# svd 112.901 127.363 147.1913 134.1285 148.358 1401.928 10000 b
[UPDATED] Or, in other words, chol
< solve
< svd
< qr
< ppcor
for now. Probably some speed up can be gained talking advantage of tha fact that the covariance matrix is symmetric (the chol
solution uses this fact already), and time could be gained also in the covariance calculation.
Of course, the ppcor
library is much more versatile, handling cases where the covariance matrix is not invertible etc, so it's in a disadvantage in comparisons. One can also make the case though that it's desirable to have the simpler solutions when partial correlations are going to be exhaustively computed and one knows the covariance matrices are positive definite.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments