Most efficient way to calculate pairwise partial correlations in base R?

debugcn Published at Dev

catastrophic-failure

Question title says it all, what's the most efficient way to calculate the pairwise partial correlation between every column of a matrix controlling for every other variable?

Basically, something analogous to the cor function below, but resulting in the partial correlations instead of simple correlations.

#> cor(iris[,-5])
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
#Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
#Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
#Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

The result should match the one we obtain with the ppcor library:

#> ppcor::pcor(iris[,-5])$estimate
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000   0.6285707    0.7190656  -0.3396174
#Sepal.Width     0.6285707   1.0000000   -0.6152919   0.3526260
#Petal.Length    0.7190656  -0.6152919    1.0000000   0.8707698
#Petal.Width    -0.3396174   0.3526260    0.8707698   1.0000000

Solutions on other partial correlation coefficients (i.e. not Pearson) are welcome as well.

catastrophic-failure

We know the pairwise partial correlations controlling for every other variable can be obtained through the inversion of the correlation or covariance matrix (see here) in O(n^3) time. So a possible solution is simply:

pcor.solve = function(x){
  res = solve(cov(x))
  res = -res/sqrt(diag(res) %o% diag(res))
  diag(res) = 1
  return(res)
}

This is basically a stripped-down version of ppcor::pcor. The result is:

pcor.solve(iris[,-5])
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000   0.6285707    0.7190656  -0.3396174
#Sepal.Width     0.6285707   1.0000000   -0.6152919   0.3526260
#Petal.Length    0.7190656  -0.6152919    1.0000000   0.8707698
#Petal.Width    -0.3396174   0.3526260    0.8707698   1.0000000

Note however the covariance matrix (or the correlation matrix, the result is the same) must be positive definite.

As this bogs down mostly to an effective inversion operation, I've looked at this thread in stats.SE. qr.solve and chol2inv can be used in covariance matrices to the same effect.

pcor.qr = function(x){
  res = qr.solve(cov(x))
  res = -res/sqrt(diag(res) %o% diag(res))
  diag(res) = 1
  dimnames(res)[[1]] = dimnames(res)[[2]] = colnames(x)
  return(res)
}
pcor.qr(iris[,-5])
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000   0.6285707    0.7190656  -0.3396174
#Sepal.Width     0.6285707   1.0000000   -0.6152919   0.3526260
#Petal.Length    0.7190656  -0.6152919    1.0000000   0.8707698
#Petal.Width    -0.3396174   0.3526260    0.8707698   1.0000000

pcor.chol = function(x){
  res = chol2inv(chol(cov(x)))
  res = -res/sqrt(diag(res) %o% diag(res))
  diag(res) = 1
  dimnames(res)[[1]] = dimnames(res)[[2]] = colnames(x)
  return(res)
}
pcor.chol(iris[,-5])
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000   0.6285707    0.7190656  -0.3396174
#Sepal.Width     0.6285707   1.0000000   -0.6152919   0.3526260
#Petal.Length    0.7190656  -0.6152919    1.0000000   0.8707698
#Petal.Width    -0.3396174   0.3526260    0.8707698   1.0000000

UPDATE:

SVD can also be used to solve. If we have a positive definite square matrix, it's SVD decomposition is A = UDU^T, and it's inverse is simply A^-1 = UD^-1U^T.

pcor.svd = function(x){
  res = svd(cov(x))
  res = res$v %*% diag(1/res$d) %*% t(res$v)
  res = -res/sqrt(diag(res) %o% diag(res))
  diag(res) = 1
  dimnames(res)[[1]] = dimnames(res)[[2]] = colnames(x)
  return(res)
}

pcor.svd(iris[,-5])
#             Sepal.Length Sepal.Width Petal.Length Petal.Width
#Sepal.Length    1.0000000   0.6285707    0.7190656  -0.3396174
#Sepal.Width     0.6285707   1.0000000   -0.6152919   0.3526260
#Petal.Length    0.7190656  -0.6152919    1.0000000   0.8707698
#Petal.Width    -0.3396174   0.3526260    0.8707698   1.0000000

microbenchmark with 10000 repeats:

library(microbenchmark)
#iris
dt1 = iris[,-5]
microbenchmark(
  ppcor = ppcor::pcor(dt1)$estimate,
  solve = pcor.solve(dt1),
  qr = pcor.qr(dt1),
  chol = pcor.chol(dt1),
  svd = pcor.svd(dt1),
  times = 10000L)

#Unit: microseconds
#  expr     min      lq     mean  median      uq        max neval cld
# ppcor 247.728 267.790 314.8356 280.853 296.248 196962.601 10000   c
# solve 176.816 198.743 217.1298 205.274 221.603   2425.964 10000  b 
#    qr 240.264 258.459 282.7005 270.123 285.518   4015.438 10000   c
#  chol 131.562 148.824 163.3567 154.423 167.019   1593.205 10000 a  
#   svd 179.615 199.675 219.2781 208.074 223.469   1920.710 10000  b 

#random data
dt2 = cbind(rnorm(1E4), rnorm(1E4)+2)
microbenchmark(
  ppcor = ppcor::pcor(dt2)$estimate,
  solve = pcor.solve(dt2),
  qr = pcor.qr(dt2),
  chol = pcor.chol(dt2),
  svd = pcor.svd(dt2),
  times = 10000L)

#Unit: microseconds
#  expr     min      lq     mean  median      uq       max neval  cld
# ppcor 243.063 267.323 306.4535 284.585 311.177  1833.936 10000    d
# solve 180.548 190.812 222.6685 198.277 216.004 84776.704 10000 a   
#    qr 229.068 248.662 282.8142 262.658 285.518  1954.301 10000   c 
#  chol 179.148 189.413 212.6551 198.277 216.005  1383.733 10000 a   
#   svd 213.672 230.933 262.5084 243.529 264.058  5261.543 10000  b  

#uncorrelated data
dt3 = cbind(sin(seq(0, 2*pi, length.out = 1000L)), cos(seq(0, 2*pi, length.out = 1000L)))
microbenchmark(
  ppcor = ppcor::pcor(dt3)$estimate,
  solve = pcor.solve(dt3),
  qr = pcor.qr(dt3),
  chol = pcor.chol(dt3),
  svd = pcor.svd(dt3),
  times = 10000L)

#Unit: microseconds
#  expr     min      lq     mean   median      uq      max neval  cld
# ppcor 142.759 162.354 188.7767 172.1500 191.745 2230.021 10000    d
# solve  80.711  89.108 102.8269  92.3740 101.704 1709.372 10000 a   
#    qr 130.629 145.092 168.0627 153.0220 169.351 4914.910 10000   c 
#  chol  79.777  87.709 102.2984  92.3740 101.238 6731.117 10000 a   
#   svd 112.901 127.363 147.1913 134.1285 148.358 1401.928 10000  b

[UPDATED] Or, in other words, chol < solve < svd < qr < ppcor for now. Probably some speed up can be gained talking advantage of tha fact that the covariance matrix is symmetric (the chol solution uses this fact already), and time could be gained also in the covariance calculation.

Of course, the ppcor library is much more versatile, handling cases where the covariance matrix is not invertible etc, so it's in a disadvantage in comparisons. One can also make the case though that it's desirable to have the simpler solutions when partial correlations are going to be exhaustively computed and one knows the covariance matrices are positive definite.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-07-21

Comments

0 comments

From Dev

Related Related

Article