Compute a random data matrix (fake data) without rmvnorm
This is a skeleton post until I have time to flesh it out. The post is motivated by a question on twitter about creating fake data that has a covariance matrix that simulates a known (given) covariance matrix that has one or more negative (or zero) eigenvalues.
First, some libraries
library(data.table)
library(mvtnorm)
library(MASS)
Second, some functions…
random.sign <- function(u){
# this is fastest of three
out <- sign(runif(u)-0.5)
#randomly draws from {-1,1} with probability of each = 0.5
return(out)
}
fake.eigenvectors <- function(p){
a <- matrix(rnorm(p*p), p, p) # only orthogonal if p is infinity so need to orthogonalize it
a <- t(a)%*%a # this is a pseudo-covariance matrix
E <- eigen(a)$vectors # decompose to truly orthogonal columns
return(E)
}
fake.eigenvalues <- function(p, m=p, start=2/3, rate=2){
# m is the number of positive eigenvalues
# start and rate control the decline in the eigenvalue
s <- start/seq(1:m)^rate
s <- c(s, rep(0, p-m)) # add zero eigenvalues
L <- diag(s/sum(s)*m) # rescale so that sum(s)=m and put into matrix,
# which would occur if all the traits are variance standardized
return(L)
}
fake.cov.matrix <- function(p){
# p is the size of the matrix (number of cols and rows)
E <- fake.eigenvectors(p)
L <- diag(fake.eigenvalues(p))
S <- E%*%L%*%t(E)
return(S)
}
# two functions to compute the random data
fake.X <- function(n,p,E,L){
# n is number of observations
# p is number of variables
X <- matrix(rnorm(n*p),nrow=n,ncol=p) %*% t(E%*%sqrt(L))
return(X)
}
And finally, some fake data. Since, I’m not starting with a known covariance matrix, I have to create one. This isn’t necessary if you already have data. I start with fake data that does have a full-rank covariance matrix, and then create fake data that has a single zero eigenvalue.
set.seed(2)
n <- 10^4 # number of cases (rows of the data)
p <- 5 # number of variables (columns of the data)
# start with a matrix. In real life this would be our data
X <- fake.X(n, p, fake.eigenvectors(p), fake.eigenvalues(p))
# and here is our "real" covariance matrix
S <- cov(X)
# okay now we want to create fake data sets with this structure
decomp <- eigen(S)
E <- decomp$vectors
L <- diag(decomp$values)
fake_data <- fake.X(n, p, E, L)
# okay what if our real covariance matrix is not positive definite, that is has zero (negative) eigenvalues.
# Here is our fake data and cov matrix
k <- 1 # number of eigenvalues to delete
X <- fake.X(n, p, fake.eigenvectors(p), fake.eigenvalues(p, m=(p-k)))
# and here is our "real" covariance matrix
S <- cov(X)
decomp <- eigen(S)
E <- decomp$vectors
L <- diag(decomp$values) # note last eigenvalue is negative
# can we simulate with rmvnorm? mvrnorm?
fake_data_rmvnorm <- rmvnorm(n, mean=rep(0, p), sigma=S)
fake_data_mvrnorm <- mvrnorm(n, mu=rep(0, p), Sigma=S)
# now let's sample fake data from this non-pos matrix
# set m to p-1
m <- p-k
E_reduced <- E[1:p, 1:m] # the first m columns of E
L_reduced <- L[1:m, 1:m] # the first m diag elements of L
fake_data <- fake.X(n, m, E_reduced, L_reduced)
S # compare...pretty good!
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.4617029 -0.21686839 0.27537359 -0.4226386 -0.4834831
## [2,] -0.2168684 0.33629531 -0.03734511 0.3562057 0.2632369
## [3,] 0.2753736 -0.03734511 1.08725135 -0.5114116 -0.6554844
## [4,] -0.4226386 0.35620567 -0.51141163 0.7362452 0.8899933
## [5,] -0.4834831 0.26323687 -0.65548437 0.8899933 1.3156042
cov(fake_data_rmvnorm)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.4576706 -0.21395310 0.27322299 -0.4155847 -0.4731524
## [2,] -0.2139531 0.33445024 -0.02847549 0.3487285 0.2528381
## [3,] 0.2732230 -0.02847549 1.07250120 -0.4958794 -0.6397814
## [4,] -0.4155847 0.34872855 -0.49587939 0.7217563 0.8751866
## [5,] -0.4731524 0.25283813 -0.63978139 0.8751866 1.3047852
cov(fake_data)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.4596107 -0.2148138 0.2670967 -0.4164482 -0.4760025
## [2,] -0.2148138 0.3366904 -0.0379339 0.3556336 0.2612796
## [3,] 0.2670967 -0.0379339 1.0760712 -0.5059885 -0.6473009
## [4,] -0.4164482 0.3556336 -0.5059885 0.7306168 0.8807419
## [5,] -0.4760025 0.2612796 -0.6473009 0.8807419 1.3014562
The top matrix is the observed covariance matrix. The middle matrix is using rmvnorm. The bottom matrix is from my own code to generate fake data that is modeled to simulate the real data (although here, even the real data is fake).