Compute a random data matrix (fake data) without rmvnorm

This is a skeleton post until I have time to flesh it out. The post is motivated by a question on twitter about creating fake data that has a covariance matrix that simulates a known (given) covariance matrix that has one or more negative (or zero) eigenvalues.

First, some libraries

library(data.table)
library(mvtnorm)
library(MASS)

Second, some functions…

random.sign <- function(u){
  # this is fastest of three
    out <- sign(runif(u)-0.5)
    #randomly draws from {-1,1} with probability of each = 0.5
    return(out)
}

fake.eigenvectors <- function(p){
    a <- matrix(rnorm(p*p), p, p) # only orthogonal if p is infinity so need to orthogonalize it
    a <- t(a)%*%a # this is a pseudo-covariance matrix
    E <- eigen(a)$vectors # decompose to truly orthogonal columns
    return(E)
}

fake.eigenvalues <- function(p, m=p, start=2/3, rate=2){
  # m is the number of positive eigenvalues
  # start and rate control the decline in the eigenvalue
  s <- start/seq(1:m)^rate
  s <- c(s, rep(0, p-m)) # add zero eigenvalues
  L <- diag(s/sum(s)*m) # rescale so that sum(s)=m and put into matrix,
  # which would occur if all the traits are variance standardized
  return(L)
}

fake.cov.matrix <- function(p){
    # p is the size of the matrix (number of cols and rows)
    E <- fake.eigenvectors(p)
    L <- diag(fake.eigenvalues(p))
    S <- E%*%L%*%t(E)
    return(S)
}

# two functions to compute the random data
fake.X <- function(n,p,E,L){
  # n is number of observations
  # p is number of variables
  X <- matrix(rnorm(n*p),nrow=n,ncol=p) %*% t(E%*%sqrt(L))
    return(X)
}

And finally, some fake data. Since, I’m not starting with a known covariance matrix, I have to create one. This isn’t necessary if you already have data. I start with fake data that does have a full-rank covariance matrix, and then create fake data that has a single zero eigenvalue.

set.seed(2)
n <- 10^4 # number of cases (rows of the data)
p <- 5 # number of variables (columns of the data)

# start with a matrix. In real life this would be our data
X <- fake.X(n, p, fake.eigenvectors(p), fake.eigenvalues(p))
# and here is our "real" covariance matrix
S <- cov(X)

# okay now we want to create fake data sets with this structure
decomp <- eigen(S)
E <- decomp$vectors
L <- diag(decomp$values)
fake_data <- fake.X(n, p, E, L)

# okay what if our real covariance matrix is not positive definite, that is has zero (negative) eigenvalues.
# Here is our fake data and cov matrix
k <- 1 # number of eigenvalues to delete
X <- fake.X(n, p, fake.eigenvectors(p), fake.eigenvalues(p, m=(p-k)))
# and here is our "real" covariance matrix
S <- cov(X)
decomp <- eigen(S)
E <- decomp$vectors
L <- diag(decomp$values) # note last eigenvalue is negative

# can we simulate with rmvnorm? mvrnorm?
fake_data_rmvnorm <- rmvnorm(n, mean=rep(0, p), sigma=S)
fake_data_mvrnorm <- mvrnorm(n, mu=rep(0, p), Sigma=S)

# now let's sample fake data from this non-pos matrix
# set m to p-1
m <- p-k
E_reduced <- E[1:p, 1:m] # the first m columns of E
L_reduced <- L[1:m, 1:m] # the first m diag elements of L
fake_data <- fake.X(n, m, E_reduced, L_reduced)
S # compare...pretty good!

##            [,1]        [,2]        [,3]       [,4]       [,5]
## [1,]  0.4617029 -0.21686839  0.27537359 -0.4226386 -0.4834831
## [2,] -0.2168684  0.33629531 -0.03734511  0.3562057  0.2632369
## [3,]  0.2753736 -0.03734511  1.08725135 -0.5114116 -0.6554844
## [4,] -0.4226386  0.35620567 -0.51141163  0.7362452  0.8899933
## [5,] -0.4834831  0.26323687 -0.65548437  0.8899933  1.3156042

cov(fake_data_rmvnorm)

##            [,1]        [,2]        [,3]       [,4]       [,5]
## [1,]  0.4576706 -0.21395310  0.27322299 -0.4155847 -0.4731524
## [2,] -0.2139531  0.33445024 -0.02847549  0.3487285  0.2528381
## [3,]  0.2732230 -0.02847549  1.07250120 -0.4958794 -0.6397814
## [4,] -0.4155847  0.34872855 -0.49587939  0.7217563  0.8751866
## [5,] -0.4731524  0.25283813 -0.63978139  0.8751866  1.3047852

cov(fake_data)

##            [,1]       [,2]       [,3]       [,4]       [,5]
## [1,]  0.4596107 -0.2148138  0.2670967 -0.4164482 -0.4760025
## [2,] -0.2148138  0.3366904 -0.0379339  0.3556336  0.2612796
## [3,]  0.2670967 -0.0379339  1.0760712 -0.5059885 -0.6473009
## [4,] -0.4164482  0.3556336 -0.5059885  0.7306168  0.8807419
## [5,] -0.4760025  0.2612796 -0.6473009  0.8807419  1.3014562

The top matrix is the observed covariance matrix. The middle matrix is using rmvnorm. The bottom matrix is from my own code to generate fake data that is modeled to simulate the real data (although here, even the real data is fake).

Compute a random data matrix (fake data) without rmvnorm

R doodles. Some ecology. Some physiology. Much fake data.

How to make plots with factor levels below the x-axis (bench-biology style)

What is an interaction?

How to estimate synergism or antagonism

Type 3 ANOVA in R -- an easy way to publish wrong tables

Linear models with a covariate ("ANCOVA")

Normal Q-Q plots - what is the robust line and should we prefer it?

ANCOVA when the covariate is a mediator affected by treatment

Bootstrap confidence intervals when sample size is really small

What is the consequence of normalizing by each case in the control?

Melting a list of columns