R Doodles
https://rdoodles.rbind.io/
Recent content on R DoodlesHugo -- gohugo.ioen-usWed, 17 Apr 2019 00:00:00 +0000The statistical significance filter
https://rdoodles.rbind.io/2019/04/the-statistical-significance-filter/
Wed, 17 Apr 2019 00:00:00 +0000https://rdoodles.rbind.io/2019/04/the-statistical-significance-filter/1 Why reported effect sizes are inflated 2 Setup 3 Exploration 1 4 Unconditional means, power, and sign error 5 Conditional means 5.1 filter = 0.05 5.2 filter = 0.2 1 Why reported effect sizes are inflated This post is motivated by many discussions in Gelman’s blog but start here
When we estimate an effect1, the estimate will be a little inflated or a little diminished relative to the true effect but the expectation of the effect is the true effect.Covariate adjustment in randomized experiments
https://rdoodles.rbind.io/2019/04/covariate-adjustment-in-randomized-experiments/
Fri, 12 Apr 2019 00:00:00 +0000https://rdoodles.rbind.io/2019/04/covariate-adjustment-in-randomized-experiments/The post motivated by a tweetorial from Darren Dahly
In an experiment, do we adjust for covariates that differ between treatment levels measured pre-experiment (“imbalance” in random assignment), where a difference is inferred from a t-test with p < 0.05? Or do we adjust for all covariates, regardless of differences pre-test? Or do we adjust only for covariates that have sustantial correlation with the outcome? Or do we not adjust at all?What to write, and not write, in a results section — an ever-growing list
https://rdoodles.rbind.io/2019/01/what-to-write-and-not-write-in-a-results-section-an-ever-growing-list/
Thu, 31 Jan 2019 00:00:00 +0000https://rdoodles.rbind.io/2019/01/what-to-write-and-not-write-in-a-results-section-an-ever-growing-list/“GPP (n=4 per site) increased from the No Wildlife site to the Hippo site but was lowest at the Hippo + WB site (Fig. 6); however, these differences were not significant due to low sample sizes and high variability.” – Subalusky, A.L., Dutton, C.L., Njoroge, L., Rosi, E.J., and Post, D.M. (2018). Organic matter and nutrient inputs from large wildlife influence ecosystem function in the Mara River, Africa. Ecology 99, 2558–2574.Paired line plots
https://rdoodles.rbind.io/2019/01/paired-line-plots/
Tue, 22 Jan 2019 00:00:00 +0000https://rdoodles.rbind.io/2019/01/paired-line-plots/load libraries make some fake data make a plot with ggplot ggplot scripts to draw figures like those in the Dynamic Ecology post Paired line plots (a.k.a. “reaction norms”) to visualize Likert data
load libraries library(ggplot2) library(ggpubr) library(data.table) make some fake data set.seed(3) n <- 40 self <- rbinom(n, 5, 0.25) + 1 others <- self + rbinom(n, 3, 0.5) fd <- data.table(id=factor(rep(1:n, 2)), who=factor(rep(c("self", "others"), each=n)), stigma <- c(self, others)) make a plot with ggplot The students are identified by the column “id”.GLM vs. t-tests vs. non-parametric tests if all we care about is NHST
https://rdoodles.rbind.io/2019/01/glm-vs-non-parametric-tests-if-all-we-care-about-is-nhst/
Mon, 07 Jan 2019 00:00:00 +0000https://rdoodles.rbind.io/2019/01/glm-vs-non-parametric-tests-if-all-we-care-about-is-nhst/A skeleton simulation of different strategies for NHST for count data if all we care about is a p-value, as in bench biology where p-values are used to simply give one confidence that something didn’t go terribly wrong (similar to doing experiments in triplicate – it’s not the effect size that matters only “we have experimental evidence of a replicatable effect”)
load libraries library(ggplot2) library(MASS) library(data.table) do_sim <- function(){ set.seed(1) niter <- 1000 methods <- c("t", "Welch", "log", "Wilcoxan", "nb") p_table <- matrix(NA, nrow=niter, ncol=length(methods)) colnames(p_table) <- methods res_table <- data.Expected covariances in a causal network
https://rdoodles.rbind.io/2019/01/expected-covariances-in-a-causal-network/
Thu, 03 Jan 2019 00:00:00 +0000https://rdoodles.rbind.io/2019/01/expected-covariances-in-a-causal-network/This is a skeleton post
Standardized variables (Wright’s rules) n <- 10^5 # z is the common cause of g1 and g2 z <- rnorm(n) # effects of z on g1 and g2 b1 <- 0.7 b2 <- 0.7 r12 <- b1*b2 g1 <- b1*z + sqrt(1-b1^2)*rnorm(n) g2 <- b2*z + sqrt(1-b2^2)*rnorm(n) var(g1) # E(VAR(g1)) = 1 ## [1] 0.997149 var(g2) # E(VAR(g2)) = 1 ## [1] 0.9972956 cor(g1, g2) # E(COR(g1,g2)) = b1*b2 ## [1] 0.Compute a random data matrix (fake data) without rmvnorm
https://rdoodles.rbind.io/2018/12/compute-a-random-data-matrix-fake-data-without-rmvnorm/
Thu, 20 Dec 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/12/compute-a-random-data-matrix-fake-data-without-rmvnorm/This is a skeleton post until I have time to flesh it out. The post is motivated by a question on twitter about creating fake data that has a covariance matrix that simulates a known (given) covariance matrix that has one or more negative (or zero) eigenvalues.
First, some libraries
library(data.table) library(mvtnorm) library(MASS) Second, some functions…
random.sign <- function(u){ # this is fastest of three out <- sign(runif(u)-0.5) #randomly draws from {-1,1} with probability of each = 0.Reporting effects as relative differences...with a confidence interval
https://rdoodles.rbind.io/2018/11/reporting-effects-as-relative-differences-with-a-confidence-interval/
Wed, 14 Nov 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/11/reporting-effects-as-relative-differences-with-a-confidence-interval/Researchers frequently report results as relative effects, for example,
“Male flies from selected lines had 50% larger upwind flight ability than male flies from control lines (Control mean: 117.5 cm/s; Selected mean 176.5 cm/s).”
where a relative effect is
\[\begin{equation} 100 \frac{\bar{y}_B - \bar{y}_A}{\bar{y}_A} \end{equation}\] If we are to follow best practices, we should present this effect with a measure of uncertainty, such as a confidence interval. The absolute effect is 59.Interaction plots with ggplot2
https://rdoodles.rbind.io/2018/10/interaction-plots-with-ggplot2/
Mon, 15 Oct 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/10/interaction-plots-with-ggplot2/ggpubr is a fantastic resource for teaching applied biostats because it makes ggplot a bit easier for students. I’m not super familiar with all that ggpubr can do, but I’m not sure it includes a good “interaction plot” function. Maybe I’m wrong. But if I’m not, here is a simple function to create a gg_interaction plot.
The gg_interaction function returns a ggplot of the modeled means and standard errors and not the raw means and standard errors computed from each group independently.Textbook error 101 -- A low p-value for the full model does not mean that the model is a good predictor of the response
https://rdoodles.rbind.io/2018/09/a-low-p-value-for-the-full-model-does-not-mean-that-the-model-is-a-good-predictor-of-the-response/
Tue, 11 Sep 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/09/a-low-p-value-for-the-full-model-does-not-mean-that-the-model-is-a-good-predictor-of-the-response/On page 606, of Lock et al “Statistics: Unlocking the Power of Data”, the authors state in item D “The p-value from the ANOVA table is 0.000 so the model as a whole is effective at predicting grade point average.” Ah no.
library(data.table) library(mvtnorm) rho <- 0.5 n <- 10^5 Sigma <- diag(2) Sigma[1,2] <- Sigma[2,1] <- rho X <- rmvnorm(n, mean=c(0,0), sigma=Sigma) colnames(X) <- c("X1", "X2") beta <- c(0.01, -0.A simple ggplot of some measure against depth
https://rdoodles.rbind.io/2018/09/a-simple-ggplot-of-some-measure-against-depth/
Mon, 10 Sep 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/09/a-simple-ggplot-of-some-measure-against-depth/set up The goal is to plot the measure of something, say O2 levels, against depth (soil or lake), with the measures taken on multiple days
library(ggplot2) library(data.table) First – create fake data depths <- c(0, seq(10,100, by=10)) dates <- c("Jan-18", "Mar-18", "May-18", "Jul-18") x <- expand.grid(date=dates, depth=depths) n <- nrow(x) head(x) ## date depth ## 1 Jan-18 0 ## 2 Mar-18 0 ## 3 May-18 0 ## 4 Jul-18 0 ## 5 Jan-18 10 ## 6 Mar-18 10 X <- model.Should the model-averaged prediction be computed on the link or response scale in a GLM?
https://rdoodles.rbind.io/2018/05/should-the-model-averaged-prediction-be-computed-on-the-link-or-response-scale-in-a-glm/
Sun, 13 May 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/05/should-the-model-averaged-prediction-be-computed-on-the-link-or-response-scale-in-a-glm/[updated to include additional output from MuMIn, BMA, and BAS]
This post is a follow up to my inital post, which was written as as a way for me to pen my mental thoughts on the recent review of “Model averaging in ecology: a review of Bayesian, information‐theoretic and tactical approaches for predictive inference”. It was also written without contacting and discussing the issue with the authors. This post benefits from a series of e-mails with the lead author Carsten Dormann and the last author Florian Hartig.On model averaging the coefficients of linear models
https://rdoodles.rbind.io/2018/05/on-model-averaging-regression-coefficients/
Thu, 10 May 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/05/on-model-averaging-regression-coefficients/a shorter argument based on a specific example is here
“What model averaging does not mean is averaging parameter estimates, because parameters in different models have different meanings and should not be averaged, unless you are sure you are in a special case in which it is safe to do so.” – Richard McElreath, p. 196 of the textbook I wish I had learned from Statistical Rethinking
This is an infrequent but persistent criticism of model-averaged coefficients in the applied statistics literature on model averaging.An even more compact defense of coefficient model averaging
https://rdoodles.rbind.io/2018/05/an-even-more-compact-defense-of-coefficient-model-averaging/
Mon, 07 May 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/05/an-even-more-compact-defense-of-coefficient-model-averaging/a longer, more detailed argument is here
The parameter that is averaged “needs to have the same meaning in all “models” for the equations to be straightforwardly interpretable; the coefficient of x1 in a regression of y on x1 is a different beast than the coefficient of x1 in a regression of y on x1 and x2.” – David Draper in a comment on Hoeting et al. 1999.
David Draper suggested this example from the textbook by Freedman, Pisani and Purves.Model-averaged coefficients of a GLM
https://rdoodles.rbind.io/2018/05/model-averaged-coefficients-of-a-glm/
Fri, 04 May 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/05/model-averaged-coefficients-of-a-glm/This is a very quick post as a comment to the statement
“For linear models, predicting from a parameter-averaged model is mathematically identical to averaging predictions, but this is not the case for non-linear models…For non-linear models, such as GLMs with log or logit link functions g(x)1, such coefficient averaging is not equivalent to prediction averaging.”
from the supplement of Dormann et al. Model averaging in ecology: a review of Bayesian, information‐theoretic and tactical approaches for predictive inference.On alpha
https://rdoodles.rbind.io/2018/04/on-alpha/
Mon, 23 Apr 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/04/on-alpha/This post is motivated by Terry McGlynn’s thought provoking How do we move beyond an arbitrary statistical threshold? I have been struggling with the ideas explored in Terry’s post ever since starting my PhD 30 years ago, and its only been in the last couple of years that my own thoughts have begun to gel. This long marination period is largely because of my very classical biostatistical training. My PhD is from the Department of Anatomical Sciences at Stony Brook but the content was geometric morphometrics and James Rohlf was my mentor for morphometrics specifically, and multivariate statistics more generally.Combining data, distribution summary, model effects, and uncertainty in a single plot
https://rdoodles.rbind.io/2018/03/combining-data-distribution-summary-model-effects-and-uncertainty-in-a-single-plot/
Tue, 27 Mar 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/03/combining-data-distribution-summary-model-effects-and-uncertainty-in-a-single-plot/A Harrell plot combines a forest plot of estimated treatment effects and uncertainty, a dot plot of raw data, and a box plot of the distribution of the raw data into a single plot. A Harrell plot encourages best practices such as exploration of the distribution of the data and focus on effect size and uncertainty, while discouraging bad practices such as ignoring distributions and focusing on \(p\)-values. Consequently, a Harrell plot should replace the bar plots and Cleveland dot plots that are currently ubiquitous in the literature.What is the range of reasonable P-values given a two standard error difference in means?
https://rdoodles.rbind.io/2018/03/what-is-the-range-of-reasonable-p-values-given-a-two-standard-error-difference-in-means/
Sun, 18 Mar 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/03/what-is-the-range-of-reasonable-p-values-given-a-two-standard-error-difference-in-means/Here is the motivating quote for this post, from Andrew Gelman’s blog post “Five ways to fix statistics”
I agree with just about everything in Leek’s article except for this statement: “It’s also impractical to say that statistical metrics such as P values should not be used to make decisions. Sometimes a decision (editorial or funding, say) must be made, and clear guidelines are useful.” Yes, decisions need to be made, but to suggest that p-values be used to make editorial or funding decisions—that’s just horrible.Bias in pre-post designs -- An example from the Turnbaugh et al (2006) mouse fecal transplant study
https://rdoodles.rbind.io/2018/03/bias-in-pre-post-designs-an-example-from-the-turnbaugh-et-al-2006-mouse-fecal-transplant-study/
Thu, 08 Mar 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/03/bias-in-pre-post-designs-an-example-from-the-turnbaugh-et-al-2006-mouse-fecal-transplant-study/This post is motivated by a twitter link to a recent blog post critical of the old but influential study An obesity-associated gut microbiome with increased capacity for energy harvest with impressive citation metrics. In the post, Matthew Dalby smartly used the available data to reconstruct the final weights of the two groups. He showed these final weights were nearly the same, which is not good evidence for a treatment effect, given that the treatment was randomized among groups.What is an R doodle?
https://rdoodles.rbind.io/2018/03/what-is-an-r-doodle/
Wed, 07 Mar 2018 00:00:00 +0000https://rdoodles.rbind.io/2018/03/what-is-an-r-doodle/An R doodle is a short script to check intuition or understanding. Almost always, this involves generating fake data. I might create an R doodle when I’m reviewing a manuscript or reading a published paper and I want to check if their statistical analysis is doing what the authors think it is doing. Or maybe I create it to help me figure out what the authors are doing. Or I might be teaching some method and I create an R doodle to help me understand how the method behaves given different input (fake) data sets.