R Doodles

R Doodles https://rdoodles.rbind.io/ Recent content on R Doodles Hugo -- gohugo.io en-us Sun, 27 Dec 2020 00:00:00 +0000 How to make plots with factor levels below the x-axis (bench-biology style) https://rdoodles.rbind.io/2020/12/how-to-make-plots-with-factor-levels-below-the-x-axis-bench-biology-style/ Sun, 27 Dec 2020 00:00:00 +0000 https://rdoodles.rbind.io/2020/12/how-to-make-plots-with-factor-levels-below-the-x-axis-bench-biology-style/ The motivation for this post was to create a pipeline for generating publication-ready plots entirely within ggplot and avoid post-generation touch-ups in Illustrator or Inkscape. These scripts are a start. The ideal modification would be turning the chunks into functions with personalized detail so that a research team could quickly and efficiently generate multiple plots. I might try to turn the scripts into a very-general-but-not-ready-for-r-package function for my students. Continue to the whole post What is an interaction? https://rdoodles.rbind.io/2020/11/what-is-an-interaction/ Wed, 04 Nov 2020 00:00:00 +0000 https://rdoodles.rbind.io/2020/11/what-is-an-interaction/ A factorial experiment is one in which there are two or more factor variables (categorical $X$) that are crossed, resulting in a group for each combination of the levels of each factor. Factorial experiments are used to estimate the interaction effect between factors. Two factors interact when the effect of one factor depends on the level of the other factors. Interactions are ubiquitous, although sometimes they are small enough to ignore with little to no loss of understanding. How to estimate synergism or antagonism https://rdoodles.rbind.io/2020/11/how-to-estimate-synergism-or-antagonism/ Tue, 03 Nov 2020 00:00:00 +0000 https://rdoodles.rbind.io/2020/11/how-to-estimate-synergism-or-antagonism/ motivating source: Integration of two herbivore-induced plant volatiles results in synergistic effects on plant defense and resistance What is synergism or antagonism? (this post is a follow up to What is an interaction?) In the experiment for Figure 1 of the motivating source article, the researchers were explicitly interested in measuring any synergistic effects of hac and indole on the response. What is a synergistic effect? If hac and indole act independently, then the response should be additive – the HAC+Indole effect should simply be the sum of the independent HAC and Indole effects. Type 3 ANOVA in R -- an easy way to publish wrong tables https://rdoodles.rbind.io/2020/10/type-3-anova-in-r-an-easy-way-to-publish-wrong-tables/ Fri, 23 Oct 2020 00:00:00 +0000 https://rdoodles.rbind.io/2020/10/type-3-anova-in-r-an-easy-way-to-publish-wrong-tables/ In R, so-called “Type I sums of squares” are default. With balanced designs, inferential statistics from Type I, II, and III sums of squares are equal. Type III sums of squares are returned using car::Anova instead of base R anova. But to get the correct Type III statistics, you cannot simply specify car:Anova(m1, type = 3). You also have to set the contrasts in the model matrix to contr.sum in your linear model fit. Linear models with a covariate ("ANCOVA") https://rdoodles.rbind.io/2020/10/linear-models-with-a-covariate-ancova/ Wed, 21 Oct 2020 00:00:00 +0000 https://rdoodles.rbind.io/2020/10/linear-models-with-a-covariate-ancova/ Normal Q-Q plots - what is the robust line and should we prefer it? https://rdoodles.rbind.io/2020/10/normal-q-q-plots-what-is-the-robust-line-and-should-we-prefer-it/ Thu, 15 Oct 2020 00:00:00 +0000 https://rdoodles.rbind.io/2020/10/normal-q-q-plots-what-is-the-robust-line-and-should-we-prefer-it/ Warning - This is a long, exploratory post on Q-Q plots motivated by the specific data set analyzed below and the code follows my stream of thinking this through. I have not gone back through to economize length. So yeh, some repeated code I’ve turned into functions and other repeated code is repeated. This post is not about how to interpret a Q-Q plot but about which Q-Q plot? to interpret. ANCOVA when the covariate is a mediator affected by treatment https://rdoodles.rbind.io/2020/07/ancova-when-the-covariate-is-a-mediator-affected-by-treatment/ Sun, 12 Jul 2020 00:00:00 +0000 https://rdoodles.rbind.io/2020/07/ancova-when-the-covariate-is-a-mediator-affected-by-treatment/ This is fake data that simulates an experiment to measure effect of treatment on fat weight in mice. The treatment is “diet” with two levels: “control” (blue dots) and “treated” (gold dots). Diet has a large effect on total body weight. The simulated data are in the plot above - these look very much like the real data. The question is, what are problems with using an “ancova” linear model to estimate the direct effect of treatment on fat weight? Bootstrap confidence intervals when sample size is really small https://rdoodles.rbind.io/2020/06/bootstrap-confidence-intervals-when-sample-size-is-really-small/ Mon, 01 Jun 2020 00:00:00 +0000 https://rdoodles.rbind.io/2020/06/bootstrap-confidence-intervals-when-sample-size-is-really-small/ TL;DR A sample table from the full results for data that look like this Table 1: Coverage of 95% bca CIs. parameter n=5 n=10 n=20 n=40 n=80 means Control 81.4 87.6 92.2 93.0 93.6 b4GalT1-/- 81.3 90.2 90.8 93.0 93.8 difference in means diff 83. What is the consequence of normalizing by each case in the control? https://rdoodles.rbind.io/2020/05/what-is-the-consequence-of-normalizing-by-each-case-in-the-control/ Mon, 18 May 2020 00:00:00 +0000 https://rdoodles.rbind.io/2020/05/what-is-the-consequence-of-normalizing-by-each-case-in-the-control/ Motivator: Novel metabolic role for BDNF in pancreatic β-cell insulin secretion I’ll finish this some day… knitr::opts_chunk$set(echo = TRUE, message=FALSE) library(tidyverse) library(data.table) library(mvtnorm) library(lmerTest) normal response niter <- 2000 n <- 9 treatment_levels <- c("cn", "high", "high_bdnf") insulin <- data.table(treatment = rep(treatment_levels, each=n)) X <- model.matrix(~ treatment, data=insulin) beta <- c(0,0,0) # no effects # the three responses are taken from the same cluster of cells and so have expected # correlation rho. Melting a list of columns https://rdoodles.rbind.io/2020/04/melting-a-list-of-columns/ Mon, 27 Apr 2020 00:00:00 +0000 https://rdoodles.rbind.io/2020/04/melting-a-list-of-columns/ An answer to this tweet “Are there any #Rstats tidy expeRts who’d be interested in improving the efficiency of this code that gathers multiple variables from wide to long? This works but it’s not pretty. There must be a prettier way…" Wide data frame has three time points where participants answer two questions on two topics. create data from original code #Simmed data Time1.Topic1.Question1 <- rnorm(500) data <- data.frame(Time1.Topic1.Question1) data$Time1.TOpic1.Question2 <- rnorm(500) data$Time1. Analyzing longitudinal data -- a simple pre-post design https://rdoodles.rbind.io/2020/03/analyzing-longitudinal-data-a-simple-pre-post-design/ Thu, 19 Mar 2020 00:00:00 +0000 https://rdoodles.rbind.io/2020/03/analyzing-longitudinal-data-a-simple-pre-post-design/ A skeletal response to a twitter question: “ANOVA (time point x group) or ANCOVA (group with time point as a covariate) for intervention designs? Discuss.” follow-up “Only 2 time points in this case (pre- and post-intervention), and would wanna basically answer the question of whether out of the 3 intervention groups, some improve on measure X more than others after the intervention” Here I compare five methods using fake pre-post data, including Using Wright's rules and a DAG to compute the bias of an effect when we measure proxies for X and Y https://rdoodles.rbind.io/2020/02/using-wright-s-rules-and-a-dag-to-compute-the-bias-of-an-effect-when-we-measure-proxies-for-x-and-y/ Fri, 07 Feb 2020 00:00:00 +0000 https://rdoodles.rbind.io/2020/02/using-wright-s-rules-and-a-dag-to-compute-the-bias-of-an-effect-when-we-measure-proxies-for-x-and-y/ This is a skeletal post to work up an answer to a twitter question using Wright’s rules of path models. Using this figure from Panel A of a figure from Hernan and Cole. The scribbled red path coefficients are added the question is I want to know about A->Y but I measure A* and Y*. So in figure A, is the bias the backdoor path from A* to Y* through A and Y? "Nested" random factors in mixed (multilevel or hierarchical) models https://rdoodles.rbind.io/2019/11/nested-random-factors-in-mixed-multilevel-or-hierarchical-models/ Sat, 30 Nov 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/11/nested-random-factors-in-mixed-multilevel-or-hierarchical-models/ Setup Import Models as nested using “tank” nested within “room” as two random intercepts (using lme4 to create the combinations) A safer (lme4) way to create the combinations of “room” and “tank”: as two random intercepts using “tank2” Don’t do this This is a skeletal post to show the equivalency of different ways of thinking about “nested” factors in a mixed model. The data are measures of life history traits in lice that infect salmon. Estimate of marginal ("main") effects instead of ANOVA for factorial experiments https://rdoodles.rbind.io/2019/11/estimate-of-marginal-main-effects-instead-of-anova-for-factorial-experiments/ Fri, 29 Nov 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/11/estimate-of-marginal-main-effects-instead-of-anova-for-factorial-experiments/ Background Comparing marginal effects to main effect terms in an ANOVA table First, some fake data Comparison of marginal effects vs. “main” effects term of ANOVA table when data are balanced Comparison of marginal effects vs. “main” effects term of ANOVA table when data are unbalanced When to estimate marginal effects keywords: estimation, ANOVA, factorial, model simplification, conditional effects, marginal effects Background I recently read a paper from a very good ecology journal that communicated the results of an ANOVA like that below (Table 1) using a statement similar to “The removal of crabs strongly decreased algae cover (\(F_{1,36} = 17. Can a linear model reproduce a Welch t-test? https://rdoodles.rbind.io/2019/10/can-a-linear-model-reproduce-a-welch-t-test/ Sun, 27 Oct 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/10/can-a-linear-model-reproduce-a-welch-t-test/ This doodle was motivated Jake Westfall’s answer to a Cross-Validated question. The short answer is yes but most R scripts that I’ve found on the web are unsatisfying because only the t-value reproduces, not the df and p-value. Jake notes the reason for this in his answer on Cross-Validated. To get the adjusted df, and the p-value associated with this, one can use the emmeans package by Russell Lenth, as he notes here. Normalization results in regression to the mean and inflated Type I error conditional on the reference values https://rdoodles.rbind.io/2019/10/normalization-results-in-regression-to-the-mean/ Wed, 16 Oct 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/10/normalization-results-in-regression-to-the-mean/ Fig 1C of the Replication Study: Melanoma exosomes educate bone marrow progenitor cells toward a pro-metastatic phenotype through MET uses an odd (to me) three stage normalization procedure for the quantified western blots. The authors compared blot values between a treatment (shMet cells) and a control (shScr cells) using GAPDH to normalize the values. The three stages of the normalization are first, the value for the Antibody levels were normalized by the value of a reference (GAPDH) for each Set. A comment on the novel transformation of the response in " Senolytics decrease senescent cells in humans: Preliminary report from a clinical trial of Dasatinib plus Quercetin in individuals with diabetic kidney disease" https://rdoodles.rbind.io/2019/10/a-comment-on-the-novel-transformation-of-the-response-in-senolytics-decrease-senescent-cells-in-humans-preliminary-report-from-a-clinical-trial-of-dasatinib-plus-quercetin-in-individuals-with-diabetic-kidney-disease/ Wed, 02 Oct 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/10/a-comment-on-the-novel-transformation-of-the-response-in-senolytics-decrease-senescent-cells-in-humans-preliminary-report-from-a-clinical-trial-of-dasatinib-plus-quercetin-in-individuals-with-diabetic-kidney-disease/ Motivation: https://pubpeer.com/publications/8DF6E66FEFAA2C3C7D5BD9C3FC45A2#2 and https://twitter.com/CGATist/status/1175015246282539009 tl;dr: Given the transformation done by the authors, for any response in day_0 that is unusually small, there is automatically a response in day_14 that is unusually big and vice-versa. Consequently, if the mean for day_0 is unusually small, the mean for day_14 is automatically unusually big, hence the elevated type I error with an unpaired t-test. The transformation is necessary and sufficient to produce the result (meaning even in conditions where a paired t-test isn’t needed, the transformation still produces elevated Type I error). What is the consequence of a Shapiro-Wilk test-of-normality filter on Type I error and Power? https://rdoodles.rbind.io/2019/08/what-is-the-consequence-of-a-shapiro-wilk-test-of-normality-filter-on-type-i-error-and-power/ Thu, 08 Aug 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/08/what-is-the-consequence-of-a-shapiro-wilk-test-of-normality-filter-on-type-i-error-and-power/ Set up Normal distribution Type I error Power Right skewed continuous – lognormal What the parameterizations look like Type I error Power This 1990-wants-you-back doodle explores the effects of a Normality Filter – using a Shapiro-Wilk (SW) test as a decision rule for using either a t-test or some alternative such as a 1) non-parametric Mann-Whitney-Wilcoxon (MWW) test, or 2) a t-test on the log-transformed response. What is the bias in the estimation of an effect given an omitted interaction term? https://rdoodles.rbind.io/2019/07/what-is-bias-in-the-estimation-of-an-effect-giving-an-omitted-interaction-term/ Wed, 31 Jul 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/07/what-is-bias-in-the-estimation-of-an-effect-giving-an-omitted-interaction-term/ Some background (due to Sewall Wright’s method of path analysis) Given a generating model: \[\begin{equation} y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 \end{equation}\] where $x_3 = x_1 x_2$; that is, it is an interaction variable. The total effect of $x_1$ on $y$ is $\beta_1 + \frac{\mathrm{COV}(x_1, x_2)}{\mathrm{VAR}(x_1)} \beta_2 + \frac{\mathrm{COV}(x_1, x_3)}{\mathrm{VAR}(x_1)} \beta_3$. If $x_3$ (the interaction) is missing, its component on the total efffect is added to the coefficient of $x_1$. Is the power to test an interaction effect less than that for a main effect? https://rdoodles.rbind.io/2019/07/is-the-power-to-test-an-interaction-effect-less-than-that-for-a-main-effect/ Tue, 02 Jul 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/07/is-the-power-to-test-an-interaction-effect-less-than-that-for-a-main-effect/ I was googling around and somehow landed on a page that stated “When effect coding is used, statistical power is the same for all regression coefficients of the same size, whether they correspond to main effects or interactions, and irrespective of the order of the interaction”. Really? How could this be? The p-value for an interaction effect is the same regardless of dummy or effects coding, and, with dummy coding (R’s default), the power of the interaction effect is less than that of the coefficients for the main factors when they have the same magnitude, so my intuition said this statement must be wrong. Analyze the mean (or median) and not the max response https://rdoodles.rbind.io/2019/06/analyze-the-mean-or-median-and-not-the-max-response/ Tue, 25 Jun 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/06/analyze-the-mean-or-median-and-not-the-max-response/ This is an update of Paired t-test as a special case of linear model and hierarchical model Figure 2A of the paper Meta-omics analysis of elite athletes identifies a performance-enhancing microbe that functions via lactate metabolism uses a paired t-test to compare endurance performance in mice treated with a control microbe (Lactobacillus bulgaricus) and a test microbe (Veillonella atypica) in a cross-over design (so each mouse was treated with both bacteria). Paired t-test as a special case of linear model and hierarchical (linear mixed) model https://rdoodles.rbind.io/2019/06/paired-t-test-as-a-special-case-of-linear-model-and-hierarchical-linear-mixed-model/ Tue, 25 Jun 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/06/paired-t-test-as-a-special-case-of-linear-model-and-hierarchical-linear-mixed-model/ Update – Fig. 2A is an analysis of the maximum endurance over three trials. This has consequences. Figure 2A of the paper Meta-omics analysis of elite athletes identifies a performance-enhancing microbe that functions via lactate metabolism uses a paired t-test to compare endurance performance in mice treated with a control microbe (Lactobacillus bulgaricus) and a test microbe (Veillonella atypica) in a cross-over design (so each mouse was treated with both bacteria). What does cell biology data look like? https://rdoodles.rbind.io/2019/06/what-does-cell-biology-data-look-like/ Sun, 09 Jun 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/06/what-does-cell-biology-data-look-like/ If I’m going to evaluate the widespread use of t-tests/ANOVAs on count data in bench biology then I’d like to know what these data look like, specifically the shape (“overdispersion”) parameter. Set up library(ggplot2) library(readxl) library(ggpubr) library(cowplot) library(plyr) #mapvalues library(data.table) # glm packages library(MASS) library(pscl) #zeroinfl library(DHARMa) library(mvabund) data_path <- "../data" # notebook, console source("../../../R/clean_labels.R") # notebook, console Data from The enteric nervous system promotes intestinal health by constraining microbiota composition Import read_enteric <- function(sheet_i, range_i, file_path, wide_2_long=TRUE){ dt_wide <- data. Reanalyzing data from Human Gut Microbiota from Autism Spectrum Disorder Promote Behavioral Symptoms in Mice https://rdoodles.rbind.io/2019/06/reanalyzing-data-from-human-gut-microbiota-from-autism-spectrum-disorder-promote-behavioral-symptoms-in-mice/ Mon, 03 Jun 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/06/reanalyzing-data-from-human-gut-microbiota-from-autism-spectrum-disorder-promote-behavioral-symptoms-in-mice/ Update - This post has been updated A very skeletal analysis of Sharon, G., Cruz, N.J., Kang, D.W., Gandal, M.J., Wang, B., Kim, Y.M., Zink, E.M., Casey, C.P., Taylor, B.C., Lane, C.J. and Bramer, L.M., 2019. Human Gut Microbiota from Autism Spectrum Disorder Promote Behavioral Symptoms in Mice. Cell, 177(6), pp.1600-1618. which got some attention on pubpeer. Commenters are questioning the result of Fig1G. It is very hard to infer a p-value from plots like these, where the data are multi-level, regardless of if means and some kind of error bar is presented. GLM vs. t-tests vs. non-parametric tests if all we care about is NHST -- Update https://rdoodles.rbind.io/2019/05/glm-vs-t-tests-vs-non-parametric-tests-if-all-we-care-about-is-nhst-update/ Thu, 30 May 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/05/glm-vs-t-tests-vs-non-parametric-tests-if-all-we-care-about-is-nhst-update/ Update to the earlier post, which was written in response to my own thinking about how to teach stastics to experimental biologists working in fields that are dominated by hypothesis testing instead of estimation. That is, should these researchers learn GLMs or is a t-test on raw or log-transformed data on something like count data good enough – or even superior? My post was written without the benefit of either [Ives](Ives, Anthony R. Should we be skeptical of a "large" effect size if p > 0.05? https://rdoodles.rbind.io/2019/05/should-we-be-skeptical-of-a-large-effect-size-if-p-0-05/ Tue, 28 May 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/05/should-we-be-skeptical-of-a-large-effect-size-if-p-0-05/ Motivator: A twitter comment “Isn’t the implication that the large effect size is a direct byproduct of the lack of power? i.e. that if the the study had more power, the effect size would have been found to be smaller.”1 2 A thought: our belief in the magnitude of an observed effect should be based on our priors, which, hopefully, are formed from good mechanistic models and not sample size“.3 Blocking vs. covariate adjustment https://rdoodles.rbind.io/2019/04/blocking-vs-covariate-adjustment/ Sat, 27 Apr 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/04/blocking-vs-covariate-adjustment/ “A more efficient design would be to first group the rats into homogeneous subsets based on baseline food consumption. This could be done by ranking the rats from heaviest to lightest eaters and then grouping them into pairs by taking the first two rats (the two that ate the most during baseline), then the next two in the list, and so on. The difference from a completely randomised design is that one rat within each pair is randomised to one of the treatment groups, and the other rat is then assigned to the remaining treatment group. The statistical significance filter https://rdoodles.rbind.io/2019/04/the-statistical-significance-filter/ Wed, 17 Apr 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/04/the-statistical-significance-filter/ 1 Why reported effect sizes are inflated 2 Setup 3 Exploration 1 4 Unconditional means, power, and sign error 5 Conditional means 5.1 filter = 0.05 5.2 filter = 0.2 1 Why reported effect sizes are inflated This post is motivated by many discussions in Gelman’s blog but start here When we estimate an effect1, the estimate will be a little inflated or a little diminished relative to the true effect but the expectation of the effect is the true effect. Covariate adjustment in randomized experiments https://rdoodles.rbind.io/2019/04/covariate-adjustment-in-randomized-experiments/ Fri, 12 Apr 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/04/covariate-adjustment-in-randomized-experiments/ The post motivated by a tweetorial from Darren Dahly In an experiment, do we adjust for covariates that differ between treatment levels measured pre-experiment (“imbalance” in random assignment), where a difference is inferred from a t-test with p < 0.05? Or do we adjust for all covariates, regardless of differences pre-test? Or do we adjust only for covariates that have sustantial correlation with the outcome? Or do we not adjust at all? What to write, and not write, in a results section — an ever-growing list https://rdoodles.rbind.io/2019/01/what-to-write-and-not-write-in-a-results-section-an-ever-growing-list/ Thu, 31 Jan 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/01/what-to-write-and-not-write-in-a-results-section-an-ever-growing-list/ “GPP (n=4 per site) increased from the No Wildlife site to the Hippo site but was lowest at the Hippo + WB site (Fig. 6); however, these differences were not significant due to low sample sizes and high variability.” If we know these are not significant due to low sampe size and high variability, why even do the test? “TRE led to a modest, but not significant, increase in sleep duration to 449. Paired line plots https://rdoodles.rbind.io/2019/01/paired-line-plots/ Tue, 22 Jan 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/01/paired-line-plots/ load libraries make some fake data make a plot with ggplot ggplot scripts to draw figures like those in the Dynamic Ecology post Paired line plots (a.k.a. “reaction norms”) to visualize Likert data load libraries library(ggplot2) library(ggpubr) library(data.table) make some fake data set.seed(3) n <- 40 self <- rbinom(n, 5, 0.25) + 1 others <- self + rbinom(n, 3, 0.5) fd <- data.table(id=factor(rep(1:n, 2)), who=factor(rep(c("self", "others"), each=n)), stigma <- c(self, others)) make a plot with ggplot The students are identified by the column “id”. GLM vs. t-tests vs. non-parametric tests if all we care about is NHST https://rdoodles.rbind.io/2019/01/glm-vs-non-parametric-tests-if-all-we-care-about-is-nhst/ Mon, 07 Jan 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/01/glm-vs-non-parametric-tests-if-all-we-care-about-is-nhst/ This post has been updated. A skeleton simulation of different strategies for NHST for count data if all we care about is a p-value, as in bench biology where p-values are used to simply give one confidence that something didn’t go terribly wrong (similar to doing experiments in triplicate – it’s not the effect size that matters only “we have experimental evidence of a replicable effect”). tl;dr - At least for Type I error at small $n$, log(response) and Wilcoxan have the best performance over the simulation space. Expected covariances in a causal network https://rdoodles.rbind.io/2019/01/expected-covariances-in-a-causal-network/ Thu, 03 Jan 2019 00:00:00 +0000 https://rdoodles.rbind.io/2019/01/expected-covariances-in-a-causal-network/ This is a skeleton post Standardized variables (Wright’s rules) n <- 10^5 # z is the common cause of g1 and g2 z <- rnorm(n) # effects of z on g1 and g2 b1 <- 0.7 b2 <- 0.7 r12 <- b1*b2 g1 <- b1*z + sqrt(1-b1^2)*rnorm(n) g2 <- b2*z + sqrt(1-b2^2)*rnorm(n) var(g1) # E(VAR(g1)) = 1 ## [1] 1.001849 var(g2) # E(VAR(g2)) = 1 ## [1] 1.006102 cor(g1, g2) # E(COR(g1,g2)) = b1*b2 ## [1] 0. Compute a random data matrix (fake data) without rmvnorm https://rdoodles.rbind.io/2018/12/compute-a-random-data-matrix-fake-data-without-rmvnorm/ Thu, 20 Dec 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/12/compute-a-random-data-matrix-fake-data-without-rmvnorm/ This is a skeleton post until I have time to flesh it out. The post is motivated by a question on twitter about creating fake data that has a covariance matrix that simulates a known (given) covariance matrix that has one or more negative (or zero) eigenvalues. First, some libraries library(data.table) library(mvtnorm) library(MASS) Second, some functions… random.sign <- function(u){ # this is fastest of three out <- sign(runif(u)-0.5) #randomly draws from {-1,1} with probability of each = 0. Reporting effects as relative differences...with a confidence interval https://rdoodles.rbind.io/2018/11/reporting-effects-as-relative-differences-with-a-confidence-interval/ Wed, 14 Nov 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/11/reporting-effects-as-relative-differences-with-a-confidence-interval/ Researchers frequently report results as relative effects, for example, “Male flies from selected lines had 50% larger upwind flight ability than male flies from control lines (Control mean: 117.5 cm/s; Selected mean 176.5 cm/s).” where a relative effect is \[\begin{equation} 100 \frac{\bar{y}_B - \bar{y}_A}{\bar{y}_A} \end{equation}\] If we are to follow best practices, we should present this effect with a measure of uncertainty, such as a confidence interval. The absolute effect is 59. Interaction plots with ggplot2 https://rdoodles.rbind.io/2018/10/interaction-plots-with-ggplot2/ Mon, 15 Oct 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/10/interaction-plots-with-ggplot2/ ggpubr is a fantastic resource for teaching applied biostats because it makes ggplot a bit easier for students. I’m not super familiar with all that ggpubr can do, but I’m not sure it includes a good “interaction plot” function. Maybe I’m wrong. But if I’m not, here is a simple function to create a gg_interaction plot. The gg_interaction function returns a ggplot of the modeled means and standard errors and not the raw means and standard errors computed from each group independently. Textbook error 101 -- A low p-value for the full model does not mean that the model is a good predictor of the response https://rdoodles.rbind.io/2018/09/a-low-p-value-for-the-full-model-does-not-mean-that-the-model-is-a-good-predictor-of-the-response/ Tue, 11 Sep 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/09/a-low-p-value-for-the-full-model-does-not-mean-that-the-model-is-a-good-predictor-of-the-response/ On page 606, of Lock et al “Statistics: Unlocking the Power of Data”, the authors state in item D “The p-value from the ANOVA table is 0.000 so the model as a whole is effective at predicting grade point average.” Ah no. library(data.table) library(mvtnorm) rho <- 0.5 n <- 10^5 Sigma <- diag(2) Sigma[1,2] <- Sigma[2,1] <- rho X <- rmvnorm(n, mean=c(0,0), sigma=Sigma) colnames(X) <- c("X1", "X2") beta <- c(0.01, -0. A simple ggplot of some measure against depth https://rdoodles.rbind.io/2018/09/a-simple-ggplot-of-some-measure-against-depth/ Mon, 10 Sep 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/09/a-simple-ggplot-of-some-measure-against-depth/ set up The goal is to plot the measure of something, say O2 levels, against depth (soil or lake), with the measures taken on multiple days library(ggplot2) library(data.table) First – create fake data depths <- c(0, seq(10,100, by=10)) dates <- c("Jan-18", "Mar-18", "May-18", "Jul-18") x <- expand.grid(date=dates, depth=depths) n <- nrow(x) head(x) ## date depth ## 1 Jan-18 0 ## 2 Mar-18 0 ## 3 May-18 0 ## 4 Jul-18 0 ## 5 Jan-18 10 ## 6 Mar-18 10 X <- model. Should the model-averaged prediction be computed on the link or response scale in a GLM? https://rdoodles.rbind.io/2018/05/should-the-model-averaged-prediction-be-computed-on-the-link-or-response-scale-in-a-glm/ Sun, 13 May 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/05/should-the-model-averaged-prediction-be-computed-on-the-link-or-response-scale-in-a-glm/ [updated to include additional output from MuMIn, BMA, and BAS] This post is a follow up to my inital post, which was written as as a way for me to pen my mental thoughts on the recent review of “Model averaging in ecology: a review of Bayesian, information‐theoretic and tactical approaches for predictive inference”. It was also written without contacting and discussing the issue with the authors. This post benefits from a series of e-mails with the lead author Carsten Dormann and the last author Florian Hartig. On model averaging the coefficients of linear models https://rdoodles.rbind.io/2018/05/on-model-averaging-regression-coefficients/ Thu, 10 May 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/05/on-model-averaging-regression-coefficients/ a shorter argument based on a specific example is here “What model averaging does not mean is averaging parameter estimates, because parameters in different models have different meanings and should not be averaged, unless you are sure you are in a special case in which it is safe to do so.” – Richard McElreath, p. 196 of the textbook I wish I had learned from Statistical Rethinking This is an infrequent but persistent criticism of model-averaged coefficients in the applied statistics literature on model averaging. An even more compact defense of coefficient model averaging https://rdoodles.rbind.io/2018/05/an-even-more-compact-defense-of-coefficient-model-averaging/ Mon, 07 May 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/05/an-even-more-compact-defense-of-coefficient-model-averaging/ a longer, more detailed argument is here The parameter that is averaged “needs to have the same meaning in all “models” for the equations to be straightforwardly interpretable; the coefficient of x1 in a regression of y on x1 is a different beast than the coefficient of x1 in a regression of y on x1 and x2.” – David Draper in a comment on Hoeting et al. 1999. David Draper suggested this example from the textbook by Freedman, Pisani and Purves. Model-averaged coefficients of a GLM https://rdoodles.rbind.io/2018/05/model-averaged-coefficients-of-a-glm/ Fri, 04 May 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/05/model-averaged-coefficients-of-a-glm/ This is a very quick post as a comment to the statement “For linear models, predicting from a parameter-averaged model is mathematically identical to averaging predictions, but this is not the case for non-linear models…For non-linear models, such as GLMs with log or logit link functions g(x)1, such coefficient averaging is not equivalent to prediction averaging.” from the supplement of Dormann et al. Model averaging in ecology: a review of Bayesian, information‐theoretic and tactical approaches for predictive inference. On alpha https://rdoodles.rbind.io/2018/04/on-alpha/ Mon, 23 Apr 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/04/on-alpha/ This post is motivated by Terry McGlynn’s thought provoking How do we move beyond an arbitrary statistical threshold? I have been struggling with the ideas explored in Terry’s post ever since starting my PhD 30 years ago, and its only been in the last couple of years that my own thoughts have begun to gel. This long marination period is largely because of my very classical biostatistical training. My PhD is from the Department of Anatomical Sciences at Stony Brook but the content was geometric morphometrics and James Rohlf was my mentor for morphometrics specifically, and multivariate statistics more generally. Combining data, distribution summary, model effects, and uncertainty in a single plot https://rdoodles.rbind.io/2018/03/combining-data-distribution-summary-model-effects-and-uncertainty-in-a-single-plot/ Tue, 27 Mar 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/03/combining-data-distribution-summary-model-effects-and-uncertainty-in-a-single-plot/ A Harrell plot combines a forest plot of estimated treatment effects and uncertainty, a dot plot of raw data, and a box plot of the distribution of the raw data into a single plot. A Harrell plot encourages best practices such as exploration of the distribution of the data and focus on effect size and uncertainty, while discouraging bad practices such as ignoring distributions and focusing on $p$-values. Consequently, a Harrell plot should replace the bar plots and Cleveland dot plots that are currently ubiquitous in the literature. What is the range of reasonable P-values given a two standard error difference in means? https://rdoodles.rbind.io/2018/03/what-is-the-range-of-reasonable-p-values-given-a-two-standard-error-difference-in-means/ Sun, 18 Mar 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/03/what-is-the-range-of-reasonable-p-values-given-a-two-standard-error-difference-in-means/ Here is the motivating quote for this post, from Andrew Gelman’s blog post “Five ways to fix statistics” I agree with just about everything in Leek’s article except for this statement: “It’s also impractical to say that statistical metrics such as P values should not be used to make decisions. Sometimes a decision (editorial or funding, say) must be made, and clear guidelines are useful.” Yes, decisions need to be made, but to suggest that p-values be used to make editorial or funding decisions—that’s just horrible. Bias in pre-post designs -- An example from the Turnbaugh et al (2006) mouse fecal transplant study https://rdoodles.rbind.io/2018/03/bias-in-pre-post-designs-an-example-from-the-turnbaugh-et-al-2006-mouse-fecal-transplant-study/ Thu, 08 Mar 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/03/bias-in-pre-post-designs-an-example-from-the-turnbaugh-et-al-2006-mouse-fecal-transplant-study/ This post is motivated by a twitter link to a recent blog post critical of the old but influential study An obesity-associated gut microbiome with increased capacity for energy harvest with impressive citation metrics. In the post, Matthew Dalby smartly used the available data to reconstruct the final weights of the two groups. He showed these final weights were nearly the same, which is not good evidence for a treatment effect, given that the treatment was randomized among groups. What is an R doodle? https://rdoodles.rbind.io/2018/03/what-is-an-r-doodle/ Wed, 07 Mar 2018 00:00:00 +0000 https://rdoodles.rbind.io/2018/03/what-is-an-r-doodle/ An R doodle is a short script to check intuition or understanding. Almost always, this involves generating fake data. I might create an R doodle when I’m reviewing a manuscript or reading a published paper and I want to check if their statistical analysis is doing what the authors think it is doing. Or maybe I create it to help me figure out what the authors are doing. Or I might be teaching some method and I create an R doodle to help me understand how the method behaves given different input (fake) data sets. https://rdoodles.rbind.io/data/data-from-human-gut-microbiota-from-autism-spectrum-disorder-promote-behavioral-symptoms-in-mice/fig1/a2564b3a-0a93-4bf9-a424-084dac9878d3/ Mon, 01 Jan 0001 00:00:00 +0000 https://rdoodles.rbind.io/data/data-from-human-gut-microbiota-from-autism-spectrum-disorder-promote-behavioral-symptoms-in-mice/fig1/a2564b3a-0a93-4bf9-a424-084dac9878d3/ (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start': new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0], j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src= 'https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f); })(window,document,'script','dataLayer','GTM-T4NWPM8'); Mendeley Data - Viewer - Fig1C_ObsOTUs_16samples.csv (function() { var supportsSVG = document.implementation.hasFeature("http://www.w3.org/TR/SVG11/feature#BasicStructure", "1.1"); if (! supportsSVG) { document.documentElement.className += ' no-svg'; } if (window.opener && window.name == 'authFlow') { window.opener.location.reload(true); window.close(); } })(); window.NREUM||(NREUM={});NREUM.info = {"agent":"","beacon":"bam.nr-data.net","errorBeacon":"bam.nr-data.net","licenseKey":"fc20a8d468","applicationID":"114917815","applicationTime":589.910739,"transactionName":"b1MBMEpWWxEFBkFQWVYZJhxIRVARFw9GFnF9YkxLXFZBAxcAQUoZAl8HSwJBUBAXDFpXeE1bAQFKGFMLCABGFgxeXw8BcVM=","queueTime":0,"ttGuid":"13f2a5be6af394","agentToken":null}; (window.NREUM||(NREUM={})).loader_config={xpid:"UgEBVFdACQIEXVBVDwAC"};window.NREUM||(NREUM={}),__nr_require=function(t,n,e){function r(e){if(!n[e]){var o=n[e]={exports:{}};t[e][0].call(o.exports,function(n){var o=t[e][1][n];return r(o||n)},o,o.exports)}return n[e].exports}if("function"==typeof __nr_require)return __nr_require;for(var o=0;o0&&(p-=1)}),s.on("internal-error",function(t){i("ierr",[t,c.now(),!0])})},{}],3:[function(t,n,e){t("loader").features.ins=!0},{}],4:[function(t,n,e){function r(t){}if(window.performance&&window.performance.timing&&window.performance.getEntriesByType){var o=t("ee"),i=t("handle"),a=t(8),s=t(7),c="learResourceTimings",f="addEventListener",u="resourcetimingbufferfull",d="bstResource",l="resource",p="-start",h="-end",m="fn"+p,w="fn"+h,v="bstTimer",y="pushState",g=t("loader");g.features.stn=!0,t(6);var x=NREUM.o.EV;o.on(m,function(t,n){var e=t[0];e instanceof x&&(this.bstStart=g.now())}),o.on(w,function(t,n){var e=t[0];e instanceof x&&i("bst",[e,n,this.bstStart,g.now()])}),a.on(m,function(t,n,e){this.bstStart=g.now(),this.bstType=e}),a.on(w,function(t,n){i(v,[n,this.bstStart,g.now(),this.bstType])}),s.on(m,function(){this.bstStart=g.now()}),s.on(w,function(t,n){i(v,[n,this.bstStart,g.now(),"requestAnimationFrame"])}),o.on(y+p,function(t){this.time=g.now(),this.startPath=location.pathname+location.hash}),o.on(y+h,function(t){i("bstHist",[location.pathname+location.hash,this.startPath,this.time])}),f in window.performance&&(window.performance["c"+c]?window.performance[f](u,function(t){i(d,[window.performance.getEntriesByType(l)]),window.performance["c"+c]()},!1):window.performance[f]("webkit"+u,function(t){i(d,[window.performance.getEntriesByType(l)]),window.performance["webkitC"+c]()},!1)),document[f]("scroll",r,{passive:!0}),document[f]("keypress",r,!1),document[f]("click",r,!1)}},{}],5:[function(t,n,e){function r(t){for(var n=t;n&&!n.hasOwnProperty(u);)n=Object.getPrototypeOf(n);n&&o(n)}function o(t){s.