Skip to content

Commit

Permalink
index improvements
Browse files Browse the repository at this point in the history
  • Loading branch information
mikabr committed Nov 6, 2024
1 parent 1827375 commit fde812d
Show file tree
Hide file tree
Showing 10 changed files with 27 additions and 21 deletions.
2 changes: 1 addition & 1 deletion 003-replication.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ And yet, RPP's results are still important and compelling, and they undeniably c

## Replication

Beyond verifying a paper's original analysis pipeline,\index{analysis pipeline} we are often interested in understanding whether the study can be replicated---if we repeat the study methods and obtain new data, do we get similar results? To quote @popper2005 [p. 86], "The scientifically significant ... effect may be defined as that which can be regularly [replicated] by anyone who carries out the appropriate experiment in the way prescribed."
Beyond verifying a paper's original analysis pipeline, we are often interested in understanding whether the study can be replicated---if we repeat the study methods and obtain new data, do we get similar results? To quote @popper2005 [p. 86], "The scientifically significant ... effect may be defined as that which can be regularly [replicated] by anyone who carries out the appropriate experiment in the way prescribed."

Replications can be conducted for many reasons [@schmidt2009]. One goal can be to verify that the results of an existing study can be obtained again if the study is conducted again in exactly the same way, to the best of our abilities. A second goal can be to gain a more precise estimate of the effect of interest by conducting a larger replication study, or combining the results of a replication study with the existing study. A third goal can be to investigate whether an effect will persist when, for example, the experimental manipulation is done in a different, but still theory-consistent, manner. Alternatively, we might want to investigate whether the effect persists in a different population. Such replications are often efforts to "replicate and extend," and are common both when the same research team wants to conduct a sequence of experiments that each build on one another or when a new team wants to build on a result from a paper they have read [@rosenthal1990].

Expand Down
6 changes: 3 additions & 3 deletions 006-inference.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -300,7 +300,7 @@ In practice, the thing that is both tricky and good about Bayes Factors\index{Ba

Now let's turn back to NHST\index{null hypothesis significance testing (NHST)} and the $p$-value. We already have a working definition of what a $p$-value is from our discussion above: it's the probability of the data (or any data that would be more extreme) under the null hypothesis. How is this quantity related to either our Bayesian estimate\index{Bayesian estimation} or the BF? Well, the first thing to notice is that the $p$-value is very close (but not identical) to the likelihood itself.^[The likelihood---for both Bayesians and frequentists---is the probability of the data, just like the $p$-value. But unlike the $p$-value, it doesn't include the probability of more extreme data as well.]

Next we can use a simple statistical test, a $t$-test,\index{t-test} to compute $p$-values for our experiment. In case you haven't encountered one, a $t$-test is a procedure for computing a $p$-value by comparing the distribution of two variables using the null hypothesis that there is no difference between them.^[$t$-tests can also be used in cases where one sample is being compared to some baseline.] The $t$-test uses the data to compute a **test statistic**\index{test statistic} whose distribution under the null hypothesis is known. Then the value of this statistic can be converted to $p$-values for making an inference.
Next we can use a simple statistical test, a $t$-test,\index{t-test|(} to compute $p$-values for our experiment. In case you haven't encountered one, a $t$-test is a procedure for computing a $p$-value by comparing the distribution of two variables using the null hypothesis that there is no difference between them.^[$t$-tests can also be used in cases where one sample is being compared to some baseline.] The $t$-test uses the data to compute a **test statistic**\index{test statistic} whose distribution under the null hypothesis is known. Then the value of this statistic can be converted to $p$-values for making an inference.

```{r inference-t-test}
set.seed(42)
Expand Down Expand Up @@ -437,7 +437,7 @@ Based on this evidence, should we conclude that precognition exists? Probably no

First, we've already discussed the need to be skeptical about situations where experimenters have the opportunity for analytic flexibility\index{analytic flexibility} in their choice of measures, manipulations, samples, and analyses. Flexibility leads to the possibility of cherry-picking those set of decisions from the "garden of forking paths" that lead to a positive outcome for the researcher's favored hypothesis (for more details, see @sec-prereg). And there is plenty of flexibility on display even in experiment 1 of Bem's paper. Although there were 100 participants in the study, they may have been combined post hoc from two distinct samples of 40 and 60, each of which saw different conditions. The 40 made guesses about the location of erotic, negative, and neutral pictures; the 60 saw erotic, positive non-romantic, and positive romantic pictures. The means of each of these conditions were presumably tested against chance (at least six comparisons, for a false positive\index{false positive} rate of `r round(1 - .95^6,2)`). Had positive romantic pictures been found significant, Bem certainly could have interpreted this finding the same way he interpreted the erotic ones.

Second, as we discussed, a $p$-value close to 0.05 does not necessarily provide strong evidence against the null hypothesis. Wagenmakers et al. computed the Bayes Factor\index{Bayes Factor (BF)} for each of the experiments in Bem's paper and found that, in many cases, the amount of evidence for $H_1$ was quite modest under a default Bayesian $t$-test.\index{t-test} Experiment 1 was no exception: the BF was `r round(1/.61,2) # they reported bf_01`, giving only "anecdotal" support for the hypothesis of some nonzero effect, even before the multiple-comparisons problem mentioned above.
Second, as we discussed, a $p$-value close to 0.05 does not necessarily provide strong evidence against the null hypothesis. Wagenmakers et al. computed the Bayes Factor\index{Bayes Factor (BF)} for each of the experiments in Bem's paper and found that, in many cases, the amount of evidence for $H_1$ was quite modest under a default Bayesian $t$-test.\index{t-test|)} Experiment 1 was no exception: the BF was `r round(1/.61,2) # they reported bf_01`, giving only "anecdotal" support for the hypothesis of some nonzero effect, even before the multiple-comparisons problem mentioned above.

Finally, since precognition is not supported by any prior compelling scientific evidence (despite many attempts to obtain such evidence) and defies well-established physical laws, perhaps we should assign a low prior probability to Bem's $H_1$, a nonzero precognition effect. Taking a strong Bayesian position, Wagenmakers et al. suggest that we might do well to adopt a prior reflecting how unlikely precognition is, say $p(H_1) = 10^{-20}$. And if we adopt this prior, even a very well-designed, highly informative experiment (with a Bayes Factor\index{Bayes Factor (BF)} conveying substantial or even decisive evidence) would still lead to a very low posterior probability of precognition.

Expand All @@ -450,7 +450,7 @@ Up until now, we've presented Bayesian and frequentist tools as two different se

[^inference-22]: This is really a very rough description. If you're interested in learning more about this philosophical background, we recommend the Stanford Encyclopedia of Philosophy entry, "Interpretations of Probability" (<https://plato.stanford.edu/entries/probability-interpret>).

You don't have to take sides in this deep philosophical debate about what probability is. But it's helpful to know that people actually seem to reason about the world in ways that are well described by the subjective Bayesian view of probability. Recent cognitive science research has made a lot of headway in describing reasoning as a process of Bayesian inference\index{Bayesian inference} where probabilities describe degrees of belief in different hypotheses [for a textbook review of this approach, see @probmods2]. These hypotheses in turn are a lot like the theories we described in @sec-theories: they describe the relationships between different abstract entities [@tenenbaum2011]. You might think that scientists are different from laypeople in this regard, but one of the striking findings from research on probabilistic reasoning and judgment is that expertise doesn't matter that much. Statistically trained scientists---and even statisticians---make many of the same reasoning mistakes as their untrained students [@kahneman1979]. Even children seem to reason intuitively in a way that looks a bit like Bayesian inference [@gopnik2012].
You don't have to take sides in this deep philosophical debate about what probability is. But it's helpful to know that people actually seem to reason about the world in ways that are well described by the subjective Bayesian view of probability. Recent cognitive science research has made a lot of headway in describing reasoning as a process of Bayesian inference\index{Bayesian inference} where probabilities describe degrees of belief in different hypotheses [for a textbook review of this approach, see @probmods2]. These hypotheses in turn are a lot like the theories we described in @sec-theories: they describe the relationships between different abstract entities [@tenenbaum2011]. You might think that scientists are different from laypeople in this regard, but one of the striking findings from research on probabilistic reasoning and judgment is that expertise doesn't matter that much. Statistically trained scientists---and even statisticians---make many of the same reasoning mistakes as their untrained students [@kahneman1979]. Even children seem to reason intuitively in a way that looks a bit like Bayesian inference [@gopnik2012].\index{Bayes Factor (BF)}

These cognitive science findings help to explain some of the problems that people (scientists included) have in reasoning about $p$-values. If you are an intuitively Bayesian reasoner, the quantity that you're probably tracking is how much you believe in your hypothesis (its posterior probability). So, many people treat the $p$-value as the posterior probability of the null hypothesis.[^inference-23] That's exactly what fallacy 1 in @tbl-dirty-dozen states---"If $p = 0.05$, the null hypothesis has only a 5% chance of being true." But this equivalence is incorrect! Written in math, $p(\text{data} | H_0)$ (the likelihood that lets us compute the $p$-value) is not the same thing as $p(H_0 | \text{data})$ (the posterior that we want). Pulling from our [accident report]{.smallcaps} above, even if the *probability of the observed ESP data given the null hypothesis* is low, that doesn't mean that the *probability of ESP* is high.

Expand Down
4 changes: 2 additions & 2 deletions 007-models.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -129,7 +129,7 @@ You can also extract the coefficient values using `coef(mod)` and put them in a

### Adding predictors

The regression model we just wrote down is the same model that underlies the $t$-test\index{t-test} from @sec-inference. But the beauty of regression modeling is that much more complex estimation problems can also be written as regression models that extend the model we made above. For example, we might want to add another predictor variable, such as the age of the participant.^[The ability to estimate multiple coefficients at once is a huge strength of regression modeling, so much so that sometimes people use the label **multiple regression**\index{multiple regression} to denote that there is more than one predictor + coefficient pair.]
The regression model we just wrote down is the same model that underlies the $t$-test from @sec-inference. But the beauty of regression modeling is that much more complex estimation problems can also be written as regression models that extend the model we made above. For example, we might want to add another predictor variable, such as the age of the participant.^[The ability to estimate multiple coefficients at once is a huge strength of regression modeling, so much so that sometimes people use the label **multiple regression**\index{multiple regression} to denote that there is more than one predictor + coefficient pair.]

Let's add this new independent variable and a corresponding regression coefficient to our model:
$$
Expand Down Expand Up @@ -320,7 +320,7 @@ But---practically speaking---how should go you about building a model for your e

A second class of methods that helps resolve issues of clustering is **generalized estimating equations** (GEE)\index{generalized estimating equations (GEE)}. In this approach, we leave the linear predictor alone. We do not add random intercepts or slopes, nor do we assume anything about the distribution of the errors (i.e., we no longer assume that they are normal, independent, and homoskedastic).

In GEE, we instead provide the model with an initial "guess" about how we think the errors might be related to one another; for example, in a repeated-measures experiment, we might guess that the errors are exchangeable,\index{exchangeability} meaning that they are correlated to the same degree within each participant but are uncorrelated across participants. Instead of *assuming* that our guess is correct, as do linear mixed models (LMM)\index{linear mixed models (LMM)}, GEE estimates the correlation structure of the errors empirically, using our guess as a starting point, and it uses this correlation structure to arrive at point estimates and inference for the regression coefficients. Remarkably, as the number of clusters and observations become very large, GEE will *always* provide unbiased point estimates and valid inference, *even if* our guess about the correlation structure was bad. Additionally, with simple finite-sample corrections [@mancl2001covariance], GEE seems to provide valid inference at smaller numbers of clusters than does LMM.
In GEE, we instead provide the model with an initial "guess" about how we think the errors might be related to one another; for example, in a repeated-measures experiment, we might guess that the errors are exchangeable,\index{exchangeability} meaning that they are correlated to the same degree within each participant but are uncorrelated across participants. Instead of *assuming* that our guess is correct, as do linear mixed models (LMM)\index{linear mixed effect model (LMM)}, GEE estimates the correlation structure of the errors empirically, using our guess as a starting point, and it uses this correlation structure to arrive at point estimates and inference for the regression coefficients. Remarkably, as the number of clusters and observations become very large, GEE will *always* provide unbiased point estimates and valid inference, *even if* our guess about the correlation structure was bad. Additionally, with simple finite-sample corrections [@mancl2001covariance], GEE seems to provide valid inference at smaller numbers of clusters than does LMM.

The price paid for these nice safeguards against model misspecification is that, in principle, GEE will typically have less statistical power\index{statistical power} than LMM *if* the LMM is in fact correctly specified, but the difference may be surprisingly slight in practice [@bie2021fitting]. For these reasons, some of this book's authors actually favor GEE with finite-sample corrections over LMM as the default model for clustered data, though they are much less common in psychology.
<!-- In general, we tend to favor LMM over GEE only when the number of observations and clusters are quite large, and when careful diagnostics also indicate that distributional assumptions are fulfilled. -->
Expand Down
2 changes: 1 addition & 1 deletion 008-measurement.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ What does it mean to measure something? Intuitively, we know that a ruler measur
\clearpage
We first have to keep in mind that not every measure is equally precise. This point is obvious when you think about physical measurement instruments: a caliper will give you a much more precise estimate of thickness than a ruler will. One way to see that the measurement is more precise is by repeating it a bunch of times. The measurements from the caliper will likely be more similar to one another, reflecting the fact that the amount of error in each individual measurement is smaller. We can do the same thing with a psychological measurement---repeat and assess variation---though as we'll see below it's a little trickier. Measurement instruments that have less error are called more **reliable**\index{reliablility} instruments.^[Is reliability the same as **precision**? Yes, more or less. Confusingly, different fields call these concepts different things [there's a helpful table of these names in @brandmaier2018]. Here we'll talk about reliability as a property of instruments specifically while using the term precision to talk about the measurements themselves.]

Second, psychological measurements do not directly reflect latent theoretical constructs\index{latent construct} of interest, quantities like happiness, intelligence, or language processing ability. And unlike quantities like length and mass, there is often disagreement in psychology about what the right theoretical quantities are. Thus, we have to measure an observable behavior---our operationalization of the construct---and then make an argument about how the measure relates to a proposed construct of interest (and sometimes whether the construct really exists at all). This argument is about the **validity**\index{validity} of our measurements.^[We are also going to talk in @sec-design about the validity of manipulations. The way you identify a causal effect on some measure is by operationalizing some construct as well. To identify causal effects, we must link a particular construct of interest to something we can concretely manipulate in an experiment, like the stimuli or instructions.]
Second, psychological measurements do not directly reflect latent theoretical constructs\index{latent construct} of interest, quantities like happiness, intelligence, or language processing ability. And unlike quantities like length and mass, there is often disagreement in psychology about what the right theoretical quantities are. Thus, we have to measure an observable behavior---our operationalization of the construct---and then make an argument about how the measure relates to a proposed construct of interest (and sometimes whether the construct really exists at all). This argument is about the **validity** of our measurements.^[We are also going to talk in @sec-design about the validity of manipulations. The way you identify a causal effect on some measure is by operationalizing some construct as well. To identify causal effects, we must link a particular construct of interest to something we can concretely manipulate in an experiment, like the stimuli or instructions.]

These two concepts, reliability and validity, provide a conceptual toolkit for assessing a psychological measurement and how well it serves the researcher's goal.

Expand Down
Loading

0 comments on commit fde812d

Please sign in to comment.