Equivalence
This Lab is based on Hartman and Hidalgo (2018)
Also see Hartman (2021) for advanced discussion.
Download the updated Lab folder from here.
Run the
.RProj
file.Open
equivalence-10022024.qmd
.
Mechanics
Equivalence T test assumes the following Two One Sided Test (TOST) form: \[ \begin{align*} H_0: \frac{\mu_T - \mu_C}{\sigma} &\geq \epsilon_U \quad \text{or} \quad \frac{\mu_T - \mu_C}{\sigma} \leq \epsilon_L \\ \text{versus} \\ H_1: \epsilon_L &< \frac{\mu_T - \mu_C}{\sigma} < \epsilon_U \end{align*} \\ \] where [\(\epsilon_L\) , \(\epsilon_U\) ] is the equivalence range, \(\mu_T\) and \(\mu_C\) are the means of the treated and control groups, respectively, for a given covariate, and \(\sigma\)is the common standard deviation. The terms \(\epsilon_U\) and \(\epsilon_L\) are the upper and lower bounds for which two groups are considered equivalent.
Choosing appropriate values for \(\epsilon_U\) and \(\epsilon_U\) is the most important aspect of equivalence testing, (Refer to Selecting an Equivalence Range section in H&H 2018 for more details).
As shown here, test is conducted using two one-sided t-tests, and the null of difference is rejected in favor of equivalence if the p-value for both one-sided tests is less than \(\alpha\)
Equivalence Testing in R1
The equivalence testing package developed by Hartman and Hidalgo (2018) is not yet available on CRAN. We install it from github by using the following code:
#install.packages("devtools")
library(devtools)
Loading required package: usethis
#install_github("ekhartman/equivtest", force = TRUE)
library(equivtest)
Using the sample example from H&H 2018
The equivalence range for the t-test for equivalence is typically defined in standardized differences rather than the raw difference in means between the two groups, but researchers can easily map their substantive ranges to standardized differences by scaling by the standard deviation in the covariate. The standardized difference is a useful metric when testing for equivalence because, given some difference between the means of the two distributions, the two groups are increasingly indistinguishable as the variance of the distributions grows towards infinity, and increasingly disjoint as the variance of the distributions shrinks towards zero (Wellek 2010). We also recommend the t-test for equivalence because it is the uniformly most powerful invariant (UMPI) test for two normally distributed variables (Wellek 2010, pg. 120).
For the equivalence t-test, we are interested in \(H_1: \epsilon_L < \frac{\mu_T-\mu_C}{\sigma} < \epsilon_U\). For more information concerning acceptable \(\epsilon\) inputs, refer to the equiv.t.test documentation or Hartman & Hidalgo (2018).
We now implement Example 6.1 from Wellek (2010). In summary, we wish to compare two treatments using a nonsymmetric equivalence range.
# Wellek p 124
=c(10.3,11.3,2,-6.1,6.2,6.8,3.7,-3.3,-3.6,-3.5,13.7,12.6)
x=c(3.3,17.7,6.7,11.1,-5.8,6.9,5.8,3,6,3.5,18.7,9.6)
y
t.test(x,y)
Welch Two Sample t-test
data: x and y
t = -1.0862, df = 21.915, p-value = 0.2892
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-8.826339 2.759672
sample estimates:
mean of x mean of y
4.175000 7.208333
=equiv.t.test(x,y,eps_std=c(.5,1), alpha = .05)
ressummary(res)
Equivalence t-test
Input: eps_std, SE = 2.793
T-statistic critical interval: 0.28 to 0.931
Substantive equivalence CI: NA to NA
Standardized equivalence CI: NA to NA
Reject the null hypothesis? FALSE, p-value of NA
# Compare with t.test
The data do not allow to reject the null hypothesis of nonequivalence of (treatment A) to (treatment B).
How does the equiv.t.test
function work?
Code
```{r}
#| code-fold: true
equiv.t.test <- function(x, y, alpha = .05, epsilon = .2, std.err = "nominal", cluster.x = NULL, cluster.y = NULL) {
# Remove NAs from the data
x = x[!is.na(x)]
y = y[!is.na(y)]
# Calculate the difference in means
dbar <- mean(x) - mean(y)
# Get the sample sizes as doubles
m <- as.double(length(x))
n <- as.double(length(y))
N <- m+n
# Calculate the variances of each group
x.var <- var(x)
y.var <- var(y)
# Calculate the non-centrality parameter for the power calculation
non.cent <- (m*n*epsilon^2)/N
# Calculate the critical value for the t-statistic based on the non-centrality parameter
critical.const <- sqrt(qf(alpha,1,N-2,non.cent))
# Calculate the standard error of the difference in means
se = sqrt((m-1)*x.var + (n-1)*y.var) / sqrt(m*n * (N-2)/N)
# Calculate the degrees of freedom
df = N - 2
# Calculate the t-statistic
t.stat <- dbar / se
# Calculate the p-value
p = pf(abs(t.stat)^2, 1, df , non.cent)
# Calculate the observed standardized mean difference
obs_smd = (mean(x) - mean(y)) / sd(y)
# Calculate the inverted test statistic
inverted <- try(uniroot(function(x) pf(abs(t.stat)^2, 1, N-2, ncp = (m*n*x^2)/N) - ifelse(pf(abs(t.stat)^2, 1, N-2, ncp = (m*n*0^2)/N) < alpha, pf(abs(t.stat)^2, 1, N-2, ncp = (m*n*obs_smd^2)/N), alpha), c(0,10*abs(t.stat)), tol = 0.0001)$root, silent = TRUE)
# If the uniroot function throws an error, set the inverted test statistic to NA
if(class(inverted) == "try-error") {
inverted = NA
}
# Determine if the null hypothesis should be rejected
rej = abs(t.stat) <= critical.const
# Return the test results
return(list(t.stat = t.stat, critical.const = critical.const, power = 2*pt(critical.const, N-2)-1, rej = rej, p = p, inverted = inverted))
}
```
The code defines a function equiv.t.test that performs an equivalence t-test.
The function takes two vectors of data (x and y) as input, along with several optional parameters.
The function first calculates the difference in means between the two groups, the standard error of the difference, the degrees of freedom, and the t-statistic.
It then calculates the p-value for the test.
The function also calculates the power of the test, which is the probability of rejecting the null hypothesis when it is false.
Finally, the function returns a list of results, including the t-statistic, the critical value, the power, the rejection decision, the p-value, and the inverted test statistic.
Equivalence Testing: A statistical test used to determine if two groups are similar within a predefined margin, rather than significantly different.
t-test: A statistical test used to compare the means of two groups.
Alpha (\(\alpha\)): The significance level, typically set at 0.05, representing the probability of rejecting the null hypothesis when it is true (Type I error).
Epsilon (\(\epsilon\)): The equivalence margin, defining the maximum difference between two groups considered practically insignificant.
Degrees of Freedom (df): The number of values in a statistical calculation free to vary.
p-value: The probability of obtaining the observed results (or more extreme) if the null hypothesis were true.
Standard Error (SE): A measure of the variability of a sample mean.
Critical Constant: The threshold value used to determine whether to reject the null hypothesis.
Power: The probability of correctly rejecting the null hypothesis when it is false.
Uniroot Function: An R function used to find the root (solution) of an equation.
Non-centrality Parameter (ncp): A parameter in non-central distributions, like the non-central t-distribution, that measures the departure from the null hypothesis.
Standardized Mean Difference (SMD): A measure of effect size, calculated as the difference between two means divided by the standard deviation.
Illustration used in Hartman and Hidalgo, 2018
Using data from Brady and Mcnulty (2011)
From page 119,
Following code is from equivalence_replication_file.R
from H&H 2018 replication docket available here
## From figure 1 -- 3045206 voters, assuming roughly equal split between treatment and control
# Difference of means between the two groups
<- 0.034
dbar
# Sample sizes for group 1 and group 2 (half of the total sample size)
<- (3045206)/2
m <- (3045206)/2
n
# Total sample size
<- m + n
N
# Variances for group 1 and group 2 (both equal here)
<- (.2772)^2
x.var <- (.2772)^2
y.var
# Tolerance level of 0.2 standard deviations
<- 0.2 # As per H&H, page 18
epsilon
# Significance level (alpha = 5%)
<- 0.05
alpha
# Non-centrality parameter (NCP)
<- (m * n * epsilon^2) / N
non.cent # This measures how far the true difference is from the null hypothesis under the alternative hypothesis
# Critical constant for the F-distribution (inverse CDF)
<- sqrt(qf(alpha, 1, N - 2, non.cent))
critical.const # The critical constant determines the boundary value for hypothesis testing
# T-statistic calculation
<- sqrt(m * n * (N - 2) / N) * dbar / sqrt((m - 1) * x.var + (n - 1) * y.var)
t.stat # The t-statistic is used to test whether the observed difference in means is statistically significant
# P-value calculation using the F-distribution CDF
= pf(abs(t.stat)^2, 1, N - 2, non.cent)
p # The p-value indicates the probability of observing such a t-statistic under the null hypothesis
# Finding the root for the equivalence confidence interval
<- uniroot(function(x) pf(abs(t.stat)^2, 1, N - 2, ncp = (m * n * x^2) / N) - alpha,
inverted c(0, 2 * abs(t.stat)), tol = 0.0001)$root
# 'uniroot()' is used to find the boundary where the p-value equals alpha (i.e., the confidence interval)
# Output the p-value
# Prints the p-value to assess significance p
[1] 0
# Output the equivalence confidence interval in standardized terms
# Prints the confidence interval in terms of standardized differences inverted
[1] 0.1245523
# Convert the standardized confidence interval to real terms (based on variance)
* sqrt(y.var) # Prints the confidence interval in real terms inverted
[1] 0.0345259
# Calculate the inverted value in real terms (one side of the CI)
<- inverted * sqrt(y.var)
inverted_real
# Calculate the lower and upper bounds of the confidence interval
<- dbar - inverted_real) (lower_bound
[1] -0.0005259025
<- dbar + inverted_real) (upper_bound
[1] 0.0685259
We currently did the equivalence test for one covariate - distance
- only.
To understand for to run this for multiple covariates at once:
Download the replcication docket for Hartman and Hidalgo (2018) from here.
Open and follow the file
equivalence_replication_file.R
.
How to run equivalence tests on your own data?
Glossary of Key Terms in H&H 2018
Balance: Refers to the similarity of the distributions of pretreatment covariates between the treatment and control groups in a study.
Bias: Systematic error in the estimation of a causal effect, often arising from confounding factors.
Causal Empiricism: An approach to research that emphasizes the importance of testing the plausibility of causal assumptions using empirical data.
Confounding: Occurs when a variable is associated with both the treatment and the outcome, making it difficult to isolate the treatment’s true effect.
Equivalence: In the context of statistical testing, equivalence implies that the difference between two groups or parameters is within a predefined range considered substantively unimportant.
Exchangeability: The idea that the treatment and control groups are sufficiently similar that they could have been interchanged without affecting the outcome of interest.
Identification Assumption: An untestable assumption about the data-generating process that is necessary to estimate a causal effect.
Natural Experiment: A study where the assignment of treatment is “as-if” random, occurring due to external factors or policy changes, rather than through researcher manipulation.
Null Hypothesis: A statement of no effect or no difference, often used as the baseline for statistical testing. In equivalence testing, the null hypothesis typically posits a meaningful difference.
Observational Study: Research where the researcher observes and measures variables without directly manipulating the treatment or exposure.
Placebo Effect: An observed effect on an outcome that is due to the act of receiving a treatment or intervention itself, rather than the treatment’s specific active ingredients.
Power: The probability of correctly rejecting the null hypothesis when it is false. Higher power indicates a greater ability to detect a true effect or difference.
Randomization: The process of randomly assigning units to treatment and control groups, which helps to ensure that the groups are similar on average.
Sensitivity Analysis: A method for examining how sensitive the results of an analysis are to changes in assumptions, such as the presence of unobserved confounding.
Standardized Effect Size: A measure of the magnitude of an effect (e.g., difference between groups) that is standardized to a common scale, often using standard deviations, to allow for comparisons across studies or variables with different units.
Statistical Significance: The likelihood of observing the data or more extreme results if the null hypothesis were true. Often determined by a p-value.
Substantive Significance: Whether a statistically significant finding is meaningful or important in the context of the research question and the real world.
Type I Error: Incorrectly rejecting a true null hypothesis. In equivalence testing, this would mean incorrectly concluding equivalence when a meaningful difference exists.
Type II Error: Incorrectly failing to reject a false null hypothesis. In equivalence testing, this would mean failing to conclude equivalence when the groups are actually equivalent within the defined range.
This section is adapted from the code and documentation given on https://github.com/ekhartman/equivtest↩︎