Equivalence

This Lab is based on Hartman and Hidalgo (2018)

Also see Hartman (2021) for advanced discussion.

Note

Download the updated Lab folder from here.
Run the .RProj file.
Open equivalence-10022024.qmd.

Equivalence Testing for Causal Inference in Social Science Research Design

Causal Identification is critically dependent on the underlying assumptions about the data generating process. These assumptions are inherently untestable but have observable implications that are empirically validated. Conceptually, uncounfoundedness which is achieved by randomization in experiments and is conditionally achieved in analysis of observational and quasi-experimental data, is tested by balance tests. In simple terms, balance tests check that if indeed the treatment assignemnt was independent of potential outcomes on controlling for vector of confounders, the covariates do not significantly differ between the control and treatment groups.
This implies, to test unconfoundedness assumption or CIA, the null-hypothesis we should begin with is that of data being inconsistent with a research design valid for causal inference.

“We argue that researchers should begin with the initial hypothesis that the data are inconsistent with a valid research design, and provide sufficient statistical evidence in favor of a valid design.” - H&H 2018

This can be framed in hypothesis testing terms as:

\(H_0:\) Data Inconsistent with observable implications of uncounfounded Research Design.

\(\implies\) Covariates’ distribution not same across Treatment Groups.

\(H_A:\) Data consistent with observable implications of uncounfounded Research Design.

\(\implies\) Covariates’ distribution same across Treatment Groups.

However, as has traditionally been the practice, also seen in most of the approaches and papers in your coursework, especially those on natural experiments, the balance tests infer the non-significance of difference between distribution of covariates, as sameness of distribution between treatment groups. As Hartman and Hidalgo (2018) put it, this practice is akin to “incorrectly equating non-significant difference with significant homogeniety” (quoted from Wellek 2010).

Traditional balance tests, which rely on null hypotheses of no difference, can be misleading due to issues of statistical power. Hartman and Hidalgo (2018) advocate for equivalence tests, where the null hypothesis assumes a meaningful difference exists, and researchers aim to find evidence for equivalence within a pre-defined range. The guide emphasizes the importance of selecting appropriate equivalence ranges and interpreting the results in the context of potential bias and causal identification.

Key Concepts:

Tests of Design: Procedures used to assess the plausibility of causal identification assumptions. This includes balance tests (comparing pretreatment covariates) and placebo tests (examining treatment effects on unaffected outcomes).
Balance Test: Assessing whether the distributions of pretreatment covariates are similar between treatment and control groups. Good balance strengthens the credibility of an unconfounded design.
Equivalence Testing: A statistical testing framework where the null hypothesis posits a meaningful difference, and the goal is to find evidence for equivalence within a defined range.
Equivalence Range: The range of values within which the difference between two groups is considered substantively inconsequential. Selecting a justifiable equivalence range is crucial for equivalence tests. It’s crucial because it operationalizes what “similar enough” means in the context of the specific study, directly incorporating researcher judgment and subject-matter knowledge.
Two One-Sided Test (TOST): A common equivalence test involving conducting two one-sided tests to determine if the difference between groups falls within the equivalence range.
Equivalence Confidence Interval (ECI): Similar to a confidence interval, the ECI represents the smallest equivalence range supported by the data at a given significance level. It helps researchers assess the uncertainty surrounding the true difference and defend the chosen range.

Mechanics

Equivalence T test assumes the following Two One Sided Test (TOST) form: \[ \begin{align*} H_0: \frac{\mu_T - \mu_C}{\sigma} &\geq \epsilon_U \quad \text{or} \quad \frac{\mu_T - \mu_C}{\sigma} \leq \epsilon_L \\ \text{versus} \\ H_1: \epsilon_L &< \frac{\mu_T - \mu_C}{\sigma} < \epsilon_U \end{align*} \\ \] where [\(\epsilon_L\) , \(\epsilon_U\) ] is the equivalence range, \(\mu_T\) and \(\mu_C\) are the means of the treated and control groups, respectively, for a given covariate, and \(\sigma\)is the common standard deviation. The terms \(\epsilon_U\) and \(\epsilon_L\) are the upper and lower bounds for which two groups are considered equivalent.

Choosing appropriate values for \(\epsilon_U\) and \(\epsilon_U\) is the most important aspect of equivalence testing, (Refer to Selecting an Equivalence Range section in H&H 2018 for more details).

As shown here, test is conducted using two one-sided t-tests, and the null of difference is rejected in favor of equivalence if the p-value for both one-sided tests is less than \(\alpha\)

Figure 1: Source- Hartman and Hidalgo, 2018, pp.5

Equivalence Testing in R¹

The equivalence testing package developed by Hartman and Hidalgo (2018) is not yet available on CRAN. We install it from github by using the following code:

#install.packages("devtools")
library(devtools)

Loading required package: usethis

#install_github("ekhartman/equivtest", force = TRUE)
library(equivtest)

Using the sample example from H&H 2018

The equivalence range for the t-test for equivalence is typically defined in standardized differences rather than the raw difference in means between the two groups, but researchers can easily map their substantive ranges to standardized differences by scaling by the standard deviation in the covariate. The standardized difference is a useful metric when testing for equivalence because, given some difference between the means of the two distributions, the two groups are increasingly indistinguishable as the variance of the distributions grows towards infinity, and increasingly disjoint as the variance of the distributions shrinks towards zero (Wellek 2010). We also recommend the t-test for equivalence because it is the uniformly most powerful invariant (UMPI) test for two normally distributed variables (Wellek 2010, pg. 120).

For the equivalence t-test, we are interested in \(H_1: \epsilon_L < \frac{\mu_T-\mu_C}{\sigma} < \epsilon_U\). For more information concerning acceptable \(\epsilon\) inputs, refer to the equiv.t.test documentation or Hartman & Hidalgo (2018).

We now implement Example 6.1 from Wellek (2010). In summary, we wish to compare two treatments using a nonsymmetric equivalence range.

# Wellek p 124

x=c(10.3,11.3,2,-6.1,6.2,6.8,3.7,-3.3,-3.6,-3.5,13.7,12.6)
y=c(3.3,17.7,6.7,11.1,-5.8,6.9,5.8,3,6,3.5,18.7,9.6)

t.test(x,y)


    Welch Two Sample t-test

data:  x and y
t = -1.0862, df = 21.915, p-value = 0.2892
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -8.826339  2.759672
sample estimates:
mean of x mean of y 
 4.175000  7.208333

res=equiv.t.test(x,y,eps_std=c(.5,1), alpha = .05)
summary(res)

Equivalence t-test 
Input: eps_std,  SE = 2.793
T-statistic critical interval: 0.28 to 0.931 
Substantive equivalence CI: NA to NA 
Standardized equivalence CI: NA to NA 
Reject the null hypothesis? FALSE, p-value of NA

# Compare with t.test

The data do not allow to reject the null hypothesis of nonequivalence of (treatment A) to (treatment B).

How does the `equiv.t.test` function work?

Code from Hartman Github Page with Notes
Key Terms in the Function

Code

```{r}
#| code-fold: true
equiv.t.test <- function(x, y, alpha = .05, epsilon = .2, std.err = "nominal", cluster.x = NULL, cluster.y = NULL) {

  # Remove NAs from the data
  x = x[!is.na(x)]
  y = y[!is.na(y)]

  # Calculate the difference in means
  dbar <- mean(x) - mean(y)

  # Get the sample sizes as doubles
  m <- as.double(length(x))
  n <- as.double(length(y))
  N <- m+n

  # Calculate the variances of each group
  x.var <- var(x)
  y.var <- var(y)

  # Calculate the non-centrality parameter for the power calculation
  non.cent <- (m*n*epsilon^2)/N

  # Calculate the critical value for the t-statistic based on the non-centrality parameter
  critical.const <- sqrt(qf(alpha,1,N-2,non.cent))

  # Calculate the standard error of the difference in means
  se = sqrt((m-1)*x.var + (n-1)*y.var) / sqrt(m*n * (N-2)/N)

  # Calculate the degrees of freedom
  df = N - 2

  # Calculate the t-statistic
  t.stat <- dbar / se

  # Calculate the p-value
  p = pf(abs(t.stat)^2, 1, df , non.cent)

  # Calculate the observed standardized mean difference
  obs_smd = (mean(x) - mean(y)) / sd(y)

  # Calculate the inverted test statistic
  inverted <- try(uniroot(function(x) pf(abs(t.stat)^2, 1, N-2, ncp = (m*n*x^2)/N) - ifelse(pf(abs(t.stat)^2, 1, N-2, ncp = (m*n*0^2)/N) < alpha, pf(abs(t.stat)^2, 1, N-2, ncp = (m*n*obs_smd^2)/N), alpha), c(0,10*abs(t.stat)), tol = 0.0001)$root, silent = TRUE)

  # If the uniroot function throws an error, set the inverted test statistic to NA
  if(class(inverted) == "try-error") {
    inverted = NA
  }

  # Determine if the null hypothesis should be rejected
  rej = abs(t.stat) <= critical.const

  # Return the test results
  return(list(t.stat = t.stat, critical.const = critical.const, power = 2*pt(critical.const, N-2)-1, rej = rej, p = p, inverted = inverted))

}
```

The code defines a function equiv.t.test that performs an equivalence t-test.

The function takes two vectors of data (x and y) as input, along with several optional parameters.
The function first calculates the difference in means between the two groups, the standard error of the difference, the degrees of freedom, and the t-statistic.
It then calculates the p-value for the test.
The function also calculates the power of the test, which is the probability of rejecting the null hypothesis when it is false.
Finally, the function returns a list of results, including the t-statistic, the critical value, the power, the rejection decision, the p-value, and the inverted test statistic.

Equivalence Testing: A statistical test used to determine if two groups are similar within a predefined margin, rather than significantly different.
t-test: A statistical test used to compare the means of two groups.
Alpha (\(\alpha\)): The significance level, typically set at 0.05, representing the probability of rejecting the null hypothesis when it is true (Type I error).
Epsilon (\(\epsilon\)): The equivalence margin, defining the maximum difference between two groups considered practically insignificant.
Degrees of Freedom (df): The number of values in a statistical calculation free to vary.
p-value: The probability of obtaining the observed results (or more extreme) if the null hypothesis were true.
Standard Error (SE): A measure of the variability of a sample mean.
Critical Constant: The threshold value used to determine whether to reject the null hypothesis.
Power: The probability of correctly rejecting the null hypothesis when it is false.
Uniroot Function: An R function used to find the root (solution) of an equation.
Non-centrality Parameter (ncp): A parameter in non-central distributions, like the non-central t-distribution, that measures the departure from the null hypothesis.
Standardized Mean Difference (SMD): A measure of effect size, calculated as the difference between two means divided by the standard deviation.

Illustration used in Hartman and Hidalgo, 2018

Using data from Brady and Mcnulty (2011)

Abstract:

Could changing the locations of polling places affect the outcome of an election by increasing the costs of voting for some and decreasing them for others? The consolidation of voting precincts in Los Angeles County during California’s 2003 gubernatorial recall election provides a natural experiment for studying how changing polling places influences voter turnout. Overall turnout decreased by a substantial 1.85 percentage points: A drop in polling place turnout of 3.03 percentage points was partially offset by an increase in absentee voting of 1.18 percentage points. Both transportation and search costs caused these changes. Although there is no evidence that the Los Angeles Registrar of Voters changed more polling locations for those registered with one party than for those registered with another, the changing of polling places still had a small partisan effect because those registered as Democrats were more sensitive to changes in costs than those registered as Republicans. The effects were small enough to allay worries about significant electoral consequences in this instance (e.g., the partisan effect might be decisive in only about one in two hundred contested House elections), but large enough to make it possible for someone to affect outcomes by more extensive manipulation of polling place locations.

From page 119,

Those who had their polling place changed in 2003 had to go an average distance of 0.354 miles in 2002, # whereas those who did not have their polling place changed had to go only 0.320 miles—a difference of 0.034 miles”

Following code is from equivalence_replication_file.R from H&H 2018 replication docket available here

## From figure 1 -- 3045206 voters, assuming roughly equal split between treatment and control
# Difference of means between the two groups
dbar <- 0.034

# Sample sizes for group 1 and group 2 (half of the total sample size)
m <- (3045206)/2
n <- (3045206)/2

# Total sample size
N <- m + n

# Variances for group 1 and group 2 (both equal here)
x.var <- (.2772)^2
y.var <- (.2772)^2

# Tolerance level of 0.2 standard deviations
epsilon <- 0.2 # As per H&H, page 18

# Significance level (alpha = 5%)
alpha <- 0.05

# Non-centrality parameter (NCP)
non.cent <- (m * n * epsilon^2) / N
# This measures how far the true difference is from the null hypothesis under the alternative hypothesis

# Critical constant for the F-distribution (inverse CDF)
critical.const <- sqrt(qf(alpha, 1, N - 2, non.cent))
# The critical constant determines the boundary value for hypothesis testing

# T-statistic calculation
t.stat <- sqrt(m * n * (N - 2) / N) * dbar / sqrt((m - 1) * x.var + (n - 1) * y.var)
# The t-statistic is used to test whether the observed difference in means is statistically significant

# P-value calculation using the F-distribution CDF
p = pf(abs(t.stat)^2, 1, N - 2, non.cent)
# The p-value indicates the probability of observing such a t-statistic under the null hypothesis

# Finding the root for the equivalence confidence interval
inverted <- uniroot(function(x) pf(abs(t.stat)^2, 1, N - 2, ncp = (m * n * x^2) / N) - alpha, 
                    c(0, 2 * abs(t.stat)), tol = 0.0001)$root
# 'uniroot()' is used to find the boundary where the p-value equals alpha (i.e., the confidence interval)

# Output the p-value
p # Prints the p-value to assess significance

[1] 0

# Output the equivalence confidence interval in standardized terms
inverted # Prints the confidence interval in terms of standardized differences

[1] 0.1245523

# Convert the standardized confidence interval to real terms (based on variance)
inverted * sqrt(y.var) # Prints the confidence interval in real terms

[1] 0.0345259

# Calculate the inverted value in real terms (one side of the CI)
inverted_real <- inverted * sqrt(y.var)

# Calculate the lower and upper bounds of the confidence interval
(lower_bound <- dbar - inverted_real)

[1] -0.0005259025

(upper_bound <- dbar + inverted_real)

[1] 0.0685259

We currently did the equivalence test for one covariate - distance - only.

To understand for to run this for multiple covariates at once:

Download the replcication docket for Hartman and Hidalgo (2018) from here.
Open and follow the file equivalence_replication_file.R.

How to run equivalence tests on your own data?

Since, the equivtest package from Hartman and Hidalgo is not on CRAN, downloading and installing it on different systems can create some issues on some of them.

Following are the steps to do the same test without loading the package.

Download the equiv-t-test.R script from here.
Run the whole script after opening it in the same project window as your replication assignment. This would load three functions in your environment - equiv.t.test, generate_plot, and run_equiv.

These functions are from the replication packet of Hartman and Hidalgo (2018).

Open the replcication docket for Hartman and Hidalgo (2018) from here. Go through the file equivalence_replication_file.R from this docket to understand how the functions loaded in step 2 are used.

Glossary of Key Terms in H&H 2018

Balance: Refers to the similarity of the distributions of pretreatment covariates between the treatment and control groups in a study.
Bias: Systematic error in the estimation of a causal effect, often arising from confounding factors.
Causal Empiricism: An approach to research that emphasizes the importance of testing the plausibility of causal assumptions using empirical data.
Confounding: Occurs when a variable is associated with both the treatment and the outcome, making it difficult to isolate the treatment’s true effect.
Equivalence: In the context of statistical testing, equivalence implies that the difference between two groups or parameters is within a predefined range considered substantively unimportant.
Exchangeability: The idea that the treatment and control groups are sufficiently similar that they could have been interchanged without affecting the outcome of interest.
Identification Assumption: An untestable assumption about the data-generating process that is necessary to estimate a causal effect.
Natural Experiment: A study where the assignment of treatment is “as-if” random, occurring due to external factors or policy changes, rather than through researcher manipulation.
Null Hypothesis: A statement of no effect or no difference, often used as the baseline for statistical testing. In equivalence testing, the null hypothesis typically posits a meaningful difference.
Observational Study: Research where the researcher observes and measures variables without directly manipulating the treatment or exposure.
Placebo Effect: An observed effect on an outcome that is due to the act of receiving a treatment or intervention itself, rather than the treatment’s specific active ingredients.
Power: The probability of correctly rejecting the null hypothesis when it is false. Higher power indicates a greater ability to detect a true effect or difference.
Randomization: The process of randomly assigning units to treatment and control groups, which helps to ensure that the groups are similar on average.
Sensitivity Analysis: A method for examining how sensitive the results of an analysis are to changes in assumptions, such as the presence of unobserved confounding.
Standardized Effect Size: A measure of the magnitude of an effect (e.g., difference between groups) that is standardized to a common scale, often using standard deviations, to allow for comparisons across studies or variables with different units.
Statistical Significance: The likelihood of observing the data or more extreme results if the null hypothesis were true. Often determined by a p-value.
Substantive Significance: Whether a statistically significant finding is meaningful or important in the context of the research question and the real world.
Type I Error: Incorrectly rejecting a true null hypothesis. In equivalence testing, this would mean incorrectly concluding equivalence when a meaningful difference exists.
Type II Error: Incorrectly failing to reject a false null hypothesis. In equivalence testing, this would mean failing to conclude equivalence when the groups are actually equivalent within the defined range.

This section is adapted from the code and documentation given on https://github.com/ekhartman/equivtest↩︎

Equivalence Testing for Causal Inference in Social Science Research Design

Mechanics

Equivalence Testing in R1

How does the equiv.t.test function work?

Illustration used in Hartman and Hidalgo, 2018

How to run equivalence tests on your own data?

Glossary of Key Terms in H&H 2018

Equivalence Testing in R¹

How does the `equiv.t.test` function work?