Star Wars: Method Matters

Using econometric techniques is a standard approach for economists to estimate economic relationships and test economic theories. As a conventional norm, we like to use star to denote a statistically significant result. However, the credibility of the empirical results, especially on the causal inference, is being increasingly questioned by researchers in recent years due to the concern over the misuse of data analysis for a statistically significant impact where the expected effect does not exist. Technically, this is often called “p-hacking.” However, for most researchers, this is more like a star war.

As an empirical researcher, among other issues in this war, we are more interested in knowing how p-hacking affects the publication bias in economics journals, especially in leading ones. Besides, we also would like to see how the different empirical methods relate to the extent of the p-hacking. If practical approaches tend to show distinct levels of p-hackings, the next question comes to which method is more reliable and suffers the least from p-hacking.

Brodeur et al. (2020) (from now on BCH) respond to the above questions by re-examining the hypothesis tests reported in all articles published by 25 top journals in economics (ranked by RePEc’s Simple Impact Factor) between 2015 and 2018 that employ at least one of following four most used empirical methods, namely, randomized control trials (RCT), difference-in-differences (DID), instrumental variables (IV), and regression discontinuity design (RDD). In total, they collected 21,740 test statistics (all tests’ statistics are two-tailed and the samples are large enough) and plotted a bar chart as in the Figure 1.

Figure 1: Z-statistics in 25 top economics journals

Source: Brodeur et al. (2020)

Strikingly, they observe that the distribution presents a two-humped shape with the second hump being between 1.65 and 2.5. Considering that 1.645 is the critical value under the significance level of 10%, the figure shows many effects are significant simply because we choose either 10% or 5% as the significance level. More importantly, if the underlying distribution of the test statistics (for any method) is continuous and infinitely differentiable, with such a large sample, it is impossible to observe a two-hump-shaped distribution but a single hump. These observations all give us the impression that p-hacking in economics journals does exist.

Then, the next question comes: would empirical methods show different extent of p-hacking or not? BCH further divided the sample into groups based on the methods and plot Figure 2. Surprisingly, different methods exhibit different shapes. The second hump around 2 is evident for DID and IV. In contrast, RCT and RDD show a roughly monotonically decreasing distribution except for a very small local maximum around 2 for RCT. Consequently, our answer would be yes. BCH suggests RCT and RDD may have a much less extent of p-hacking than DID and IV.

Figure 2: Z-statistics by methods

Source: Brodeur et al. (2020)

BCH then employs various complementary tests to analyze their conjectures formally. All results are in line with each other, providing more evidence of more p-hacking in DID and IV than RCT and RDD. Interestingly, they also compare the results with other non-economics disciplines such as political science and sociology and find economics journals exposed to a much smaller extent of p-hacking than those of non-economics fields.

BCH also explores the channels. First, they explain why IV tends to be more associated with p-hacking. BCH attributes the reason to the weak instrument. They show that F-statistics are usually unreported or less than 10, if reported, for the first stage IV study. They also find evidence that the z-statistics of the relatively weak instrument are more likely to be placed around the conventional thresholds in the second stage of the IV study. In other words, the weaker the IV, the greater the extent of p-hacking. Moreover, by comparing every published paper with their previous working paper counterpart, BCH also shows that the editorial process fails to impact the degree of p-hacking as the two distributions are almost the same.

Researchers like to use eye-catchers, stars, to conclude their main findings. However, think it over before giving it complete trust. Not all star stories are inherently convincible. Well, some of them may be flawed but still deliver some insights. After all, as Mark Twain once said, “Facts are stubborn things, but statistics are pliable.”

Chaoyi Chen

References:

Brodeur, A., Cook, N., & Heves, A. (2020). Method Matters: p-Hacking and Publication Bias in Causal Analysis in Economics. American Economic Review, 110(11),3634-3660.

Főoldali kép forrása: pixabay.com