### Quick Links

Rachel Ward is an assistant professor in the department of mathematics and a researcher at ICES.

Classic “goodness-of-fit" statistical tests compare deviations between observed and expected data with predictable distributions that are well known, widely used and well defined. The problem is these tests, which researchers depend on to confirm or reject change, are, in many ways, not so good, says ICES researcher Rachel Ward.

The way to remedy shortcomings in old tests, and find new ways of evaluating data altogether is in computationally enabled statistical tests, says Ward. Her recent research shows just that by comparing how different methods fare at detecting evidence for evolution in a theoretical population. At the core, it’s a match-up against the classical, well-known chi-square test and the computer-driven, and less studied, root-mean-square test.

“There is a lot of beautiful mathematical theory for the classical goodness-of-fit tests, but unfortunately the theory is asymptotic, holding only in the unrealistic setting where we have access to infinite data” said Ward, an assistant professor in UT's mathematics department. “Now that we have the computational power to do more accurate non-asymptotic tests, we have an obligation to do so even if the theory is not as elegant”

Her research, conducted with co-author Raymond Carroll, director of the Texas A&M Institute for Applied Mathematics and Computational Science, was published in November in the journal Biostatistics.

Finding evidence for evolution usually starts with finding the Hardy-Weinberg deviation for a population, a value that describes the difference between expected frequency for an allele and the actual observed value. If a deviation is sufficiently large, it’s good evidence that a population’s gene frequency has actually changed—the definition of evolution—rather than the difference being a product of expected sample-size fluctuation.

Epidemiologists use these Hardy-Weinberg based methods to study evolution in deadly viruses like HIV and influenza; ecologists use similar techniques to understand genetic diversity, such as color variety in poison frogs, or detect mingling of different populations, such as between wild and farm-raised salmon.

But before any conclusions can be drawn, statistical tests need to be applied to ensure that the deviation is significant enough to indicate a high probability for genetic change.

From the earliest days, the go-to “goodness of fit” test for evaluating deviations of Hardy-Weinberg equilibrium has been the chi-square test.

This statistical test measures the likelihood that a deviation is simply due to chance by placing the Hardy-Weinberg deviation in terms of a chi-square value, which can be plotted on a pre-defined distribution curve. The exact curve the value is plotted on depends upon the number of alleles being evaluated, a known parameter of the test. Nevertheless, all curves have a bell-like shape with a right tail that approaches zero probability as the chi-square values increase (making, in relative terms, high chi-square values associated with low probability of results being a chance finding, and a high likelihood for evolution).

Because these curves are derived from an explicit formula, it enables people to easily approximate the significance of the chi-square value with no advanced computing needed, said Ward, a necessary feature in the early days of the test.

“The chi-square statistic approaches a limiting function that is independent of all other underlying parameters in the population. It’s a universal limit,” said Ward. “And this means you can write down in a table the various values of the function, at different significance levels, and do the goodness-of-fit test by hand, without the help of a computer. This was necessary a century ago when goodness-of-fit statistics were being developed, but not anymore.”

But, to ensure that calculated chi-square value and those on the chi-square distribution curve are comparable, the test includes steps that “normalize” the final chi-square value. The process is similar to how the three-dimensional form of the Earth is transposed to fit on two-dimensional paper map. And just as map transposition can create scaling errors (On the popular Mercantor map Greenland and Africa appear about the same size despite Africa having almost 40 times the area), the normalization of the chi-square statistic has the troublesome side effect of making the chi-square value overly influenced by small outliers.

“The chi-square statistic can act very unpredictably with respect to changes in rare alleles,” said Ward. “It makes the test somewhat blind to important deviations in the more frequent alleles in the population.”

That means a relatively small sampling error that overestimates the frequency for a rare allele can throw off the entire test, finding evidence for evolution when there is none.

For example, in her research, Ward drew a random sample of allele pairings from a model data set of allele combinations, a procedure akin to ecologists surveying the different genotypes of a population, and computed the Hardy-Weinberg deviation using the chi-square test. She found strong evidence for deviation from equilibrium: the probability that the results could be obtained in a non-evolving population was about 1 percent.

However, by simply removing one draw of a single rare allele, the results completely changed, bumping up the probability of observing the results in a non-evolving population to 20 percent, more than four times as much as the widely accepted 5 percent error cut-off.

“It’s a worrisome property of the chi-square test,” said Ward. “It implies that it is not difficult to steer the results of the test to your liking by perturbing the data slightly.”

To combat the problems inherent to the chi-square test, Ward proposes the use of the root-mean-square test.

Unlike the chi-square statistic, there’s no “normalization” of values to fit a certain distribution. Instead, it’s the values themselves that determine the distribution.

“The idea is that there is some underlying probability distribution, but I don’t know explicitly what that distribution is, and I don’t need to if I’m using a computer to compute [the probability that the results indicate significant change],” said Ward.

Similar to the chi-square test, the outcome of the root-mean square test is a root-mean square value that has an associated p-value, the probability indicating the likelihood that the deviation could be observed in a population not undergoing evolution.

However, not having an explicit formula for the graph, the root-mean-square test finds its underlying distribution, and in effect root-mean-square value, through sampling the data over and over. The technique, called the Monte Carlo method, is computationally enabled and queries the data, in Ward’s case 16 million times, until a good approximation of the underlying values emerge,

By letting the distribution depend on the data, the root-mean-square test avoids being overly influenced by outliers in a small portion of the curve, and relies instead on more commonly occurring values to determine deviation significance, said Ward.

In other words, when analyzing with the root-mean-square test it takes more than an errant rare allele, or alleles, to skew the data.

However, the root-mean-square test has its own point of weakness: a sensitivity to data that makes up the bulk of the distribution, or the most common alleles. The data in this portion of the distribution has the most influence on the root-mean-square value—which can misrepresent evidence for evolution in the case of errors that affect the distribution as a whole.

This emphasis on outliers in the majority is the complete opposite of the chi-square test, says Ward. Thus, the strengths inherent in the two tests essentially cancel out the weaknesses when both tests are applied to the same data.

“Data is most thoroughly evaluated when both tests are used together. And a disparity in either of the resulting p-values is a red flag that errors may exist in all or parts of the data,” said Ward. “To apply both tests [to data] would be the recommendation.”

Although Ward chose the frame of Hardy-Weinberg deviation for evaluating chi-square test and root-mean-square tests, statistics that measure deviation probability are useful across many fields. Uli Sauerland, researcher at the Centre for General Linguistics in Berlin, Germany, is applying Ward’s root-mean-square methods to study patterns of pronouns used across languages.

“I am working with data from several hundred languages where for each it is coded which pattern of identities the pronouns have. The main goal is to explain why many patterns occur either very rarely or not at all,” said Sauerland.

Sauerland is applying the Monte Carlo method, used by Ward to draw genotypes, to find possible pronoun patterns beyond those that exist in languages spoken today, and then using the root-mean-square test to test if deviations between pronoun patterns are significant.

“The number of logically possible patterns, about 8,000, is slightly greater than current estimates of the number of languages spoken on Earth, about 6,000. So, Dr. Ward's work on Monte Carlo methods using root-mean-square distance has been central for my research,” said Sauerland.

Ward said she plans on continuing research into applications of computationally enabled statistics, such as the root-mean-square test. Although computers can be used to improve and enhance older methods, it’s the development of completely new ones that most interest Ward.

“It will be interesting to see how [statistics] will evolve,” said Ward “It fascinates me how we still limit ourselves in statistics in many ways that we do not have to anymore because of computers.”

*Written by Monica Kortsha*