Worked statistics problem and a path to more value from your data

In this previous post I went through the details of experiments in psychology or political science and went through the steps to take an anova. I thought I’d now do a worked problem with real numbers. At the end I also want to discuss what 95% confidence may mean, and why it might not mean what you think. At the end, here, I’ll discuss what you can do about this, and suggest an additional statistic to check. If you have not read the blog post, go back and read it.

The worked problem: Assume you have two groups of people tested for racism, or political views, or some allergic reaction. One group was given nothing more than the test, the other group is given some prompt: an advertisement, a drug, a lecture… We want to know if we had a significant effect at 95% confidence. Here are the test scores for both groups assuming a scale from 0 to 3.

Control group: 0, 0, 1, 1, 1, 1, 2, 2, 2, 2, 3.  These are the Xi° s; there are 11 of them

Prompted group: 0, 1, 1, 1, 2, 2, 2, 2, 3, 3.  These are the Xi* s; there are 10 of them.

X°-bar = 1.364; x*-bar = 1.7;   The difference between these two is 0.336. To find if this is significant at 95% certainly we calculate an average SD:

SD°2 = .777  SD° = √.777 = 0.881

SD*2 = .81  SD* = √0.81 = 0.9

The average = 0.891 = SD. We now calculate SV = SD (1/10 + 1/11) = .1700

T-table

T-table

We now go to the T-table for df = 19 and see that it’s 2.093. Since 2.093 times 0.1700 is more than 0.336 we do not have statistical significance. That looks bad. Thesis down the drain. Our prompt does not work, or so we think. But don’t give up so fast:

If we just took more data, e.g. testing another 21 students, and it came out similar to that above, we’d find the SV would be smaller by a factor of √2 = 1.414 and the T value would be smaller too. Then for sure, we would have proven the effectiveness of our prompt to 95% certainty. Now, we found that our prompt is effective. Quite effective. We can now publish.

The above is an example of how to do the anova, and is also an indicator of the problems inherent to it.

Most studies test with 60-80 subjects, and because the do so, they tend to find a difference between many, many groups at the 95% confidence level. They thus find they can use ANOVA to prove almost anything, but just to be sure, research groups tend to test different several prompts, and several doses of the same prompt too. To some extent this is legitimate; once you’ve brought in the students to test, and you’r paying for them to answer questions, you might as well ask a few extra questions, But be careful if some of these differences will show up significant at the 95% level and others do not. if weak prompts are significant at the 95% level but strong are not, you may be fooling yourself (and others) about the significance. The genetic modification food folks did not have the money to test lots of rats, so they tested fed 20 different doses of GM grain and looked for one that made health worse at the 95% certainty level. Of course, at some does the rats were healthier, but this group just dismissed those results. If you are at all honest, you are not free to toss such insights, and just use ANOVA as the genetic food clowns did. The dose-response is a key insight, be sure to save this data it, and use all of it.

With any prompt, there will always be a difference between groups, and even without a prompt. The fact that you find some difference that is bigger than random chance at the 95% level, suggests that something is happening, but if there is no reasonable correlation between the dose of the prompt, all yo can say is that you are measuring a difference that is different from the prompt. The cause may be that one group test was done early in the morning and the other group later. Thus, your main cause might be early bird personality. Did one group talk to another? You might be testing the effect of priming. Did you test in different locations? All these things will affect the difference between X°-bar – X*-bar, and of course the use of psychology students as your subject means that the subjects often suspect they know what you want to find. The researcher rarely has the ability to notice how significant these effects are in the particular study. A graduate student takes the data and with help of the professor uses the math above to show a difference at 95% certainty. Both think they have shown that the movie or other prompt has a strong effect on how people look at war or flying, or racism. But more often than not they have instead shown that the small difference between psychology students on the west campus and the east.

A next test, is R-squared. It is a critically important addition to the anova, and this is where you’ll get to use all of the prompt-dose data. Plot all your data on a graph, not just the averages, but all the raw data. with dose as the horizontal axis and test scores as the vertical on a two-dimensional graph. Include here the scores from those who’ve seen the movie-prompt twice (you might want to show it twice on purpose too). Use the scores for those who’ve seen half a prompt too. Plot a best linear fit through the data, and look at the value for R-squared as calculated by your computer. For two sets of prompt dose, 0 and 1, the slope of the best-fit line should be come out to equal the X*-bar minus X°-bar. Here, though, despite a 95% significance, you are likely to find that the R-squared value is small, often less than 10%. This is to say that the slope, the difference you observed, explains less than 10% of what is happening in your data. You’ll have to look a lot deeper to get the rest. A lot of the additional insight comes from the few subjects who saw the movie twice, or half, or in different locations and at different times of the day. To be confident in your result, you want a high confidence anova score and an r-squared above 50%, I’d say.

—- Let’s discuss a real paper in political science and psychology —-

How the strength of security prompt affects the desirability of "the charismatic candidate", study by Gillath and Hart.

How the strength of security prompt affects the desirability of “the charismatic candidate”, study by Gillath and Hart.

The chart at right presents a linear plot of data from a recent psychological paper analyzing candidate choice and policy opinion: “The effects of psychological security and insecurity on political attitudes and leadership preferences” by O. Gillath and J. Hart, Eur. J. Soc. Psychol. 40, 122–134 (2010). The authors took different groups of students and asked them about political candidates. There was no zero-prompt data, but three levels of prompt. The study asked the subjects (psychology students) to envision a close friend, or a less-close friend, or an acquaintance. The authors called this a security prompt since the close friend was considered a security source. Since the paper didn’t give raw data, and only provided averages and standard deviations, I was forced to make up the data on the chart assuming that 1/3 of the students scored at the average, 1/3 scored one SD higher than the average, and 1/3 at one SD lower. The R-square results would be the same as above, assuming normal distribution.

With 34 to 47 subjects in each group, the researchers found that, at better than the 95% confidence level (P<.05) students were less drawn to the “charismatic candidate” when given the strongest security prompt rather than the weakest. That is, when instructed to think about the very close friend rather than the acquaintance. They were drawn to other candidates less too, but the effect was not as strong as will the charismatic candidate. It’s a published study and a pretty interesting result, if true. I checked the data using the math above (they used a computer) and it checks out. But then I checked it again using a graph and R-squared. R-squared is only 7.6%, This is to say that this effect is only about 7.6% of what’s going on here (or perhaps you should use R, using that it’s 27.6%) In any case, most of what’s going on is not reflected by saying that thoughts of security strongly affect candidate choice.

The reason the results are so different with R-squared is that R-squared does not give you any benefit from large n. Nine data points arranged as above will have the same R-squared value as 90. What r-squared is picking up on is that the SDs are large relative to the slope. By adding some extra experiments, and analyzing by r-squared, you can find out why the standard deviation is so big. This is a way to find out what else is happening. As an example, you’d want to repeat the test with students that are given no prompt at all, just off the street, and then repeat after the prompt. The first data sets will be particularly telling. If the scores don’t match the pattern above, it may mean that you are giving something away in the way you are presenting the prompt. If this data does match the above pattern, the r-square value will go up, and it will increase your confidence that you’re on to something. Similarly, you’ll want to see what happens if you give two strong prompts. Also, you will want to check if the data is normally distributed (I’d written about how that affects things and how you check for that). If the scores are not normal, you will have to dust them some legitimate way to make them normal. This is an art. To quote Bob Dylan, the current understanding of this paper is, in my opinion: “Something is happening here, and you don’t know what it is.”

Robert Buxbaum, March 18, 2019

Leave a Reply