Some Thoughts on Likert Scale Data

I am always encountering the problem about handling the data from a likert scale, partly because almost all the students I provided consultation with collect data from questionnaires with the questions like 1-strongly disagree,2-,3-,4-, to 5- strong agree. This kinds of questions could not be more common in sampling survey field. But if we treat this data as interval or ratio type, there are some potential problems.

1. Many statistical techniques require the data following a normal distribution or asymptotically normality at least. It is really hard to ask likert scale data to follow normal distribution, at least for the following two reasons from my experiences: first, the answers of respondents often have floor effects or ceiling effect, which means there are rarely people who choose 5 as many as who choose 1. In another words, if most people choose 5 probably there are fewest people to choose 1. People’s answers have similar tendency. All above will not lead the distribution to be a bell-shape, or symmetrical shape. The other reason is that the normal distribution was designed for continuous data. How can you estimate a normal curve by just three or five vertical discrete strips? Ridiculous.

2. The interval between any two points cannot make sense. I agree with that if the ordinal data is 1st place, 2nd place or 3rd place in a 100 meter run. The distances between 1 and 2 make no senses(who tell me 1.3th place win gold medal or silver?). However, I admit the data from Likert Scale is ordinal in nature. I think the distance between two points could mean something, more or less. For example, if the mean of two person’s answer is 1.5, at least we cannot conclude they have a positive attitude, right? I read some article talking about this. I suggest to the students come to see me that treat it as interval type if you are using a at least five-points likert scale. But be aware of that there are still debating and arguing among statisticians. If you show my article to another statistician, they may say what I wrote is bullshit.

Is it very important that the data is ordinal or interval? The answer is definitely YES. Different types of measurement have their corresponding appropriate statistical techniques. Gender, for example, we cannot take the mean score of it.(Who’s gender is half/half?? I guess there are!). Therefore, we cannot use t-test, ANOVA, mean score and many other, instead only mode, median and frequency can be applied.

Oh, it is too sad of it. Aren’t there any exemptions? Yes, there are. Thanks to Central Limit Theorem. We can safely disregard what I talk about by Central Limit Theorem, IF the sample size is LARGE.

Is it good? But unfortunately, how large is large? The empirical cutoff is 100. But how about your sample size is 90? HAHA. Yeah, this is statistics. The only way now is judging by yourself.

Yes, it is true. Your case might always be imperfect, life is not always ideal and statistics methods used are not always most appropriate.

Statistics, I love it and I hate it.

Posted in Statistics | Tagged , , , | Leave a comment

Multiple Imputation by SPSS

Recently when I read some articles about missing value analysis, most of them said multiple imputation is the better way to deal with the missing value. Then I decided to change my mind and take a look at what that is.

In multiple imputation, there are two terms very important, MCAR and MAR. Missing completely at random (MCAR) means the missing values does not depend on other values. While Missing at Random (MAR) means the pattern of missing data is related to the observed data only. When deletion is better than multiple imputation? My answer is if there are few missing values, usually less than 5%, and the missing value does not depend on other values (MCAR), then deletion is relatively “safe”.

There is a test called Little’s MCAR test to determine if the missing value is MCAR or MAR. The null hypothesis is the missing data is MCAR. Then Multiple Imputation procedure provides multiple versions of dataset (5 versions by default), each containing its own set of imputed datasets. When doing statistical analysis in SPSS, the results for all of the imputed dataset are pooled, which are more accurate than deletion and only one imputation.

There are several methods for estimating missing values. They are listwise, pairwise, regression, and EM method. The first three, listwise, pairwise and regression method require the missing data are MCAR. In this condition, they can give consistent and unbiased estimates of the correlations and covariances. However, EM method only requires the missing data is MAR, so it could be used when MCAR is violated. When the missing data is neither MCAR nor MAR, which is uncommon, none of these methods is appropriate.

After creating several “complete” datasets, “Analytic procedures that work with multiple imputation datasets produce output for each complete dataset, plus pooled output that estimates what the results would have been if the original dataset had no missing values.” Fortunately, most of the frequently used techniques support pooling, such as descriptive, t-test, ANOVA, Linear model and so on.

In summary, the steps for doing Multiple Imputation by SPSS are:

  1. Run descriptive statistics to describe the pattern of missing data.
  2. Run Little’s MCAR Test to confirm the conclusion we drew from the descriptive statistics.
  3. When missing value is MCAR, deletion is relatively safe if there are less than 5% missing values, or use listwise, pairwise or regression methods. When missing value is MAR, use EM method for estimate.
  4. After multiple imputation, use desired statistical techniques that work with the created “complete” dataset, for each dataset as well as the pooled outputs.

Reference:

  1. PASW Missing Values 18
Posted in Statistics | Tagged , , | 2 Comments

My Thought About handling Missing Values

From my experiences, missing data is absolutely inevitable in the students’ data set. To my surprise, most of the students are not very concerned about missing value. In their minds, the way to handle missing value cases is just collect more respondents and delete the missing values directly, which makes me feel a bit uncomfortable. Therefore, I did read some papers about handling missing values. And in this paper I am especially focusing on the missing values from likert scale since most data I met is in such type.

At least in my opinion deleting missing values is not a good way because the data set might lack of representativeness after deleting missing data since the missing data may be dependent on some other variables. In another word, the missing value is not random. If the sample size is large enough and the number of missing value is few, for example less than 5%, deleting them may be not seriously influence the trend or pattern of the data. In this situation, it could be considered as an option. When this criterion cannot be met, the results could be biased by this way.

Besides that, there are several other ways to deal with missing values in practice and each of these has advantages and drawbacks.

1. Neutral value substitution. For example in a 5 points scale, substitute 3 for all missing values. This is a very simple method but the drawbacks are obvious. This method does not care about the survey population at all, but only depend on the types of the scale. The mean of the non-missing value might be far high or way below 3.

2. Mean or median Substitution. It is a very convenient way. But the problem is using mean or median will reduce the true variance of the variable. Moreover, there are two possible way to compute the mean or median: one is substitute for a missing response the mean of that person’s other responses. This looks reasonable since if one particular person has a negative attitude towards the survey, the general responses maybe cannot represent this person’s answer. But, as Tom Knapp, Professor Emeritus, University of Rochester, writes:“ The main reason it is controversial is that it tends to artificially increase the internal-consistency reliability of the measuring instrument- it makes the items look like they hang together better than they do.” The second way is that using the mean of the remaining persons’ non-missing values for the same variable. Just as discussed before, it will reduce the true variance of the variable.

3. Multiple imputation. This method is to guess the most likely missing value based on the values found in other variables. I searched a 77-page article about imputation by SPSS.

As a young statistics student, I am supposed to read the entire article and figure them out. However, considering this is time consuming and I should mainly assist the student with the research questions, not handling missing values.

In summary, by comparing all the advantages and drawbacks, I prefer the straightforward method: using mean of the same person’s other responses since this is justified and straightforward. Although this is controversial, I think we would be OK with a purely predictive purpose.

References:
1. Colldata blog http://colldata.wordpress.com
2. PASW Missing Values 18.
3. Some other web articles.

Posted in Dairy | Tagged | Leave a comment

The coming decade, what would happen?

Ten years ago, maybe you were learning to send email.

Now, you have been tired of visiting internet by iphone.

This video will let you know what would happen in next decade.

Posted in Dairy | Tagged , | Leave a comment

Thoughts of Overview of SPSS

At first, thanks so much for coming to hear my presentation. I hope it would be somewhat or somehow helpful to someone.

As we know, SPSS is not the most powerful statistical software, but I think it is the most convenient to use, especially for the beginner. Many statistician do not like SPSS since there are some statistical bug or error in it. Moreover, SPSS is like a black box, what we can do is just input the data and choose the methods or techniques we want to use. I agree that sometimes the outputs  look lovely, at least to non-stats major’s students. BUT, We cannot know the EXACT PROCESS or PROCEDURE in SPSS. In another word, we do not know the details about how SPSS dea with our data, which is terrible sometimes.

The reason why many statistician do not like SPSS?  I think there are following several seasons:

1. There are some statistical errors in SPSS.

2. Do not like the interface of SPSS.

3. Hypocrisy. They think “I am a statistician,so I should use a advanced software.I will lose face if someone knows I actually use SPSS at home”.

Anyway, even though there are some errors in SPSS, we have to agree that SPSS is powerful, especially useful and convenient for social icience study because it does not need to type complicated syntax.

On December 2nd, I will conduct the third and also the last workshop of this semester, Overview of SAS. Welcome to register. In this workshop,we will use a lot of syntax which may confuse you. That is OK, maybe you are not supposed to know everything about SAS in just 3 hours. But you can get familiar with that, and you could explore it by yourself  in the future.

Thank all today’s participants so much, I appreciate that.

Posted in Statistics | Tagged , | Leave a comment

Stay Hungry. Stay Foolish

“If you live each day as if it was your last, someday you’ll most certainly be right.”

This is the text of the Commencement address by Steve Jobs. And it inspires me every day.

I would like to share it with you.

http://news.stanford.edu/news/2005/june15/jobs-061505.html

Steve Jobs

Steve Jobs

Posted in Dairy | Leave a comment

Summary of the workshop of Statistical Concepts

Generally, the workshop is not bad. I introduce most of what I prepared. I probably prepare too many stuff and have to rush out at the end. The students with some stats background may feel a bit disappointed since this workshop does not meet their expectation.

Anyway, I appreciated everyone coming to the presentation. Next one, Overview of SPSS, will be much better than this one, definitely.

Posted in Dairy | Tagged | Leave a comment

OVerview of Statistical Concepts

Today I will give a presentation to introduce the statistical concepts for three hours.

Good luck to myself! ^_^

Posted in Dairy | Tagged , | 1 Comment

My first Personal Website!

Welcome again to visit Haolai’s Personal Website!

In my website, I will try to often post some articles about my major and thoughts.

Thank Yihui sooooo much! I can not set up this site without his help. His website(http://yihui.name) is really a fun place to visit. He led me to the future of my research.

In JSM 2009, he introduce so many famous scholars to me which is a wonderful experience.

Welcome once again and have fun in here ^_^

Plus, I am a gooner! Arsenal is my team forever!

Posted in Dairy | 6 Comments