My Thought About handling Missing Values

From my experiences, missing data is absolutely inevitable in the students’ data set. To my surprise, most of the students are not very concerned about missing value. In their minds, the way to handle missing value cases is just collect more respondents and delete the missing values directly, which makes me feel a bit uncomfortable. Therefore, I did read some papers about handling missing values. And in this paper I am especially focusing on the missing values from likert scale since most data I met is in such type.

At least in my opinion deleting missing values is not a good way because the data set might lack of representativeness after deleting missing data since the missing data may be dependent on some other variables. In another word, the missing value is not random. If the sample size is large enough and the number of missing value is few, for example less than 5%, deleting them may be not seriously influence the trend or pattern of the data. In this situation, it could be considered as an option. When this criterion cannot be met, the results could be biased by this way.

Besides that, there are several other ways to deal with missing values in practice and each of these has advantages and drawbacks.

1. Neutral value substitution. For example in a 5 points scale, substitute 3 for all missing values. This is a very simple method but the drawbacks are obvious. This method does not care about the survey population at all, but only depend on the types of the scale. The mean of the non-missing value might be far high or way below 3.

2. Mean or median Substitution. It is a very convenient way. But the problem is using mean or median will reduce the true variance of the variable. Moreover, there are two possible way to compute the mean or median: one is substitute for a missing response the mean of that person’s other responses. This looks reasonable since if one particular person has a negative attitude towards the survey, the general responses maybe cannot represent this person’s answer. But, as Tom Knapp, Professor Emeritus, University of Rochester, writes:“ The main reason it is controversial is that it tends to artificially increase the internal-consistency reliability of the measuring instrument- it makes the items look like they hang together better than they do.” The second way is that using the mean of the remaining persons’ non-missing values for the same variable. Just as discussed before, it will reduce the true variance of the variable.

3. Multiple imputation. This method is to guess the most likely missing value based on the values found in other variables. I searched a 77-page article about imputation by SPSS.

As a young statistics student, I am supposed to read the entire article and figure them out. However, considering this is time consuming and I should mainly assist the student with the research questions, not handling missing values.

In summary, by comparing all the advantages and drawbacks, I prefer the straightforward method: using mean of the same person’s other responses since this is justified and straightforward. Although this is controversial, I think we would be OK with a purely predictive purpose.

References:
1. Colldata blog http://colldata.wordpress.com
2. PASW Missing Values 18.
3. Some other web articles.

About Lincoln

Welcome to Haolai(Lincoln)'s Website! I am now a doctoral student in Statistics at Western Michigan University, USA. Have fun here!
This entry was posted in Dairy and tagged . Bookmark the permalink.

One Response to My Thought About handling Missing Values

  1. ChuiLing says:

    Hi Lincoln! Thank god I found you here!

    I m a master student. Honestly, I have very little knowledge about statistic. There is one big ? in my head now, regarding the zero values.

    I am collecting secondary data whereby 2 of the variables have > 15% of zero values. These are absolute zero values, i.e. the company has no loan with bank or no fixed assets. So, it should not be consider as “missing value”. But, how should I deal with these zero values?

    I am now in the process of cleaning my data. Later on will analyze it using logistic regression. Really hope that you can give me some guidance so that I can move on. Thanks a lot!

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>