Categorical Variables in R

Kate · May 2, 2022, 11:26pm

Hi All,
We have a dataset that comes yearly as a text file. For the past 10-15 years we have written a syntax in SPSS to read it in and clean and code variables. In SPSS a variable can have a number value and character label over the number value. So you could have something like:
low_birth_weight
1 (label “Yes”)
0 (label “No”)

This means when you graph or make a table out of low_birth_weight it will show up with the Yes and No labels not the 1 and 0.

We are converting this syntax to R. For variables like this would it be best to code them as factors? Or is there something I am missing and we should code them as a different class?
Thanks,
Kate

mcewenkhundi · May 3, 2022, 7:15am

Hi Kate,

I agree with what you have said, I would code SPSS variables with value labels as factors in R.

To use the example that you have used this would be factor(low_birth_weight, levels = c(0,1), labels = c(“No”, “Yes”))

Another option is to import the SPSS .sav data directly into R using
dat_raw ← haven::read_sav(“path/data.sav”) function and use the labelled package to convert the value labeled variables to factors. This Blog has more detailed explanation on this approach PIPING HOT DATA: Leveraging labelled data in R

neale · May 6, 2022, 5:35pm

Hi Kate,
I agree with everything McEwen said about how you can use factors, or the {labelled} package.

I would just add that in casual use, most R users tend to affix more reader-friendly labels for plots and tables directly in the command that produces the plot or table (e.g. in the scales command of a ggplot). Factors are most often used to specify the order of appearance in a graphic.

Kate · May 6, 2022, 5:50pm

Thanks McEwen and Neale. This is helpful.
What we want is future users to understand what the underlying 1 or 0 means for the variable so I think factors are the way to go.
And yes we also will want to arrange how they appear in graphs.
-Kate