5 min read

Contrasts in R

In this post I discuss how to create custom contrasts for factor variables in R. First lets create some simulated data. Create the data, and factor Disease status:


Disease <- c(rep("RA", 5), rep("SLE", 5), rep("Scleroderma", 5), 
             rep("Myositis", 5), rep("Control", 5))
set.seed(1234)
sex <-  rbinom(25,1, 0.5)
age <-  rnorm(25, 40, 5)
y <- rnorm(25, 0.5, 0.12)
data <- data.frame(y,sex,age,Disease=factor(Disease))
str(data)

## 'data.frame':	25 obs. of  4 variables:
##  $ y      : num  0.506 0.323 0.552 0.492 0.513 ...
##  $ sex    : int  0 1 1 1 1 1 0 0 1 1 ...
##  $ age    : num  44.4 46.9 31.6 36.9 40.1 ...
##  $ Disease: Factor w/ 5 levels "Control","Myositis",..: 3 3 3 3 3 5 5 5 5 5 ...

We want the following contrasts:

  • Control versus all 4 diseases combined
  • RA versus the combination of (SLE, Scleroderma, Myositis), leaving out the Controls

Default settings

Let $x_1,x_2,x_3,x_4$ be the indicators for Myositis, RA, Scleroderma and SLE, respectively. The standard linear model R will fit is given by (for simplicity I am ignoring age and sex, but it won’t make a difference when you add them in the model): $$\mu_y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_4$$


summary(fit <- lm(y ~ Disease, data=data))

## 
## Call:
## lm(formula = y ~ Disease, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18718 -0.05324  0.02859  0.07511  0.12395 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         0.60943    0.04756  12.814 4.22e-11 ***
## DiseaseMyositis    -0.24835    0.06726  -3.692  0.00144 ** 
## DiseaseRA          -0.13225    0.06726  -1.966  0.06330 .  
## DiseaseScleroderma -0.15540    0.06726  -2.310  0.03165 *  
## DiseaseSLE         -0.05650    0.06726  -0.840  0.41081    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1063 on 20 degrees of freedom
## Multiple R-squared:  0.4452,	Adjusted R-squared:  0.3342 
## F-statistic: 4.012 on 4 and 20 DF,  p-value: 0.01507

This is the default contrast matrix with unordered factor variables:


contrasts(data$Disease)
Myositis RA Scleroderma SLE
Control 0 0 0 0
Myositis 1 0 0 0
RA 0 1 0 0
Scleroderma 0 0 1 0
SLE 0 0 0 1

This compares the mean of the response for the Controls to the mean of the response for Myositis, RA, Scleroderma, and SLE separately. The table can be read by column, and the numbers in the columns represent the weight of the regression coefficient, e.g. in the first column Myositis is being compare to Control.

Custom Contrats

Since we want only two contrasts, we want R to fit the following model: $$\mu_y = \beta_0 + \beta_1 x_1 + \beta_2 x_2$$

where $\beta_1$ represents the contrast estimate for the comparison between controls and all other diseases, and $\beta_2$ represents the contrast estimate of RA versus the combination of SLE, Scleroderma, Myositis.

To create custom contrasts, we must specify the contrast matrix as follows:

Control_vs_All RA_vs_Myos_Scle_SLE
Control 0.8 0.0000000
Myositis -0.2 -0.3333333
RA -0.2 1.0000000
Scleroderma -0.2 -0.3333333
SLE -0.2 -0.3333333

Again we look at the above table, column by column. The variables we want to contrast should have opposite signs and the columns should sum to 0. This contrast matrix leads to the following mean response equations for each of the groups:

$$\begin{align} \mu_{control} & = \beta_0 + 0.8 \beta_1\\ \mu_{myos} & = \beta_0 - 0.2 \beta_1 - \frac{1}{3} \beta_2 \\ \mu_{ra} & = \beta_0 - 0.2 \beta_1 + \beta_2 \\ \mu_{scler} & = \beta_0 - 0.2 \beta_1 - \frac{1}{3} \beta_2 \\ \mu_{sle} & = \beta_0 - 0.2 \beta_1 - \frac{1}{3} \beta_2 \\ \end{align}$$

To solve for $\beta_0$ we can add up all the equations to get

$$ \begin{align} \mu_{control}+\mu_{myos}+\mu_{ra}+\mu_{scler}+\mu_{sle} & = 5 \beta_0 \\ \beta_0 & = \frac{\mu_{control}+\mu_{myos}+mu_{ra}+\mu_{scler}+\mu_{sle}}{5} \end{align} $$

To solve for $\beta_1$ we substract $\mu_{control}$ from the combined mean of $\mu_{myos},\mu_{ra},\mu_{scler}$ and $\mu_{sle}$ which gives:

$$ \begin{align} \mu_{control}-\frac{\mu_{myos}+\mu_{ra}+\mu_{scler}+\mu_{sle}}{4} & = \beta_0 + 0.8 \beta_1 - \frac{4\beta_0 -0.8\beta_1}{4}\\ & = \beta_0 + 0.8 \beta_1 - \beta_0 + 0.2 \beta_1 \\ & = \beta_1 \end{align} $$

To solve for $\beta_2$ we substract $\mu_{ra}$ from the combined mean of $\mu_{myos},\mu_{scler}$ and $\mu_{sle}$ which gives:

$$ \begin{align} \mu_{ra}-\frac{\mu_{myos}+\mu_{scler}+\mu_{sle}}{3} & = \beta_0 - 0.2 \beta_1 + \beta_2 - \frac{3\beta_0 -0.6\beta_1 - \beta_2}{3}\\ & = \beta_0 - 0.2 \beta_1 + \beta_2 - \beta_0 + 0.2 \beta_1 +\frac{1}{3}\beta_2 \\ & = \frac{4}{3} \beta_2 \\ \beta_2 & = \frac{3}{4} \left( \mu_{ra}-\frac{\mu_{myos}+\mu_{scler}+\mu_{sle}}{3} \right) \end{align} $$

First we create the contrast matrix with appropriate row and column names for clarity:


my.contr <- matrix(c( 4/5, -1/5, -1/5, -1/5, -1/5,
                 0,-1/3,1,-1/3,-1/3),
              ncol = 2, dimnames = list(c("Control", "Myositis", "RA","Scleroderma","SLE"),
    c("Control_vs_All","RA_vs_Myos_Scle_SLE")))

Then we store the contrasts attribute to the Disease variable. The how.many argument specifies how many contrasts we want, therefore this should correspond to the number of columns in the contrast matrix.


contrasts(data$Disease,how.many=2) <- my.contr
contrasts(data$Disease)

##             Control_vs_All RA_vs_Myos_Scle_SLE
## Control                0.8           0.0000000
## Myositis              -0.2          -0.3333333
## RA                    -0.2           1.0000000
## Scleroderma           -0.2          -0.3333333
## SLE                   -0.2          -0.3333333

summary(fit <- lm(y ~ Disease, data=data))

## 
## Call:
## lm(formula = y ~ Disease, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.18778 -0.09120  0.01436  0.06365  0.19806 
## 
## Coefficients:
##                            Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                 0.49093    0.02405  20.410 8.72e-16 ***
## DiseaseControl_vs_All       0.14812    0.06013   2.463   0.0221 *  
## DiseaseRA_vs_Myos_Scle_SLE  0.01587    0.04658   0.341   0.7365    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1203 on 22 degrees of freedom
## Multiple R-squared:  0.2194,	Adjusted R-squared:  0.1484 
## F-statistic: 3.092 on 2 and 22 DF,  p-value: 0.06557

Here we check to make sure that the lm fit is giving the same result as the formulas derived above:


#' group level means
mu.control <- mean(data[which(data$Disease=="Control"),"y"])
mu.myos <- mean(data[which(data$Disease=="Myositis"),"y"])
mu.ra <- mean(data[which(data$Disease=="RA"),"y"])
mu.scler <- mean(data[which(data$Disease=="Scleroderma"),"y"])
mu.sle <- mean(data[which(data$Disease=="SLE"),"y"])

#' beta0
mean(c(mu.control,mu.myos,mu.ra,mu.scler,mu.sle))

## [1] 0.4909311

#' beta1
mu.control - mean(c(mu.myos,mu.ra,mu.scler,mu.sle))

## [1] 0.1481237

#' beta2
(mu.ra - mean(c(mu.myos,mu.scler,mu.sle)))*(3/4)

## [1] 0.01587466

References