Command Line Arguments in R

In this post, I explain how I sometimes conduct simulation studies on a compute cluster. In this example, I simulate a toy dataset, fit a linear mixed model, and collect the resulting regression coefficients.

There are three main objectives which motivated this example:

I want to run many replications for a given simulation scenario while leveraging the many CPUs available on a compute cluster
I want to alter the simulation parameters with minimal effort
I want to collect the results from each simulation scenario into a single data.frame

In this example, I alter both the sample size and the noise variance. Each take 3 different values for a total combination of 9 simulation scenarios. We want to be able to run each of these 9 scenarios as a different job (i.e. not in the same R session) so that they may run in parallel. The different combinations of simulation parameters will be stored in a data.frame. They are referenced by a command line argument (outside of R, at the job submission command), e.g.:

for i in {1..9..1} ; do qsub -v index=$i mclapplyExample.sh ; done

This is a simple loop which passes the value of i to the bash script mclapplyExample.sh which subsequently passes it to the R script mclapplyExample2.R. Below I provide the three scripts required for this example.

The main simulation script

The following script generates a data.frame of all the simulation parameters. It then reads the command line argument which corresponds to the simulation scenario. The rest of the script runs the jobs using mcmapply. i saved the following script to an R file called mclappyExample2.R. Carefully read the notes at the top of the script.

##################################
# R source code file for conducting simulations using the parallel package with 
# command line arguments. The main idea is to define a data.frame of simulation 
# parameters, and then use command line arguments to reference a particular 
# combination of simulation parameters
# Created: July 27, 2019
# Updated: July 29, 2019
# Notes: 
# qsub command to cycle through all 9 combinations of simulation parameters:
# for i in {1..9..1} ; do qsub -v index=$i mclapplyExample2.sh ; done
# use mclapplyExample2.sh to submit to compute cluster
# use collectResults.R to combine results in one data.frame
##################################


# load libraries ----------------------------------------------------------

library(parallel)
library(lme4)
#detectCores()


# Define a data.frame of simulation parameters ----------------------------

parametersDf <- base::expand.grid(sampleSize = c(200, 500, 1000),
                                  noiseSD = c(1,2,3),
                                  replications = 10,
                                  stringsAsFactors = FALSE)


# get simulation scenario index from command line args --------------------

parameterIndex <- base::as.numeric(base::as.character(base::commandArgs(trailingOnly = T)[1]))

# select simulation parameters based on command line argument
simulationParameters <- parametersDf[parameterIndex,, drop = F]
replications <- 1:simulationParameters[["replications"]]
sampleSize <- simulationParameters[["sampleSize"]]
noiseSD <- simulationParameters[["noiseSD"]]

# Simulation function definition ------------------------------------------

f <- function(i, n, sd, scenario = parameterIndex) {

  # simulate data
  x <- stats::runif(n*5)
  z <- stats::rbinom(n*5,1,x)
  ai <- base::rep(stats::rnorm(n), each = 5)
  
  # generate response with noise
  y <- ai + 1 + x + z + stats::rnorm(n*5, sd = sd)
  
  # generate cluster ID
  id <- base::rep(1:n, each = 5)

  # fit linear mixed model and get summary
  fit <- lme4::lmer(y ~ x + z + (1 | id))
  sfit <- summary(fit)
  
  
  # return list of simulation parameters and lmm coefficients
  return(list(scenario = scenario,
              replicate = i, 
              sampleSize = n, 
              noiseSD = sd, 
              b0 = sfit$coefficients[1,1], 
              bx = sfit$coefficients[2,1],
              bz = sfit$coefficients[3,1]))
}



# Execute simulation ------------------------------------------------------

MyOutput <- parallel::mcmapply(FUN = f, 
                               i = replications,
                               MoreArgs = list(n = sampleSize,
                                               sd = noiseSD),
                               SIMPLIFY = FALSE)


# Collect results in a matrix ---------------------------------------------

results <- base::do.call(base::rbind, MyOutput)


# Write results to file ---------------------------------------------------

# create filename with random pattern extension. assumes you have a directory called results
# change tmpdir to the directory where you want to store results
# depending on system, the PBS_O_WORKDIR might be set and can be used via
# paste(Sys.getenv("PBS_O_WORKDIR"),"results", sep="/")
filename <- base::tempfile(pattern = base::sprintf("scenario_%1.0f_sampleSize_%1.0f_noiseSD_%1.0f_",
                                                   parameterIndex, 
                                                   sampleSize, 
                                                   noiseSD),
                           tmpdir = "results",
                           fileext = ".txt")

utils::write.table(results,
                   file = filename,
                   quote = FALSE,
                   row.names = FALSE,
                   col.names = FALSE)

The bash script

This is the bash script which I saved as mclappyExample2.sh.

#!/bin/bash
#PBS -l walltime=1:30:00
#PBS -o log/
#PBS -e log/
#PBS -N sim1
#PBS -m ea
#PBS -l mem=12G
#PBS -l vmem=12G

cd $PBS_O_WORKDIR

SRC=$PBS_O_WORKDIR

Rscript ${SRC}/mclappyExample2.R ${index}

Collect Results

The following is simply to collect all results and merge into a single data.frame. Carefully read the notes at the top of the script.

##################################
# R source code file for gathering simulation results into one dataset
# Created: July 27, 2019
# Updated: July 29, 2019
# Notes: 
# This script is to be used only once mclapplyExample2.sh has been executed
# Assumes results are stored in a directory called results. 
# Change the path argument in list.files function accordingly
##################################

# To read in all the results and combine into 1 data.frame ----------------

# list all files starting with the word scenario in the results directory
# include the full path 
files <- base::list.files(path = "results", 
                          pattern = "scenario",
                          full.names = TRUE)

# read in all the results into a list
resultsList <- base::lapply(files, utils::read.table)


# combine results into a single data.frame
resultsDF <- base::do.call(base::rbind, resultsList)
colnames(resultsDF) <- c("scenario","replicate", "sampleSize", "noiseSD", "b0", "bx", "bz")