Outline

This step-by-step guide should help to quickly generate (~30 minutes) the UK Biobank probability weights using the prediction model derived as part of the paper Participation bias in the UK Biobank distorts genetic associations and downstream analyses.

To derive the model, we combined the harmonized UKBB data with the data from the reference sample (HSE) and then applied LASSO regression in glmnet, to predict UKBB participation (with UKBB=1; HSE=0). This tutorial illustrates how the glmnet LASSO regression output cvfit.rda can be used to derive the UK Biobank participation probabilities without the need to access external reference data (e.g., HSE, Census data etc).

The only required data input for the sampling weights is the raw UKBB data file, containing all the variable IDs listed in obtainWeights.xlsx. The code shown below can then be used to download the LASSO regression output (i.e., the glmnet object cvfit.rda, plus additional metadata and functions), prepare the raw UKBB data (so the variable coding matches the coding of the variables included in the glmnet model), and select the UKBB participants (e.g., individuals from Scotland are excluded since HSE data is only collected in England). Going through the pre-processing and data handling steps is important as the individual participation probabilities can only be correctly obtained if the input data is formatted as the data used to curate the prediction model in the paper.

This tutorial uses the R-script obtainWeights.R, which can be accessed and downloaded via the GitHub repository.

The only part that needs to be modified in the script is the raw UKBB data input file (cf., section “Read in UK Biobank data”).

Install packages

# load (and install libraries)
library('fastDummies')
library("glmnet")
library('data.table')
library('plyr’)

Downlaod data from study repository

# download and import functions
system("wget https://raw.github.com/TabeaSchoeler/TS2021_UKBBweighting/main/tutorial/functionsWeights.R")
source("functionsWeights.R")

# Download list with variable names
system("wget https://raw.github.com/TabeaSchoeler/TS2021_UKBBweighting/main/data/obtainWeights.xlsx")
varsAll <- as.data.frame(readxl::read_excel("obtainWeights.xlsx", na="NA"))
varsLabel = subset(varsAll, is.na(varsAll$ID)==FALSE)

Read in UK Biobank data

Replace the ‘ukbXXXXXXX.csv’ file with your own raw UKBB data file that contains all the variable IDs needed to obtain the predicted probabilities.

# Read in raw UKBB data
UKBBall=fread("ukbXXXXXXX.csv")

Pre-processing of the data

# Select and rename relevant variables
UKBBs=subset(UKBBall, select = c("eid", paste0(varsLabel$ID, "-0.0")))
colnames(UKBBs) = c("eid", varsLabel$label)

# recode variables
UKBBrL=lapply(varsAll$label, function(x) recodeVar(df=UKBBs, var=x, varInfo=varsAll))
UKBBr=Reduce(function(x,y) dplyr::full_join(x = x, y = y, by = "eid", suffix=c("", "") ),  UKBBrL)

# Apply age restriction
UKBBe1=subset(UKBBr, age >=40 & age  <=69)

# Exclude individuals outside England
UKBBe2=subset(UKBBe1, assessment_center != 11004 & assessment_center != 11005 & assessment_center !=  11003 & assessment_center !=  11022 & assessment_center !=  11023) 
# 11004: Glasgow, 11005: Edinburgh , 11003: Cardiff, 11022: Swansea,  11023: Wrexham 

# Include white ethnicity
# | 1: White | 1001: British | 1002: Irish | 1003: Any other white background 
UKBB=subset(UKBBe2, ethnic_background == 1 | ethnic_background == 1001 | ethnic_background == 1002 | ethnic_background == 1003)

Generate the sampling weights using the glmnet object


# Download and load glmnet output
system("wget https://raw.github.com/TabeaSchoeler/TS2021_UKBBweighting/main/data/cvfit.rda")
load("cvfit.rda")

# Create dummy variables
vars = subset(varsAll, includePrediction=="yes")
UKBBd=createDummy(df=UKBB, varInc=vars)
#UKBBd=sample_n(UKBBd, size=5000)
UKBBc=UKBBd[complete.cases(UKBBd),]

# Download variable names for glmnet model
system("wget https://raw.github.com/TabeaSchoeler/TS2021_UKBBweighting/main/data/varsModel.rds")
varPred=readRDS("varsModel.rds")

# include all possible two-way interaction terms among the dummy and continuous variables
f <- as.formula( ~ .*.)
xVal <- model.matrix(f, subset(UKBBc, select = varPred))

# get predicted probabilities and sampling weights
UKBBc$probs <- predict(cvfit, newx=xVal, s="lambda.min", type = "response")[,1]
UKBBc$IPSW=(1-UKBBc$probs)/UKBBc$probs # inverse probability weights
UKBBc$IPSWnorm=UKBBc$IPSW/mean(UKBBc$IPSW) # normalized inverse probability weights

head(subset(UKBBc, select=c(eid, probs, IPSW)))

      eid     probs       IPSW
4 10000XX 0.8639087 0.15752975
5 10000XX 0.9265383 0.07928624
6 10000XX 0.9678410 0.03322759
7 10000XX 0.9723823 0.02840206
8 10000XX 0.9718159 0.02900151
9 10000XX 0.9694545 0.03150794

Any feedback?

Please reach out if there are any questions or comments!

Generation of UKBB sampling weights

Table of Contents