Weighted genome-wide analysis

Table of Contents

Project setup



Project folder

  • Create project folder and set up the folder structure as follows:
    # Define project directory
    PFOLDER="your/project"

    # Create subfolders
    mkdir $PFOLDER/programs
    mkdir $PFOLDER/data
    mkdir $PFOLDER/output
  • Folder structure:
PFOLDER/
   | wGWA.R
   ├── data/
   │   └── GenData.bed
   │   └── GenData.bim
   │   └── GenData.fam
   │   └── phenoDat
   ├── output/
   ├── programs/
       └── ldak


Download the toy data

  1. Download here

  2. store data in $PFOLDER


Download LDAK

  1. Download here

  2. Rename and store LDAK as follows: $PFOLDER/programs/ldak


Dowload the R script to perform all analyses

  1. Download here

  2. Store R-script as follows: $PFOLDER/wGWA.R



Application of sampling weights to phenotypic data

  • Load R-packages and define working directory
if (!require("hudson")) install.packages("hudson")
if (!require("data.table")) install.packages("data.table")
PFOLDER="your/project"
  • Load functions
# =========== FUNCTIONS ==========================
# get weighted mean
meanW=function(df, pheno){
  weights=df$weights
  variable=df[[pheno]]
  meanW=sum(variable*weights)/sum(weights)
  mean=mean(variable, na.rm=T)
  return(data.frame(pheno=pheno, meanS=round(mean,1), meanW=round(meanW,1)))
}


# get weighted and standard correlations
get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}

getWcor=function(df, pheno){
  weighted_corr <- cov.wt(subset(df, select=c(pheno)), wt = df$weights, cor = TRUE)
  corr_matrix <- weighted_corr$cor
  cor_w_tri <- get_lower_tri(corr_matrix)
  corMelt_w=melt(cor_w_tri)
  corOutW=subset(corMelt_w, Var1!=Var2 & is.na(value)==F )
  
  cor_sample=cor(subset(df, select=c(pheno)))
  cor_tri <- get_lower_tri(cor_sample)
  corMelt=melt(cor_tri)
  corOut=subset(corMelt, Var1!=Var2 & is.na(value)==F )
  
  corDF=data.frame(cor=paste0(corOut$Var1, "~", corOut$Var2), corS=corOut$value, corW=corOutW$value)
  return(corDF)
}



Phenotype analyses

Read in phenotype data

dat=fread(paste0(PFOLDER, "/data/PhenoDat"))
head(dat)
FID IID age bmi education weights
FAM0001 IID0001 56.42 25.91 17.60 0.51
FAM0001 IID0002 50.17 21.66 20.16 0.10
FAM0001 IID0003 41.63 38.89 14.57 20.44
FAM0001 IID0004 55.46 29.34 19.52 0.30
  • The sampling weight are stored in the column dat$weights

  • The sampling weights were designed to bring the non-representative (simulated) sample in line with a (simulated) representative reference sample. We have the following means in the simulated datasets:

Phenotype Mean in reference sample Mean in non-representative sample
age 53.0 57.0
bmi 28.0 26.5
education 16.5 17.5
  • The estimated weights were re-scaled so that they sum up to the total sample, where a re-scaled weight >1 means that an individual carries more weight in the weighted population than in the original population, and a re-scaled weight <1 means that an individual carries less weight.


Spread of the sampling weights

summary(dat$weights)
Sampling weights
mean min max
1 0.01 36.19


The effective sample size of the weighted sample

The effective sample size (nEFF) for the weighed sample is calculated using the formula:

\[\textrm{nEFF} = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2}\]

nEFF=round((sum(dat$weights)^2)/sum(dat$weights^2),0)
Sample size
Total sample size: 6000
Effective sample size: 1684
  • Weighting will always lead to a reduction in the effective sample size (nEff), provided there is variability among sampling weights. The degree of sample size reduction is a function of the variability among sampling weights – the larger the spread, the larger the reduction in nEff.
  • For example, consider a sample of n=2 where individual one has a normalized weight of wn1=0.5 (i.e., is < 1, so will be down-weighted) and individual two has a normalized weight of wn2=1.5 (i.e., is > 1, so will be up-weighted). Applying the formula used for effective sample size estimation [ sum(wn)^2/sum(wn^2)] gives us an effective sample size of nEff=1.6, highlighting that weighting led to a 20% reduction in sample size. Sampling weights with a lower spread would result in a larger effective sample size [e.g.,wn1=1.1; wn2=0.9; nEff=1.98] and sampling weights with no variability will recover the original sample size of n=2 [e.g.,wn1=1.1; wn2=1; nEff=2].


Estimate weighted mean

\[Weighted Mean = \frac{1}{N}\sum_{i=1}^{N}w_i x_i\]


pheno=c("age", "bmi", "education")
outWL=lapply(pheno, function(x) meanW(df=dat, pheno=x))
outW=do.call(rbind, outWL)
Phenotye mean (non-representative sample) mean (weighted sample) mean (reference sample)
age 56.9 53.1 53.0
bmi 26.3 28.3 28.0
education 17.5 16.2 16.5


Estimate weighted correlations

corW=getWcor(df=dat, pheno=pheno)
Correlation (non-representive sample) Correlation (weighted sample) Correlation (reference sample)
bmi~age 0.1 0.24 0.3
education~age 0.1 0.24 0.3
education~bmi 0.1 0.17 0.3

The following correlations were used to simulate the data:

  • Correlations among all phenotypes in the non-representative sample: \(r=0.1\)

  • Correlations among all phenotypes in the reference sample: \(r=0.3\)



Genotype analyses

Weighted genome-wide analyses


Prepare data for LDAK

Phenotype data

# prepare phenotype data
write.table(subset(dat, select=c(FID, IID, bmi)),
            file= paste0(PFOLDER, "/output/phenoLDAK" ),
            sep="\t",
            row.names = F,
            col.names=F,
            quote=F)

print(head(subset(dat, select=c(FID, IID, bmi)), n=3))
##        FID     IID      bmi
## 1: FAM0001 IID0001 25.91042
## 2: FAM0001 IID0002 21.66125
## 3: FAM0001 IID0003 38.89009

Sampling weights

write.table(subset(dat, select=c(FID, IID, weights)),
            file= paste0(PFOLDER, "/output/weightsLDAK" ),
            sep="\t",
            row.names = F,
            col.names=F,
            quote=F)

print(head(subset(dat, select=c(FID, IID, weights)), n=3))
##        FID     IID     weights
## 1: FAM0001 IID0001  0.50752976
## 2: FAM0001 IID0002  0.09965894
## 3: FAM0001 IID0003 20.44120488


Run standard GWA in LDAK

  • A detailed documentation is available on the LDAK website

  • Parameters standard linear regression:

    • --linear <outfile>: perform linear regression
    • --pheno <phenofile>: to specify the phenotypes (in PLINK format)
    • --covar <covarfile>: to provide covariates (in PLINK format)
    • --bfile: to specify the genetic data files
# give execute permissions
system(paste0("chmod a+x ",PFOLDER, "/programs/ldak"))
# run linear regression
start_time <- Sys.time()
system(paste0(PFOLDER, "/programs/ldak --linear ",  PFOLDER, "/output/bmi --pheno ", PFOLDER, "/output/phenoLDAK --bfile ",PFOLDER, "/data/GenData"))
end_time <- Sys.time()
## [1] "Running time standard GWAS: 1.5 seconds"

Results

  • The main results are saved in bmi.assoc
  • The a summary version of the results is saved in bmi.summaries
  • The p-values are saved in bmi.pvalues


Run weighted GWA in LDAK

  • To run weighted linear regression, set:

    • --sandwich YES: use sandwich estimator to get the standard errors (Huber-White Sandwich Estimator)
    • --sample-weights <sampleweightfile>: perform weighted linear regression. The <sampleweightfile> should have three columns, where each row provides two sample IDs followed by the sampling weights
# run LDAK
start_time <- Sys.time()
system(paste0(PFOLDER, "/programs/ldak --linear ",  PFOLDER, "/output/bmiW --pheno ", PFOLDER, "/output/phenoLDAK --sandwich YES --sample-weights ", PFOLDER, "/output/weightsLDAK --bfile ",PFOLDER, "/data/GenData"))
end_time <- Sys.time()
## [1] "Running time weighted GWAS: 1.78 seconds"

Results

  • The main results are saved in bmiW.assoc
  • The a summary version of the results is saved in bmiW.summaries
  • The p-values are saved in bmiW.pvalues


Process LDAK output

list.files(paste0(PFOLDER, "/output"))
##  [1] "bmi.assoc"      "bmi.coeff"      "bmi.progress"   "bmi.pvalues"   
##  [5] "bmi.score"      "bmi.summaries"  "bmiW.assoc"     "bmiW.coeff"    
##  [9] "bmiW.progress"  "bmiW.pvalues"   "bmiW.score"     "bmiW.summaries"
## [13] "phenoLDAK"      "weightsLDAK"
# Weighted GWA
effectW=fread(paste0(PFOLDER, "/output/bmiW.assoc"), header=T)
gwaW=data.frame(SNP=effectW$Predictor, CHR=effectW$Chromosome, POS=effectW$Basepair, BETA=effectW$Effect, SE=effectW$SD)
pvalw=fread(paste0(PFOLDER, "/output/bmiW.pvalues"), header=T)
gwaW$pvalue=pvalw$P
Weighted estimates
SNP CHR POS BETA SE pvalue
rs0001 1 10090 0.00 0.19 0.98
rs0002 1 10133 0.01 0.19 0.97
rs0003 1 10247 -0.22 0.33 0.51
effect=fread(paste0(PFOLDER, "/output/bmi.assoc"), header=T)
gwa=data.frame(SNP=effect$Predictor, CHR=effect$Chromosome, POS=effect$Basepair, BETA=effect$Effect, SE=effect$SD)
pval=fread(paste0(PFOLDER, "/output/bmi.pvalues"), header=T)
gwa$pvalue=pval$P
Standard estimates
SNP CHR POS BETA SE pvalue
rs0001 1 10090 0.09 0.09 0.35
rs0002 1 10133 0.09 0.09 0.35
rs0003 1 10247 -0.14 0.14 0.32


Plot the results

setwd(paste0(PFOLDER))
manH=gmirror(top=subset(gwa, select=c(SNP, CHR, POS, pvalue)), 
             bottom=subset(gwaW, select=c(SNP, CHR, POS, pvalue)), tline=0.05, bline=0.05, 
             toptitle="Standard GWAS", 
             bottomtitle = "Weighted GWAS", 
             highlight_p = c(0.05,0.05),
             highlighter="darkviolet",
             file="/output/ManH")
## [1] "Saving plot to /output/ManH.png"