Project setup
Project folder
- Create project folder and set up the folder structure as follows:
# Define project directory
PFOLDER="your/project"
# Create subfolders
mkdir $PFOLDER/programs
mkdir $PFOLDER/data
mkdir $PFOLDER/output- Folder structure:
PFOLDER/
| wGWA.R
├── data/
│ └── GenData.bed
│ └── GenData.bim
│ └── GenData.fam
│ └── phenoDat
├── output/
├── programs/
└── ldakDownload the toy data
store data in
$PFOLDER
Download LDAK
Rename and store LDAK as follows:
$PFOLDER/programs/ldak
Dowload the R script to perform all analyses
Store R-script as follows:
$PFOLDER/wGWA.R
Application of sampling weights to phenotypic data
- Load R-packages and define working directory
if (!require("hudson")) install.packages("hudson")
if (!require("data.table")) install.packages("data.table")
PFOLDER="your/project"- Load functions
# =========== FUNCTIONS ==========================
# get weighted mean
meanW=function(df, pheno){
weights=df$weights
variable=df[[pheno]]
meanW=sum(variable*weights)/sum(weights)
mean=mean(variable, na.rm=T)
return(data.frame(pheno=pheno, meanS=round(mean,1), meanW=round(meanW,1)))
}
# get weighted and standard correlations
get_lower_tri<-function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
getWcor=function(df, pheno){
weighted_corr <- cov.wt(subset(df, select=c(pheno)), wt = df$weights, cor = TRUE)
corr_matrix <- weighted_corr$cor
cor_w_tri <- get_lower_tri(corr_matrix)
corMelt_w=melt(cor_w_tri)
corOutW=subset(corMelt_w, Var1!=Var2 & is.na(value)==F )
cor_sample=cor(subset(df, select=c(pheno)))
cor_tri <- get_lower_tri(cor_sample)
corMelt=melt(cor_tri)
corOut=subset(corMelt, Var1!=Var2 & is.na(value)==F )
corDF=data.frame(cor=paste0(corOut$Var1, "~", corOut$Var2), corS=corOut$value, corW=corOutW$value)
return(corDF)
}Phenotype analyses
Read in phenotype data
dat=fread(paste0(PFOLDER, "/data/PhenoDat"))
head(dat)| FID | IID | age | bmi | education | weights |
|---|---|---|---|---|---|
| FAM0001 | IID0001 | 56.42 | 25.91 | 17.60 | 0.51 |
| FAM0001 | IID0002 | 50.17 | 21.66 | 20.16 | 0.10 |
| FAM0001 | IID0003 | 41.63 | 38.89 | 14.57 | 20.44 |
| FAM0001 | IID0004 | 55.46 | 29.34 | 19.52 | 0.30 |
The sampling weight are stored in the column
dat$weightsThe sampling weights were designed to bring the non-representative (simulated) sample in line with a (simulated) representative reference sample. We have the following means in the simulated datasets:
| Phenotype | Mean in reference sample | Mean in non-representative sample |
|---|---|---|
| age | 53.0 | 57.0 |
| bmi | 28.0 | 26.5 |
| education | 16.5 | 17.5 |
- The estimated weights were re-scaled so that they sum up to the total sample, where a re-scaled weight >1 means that an individual carries more weight in the weighted population than in the original population, and a re-scaled weight <1 means that an individual carries less weight.
Spread of the sampling weights
summary(dat$weights)| mean | min | max |
|---|---|---|
| 1 | 0.01 | 36.19 |
The effective sample size of the weighted sample
The effective sample size (nEFF) for the weighed sample is calculated using the formula:
\[\textrm{nEFF} = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2}\]
nEFF=round((sum(dat$weights)^2)/sum(dat$weights^2),0)| Sample size |
|---|
| Total sample size: 6000 |
| Effective sample size: 1684 |
- Weighting will always lead to a reduction in the effective sample size (nEff), provided there is variability among sampling weights. The degree of sample size reduction is a function of the variability among sampling weights – the larger the spread, the larger the reduction in nEff.
- For example, consider a sample of n=2 where individual one has a
normalized weight of wn1=0.5 (i.e., is < 1, so will be down-weighted)
and individual two has a normalized weight of wn2=1.5 (i.e., is > 1,
so will be up-weighted). Applying the formula used for effective sample
size estimation [
sum(wn)^2/sum(wn^2)] gives us an effective sample size of nEff=1.6, highlighting that weighting led to a 20% reduction in sample size. Sampling weights with a lower spread would result in a larger effective sample size [e.g.,wn1=1.1; wn2=0.9; nEff=1.98] and sampling weights with no variability will recover the original sample size of n=2 [e.g.,wn1=1.1; wn2=1; nEff=2].
Estimate weighted mean
\[Weighted Mean = \frac{1}{N}\sum_{i=1}^{N}w_i x_i\]
pheno=c("age", "bmi", "education")
outWL=lapply(pheno, function(x) meanW(df=dat, pheno=x))
outW=do.call(rbind, outWL)| Phenotye | mean (non-representative sample) | mean (weighted sample) | mean (reference sample) |
|---|---|---|---|
| age | 56.9 | 53.1 | 53.0 |
| bmi | 26.3 | 28.3 | 28.0 |
| education | 17.5 | 16.2 | 16.5 |
Estimate weighted correlations
corW=getWcor(df=dat, pheno=pheno)| Correlation (non-representive sample) | Correlation (weighted sample) | Correlation (reference sample) | |
|---|---|---|---|
| bmi~age | 0.1 | 0.24 | 0.3 |
| education~age | 0.1 | 0.24 | 0.3 |
| education~bmi | 0.1 | 0.17 | 0.3 |
The following correlations were used to simulate the data:
Correlations among all phenotypes in the non-representative sample: \(r=0.1\)
Correlations among all phenotypes in the reference sample: \(r=0.3\)
Genotype analyses
Weighted genome-wide analyses
Prepare data for LDAK
Phenotype data
# prepare phenotype data
write.table(subset(dat, select=c(FID, IID, bmi)),
file= paste0(PFOLDER, "/output/phenoLDAK" ),
sep="\t",
row.names = F,
col.names=F,
quote=F)
print(head(subset(dat, select=c(FID, IID, bmi)), n=3))## FID IID bmi
## 1: FAM0001 IID0001 25.91042
## 2: FAM0001 IID0002 21.66125
## 3: FAM0001 IID0003 38.89009
Sampling weights
write.table(subset(dat, select=c(FID, IID, weights)),
file= paste0(PFOLDER, "/output/weightsLDAK" ),
sep="\t",
row.names = F,
col.names=F,
quote=F)
print(head(subset(dat, select=c(FID, IID, weights)), n=3))## FID IID weights
## 1: FAM0001 IID0001 0.50752976
## 2: FAM0001 IID0002 0.09965894
## 3: FAM0001 IID0003 20.44120488
Run standard GWA in LDAK
A detailed documentation is available on the LDAK website
Parameters standard linear regression:
--linear <outfile>: perform linear regression--pheno <phenofile>: to specify the phenotypes (in PLINK format)--covar <covarfile>: to provide covariates (in PLINK format)--bfile: to specify the genetic data files
# give execute permissions
system(paste0("chmod a+x ",PFOLDER, "/programs/ldak"))
# run linear regression
start_time <- Sys.time()
system(paste0(PFOLDER, "/programs/ldak --linear ", PFOLDER, "/output/bmi --pheno ", PFOLDER, "/output/phenoLDAK --bfile ",PFOLDER, "/data/GenData"))
end_time <- Sys.time()## [1] "Running time standard GWAS: 1.5 seconds"
Results
- The main results are saved in
bmi.assoc - The a summary version of the results is saved in
bmi.summaries - The p-values are saved in
bmi.pvalues
Run weighted GWA in LDAK
To run weighted linear regression, set:
--sandwich YES: use sandwich estimator to get the standard errors (Huber-White Sandwich Estimator)--sample-weights <sampleweightfile>: perform weighted linear regression. The<sampleweightfile>should have three columns, where each row provides two sample IDs followed by the sampling weights
# run LDAK
start_time <- Sys.time()
system(paste0(PFOLDER, "/programs/ldak --linear ", PFOLDER, "/output/bmiW --pheno ", PFOLDER, "/output/phenoLDAK --sandwich YES --sample-weights ", PFOLDER, "/output/weightsLDAK --bfile ",PFOLDER, "/data/GenData"))
end_time <- Sys.time()## [1] "Running time weighted GWAS: 1.78 seconds"
Results
- The main results are saved in
bmiW.assoc - The a summary version of the results is saved in
bmiW.summaries - The p-values are saved in
bmiW.pvalues
Process LDAK output
list.files(paste0(PFOLDER, "/output"))## [1] "bmi.assoc" "bmi.coeff" "bmi.progress" "bmi.pvalues"
## [5] "bmi.score" "bmi.summaries" "bmiW.assoc" "bmiW.coeff"
## [9] "bmiW.progress" "bmiW.pvalues" "bmiW.score" "bmiW.summaries"
## [13] "phenoLDAK" "weightsLDAK"
# Weighted GWA
effectW=fread(paste0(PFOLDER, "/output/bmiW.assoc"), header=T)
gwaW=data.frame(SNP=effectW$Predictor, CHR=effectW$Chromosome, POS=effectW$Basepair, BETA=effectW$Effect, SE=effectW$SD)
pvalw=fread(paste0(PFOLDER, "/output/bmiW.pvalues"), header=T)
gwaW$pvalue=pvalw$P| SNP | CHR | POS | BETA | SE | pvalue |
|---|---|---|---|---|---|
| rs0001 | 1 | 10090 | 0.00 | 0.19 | 0.98 |
| rs0002 | 1 | 10133 | 0.01 | 0.19 | 0.97 |
| rs0003 | 1 | 10247 | -0.22 | 0.33 | 0.51 |
effect=fread(paste0(PFOLDER, "/output/bmi.assoc"), header=T)
gwa=data.frame(SNP=effect$Predictor, CHR=effect$Chromosome, POS=effect$Basepair, BETA=effect$Effect, SE=effect$SD)
pval=fread(paste0(PFOLDER, "/output/bmi.pvalues"), header=T)
gwa$pvalue=pval$P| SNP | CHR | POS | BETA | SE | pvalue |
|---|---|---|---|---|---|
| rs0001 | 1 | 10090 | 0.09 | 0.09 | 0.35 |
| rs0002 | 1 | 10133 | 0.09 | 0.09 | 0.35 |
| rs0003 | 1 | 10247 | -0.14 | 0.14 | 0.32 |
Plot the results
setwd(paste0(PFOLDER))
manH=gmirror(top=subset(gwa, select=c(SNP, CHR, POS, pvalue)),
bottom=subset(gwaW, select=c(SNP, CHR, POS, pvalue)), tline=0.05, bline=0.05,
toptitle="Standard GWAS",
bottomtitle = "Weighted GWAS",
highlight_p = c(0.05,0.05),
highlighter="darkviolet",
file="/output/ManH")## [1] "Saving plot to /output/ManH.png"