Project setup
Project folder
- Create project folder and set up the folder structure as follows:
# Define project directory
PFOLDER="your/project"
# Create subfolders
mkdir $PFOLDER/programs
mkdir $PFOLDER/data
mkdir $PFOLDER/output
- Folder structure:
PFOLDER/
| wGWA.R
├── data/
│ └── GenData.bed
│ └── GenData.bim
│ └── GenData.fam
│ └── phenoDat
├── output/
├── programs/
└── ldak
Download the toy data
store data in
$PFOLDER
Download LDAK
Rename and store LDAK as follows:
$PFOLDER/programs/ldak
Dowload the R script to perform all analyses
Store R-script as follows:
$PFOLDER/wGWA.R
Application of sampling weights to phenotypic data
- Load R-packages and define working directory
if (!require("hudson")) install.packages("hudson")
if (!require("data.table")) install.packages("data.table")
="your/project" PFOLDER
- Load functions
# =========== FUNCTIONS ==========================
# get weighted mean
=function(df, pheno){
meanW=df$weights
weights=df[[pheno]]
variable=sum(variable*weights)/sum(weights)
meanW=mean(variable, na.rm=T)
meanreturn(data.frame(pheno=pheno, meanS=round(mean,1), meanW=round(meanW,1)))
}
# get weighted and standard correlations
<-function(cormat){
get_lower_triupper.tri(cormat)] <- NA
cormat[return(cormat)
}
=function(df, pheno){
getWcor<- cov.wt(subset(df, select=c(pheno)), wt = df$weights, cor = TRUE)
weighted_corr <- weighted_corr$cor
corr_matrix <- get_lower_tri(corr_matrix)
cor_w_tri =melt(cor_w_tri)
corMelt_w=subset(corMelt_w, Var1!=Var2 & is.na(value)==F )
corOutW
=cor(subset(df, select=c(pheno)))
cor_sample<- get_lower_tri(cor_sample)
cor_tri =melt(cor_tri)
corMelt=subset(corMelt, Var1!=Var2 & is.na(value)==F )
corOut
=data.frame(cor=paste0(corOut$Var1, "~", corOut$Var2), corS=corOut$value, corW=corOutW$value)
corDFreturn(corDF)
}
Phenotype analyses
Read in phenotype data
=fread(paste0(PFOLDER, "/data/PhenoDat"))
dathead(dat)
FID | IID | age | bmi | education | weights |
---|---|---|---|---|---|
FAM0001 | IID0001 | 56.42 | 25.91 | 17.60 | 0.51 |
FAM0001 | IID0002 | 50.17 | 21.66 | 20.16 | 0.10 |
FAM0001 | IID0003 | 41.63 | 38.89 | 14.57 | 20.44 |
FAM0001 | IID0004 | 55.46 | 29.34 | 19.52 | 0.30 |
The sampling weight are stored in the column
dat$weights
The sampling weights were designed to bring the non-representative (simulated) sample in line with a (simulated) representative reference sample. We have the following means in the simulated datasets:
Phenotype | Mean in reference sample | Mean in non-representative sample |
---|---|---|
age | 53.0 | 57.0 |
bmi | 28.0 | 26.5 |
education | 16.5 | 17.5 |
- The estimated weights were re-scaled so that they sum up to the total sample, where a re-scaled weight >1 means that an individual carries more weight in the weighted population than in the original population, and a re-scaled weight <1 means that an individual carries less weight.
Spread of the sampling weights
summary(dat$weights)
mean | min | max |
---|---|---|
1 | 0.01 | 36.19 |
The effective sample size of the weighted sample
The effective sample size (nEFF) for the weighed sample is calculated using the formula:
\[\textrm{nEFF} = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2}\]
=round((sum(dat$weights)^2)/sum(dat$weights^2),0) nEFF
Sample size |
---|
Total sample size: 6000 |
Effective sample size: 1684 |
- Weighting will always lead to a reduction in the effective sample size (nEff), provided there is variability among sampling weights. The degree of sample size reduction is a function of the variability among sampling weights – the larger the spread, the larger the reduction in nEff.
- For example, consider a sample of n=2 where individual one has a
normalized weight of wn1=0.5 (i.e., is < 1, so will be down-weighted)
and individual two has a normalized weight of wn2=1.5 (i.e., is > 1,
so will be up-weighted). Applying the formula used for effective sample
size estimation [
sum(wn)^2/sum(wn^2)
] gives us an effective sample size of nEff=1.6, highlighting that weighting led to a 20% reduction in sample size. Sampling weights with a lower spread would result in a larger effective sample size [e.g.,wn1=1.1; wn2=0.9; nEff=1.98] and sampling weights with no variability will recover the original sample size of n=2 [e.g.,wn1=1.1; wn2=1; nEff=2].
Estimate weighted mean
\[Weighted Mean = \frac{1}{N}\sum_{i=1}^{N}w_i x_i\]
=c("age", "bmi", "education")
pheno=lapply(pheno, function(x) meanW(df=dat, pheno=x))
outWL=do.call(rbind, outWL) outW
Phenotye | mean (non-representative sample) | mean (weighted sample) | mean (reference sample) |
---|---|---|---|
age | 56.9 | 53.1 | 53.0 |
bmi | 26.3 | 28.3 | 28.0 |
education | 17.5 | 16.2 | 16.5 |
Estimate weighted correlations
=getWcor(df=dat, pheno=pheno) corW
Correlation (non-representive sample) | Correlation (weighted sample) | Correlation (reference sample) | |
---|---|---|---|
bmi~age | 0.1 | 0.24 | 0.3 |
education~age | 0.1 | 0.24 | 0.3 |
education~bmi | 0.1 | 0.17 | 0.3 |
The following correlations were used to simulate the data:
Correlations among all phenotypes in the non-representative sample: \(r=0.1\)
Correlations among all phenotypes in the reference sample: \(r=0.3\)
Genotype analyses
Weighted genome-wide analyses
Prepare data for LDAK
Phenotype data
# prepare phenotype data
write.table(subset(dat, select=c(FID, IID, bmi)),
file= paste0(PFOLDER, "/output/phenoLDAK" ),
sep="\t",
row.names = F,
col.names=F,
quote=F)
print(head(subset(dat, select=c(FID, IID, bmi)), n=3))
## FID IID bmi
## 1: FAM0001 IID0001 25.91042
## 2: FAM0001 IID0002 21.66125
## 3: FAM0001 IID0003 38.89009
Sampling weights
write.table(subset(dat, select=c(FID, IID, weights)),
file= paste0(PFOLDER, "/output/weightsLDAK" ),
sep="\t",
row.names = F,
col.names=F,
quote=F)
print(head(subset(dat, select=c(FID, IID, weights)), n=3))
## FID IID weights
## 1: FAM0001 IID0001 0.50752976
## 2: FAM0001 IID0002 0.09965894
## 3: FAM0001 IID0003 20.44120488
Run standard GWA in LDAK
A detailed documentation is available on the LDAK website
Parameters standard linear regression:
--linear <outfile>
: perform linear regression--pheno <phenofile>
: to specify the phenotypes (in PLINK format)--covar <covarfile>
: to provide covariates (in PLINK format)--bfile
: to specify the genetic data files
# give execute permissions
system(paste0("chmod a+x ",PFOLDER, "/programs/ldak"))
# run linear regression
<- Sys.time()
start_time system(paste0(PFOLDER, "/programs/ldak --linear ", PFOLDER, "/output/bmi --pheno ", PFOLDER, "/output/phenoLDAK --bfile ",PFOLDER, "/data/GenData"))
<- Sys.time() end_time
## [1] "Running time standard GWAS: 1.5 seconds"
Results
- The main results are saved in
bmi.assoc
- The a summary version of the results is saved in
bmi.summaries
- The p-values are saved in
bmi.pvalues
Run weighted GWA in LDAK
To run weighted linear regression, set:
--sandwich YES
: use sandwich estimator to get the standard errors (Huber-White Sandwich Estimator)--sample-weights <sampleweightfile>
: perform weighted linear regression. The<sampleweightfile>
should have three columns, where each row provides two sample IDs followed by the sampling weights
# run LDAK
<- Sys.time()
start_time system(paste0(PFOLDER, "/programs/ldak --linear ", PFOLDER, "/output/bmiW --pheno ", PFOLDER, "/output/phenoLDAK --sandwich YES --sample-weights ", PFOLDER, "/output/weightsLDAK --bfile ",PFOLDER, "/data/GenData"))
<- Sys.time() end_time
## [1] "Running time weighted GWAS: 1.78 seconds"
Results
- The main results are saved in
bmiW.assoc
- The a summary version of the results is saved in
bmiW.summaries
- The p-values are saved in
bmiW.pvalues
Process LDAK output
list.files(paste0(PFOLDER, "/output"))
## [1] "bmi.assoc" "bmi.coeff" "bmi.progress" "bmi.pvalues"
## [5] "bmi.score" "bmi.summaries" "bmiW.assoc" "bmiW.coeff"
## [9] "bmiW.progress" "bmiW.pvalues" "bmiW.score" "bmiW.summaries"
## [13] "phenoLDAK" "weightsLDAK"
# Weighted GWA
=fread(paste0(PFOLDER, "/output/bmiW.assoc"), header=T)
effectW=data.frame(SNP=effectW$Predictor, CHR=effectW$Chromosome, POS=effectW$Basepair, BETA=effectW$Effect, SE=effectW$SD)
gwaW=fread(paste0(PFOLDER, "/output/bmiW.pvalues"), header=T)
pvalw$pvalue=pvalw$P gwaW
SNP | CHR | POS | BETA | SE | pvalue |
---|---|---|---|---|---|
rs0001 | 1 | 10090 | 0.00 | 0.19 | 0.98 |
rs0002 | 1 | 10133 | 0.01 | 0.19 | 0.97 |
rs0003 | 1 | 10247 | -0.22 | 0.33 | 0.51 |
=fread(paste0(PFOLDER, "/output/bmi.assoc"), header=T)
effect=data.frame(SNP=effect$Predictor, CHR=effect$Chromosome, POS=effect$Basepair, BETA=effect$Effect, SE=effect$SD)
gwa=fread(paste0(PFOLDER, "/output/bmi.pvalues"), header=T)
pval$pvalue=pval$P gwa
SNP | CHR | POS | BETA | SE | pvalue |
---|---|---|---|---|---|
rs0001 | 1 | 10090 | 0.09 | 0.09 | 0.35 |
rs0002 | 1 | 10133 | 0.09 | 0.09 | 0.35 |
rs0003 | 1 | 10247 | -0.14 | 0.14 | 0.32 |
Plot the results
setwd(paste0(PFOLDER))
=gmirror(top=subset(gwa, select=c(SNP, CHR, POS, pvalue)),
manHbottom=subset(gwaW, select=c(SNP, CHR, POS, pvalue)), tline=0.05, bline=0.05,
toptitle="Standard GWAS",
bottomtitle = "Weighted GWAS",
highlight_p = c(0.05,0.05),
highlighter="darkviolet",
file="/output/ManH")
## [1] "Saving plot to /output/ManH.png"