Project setup

Project folder

Create project folder and set up the folder structure as follows:

    # Define project directory
    PFOLDER="your/project"

    # Create subfolders
    mkdir $PFOLDER/programs
    mkdir $PFOLDER/data
    mkdir $PFOLDER/output

Folder structure:

PFOLDER/
   | wGWA.R
   ├── data/
   │   └── GenData.bed
   │   └── GenData.bim
   │   └── GenData.fam
   │   └── phenoDat
   ├── output/
   ├── programs/
       └── ldak

Download the toy data

Download here
store data in $PFOLDER

Download LDAK

Download here
Rename and store LDAK as follows: $PFOLDER/programs/ldak

Dowload the R script to perform all analyses

Download here
Store R-script as follows: $PFOLDER/wGWA.R

Application of sampling weights to phenotypic data

Load R-packages and define working directory

if (!require("hudson")) install.packages("hudson")
if (!require("data.table")) install.packages("data.table")
PFOLDER="your/project"

Load functions

# =========== FUNCTIONS ==========================
# get weighted mean
meanW=function(df, pheno){
  weights=df$weights
  variable=df[[pheno]]
  meanW=sum(variable*weights)/sum(weights)
  mean=mean(variable, na.rm=T)
  return(data.frame(pheno=pheno, meanS=round(mean,1), meanW=round(meanW,1)))
}


# get weighted and standard correlations
get_lower_tri<-function(cormat){
  cormat[upper.tri(cormat)] <- NA
  return(cormat)
}

getWcor=function(df, pheno){
  weighted_corr <- cov.wt(subset(df, select=c(pheno)), wt = df$weights, cor = TRUE)
  corr_matrix <- weighted_corr$cor
  cor_w_tri <- get_lower_tri(corr_matrix)
  corMelt_w=melt(cor_w_tri)
  corOutW=subset(corMelt_w, Var1!=Var2 & is.na(value)==F )
  
  cor_sample=cor(subset(df, select=c(pheno)))
  cor_tri <- get_lower_tri(cor_sample)
  corMelt=melt(cor_tri)
  corOut=subset(corMelt, Var1!=Var2 & is.na(value)==F )
  
  corDF=data.frame(cor=paste0(corOut$Var1, "~", corOut$Var2), corS=corOut$value, corW=corOutW$value)
  return(corDF)
}

Phenotype analyses

Read in phenotype data

dat=fread(paste0(PFOLDER, "/data/PhenoDat"))
head(dat)


FID	IID	age	bmi	education	weights
FAM0001	IID0001	56.42	25.91	17.60	0.51
FAM0001	IID0002	50.17	21.66	20.16	0.10
FAM0001	IID0003	41.63	38.89	14.57	20.44
FAM0001	IID0004	55.46	29.34	19.52	0.30

The sampling weight are stored in the column dat$weights
The sampling weights were designed to bring the non-representative (simulated) sample in line with a (simulated) representative reference sample. We have the following means in the simulated datasets:


Phenotype	Mean in reference sample	Mean in non-representative sample
age	53.0	57.0
bmi	28.0	26.5
education	16.5	17.5

The estimated weights were re-scaled so that they sum up to the total sample, where a re-scaled weight >1 means that an individual carries more weight in the weighted population than in the original population, and a re-scaled weight <1 means that an individual carries less weight.

Spread of the sampling weights

summary(dat$weights)

Sampling weights
mean	min	max
1	0.01	36.19

The effective sample size of the weighted sample

The effective sample size (nEFF) for the weighed sample is calculated using the formula:

\[\textrm{nEFF} = \frac{(\sum_{i=1}^n w_i)^2}{\sum_{i=1}^n w_i^2}\]

nEFF=round((sum(dat$weights)^2)/sum(dat$weights^2),0)


Sample size
Total sample size: 6000
Effective sample size: 1684

Weighting will always lead to a reduction in the effective sample size (nEff), provided there is variability among sampling weights. The degree of sample size reduction is a function of the variability among sampling weights – the larger the spread, the larger the reduction in nEff.
For example, consider a sample of n=2 where individual one has a normalized weight of wn1=0.5 (i.e., is < 1, so will be down-weighted) and individual two has a normalized weight of wn2=1.5 (i.e., is > 1, so will be up-weighted). Applying the formula used for effective sample size estimation [ sum(wn)^2/sum(wn^2)] gives us an effective sample size of nEff=1.6, highlighting that weighting led to a 20% reduction in sample size. Sampling weights with a lower spread would result in a larger effective sample size [e.g.,wn1=1.1; wn2=0.9; nEff=1.98] and sampling weights with no variability will recover the original sample size of n=2 [e.g.,wn1=1.1; wn2=1; nEff=2].

Estimate weighted mean

\[Weighted Mean = \frac{1}{N}\sum_{i=1}^{N}w_i x_i\]

pheno=c("age", "bmi", "education")
outWL=lapply(pheno, function(x) meanW(df=dat, pheno=x))
outW=do.call(rbind, outWL)


Phenotye	mean (non-representative sample)	mean (weighted sample)	mean (reference sample)
age	56.9	53.1	53.0
bmi	26.3	28.3	28.0
education	17.5	16.2	16.5

Estimate weighted correlations

corW=getWcor(df=dat, pheno=pheno)


	Correlation (non-representive sample)	Correlation (weighted sample)	Correlation (reference sample)
bmi~age	0.1	0.24	0.3
education~age	0.1	0.24	0.3
education~bmi	0.1	0.17	0.3

The following correlations were used to simulate the data:

Correlations among all phenotypes in the non-representative sample: $r=0.1$
Correlations among all phenotypes in the reference sample: $r=0.3$

Genotype analyses

Weighted genome-wide analyses

Prepare data for LDAK

Phenotype data

# prepare phenotype data
write.table(subset(dat, select=c(FID, IID, bmi)),
            file= paste0(PFOLDER, "/output/phenoLDAK" ),
            sep="\t",
            row.names = F,
            col.names=F,
            quote=F)

print(head(subset(dat, select=c(FID, IID, bmi)), n=3))

##        FID     IID      bmi
## 1: FAM0001 IID0001 25.91042
## 2: FAM0001 IID0002 21.66125
## 3: FAM0001 IID0003 38.89009

Sampling weights

write.table(subset(dat, select=c(FID, IID, weights)),
            file= paste0(PFOLDER, "/output/weightsLDAK" ),
            sep="\t",
            row.names = F,
            col.names=F,
            quote=F)

print(head(subset(dat, select=c(FID, IID, weights)), n=3))

##        FID     IID     weights
## 1: FAM0001 IID0001  0.50752976
## 2: FAM0001 IID0002  0.09965894
## 3: FAM0001 IID0003 20.44120488

Run standard GWA in LDAK

A detailed documentation is available on the LDAK website
Parameters standard linear regression:
- --linear <outfile>: perform linear regression
- --pheno <phenofile>: to specify the phenotypes (in PLINK format)
- --covar <covarfile>: to provide covariates (in PLINK format)
- --bfile: to specify the genetic data files

# give execute permissions
system(paste0("chmod a+x ",PFOLDER, "/programs/ldak"))
# run linear regression
start_time <- Sys.time()
system(paste0(PFOLDER, "/programs/ldak --linear ",  PFOLDER, "/output/bmi --pheno ", PFOLDER, "/output/phenoLDAK --bfile ",PFOLDER, "/data/GenData"))
end_time <- Sys.time()

## [1] "Running time standard GWAS: 1.5 seconds"

Results

The main results are saved in bmi.assoc
The a summary version of the results is saved in bmi.summaries
The p-values are saved in bmi.pvalues

Run weighted GWA in LDAK

To run weighted linear regression, set:
- --sandwich YES: use sandwich estimator to get the standard errors (Huber-White Sandwich Estimator)
- --sample-weights <sampleweightfile>: perform weighted linear regression. The <sampleweightfile> should have three columns, where each row provides two sample IDs followed by the sampling weights

# run LDAK
start_time <- Sys.time()
system(paste0(PFOLDER, "/programs/ldak --linear ",  PFOLDER, "/output/bmiW --pheno ", PFOLDER, "/output/phenoLDAK --sandwich YES --sample-weights ", PFOLDER, "/output/weightsLDAK --bfile ",PFOLDER, "/data/GenData"))
end_time <- Sys.time()

## [1] "Running time weighted GWAS: 1.78 seconds"