Merging HGDP and 1000G

This repo will take you through the steps to merge the HGDP and 1000G reference files into a single plink binary file. Once the repo has been downloaded make sure that you meet all of the requirements, and download the necessary files in their respective folders. To do that, first go to the DataBases folder, and read the README files indicating what needs to be downloaded on each (both the HGDP and the 1000G folders). After everything has been downloaded, you can start running the script, located in the Code folder. This is a python notebook, so you can interactively run it, and modify it to your needs. In summary the script will follow these steps:

Transform the HGDP into plink files
LifOver the HGDP from hg18 to hg19
Extract only the SNPs found in the HGDP from the 1000G vcf files
Concatenate the different chromosomes and export to plink files
Merge the HGDP and 1000G

Requirements

This script was ran on a Linux machine, using Ubuntu 18.04. You will need the following programs:

Python 3.x: I recommend installing python 3.x using Anaconda.

For the following programs, you can use the bioconda channel to install them through Anaconda. To do that, once you’ve installed Anaconda follow the instructions in here. The script will assume that all of the following programs are in your path.

Plink: to install it using bioconda use the following command conda install plink
Vcftools: to install it using bioconda use the following command conda install vcftools
Bcftools: to install it using bioconda use the following command conda install bcftools
USCS liftOver: to install it using bioconda use the following command conda install ucsc-liftover

To ease the process, there is a conda environment file in Code/mergeref.yml. With anaconda already installed you can create the same environment used to run the script:

conda env create -f mergeref.yml
source activate mergeref

Files to download

In the DataBases folder you’ll need to download the respective files. In each folder (HGDP and 1000G) there is a README file with the same information.

HGDP

Download the following files and paste them in the DataBases/HGDP folder. The HGDP Stanford files can be downloaded from here. You will also need to download the Sample Information from here. Finally, you’ll need to download the chain file that tells liftOver how to convert between hg18 to hg19 from here.

1000G

Download the following files and and paste them in the DataBases/1000G folder. The 1000G Phase 3 files can be downloaded from here.

Steps

Here you can see the steps for merging the HGDP and 1000G databases.