Merging the HGDP and 1000 Genomes reference samples
This repo will take you through the steps to merge the HGDP and 1000G reference files into a single plink binary file.
Once the repo has been downloaded make sure that you meet all of the requirements, and download the necessary files in their respective folders.
To do that, first go to the DataBases folder, and read the README files indicating what needs to be downloaded on each (both the HGDP and the 1000G folders).
After everything has been downloaded, you can start running the script, located in the Code folder.
This is a python notebook, so you can interactively run it, and modify it to your needs.
In summary the script will follow these steps:
This script was ran on a Linux machine, using Ubuntu 18.04. You will need the following programs:
For the following programs, you can use the bioconda channel to install them through Anaconda. To do that, once you’ve installed Anaconda follow the instructions in here. The script will assume that all of the following programs are in your path.
conda install plinkconda install vcftoolsconda install bcftoolsconda install ucsc-liftoverTo ease the process, there is a conda environment file in Code/mergeref.yml.
With anaconda already installed you can create the same environment used to run the script:
conda env create -f mergeref.yml
source activate mergeref
In the DataBases folder you’ll need to download the respective files.
In each folder (HGDP and 1000G) there is a README file with the same information.
Download the following files and paste them in the DataBases/HGDP folder.
The HGDP Stanford files can be downloaded from here.
You will also need to download the Sample Information from here.
Finally, you’ll need to download the chain file that tells liftOver how to convert between hg18 to hg19 from here.
Download the following files and and paste them in the DataBases/1000G folder.
The 1000G Phase 3 files can be downloaded from here.
Here you can see the steps for merging the HGDP and 1000G databases.