Merging the HGDP and 1000 Genomes reference samples
This repo will take you through the steps to merge the HGDP and 1000G reference files into a single plink binary file.
Once the repo has been downloaded make sure that you meet all of the requirements, and download the necessary files in their respective folders.
To do that, first go to the DataBases
folder, and read the README files indicating what needs to be downloaded on each (both the HGDP
and the 1000G
folders).
After everything has been downloaded, you can start running the script, located in the Code
folder.
This is a python notebook, so you can interactively run it, and modify it to your needs.
In summary the script will follow these steps:
This script was ran on a Linux machine, using Ubuntu 18.04. You will need the following programs:
For the following programs, you can use the bioconda channel to install them through Anaconda. To do that, once you’ve installed Anaconda follow the instructions in here. The script will assume that all of the following programs are in your path.
conda install plink
conda install vcftools
conda install bcftools
conda install ucsc-liftover
To ease the process, there is a conda environment file in Code/mergeref.yml
.
With anaconda already installed you can create the same environment used to run the script:
conda env create -f mergeref.yml
source activate mergeref
In the DataBases
folder you’ll need to download the respective files.
In each folder (HGDP
and 1000G
) there is a README file with the same information.
Download the following files and paste them in the DataBases/HGDP
folder.
The HGDP Stanford files can be downloaded from here.
You will also need to download the Sample Information from here.
Finally, you’ll need to download the chain file that tells liftOver how to convert between hg18 to hg19 from here.
Download the following files and and paste them in the DataBases/1000G
folder.
The 1000G Phase 3 files can be downloaded from here.
Here you can see the steps for merging the HGDP and 1000G databases.