Population structure analysis

Here is the pipeline for the population structure analysis done for the samples from Shriver’s lab. A list of the samples used can be seen here.

Pipeline summary

We first ran a QC procedure for each dataset. After harmonizing them, we merged all datasets, and finally merge them with the reference samples from 1000 Genomes (1000G) and the Human Diversity Project (HGDP). The pipeline for merging those two reference samples can be found here

We applied an LD prune to generate appropriates files to run on Admixture, and PCA.

QC procedure

Our QC procedure was done using plink 1.9 for each dataset, both before and after merging across platforms, in the following order:

Remove founders, that is, individuals with at least one parent in the dataset, and retained only autosomal chromosomes
Remove SNPs with missing call rates higher than 0.1
Remove SNPs with minor allele frequencies below 0.05
Remove SNPs with hardy-weinberg equilibrium p-values less than 1e-50
Remove samples with missing call rates higher than 0.1
Remove one arbitrary individual from any pairwise comparison with a pihat >= 0.25 from an IBD estimation after LD prune

After merging platforms, the QC procedure was repeated from steps 2 to 5.

Merging platforms

Because our datasets were genotyped using different platforms, and to increase the chances of a successful merge, before attempting to merge them we harmonized our datasets using the 1000 Genomes Phase 3 (1000G) as reference sample. In doing so, we solved unknown strand issues, updated variant IDs, and updated the reference alleles. We kept all SNPs from each dataset, and we removed problematic SNPs during the merging steps.

Pipeline scripts

Initial dataset split and QC
Harmonize genotypes, was uploaded to Penn State HPC infrastructure
Merging datasets and QC
Phasing genotypes, using Penn State HPC infrastructure
FineStructure, using Penn State HPC infrastructure