Population structure analysis for Shriver's lab genotype data
Here is the pipeline for the population structure analysis done for the samples from Shriver’s lab. A list of the samples used can be seen here.
We first ran a QC procedure for each dataset. After harmonizing them, we merged all datasets, and finally merge them with the reference samples from 1000 Genomes (1000G) and the Human Diversity Project (HGDP). The pipeline for merging those two reference samples can be found here
We applied an LD prune to generate appropriates files to run on Admixture, and PCA.
Our QC procedure was done using plink 1.9 for each dataset, both before and after merging across platforms, in the following order:
After merging platforms, the QC procedure was repeated from steps 2 to 5.
Because our datasets were genotyped using different platforms, and to increase the chances of a successful merge, before attempting to merge them we harmonized our datasets using the 1000 Genomes Phase 3 (1000G) as reference sample. In doing so, we solved unknown strand issues, updated variant IDs, and updated the reference alleles. We kept all SNPs from each dataset, and we removed problematic SNPs during the merging steps.