• Queen Mary University of London
  • Barts Health NHS
  • Bradford NHS
  • Manchester Uni

May 2018: Exome sequencing data from 7465 individuals - East London, Birmingham, Bradford

The updated files below contain lists of predicted loss of function (LoF) and functional (missense or inframe indels) variants in the current Genes & Health callset, released in November 2017. Variants were called in 3781 East London Genes & Health volunteers (Bangladeshi and Pakistani, with self stated related parents), 2624 Bradford volunteers (Pakistani, mostly self-stated or DNA autozygous individuals) and 1060 Birmingham volunteers (Pakistani, unselected). Bradford and Birmingham samples are as described in Narasimhan et al Science 2016, with new additional Bradford samples. 

Only variants that were present as homozygotes in at least 1 Genes & Health sample have been included in these files, which contains detailed variant annotation (from Ensembl Variant Effect Predictor) as well as allele frequencies in other populations, and from ExAC/gnomAD.

Mapping was done with bwa-mem and variant calling was carried out with GATK HaplotypeCaller. We removed variant sites for which the following was true:

SNPs: "QD < 2.0 || FS > 30 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"

Indels: "QD < 2.0 || FS > 30 || ReadPosRankSum < -20.0" 

 

All files contain genotype counts after the variants have either been through the basic variant-level filtering (Files 2, 4, 6) or, subsequently, have also been through genotype-level filtering, setting to missing genotypes with GQ<20 or allele balance p-value <0.01 (Files 1, 3, 5). Caution: This genotype-level filtering is not optimal (being probably too strict on homozygotes in low-coverage regions), so will be improved at a later stage.

Please also note that whilst some samples have been sequenced to a read depth of ~ 40X, some have only a read depth ~ 20X.

Files containing predicated loss of function only variants (Files 1, 2, 3, 4,) are in pairs. The *all_transcripts_printed.txt  files contain the annotations across all transcripts for which the variant is a predicted loss of function.  The files *annotation_not_in_last_exon_and_present_in_all_transcripts.txt contain just the predicted loss of function variants which will be present in all transcripts of a gene and are not located in the last exon.

For more information about the files see File 7. 

ADDED 20 JULY 2018:

File 8 and 9: Variants that were present in at least 1 sample from any cohort (East London Genes and Health, Bradford, Birmingham) have been included in these files 

 

Files 

File 1. ** MOST USERS WILL PROBABLY WANT TO USE THIS FILE ** - Filtered list of predicted loss of function variants with basic GATK filtering (gatk_PASS) followed by more stringent genotype level filtering (gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01), containing only those variants which will be present in all annotated Ensembl transcripts of a gene and are also not located in the last exon. There is only one transcript for each variant listed. 

all_LoFs.gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01.LoFs.missingness_lt_0.genotype_counts.present_in_ELGH.n_transcripts_corrected.all_transcripts_printed.annotation_not_in_last_exon_and_present_in_all_transcripts.txt

Caution: This genotype-level filtering is not optimal (being probably too strict on homozygotes in low-coverage regions), so will be improved at a later stage. 

 

File 2. Filtered list of predicted loss of function variants with basic GATK filtering (gatk_PASS) containing only those variants which will be present in all annotated Ensembl transcripts of a gene and are also not located in the last exon. There is only one transcript for each variant listed.

all_LoFs.gatk_PASS.LoFs.missingness_lt_0.genotype_counts.present_in_ELGH.n_transcripts_corrected.all_transcripts_printed.annotation_not_in_last_exon_and_present_in_all_transcripts.txt  

 

File 3.  List of all predicted loss of function variants with basic GATK filtering (gatk_PASS), followed by stringent genotype level filtering (gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01) showing annotations across all the transcripts within a gene for which the variant is a predicted loss of function. The variant annotations with respect to all transcripts are printed.

all_LoFs.gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01.LoFs.missingness_lt_0.genotype_counts.present_in_ELGH.n_transcripts_corrected.all_transcripts_printed.txt

Caution: This genotype-level filtering is not optimal (being probably too strict on homozygotes in low-coverage regions), so will be improved at a later stage.

 

File 4. List of all predicted loss of function variants with basic GATK filtering (gatk_PASS) showing annotations across all the transcripts within a gene for which the variant is a predicted loss of function. The variant annotations with respect to all transcripts are printed.

all_LoFs.gatk_PASS.LoFs.missingness_lt_0.genotype_counts.present_in_ELGH.n_transcripts_corrected.all_transcripts_printed.txt

 

 

 File 5. List of all predicted functional variants (missense or inframe indels) with basic GATK filtering, followed by more stringent genotype level filtering (gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01). The variant annotations with respect to all transcripts are printed.

all_functional.gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01.functional.missingness_lt_0.genotype_counts.present_in_ELGH.n_transcripts_corrected.txt.zip

Caution: This genotype-level filtering is not optimal (being probably too strict on homozygotes in low-coverage regions), so will be improved at a later stage.

 

File 6. List of all predicted functional variants (missense or inframe indels) with basic GATK filtering (gatk_PASS). The variant annotations with respect to all transcripts are printed.

all_functional.gatk_PASS.functional.missingness_lt_0.genotype_counts.present_in_ELGH.n_transcripts_corrected.txt.zip 

 

File 7. UPDATED 13 JUNE 2018 -Word document giving more information about file column headings and processes. 

README_UPDATED_0.docx

 

 File 8. ADDED 20 JULY 2018 List of all predicted loss of function variants with basic GATK filtering (gatk_PASS), followed by stringent genotype level filtering (gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01) showing annotations across all the transcripts within a gene for which the variant is a predicted loss of function. The variant annotations with respect to all transcripts are printed. Variants present in at least 1 sample from any cohort (East London Genes and Health, Bradford, Birmingham).

all_chrs.LoFs.GATK_PASS.FS_30.DP_0.GQ_20.AB_0.01.LoFs.missingness_lt_0.genotype_counts.present_in_any_sample.n_transcripts_corrected.all_transcripts_printed.txt

 

File 9. ADDED 20 JULY 2018 List of all predicted functional variants (missense or inframe indels) with basic GATK filtering, followed by more stringent genotype level filtering (gatk_PASS.FS_30.DP_0.GQ_20.AB_0.01). Variants present in at least 1 sample from any cohort (East London Genes and Health, Bradford, Birmingham).

all_chrs.functional.GATK_PASS.FS_30.DP_0.GQ_20.AB_0.01.functional.missingness_lt_0.genotype_counts.present_in_any_sample.n_transcripts_corrected.txt.gz

Downloads

Please log in or Register to download