• Queen Mary University of London
  • Barts Health NHS
  • Bradford NHS
  • Manchester Uni

Sept 2019: SUMMARY FILES, Exome sequencing loss of function variant data from 8921 individuals (East London, Birmingham, Born In Bradford)


September 2019: Exome sequencing data from 8921 individuals - East London, Birmingham, Bradford

The updated files below contain lists of predicted loss of function (LoF) and functional (missense or inframe indels) variants in the current Genes & Health callset, released in September 2019. Variants were called in 5236 East London Genes & Health volunteers (Bangladeshi and Pakistani, with self stated related parents), 2624 Bradford volunteers (Pakistani, mostly self-stated or DNA autozygous individuals) and 1061 Birmingham volunteers (Pakistani, unselected). Bradford and Birmingham samples are as described in Narasimhan et al Science 2016, with new additional Bradford samples in 2017.

Data download

Data can be downloaded here.

An explanation of the column titles can be downloaded here.

The files also include a simple list of gene names where knockouts are observed. We can recall volunteers (with their consent, if they agree) from Genes & Health, and Born In Bradford. But not from Birmingham unfortunately.

We expect a substantial (15-20% ?) false positive rate in knockout genotype calls. These are mostly because of genome annotation issues rather than sequencing errors. Please manually review genotypes of interest.

Technical details

Mapping was done with bwa-mem and variant calling was carried out with GATK HaplotypeCaller. We removed variant sites for which the following was true:

SNPs: "QD < 2.0 || FS > 30 || MQ < 40.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"

Indels: "QD < 2.0 || FS > 30 || ReadPosRankSum < -20.0" 

These files contain detailed variant annotation (from Ensembl Variant Effect Predictor v95) as well as allele frequencies from ExAC/gnomAD and from the different cohorts (ELGH, BiB = Born in Bradford mothers, Birm = Birmingham). 

Note that the files come in pairs, two for high-confidence LoFs annotated by LOFTEEE, and two for functional variants (missense variants or inframe indels). For each of these, there is one file labelled “before genotype filtering” and one “after genotype filtering”. The genotype filtering involved setting to missing genotypes with GQ<20 or allele balance p-value <0.01. Caution: This genotype-level filtering is not optimal (being probably too strict on homozygotes in low-coverage regions), so will be improved at a later stage. We only provide the “after genotype filtering” files here – the other files can be requested but we suggest use with caution.

Please also note that whilst some samples have been sequenced to a read depth of ~40X, some have only a read depth ~ 20X.

In the files of “functional variants”, each variant is only listed once. However, in the high-confidence LoF (HC_LoF) files, variants are listed multiple times, once for every transcript in which they have annotated as HC LoF by LOFTEE. Users wishing to remove redundancy and pull out one transcript per gene can restrict to lines that have top.transcript==1; these correspond to variants that are high-confidence LoFs in all protein-coding transcripts for that gene, and that are not in the last exon or intron. 


Please log in or Register to download