WES vs WGS: why the exome isn’t the whole story (and sometimes when it’s better)

By John Brunstein, PhD I

n this month’s installment we’re going to revisit in a bit more depth a topic that’s been touched on in this space before—that is, the differences between a whole genome sequence (WGS) and a whole exome sequence (WES). On the surface the differences are simple and explicit in the names. WGS provides the sequence of the genomic (nuclear) DNA from a sample, including all sorts of noncoding regions such as centromeres, telomeres, long repetitive stretches of “junk” DNA, and various un-transcribed control regions which influence the activity of the actual genes. For a human, a whole genome is approxi- mately 3.3 billion base pairs, haploid—so 6.6 billion base pairs to capture the whole diploid complement per cell. The exome by contrast is just the collection of expressed RNAs (including both coding mRNAs and noncoding functional RNAs which can be every- thing from rRNA functional ribosomal components to tRNAs essential for protein expression to things like miRNAs important for gene silencing and post- transcriptional regulation). The human exome is roughly 30 million base pairs total size, or only about one percent of the genome.

Sequencing either a genome or an exome requires collecting a significant “overage” of data, or “sequenc- ing depth.” This is done for two reasons: one is to improve accuracy (a single read may misrepresent a particular base pair, so a consensus of multiple reads over the same spot is more accurate) and the other is that to build up full chromosome length reads from short bits requires ‘tiling’ or overlap between reads so we can generate long contiguous sequences. Since the predominant next generation sequencing (NGS) technologies produce individual read lengths much shorter than many RNA transcripts, tiling is as much a requirement for WES as it is for WGS. Overall then, while there are a lot of nuances we won’t go into, while either a WGS or WES requires a lot of data to be generated and processed by bioinformatic pipelines, a WES is to a first approximation 30 fold less data than a WGS (you’re excused for expecting that to be 100 fold but WGS tend to be run ~30x depth and WES at ~100x, to allow for capture of rare variants; more on that below). Obviously then WES has one immediate advantage over WGS in that it’s faster and cheaper to obtain and analyze. We generally think of doing some form of NGS in a clinical context as a means to try to uncover the root cause of a particular physical manifestation—a phenotype. We’ll ignore the inconvenient reality that some phenotypic behavior arises from complex polygenic traits and assume for simplicity that in this hypothetical example it’s a simple monogenic


Mendelian cause. Cost and time factors aside, what are the pros and cons of using either a WGS or WES approach to tackling this?

Surprise #1: for complete exon coverage, WGS beats WES Within protein coding sequences, mutations can in some cases be known pathogenic from other examples, or they may be novel but of readily apparent impact such as stop codons, significant insertions/deletions, or frame shifts. Even less readily interpretable amino acid substitutions may in some cases be scrutinized against known or computer pre- dicted protein structures with a reasonable chance of spotting significantly disruptive changes (putting a proline in the middle of that critical ɲ-helix probably isn’t a good thing)! While you might think that muta- tions in coding regions should be equally observable in both WES and WGS approaches, it’s been observed that that’s not quite true; in particular, GC-rich gene sequences appear more accurately captured by WGS than WES. WGS also scores better for completeness among preselected panels of disease relevant genes, where WES is reported to miss between 0.42 percent and a whopping 24.44 percent of exonic data as cap- tured in a PCR-free WGS strategy. (For a more in-depth look at these numbers, see e.g. [1]). If complete cover- age even just of exons is your goal, then WGS edges out WES.

Meaningful mutations can also occur outside of exons, in regulatory elements such as transcriptional promoters, enhancers, and suppressors thereby alter- ing expression level and/or location. Similarly, muta- tions within introns can influence splice site selection and lead to inappropriate expression of particular splice variant isoforms of a gene which is otherwise expressed at an overall appropriate level. Since these by their very nature occur in non-transcribed sections of the genome (or at least not retained in mature transcripts), an immediate expectation might be that these will be captured in WGS and not in WES. Strictly speaking that’s true; a WGS data set will include all of these sorts of regions but a challenge comes when we try to interpret. Like with exons, in some cases there are very specific variations such as SNPs (single nucleo- tide polymorphisms) in non-exonic regions which have a known phenotypic impact (or lack thereof). As databases get filled with more and more example human genomes with clinical correlates, the library of known variations becomes bigger. At present, how- ever, compared to the size of the human genome and the frequency with which variations from reference genomes are seen, this known library is small and in

Page 1  |  Page 2  |  Page 3  |  Page 4  |  Page 5  |  Page 6  |  Page 7  |  Page 8  |  Page 9  |  Page 10  |  Page 11  |  Page 12  |  Page 13  |  Page 14  |  Page 15  |  Page 16  |  Page 17  |  Page 18  |  Page 19  |  Page 20  |  Page 21  |  Page 22  |  Page 23  |  Page 24  |  Page 25  |  Page 26  |  Page 27  |  Page 28  |  Page 29  |  Page 30  |  Page 31  |  Page 32  |  Page 33  |  Page 34  |  Page 35  |  Page 36  |  Page 37  |  Page 38  |  Page 39  |  Page 40  |  Page 41  |  Page 42  |  Page 43  |  Page 44  |  Page 45  |  Page 46  |  Page 47  |  Page 48