Repeat Expansion Detection with Expansion Hunter

Short tandem repeats (STRs) are regions of the genome consisting of repetitions of short DNA segments called repeat units. STRs can expand to lengths beyond the normal range and cause mutations called repeat expansions. Repeat expansions are responsible for many diseases, including Fragile X syndrome, amyotrophic lateral sclerosis, and Huntington's disease.

DRAGEN includes a repeat expansion detection method called ExpansionHunter. ExpansionHunter performs sequence-graph based realignment of reads that originate inside and around each target repeat. It then genotypes the length of the repeat in each allele based on these graph alignments. More information and analysis is available in the following ExpansionHunter papers:

•

ExpansionHunter (http://www.genome.org/cgi/doi/10.1101/gr.225672.117)

•

Graph ExpansionHunter (https://doi.org/10.1101/572545)

These methods work only for whole human genome samples in PCR-free libraries. Repeats are only genotyped if the coverage at the locus is at least 10x.

Repeat Expansion Detection Options

To enable DRAGEN repeat expansion detection, the following command-line options are required.

•

--repeat-genotype-enable = true

•

--repeat-genotype-specs=<path to spec file>

In addition, the sex of the sample should be set using the --sample-sex option.

The following options are optional.

•

--repeat-genotype-region-extension-length=<length of region around repeat to examine> (default 1000bp)

•

--repeat-genotype-min-baseq=<Minimum base quality for ‘high confidence’ bases> (default 20)

For more information on the spec file specified by --repeat-genotype-specs option, see Repeat Expansion Specification Files.

The main output of repeat expansion detection is a VCF file, containing the variants found via this analysis.

Repeat Expansion Detection Output Files

VCF Output File

The results of repeat genotyping are output as a separate VCF file, giving the length of each allele at each callable repeat defined in the repeat-specification catalog file. The name is <outputPrefix>.repeats.vcf (.gz).

The VCF output file begins with the following fields.

Core VCF Fields
Field	Description
CHROM	Chromosome identifier
POS	Position of the first base before the repeat region in the reference
ID	Always .
REF	The reference base at position POS
ALT	List of repeat alleles in format <STRn> where n is the number of repeat units
QUAL	Always .
FILTER	LowDepth filter is applied when the overall locus depth is below 10x or the number of reads that span one or both breakends is below 5.

Additional INFO Fields
Field	Description
SVTYPE	Always STR
END	Position of the last base of the repeat region in the reference
REF	Number of repeat units spanned by the repeat in the reference
RL	Reference length in bp
RU	Repeat unit in the reference orientation
REPID	Repeat id from the repeat-specification file

GENOTYPE (Per Sample) Fields
Field	Description
GT	Genotype
SO	Type of reads that support the allele; can be SPANNING, FLANKING, or INREPEAT meaning that the reads span, flank, or are fully contained in the repeat
CI	Confidence interval called repeat length of each allele
AD_SP	Number of spanning reads consistent with the allele
AD_FL	Number of flanking reads consistent with the allele
AD_IR	Number of in-repeat reads consistent with the allele

For example, the following VCF entry describes the state of C9orf72 repeat in a sample with ID LP6005616-DNA_A03.

QUAL    FILTER  INFO    FORMAT  LP6005616-DNA_A03
chr9    27573526        .       C       <STR2>,<STR349> .       PASS
SVTYPE=STR;END=27573544;REF=3;RL=18;RU=GGCCCC;REPID=ALS GT:SO:CN:CI:AD_SP:AD_FL:AD_IR
1/2:SPANNING/INREPEAT:2/349:2-2/323-376:19/0:3/6:0/459

In this example, the first allele spans 2 repeat units while the second allele spans 349 repeat units. The repeat unit is GGCCCC (RU INFO field), so the sequence of the first allele is GGCCCCGGCCCC and the sequence of the second allele is GGCCCC x 349. The repeat spans three repeat units in the reference (REF INFO field).

The length of the short allele was estimated from spanning reads (SPANNING) while the length of the expanded allele was estimated from in-repeat reads (INREPEAT). The confidence interval for the size of the expanded allele is (323,376). There are 19 spanning and 3 flanking reads consistent with the repeat allele of size 2 (that is 19 reads fully contain the repeat of size 2 and 2 flanking reads overlap at most 2 repeat units). Also, there are 6 flanking and 459 in-repeat reads consistent with the repeat allele of size 349.

Additional Output Files

The sequence-graph alignments of reads in the targeted repeat regions are output in a BAM file. You can use a specialized GraphAlignmentViewer tool (github.com/Illumina/GraphAlignmentViewer/) to visualize the alignments. Programs like Integrative Genomics Viewer (IGV) are not designed for displaying graph-aligned reads and cannot visualize these BAMS.

The BAMs store graph alignments in custom XG tags using the format <LocusName>,<StartPosition>,<GraphCIGAR>.

•

LocusName—A locus identifier that matches the corresponding entry in the repeat expansion specification file.

•

StartPosition—The starting alignment position of a read on the first graph node.

•

GraphCIGAR—The alignment of a read against the graph starting from that position. GraphCIGAR consists of a sequence of graph node identifiers and linear CIGARS describing the alignment of the read to each node.

Quality scores in the BAM file are binary. High-scoring bases are assigned a score of 40, and low-scoring bases are assigned a score of 0.