An Alu Insertion in F8 Intron 18 Causes Exon Skipping and Severe Hemophilia A
Can deep learning models detect that a 331 bp Alu insertion abolishes splicing — and can we decompose what they've learned?
An Alu retrotransposon insertion in Factor VIII intron 18 causes severe hemophilia A by triggering complete exon 19 skipping. This well-characterized case — with an identified insertion, a proposed mechanism, and a clinically confirmed splicing outcome — provides an ideal ground truth for asking what deep learning models like SpliceAI and AlphaGenome actually learn from sequence context.
331 Base Pairs That Broke Factor VIII
In 2003, Ganguly et al. described a patient with severe hemophilia A caused by a de novo Alu Yb9 insertion into intron 18 of the Factor VIII (F8) gene. The 331 bp insertion (318 bp Alu + 13 bp target site duplication) landed in antisense orientation, 19 bp upstream of exon 19. RT-PCR detected exon 19 skipping; the authors proposed the insertion displaces a putative branch point (inferred by Ganguly et al. at position −85) beyond the range of lariat formation.
The Alu lands between the putative branch point (inferred at pos −85 by Ganguly et al.) and exon 19. If this branch point location is correct, the BP is pushed >300 bp from the acceptor AG — likely too far for lariat formation. The acceptor AG dinucleotide itself is intact; Ganguly et al. proposed that splicing fails due to branch point displacement, though this was not directly tested.
Reference genome
Patient genome — Alu Yb9 insertion
Can SpliceAI and AlphaGenome predict that this insertion is pathogenic — and if so, do they understand why? In this notebook (and notebooks to follow), we dissect the models' internal logic to see if they have learned higher-order biological principles or if they are simply memorizing sequences or co-occurrences.
Experimental Design
We construct a panel of 11 synthetic variants, each inserting 331 bp (318 bp element + 13 bp target site duplication) at the same position (intron 18, position −19), and score them with SpliceAI and AlphaGenome. Controls are designed to isolate what the models respond to: GC content, Alu-specific motifs, orientation, and position.
Figure 2 · Variant Panel
| Variant | Purpose | Expected Effect |
|---|---|---|
| Wild type | Baseline — reference genome | Normal splicing |
| Alu Yb9 antisense (−19) | Disease-causing insertion | Exon 19 skipping |
| AT-rich insert | Length + low-GC control | Disruption |
| Random 56% GC | GC-matched to Alu (56%) | Disruption |
| Scrambled Alu (−19, antisense) ×5 | Motif-destruction, same composition (Payer et al. 2019) | Variable |
| Alu Yb9 sense (−19) | Orientation control | Also disrupts (less?) |
| Alu Yb9 antisense (−200) | Position control — past branch point | Normal splicing |
If models capture the biology, exon 19 acceptor scores should follow:
WT > Alu@−200 ≫ { any 331 bp insertion at −19 } ≫ Alu antisense
The relative ordering within the middle group is less certain — it depends on whether inserted sequences accidentally recreate compensating features like branch points within the insert that could partially rescue lariat formation.
Results: Both Models Predict the Pathogenic Effect
Both SpliceAI and AlphaGenome correctly predict near-complete destruction of the exon 19 acceptor when the Alu is inserted at position −19. The disease variant ranks dead last (or nearly so) in both models, while the wild type scores highest. The position control at −200 (past the branch point) is predicted as benign.
Figure 3 · Exon 19 Acceptor — Prediction Hierarchy
Acceptor scores for 11 variants at the exon 19 splice site. Each variant inserts 331 bp at position −19 in intron 18 (except WT and the −100 and −200 position controls). Both models rank the disease-causing Alu Yb9 at or near the bottom. The −200 control scores nearly identically to WT, confirming the models have learned branch point proximity. Bars colored by variant class: green = WT, red = disease, amber = position controls (POS CTRL), gray = other controls.
SpliceAI (10kb context)
AlphaGenome (16kb context)
Both models rank the disease-causing Alu insertion at the bottom of all 11 variants. The predicted hierarchy closely matches biological expectations, with any 331 bp insertion at position −19 causing severe disruption, with the actual Alu sequence being the most damaging.
Did models learn exon definition?
While acceptor loss is the expected direct consequence of the insertion, the most revealing result is the simultaneous silencing of the exon 19 donor. Despite the donor motif and its immediate context remaining identical to wild type, its score collapses from 0.87 to 0.06 (SpliceAI) and to 0.03 (AlphaGenome). If the models evaluated splice sites independently, the donor — which lies over 130 bp downstream — should remain active (leading to intron retention). The fact that it does not strongly suggests exon definition (Robberson et al. 1990; Berget 1995): the models appear to recognize internal exons as discrete units, requiring both flanking splice sites for inclusion. Disrupting the acceptor breaks the definition of the exon unit, effectively silencing the downstream donor.
Figure 4 · Exon Definition — Schematic
The donor GT dinucleotide and its flanking sequence are identical between WT and disease. The donor falls from 0.87 to 0.06 (SpliceAI) and from 0.87 to 0.03 (AlphaGenome).
Normal splicing (WT)
With Alu insertion (disease) — acceptor abolished, donor silenced, and exon 19 is skipped
331bp
Ganguly et al.'s observation of complete exon 19 skipping is consistent with exon definition. Whether intron retention also occurs at some level is harder to rule out: their nested PCR design favored shorter amplicons, and any retained-intron transcript would likely carry a premature termination codon from the Alu insert, making it a target for nonsense-mediated decay — in which case the only observable consequence would be reduced overall expression rather than a detectable band (see Q2). What we can say is that the dominant splicing outcome matches what exon definition predicts, and both models recapitulate it.
But when we look across the full variant panel, as shown in Figure 5 below, the picture becomes more complicated.
Figure 5 · Four Splice Sites Across All Variants
Predictions for the four splice sites flanking exon 19. Cell color encodes score magnitude. The outer flanking sites (E18 donor, E20 acceptor) remain stable across all variants. At exon 19, insertions at position −19 destroy the acceptor, whereas position controls (at −200 and −100) preserve it. The E19 donor score generally tracks with the acceptor (exon definition), except where cryptic sites within the insert restore the splicing path (e.g., Scrambled Alu 1).
A · SpliceAI (10 kb context)
| Variant | E19 acceptor | E19 donor | E18 donor | E20 acceptor |
|---|
B · AlphaGenome (16 kb context)
| Variant | E19 acceptor | E19 donor | E18 donor | E20 acceptor |
|---|
New acceptors restore donor recognition
While the silencing of the exon 19 donor generally parallels exon 19 acceptor loss, consistent with exon definition, the variant panel reveals an exception. In the Scrambled Alu 1 variant, both models score the acceptor at 0.00 yet predict a donor score comparable to wild type (0.89). By contrast, the intact Alu insertion silences the donor completely. Despite identical acceptor disruption, the two variants yield opposite donor predictions, suggesting that specific sequence features within the scrambled insert are actively modulating the splicing landscape.
Scrambled Alu 1 contains a functional acceptor
Scanning the Scrambled Alu 1 insert revealed that the shuffled sequence had accidentally created a acceptor site recognized by both models.
| Position | Offset | SpliceAI | AlphaGenome | Potential exon with E19 donor |
|---|---|---|---|---|
| 12540 (AG) | +188 bp | 0.60 | 0.72 | 265 bp |
| 12363 (AG) | +11 bp | 0.74 | 0.06 | 442 bp |
Offset is measured from the start of the inserted sequence. Both models score the AG dinucleotide at +188 bp as an acceptor (SpliceAI 0.60, AlphaGenome 0.72), predicting a 265 bp exon between this site and the E19 donor. A second AG at +11 bp scores highly in SpliceAI (0.74) but not in AlphaGenome (0.06), indicating different learned sensitivities to local sequence context. In either case, the predicted acceptor explains the maintained E19 donor score: the models recognize a new upstream acceptor partner, consistent with exon definition logic.
Most scrambles rescue the donor — the real Alu is the outlier
Scrambled Alu 1 isn't the only scramble with an elevated donor. To quantify this, we expanded the simulation to generate 5,000 independent random shuffles of the antisense Alu insert sequence (preserving length and base composition) and scored each with both SpliceAI and AlphaGenome.
| Metric (E19 donor) | SpliceAI | AlphaGenome |
|---|---|---|
| Mean score | 0.367 | 0.405 |
| Median score | 0.350 | 0.363 |
| Retained (donor score >0.3) | 3,149/5,000 (63.0%) | 3,042/5,000 (60.8%) |
| Both models retained | 2,558/5,000 (51.2%) | |
| Alu antisense score | 0.061 | 0.030 |
| Alu antisense percentile | 0.02% (1 lower) | 0.00% (0 lower) |
The real Alu antisense insertion scores below almost every randomized permutation (Figure 6). In AlphaGenome, it sits at the absolute bottom (0/5,000 lower); in SpliceAI, only a single random shuffle out of 5,000 managed to suppress the donor more effectively.
Figure 6 · Scrambled Alu Score Distributions (N=5,000)
Distribution of E19 acceptor (left column) and donor (right column) scores across 5,000 random shuffles of the antisense Alu insert for SpliceAI (top) and AlphaGenome (bottom). Left: Most scrambles destroy the acceptor (median < 0.06), but a long right tail represents scrambles that create strong acceptors. Right: Donor scores are broadly distributed, with the majority retaining significant activity. The real Alu insertion (red dashed line) is a hyper-repressive outlier, scoring far below the null distribution in both models. WT reference scores shown in green.
This variability aligns with Payer et al. (2019), who previously reported inconsistent splicing effects when using scrambled Alus as experimental controls. Our N=5,000 simulation highlights the scale of this challenge: random Alu-composition DNA frequently generates acceptor motifs (as seen in Scrambled Alu 1) that restore exon definition.
The real Alu falls at the 0th percentile of its own compositional null distribution — not merely lacking splice sites by chance, but scoring well below what random sequence of identical composition produces. Both models assign minimal splice scores to the intact element while readily predicting splice sites in randomized permutations of the same sequence. This differential is consistent with the models having learned sequence features associated with Alu-mediated splicing repression, though the specific learned features driving these predictions remain to be identified.
Takeaways
1. Insertion position is the primary determinant of pathogenicity. Placing the Alu at −200 appears benign, while the same insertion at −19 is pathogenic. This seems to suggest the models have learned spatial constraints (branch point displacement).
2. The models have learned exon definition. The donor score collapses when the acceptor is disrupted, confirming cooperative exon-level logic. The one apparent counterexample (Scrambled Alu 1, donor = 0.89 despite acceptor = 0.00) turned out to be confirmation: the shuffled sequence created a strong cryptic acceptor inside the insert, giving the donor a new partner.
3. The models recognize the signature of evolutionary silencing. Most scrambled controls (60%+) create cryptic acceptors that rescue the E19 donor. The intact Alu, despite having the same base composition, sits at the 0th percentile of this distribution (splice-silent). This indicates the model is not just looking for splice sites, but has learned to detect the specific repressive syntax that evolved to mask Alu elements from the splicing machinery.
The Next Challenge
This case establishes a baseline: deep learning models can detect well-characterized disruptions and recapitulate complex logic like exon definition. The next challenge is to investigate whether sequence-only models can account for 3D RNA structure (e.g., dsRNA formation in inverted repeats), or if this represents a missing factor that requires model improvements—specifically to capture cases where structural dynamics, rather than just sequence motifs, drive splicing outcomes.
Open Questions & Upcoming Analysis
Can insertion mutagenesis be used to map branch points?
Our finding that the intact Alu is uniquely splice-silent suggests it may be the ideal tool for this task. Unlike random sequences (which we showed create confounding cryptic sites), the Alu acts as a pure "displacer"—pushing the branch point away without modifying the local motif landscape. While tools like LaBranchoR rely on single-nucleotide changes, a systematic positional walk with this "silent spacer" could precisely map the effective branch point distance tolerance encoded in each model—testing the spatial limits of lariat formation without the noise of de novo sites.
Why exon skipping and not intron retention?
The models predict that destroying the acceptor also silences the downstream donor, leading to exon skipping. This aligns with the exon definition model. However, experimentally verifying this specific mechanism is challenging: the original nested PCR assay favored shorter products (skipping), and any retained-intron transcript would likely be degraded by nonsense-mediated decay (NMD).
An important open question is whether this outcome is universal. Do these models always predict skipping when an acceptor is destroyed, or can they identify specific genomic contexts (e.g., short introns, specific motif environments) where intron retention becomes the dominant failure mode?
Coming in future notebooks
The next phase of RetroSpect will test cases where the biology is subtler. Planned analyses include:
- Generalizing the Mechanism: Is the model's success specific to F8 intron 18? We will analyze additional pathogenic Alu insertions—in F8 and other clinically relevant genes.
- Decomposing the Silencing Signal: Do the models recognize the global Alu architecture, or just specific motifs? We will map the minimal unit of repression by testing isolated sub-elements (e.g., poly-U tracts, left/right arms) alongside the rare "super-silencer" scrambles to identify features that function independently of the full element.
- Alu Subfamily Divergence: Do the models distinguish between active threats and ancient fossils? We will compare the silencing efficiency of the modern AluY family against older, diverged subfamilies (AluS, AluJ) to test if the predicted repression scales with evolutionary age.
Discussion
This notebook is intended as a living document. If you have suggestions, alternative interpretations, or ideas for follow-up experiments, the section below is for you. You can also select any text or click any diagram in the notebook above to leave an inline comment. Your email is never displayed — it's used for reply notifications and to verify your identity. Claiming a name lets you edit your comments from any device.