This notebook is interactive — select any passage or click any diagram to leave a comment, or join the discussion below.
RetroSpect · Case Study 02

In silico Alu Mutagenesis Pinpoints the F8 Intron 18 Branch Point

24 Alu insertions scored by SpliceAI and AlphaGenome place the branch point at −27, consistent with exon 19 skipping in three unrelated patients

Author
Ida Shinder
Published
February 2026
Gene
F8 · chrX · strand −
Transcript
NM_000132

A 331 bp Alu retrotransposon insertion in intron 18 of the Factor VIII gene causes severe hemophilia A by triggering complete exon 19 skipping. The original report (Ganguly et al. 2003) proposed that the insertion disrupts a branch point at position −85, but that location was inferred, not experimentally mapped. In a companion study, we showed that this Alu is uniquely splice-silent — it creates zero cryptic splice sites, acting as a pure spacer. That property makes it an ideal positional probe: by inserting the same element at varying distances from the exon and asking two deep learning models where splicing breaks, we can map the true branch point with single-nucleotide resolution.

Background — The Assumed Branch Point

In a companion study, we showed that both SpliceAI and AlphaGenome correctly predict that a 331 bp Alu Yb9 insertion 19 bp upstream of exon 19 in the F8 gene abolishes splicing — consistent with the complete exon skipping and severe hemophilia A reported by Ganguly et al. (2003). The original report by Ganguly et al. (2003) proposed that the mechanism involved displacement of a putative branch point at position −85, breaking lariat formation.

The companion study also showed that the intact Alu is uniquely splice-silent: it creates zero cryptic splice sites, while nearly half of 5,000 composition-matched shuffles do. The Alu acts as a pure spacer — adding distance between the 3′ splice site and the branch point without introducing competing signals. That property makes it an ideal tool for mapping the branch point: if we insert the same splice-silent element at varying distances from the exon, the position where splicing breaks should mark the branch point location.

What is a branch point?

The branch point is a conserved adenine within the intron whose 2′-OH group attacks the 5′ splice site, forming the lariat intermediate that initiates splicing. In humans, the branch point is typically 18–40 nt upstream of the 3′ splice site.

Why exon 19?

Wang et al. (2020) noted that the natural 3′ss of F8 exon 19 is inefficiently recognized; NNSPLICE and NetGene2 did not predict it at default cutoffs. We scored all 26 F8 exon boundaries with MaxEntScan and found that exon 19 ranks 6th weakest out of 25 F8 splice sites for both its acceptor (7.10) and donor (6.97), both below the >8 threshold typically associated with strong splice sites (Yeo & Burge 2004). This leaves the exon potentially more dependent on proper branch point function than most.

Key question

If we insert the same Alu element at varying distances from the exon, can we locate the true branch point by finding the exact position where splicing transitions from disrupted to preserved?

Experimental Design — Position Scan

We inserted the identical Alu Yb9 element (antisense, disease orientation) at 24 positions across intron 18, ranging from −19 (the disease position) to −200. From −19 to −30, we used single-nucleotide resolution to precisely map the disruption boundary. Beyond −30, we used coarser spacing.

Each insertion was scored by two deep learning models:

  • SpliceAI — predicts the probability that each position is a splice acceptor or donor (0–1)
  • AlphaGenome — predicts splice site usage and junction-level PSI (percent spliced in, 0–1)

Table 1. Splice site scores for Alu insertion position scan

PositionSpliceAI E19 AccAlphaGenome E19 AccAlphaGenome PSIInterpretation
−190.00130.00240.0000Complete skipping (disease)
... −20 through −24: all near-zero (complete skipping) ...
−250.00130.00490.0000Complete skipping
−260.00190.00490.0000Complete skipping
−270.00720.00930.0000Complete skipping
−280.00800.01470.0000Complete skipping
−290.00770.01220.0000Complete skipping
−300.54300.51170.8771Partial recovery
−400.70510.65430.9644Near-normal splicing
−850.72280.75000.9825Preserved (Ganguly BP)
−1000.72150.74610.9832Preserved
−2000.68110.71880.9844Preserved

Disrupted (score < 0.1)    Transition    Preserved (score > 0.5)

Every insertion from −19 through −29 produces near-zero acceptor scores and zero PSI. But at position −30, splicing abruptly recovers: SpliceAI jumps from 0.008 to 0.543, AlphaGenome from 0.012 to 0.512, and PSI from 0.000 to 0.877.

This razor-sharp transition, a single base pair between −29 and −30, is the signature of a branch point. An Alu inserted at position −29 lands within the branch point motif (AAATAAT, positions −32 to −26), splitting it and disrupting lariat formation. An Alu inserted at −30 lands just upstream of the motif, leaving it intact. Both models agree independently, placing the branch point at position −27.

An Alu insertion at position −85, the putative branch point from Ganguly et al., shows no disruption whatsoever. Splicing is fully preserved with scores matching wild type, ruling out −85 as the functional branch point.

Figure 1 — E19 acceptor and donor scores vs Alu insertion position

Exon 19 acceptor scores (top) and donor scores (bottom) for each Alu insertion site. SpliceAI (blue) and AlphaGenome (red) both show an abrupt transition between positions −29 and −30. Green line marks the branch point adenine at −27; dashed purple line marks the putative branch point at −85 (Ganguly et al.). Dashed horizontal lines indicate wild-type scores.

0.00.20.40.60.81.0E19 Acceptor ScoreSpliceAIAlphaGenomeWT SpliceAI (0.86)WT AlphaGenome (0.78)Actual BP (-27)Putative BP (-85, Ganguly et al.)Downstream of BPUpstream of BP−200−175−150−125−100−75−50−25Alu Insertion Position (relative to 3' end of intron 18)0.00.20.40.60.81.0E19 Donor ScoreSpliceAIAlphaGenomeActual BP (-27)Putative BP (-85, Ganguly et al.)

Figure 2 — Combined acceptor score and PSI vs insertion position

SpliceAI acceptor (blue circles), AlphaGenome acceptor (red squares), and AlphaGenome junction-based PSI (green diamonds) plotted together. All three metrics show the same sharp boundary between −29 and −30. PSI, which directly quantifies exon inclusion, jumps from 0.000 to 0.877 across a single base pair. Green line marks the branch point adenine at −27; dashed purple line marks the putative branch point at −85 (Ganguly et al.).

−200−175−150−125−100−75−50−25Alu Insertion Position (relative to intron 3' end)0.00.20.40.60.81.0Score / PSIBP displaced(splicing disrupted)BP preserved(splicing intact)SpliceAI E19 AcceptorAlphaGenome E19 AcceptorAlphaGenome PSI (junctions)Actual BP (-27)Putative BP (-85, Ganguly et al.)

Key Numbers

−29/−30
Disruption boundary (1 bp)
−27
Actual branch point
AAATAAT
Branch point motif
0.88
PSI recovery at −30

Independent Confirmation — Wang et al. (2020)

Our in silico mapping of the branch point to position −27 is independently corroborated by a clinical genetics study. Wang et al. (2020) reported two unrelated patients with moderate hemophilia A (FVIII:C 1.2% and 2%) caused by the intronic variant c.5999-27A>G — a single nucleotide change at the exact position we identified as the branch point. Sanger sequencing confirmed exon 19 skipping, and SVM-BPfinder predicted branch point disruption. Wang et al. found that four standard splice site prediction algorithms showed no impact — this is a variant class invisible to traditional splice site predictors, yet both deep learning models flag it, consistent with having implicitly learned branch point requirements.

Clinical validation

Wang X, Hu Q, Tang N, Lu Y, Deng J. "Deep intronic F8 c.5999-27A>G variant causes exon 19 skipping and leads to moderate hemophilia A." Blood Coagul Fibrinolysis. 2020;31(7):476–480. PMID: 32833809

Three unrelated patients now demonstrate exon 19 skipping caused by disruption at position −27, through two entirely different mutational mechanisms: point mutation (Wang et al.) and Alu insertion (Ganguly et al.).

In Silico Validation — BP A>G Mutation

To further confirm position −27, we directly tested the branch point A>G point mutation (c.5999-27A>G) — the exact variant found in Wang et al.'s patients — with both models.

Branch point A>G mutation results

SpliceAI: E19 acceptor drops from 0.86 (WT) to 0.48

AlphaGenome: E19 acceptor drops from 0.78 (WT) to 0.36

AlphaGenome PSI: drops from 0.98 (WT) to 0.70 — moderate loss, consistent with the moderate hemophilia A phenotype

A single nucleotide change at −27 causes a larger splice disruption than Alu insertions at many distant positions (e.g., −150 or −200). This is strong evidence that −27 is the functional branch point: even a point mutation here substantially impairs splicing, while a 331 bp Alu insertion far upstream has no effect.

Figure 3 — Branch point A>G mutation vs Alu insertions

Acceptor scores (left) and PSI (right) for five conditions. The A>G point mutation (PSI 0.70) matches moderate hemophilia; the Alu at −19 (PSI 0.00) matches severe hemophilia — quantitative genotype-phenotype correspondence across two different mutational mechanisms.

WTBP A>G(c.5999-27A>G)Alu -27(insertion)Alu -30(insertion)Alu -19(disease)0.00.20.40.60.81.0E19 Acceptor Score0.8590.4820.0070.3920.0010.7810.3590.0090.5700.002Acceptor ScoreSpliceAIAlphaGenomeWTBP A>G(c.5999-27A>G)Alu -27(insertion)Alu -30(insertion)Alu -19(disease)0.00.20.40.60.81.0PSI (Exon 19 Inclusion)0.9840.7000.0000.8530.000Junction-Based PSIF8 Exon 19: Branch Point A>G Mutation vs Alu Insertions

Systematic Adenine Mutagenesis — Is −27 Really THE Branch Point?

The branch point motif at this locus is AAATAAT (positions −32 to −26), which contains five adenines. Is position −27 uniquely the functional branch point, or could another adenine in the motif serve as the lariat-forming nucleotide?

We tested 10 point mutations in three groups:

  1. All three substitutions at −27 (A>G, A>C, A>T) — if −27 is the BP, all should disrupt splicing
  2. A>G at other motif adenines (−32, −31, −30, −28) — should not disrupt if −27 is the BP
  3. A>G outside the motif (−37, −36, −35, −34) — negative controls

Table 2. Adenine mutagenesis results

MutationSpliceAI AccAG AccAG PSI
WT (reference)0.85860.78120.9843
−27 A>G0.48200.35940.7000
−27 A>C0.66160.62110.9684
−27 A>T0.70410.71090.9776
−32 A>G0.84480.76950.9837
−31 A>G0.77620.75000.9797
−30 A>G0.85470.76170.9836
−28 A>G0.79340.74220.9826
−37 A>G0.84440.76560.9837
−36 A>G0.82570.75000.9834
−35 A>G0.83370.77730.9842
−34 A>G0.84140.77730.9840

WT reference: SpliceAI = 0.8586, AlphaGenome = 0.7812, PSI = 0.9843. Substitutions at −27    Other positions (minimal effect)

The results unambiguously identify position −27 as the branch point:

  • Position −27 is uniquely sensitive — all three substitutions cause disruption, with A>G being the most severe
  • A>G > A>C > A>T in disruption severity — consistent with the biochemistry of lariat formation: guanine lacks the 2′-OH geometry required for nucleophilic attack, while pyrimidines (C, T) may retain partial capacity. The fact that both models reproduce this severity ordering — without explicit training on branch point biochemistry — suggests they have learned genuine sequence-function relationships, not merely memorized common variants
  • No other adenine in the motif is a branch point — A>G at positions −32, −31, −30, and −28 show less than 10% effect on both models
  • Outside controls confirm specificity — A>G mutations at −37, −36, −35, and −34 have negligible effects

Figure 4 — Systematic adenine mutagenesis across the branch point motif

SpliceAI acceptor (top), AlphaGenome acceptor (middle), and PSI (bottom) for 18 variants. Red: substitutions at −27. Orange: other motif adenines. Gray: outside controls. Only −27 shows significant disruption.

-27 A>G-27 A>C-27 A>T-32 A>G-32 A>C-32 A>T-31 A>G-31 A>C-31 A>T-30 A>G-30 A>C-30 A>T-28 A>G-28 A>C-28 A>T-37 A>G-36 A>G-35 A>G-34 A>G0.00.20.40.60.81.0E19 Acceptor Score0.480.660.700.840.850.910.780.800.820.850.920.890.790.730.720.840.830.830.84A. SpliceAI AcceptorWT (0.859)-27 A>G-27 A>C-27 A>T-32 A>G-32 A>C-32 A>T-31 A>G-31 A>C-31 A>T-30 A>G-30 A>C-30 A>T-28 A>G-28 A>C-28 A>T-37 A>G-36 A>G-35 A>G-34 A>G0.00.20.40.60.81.0E19 Acceptor Score0.360.620.710.770.780.800.750.750.770.760.810.790.740.730.720.770.750.780.78B. AlphaGenome AcceptorWT (0.781)-27 A>G-27 A>C-27 A>T-32 A>G-32 A>C-32 A>T-31 A>G-31 A>C-31 A>T-30 A>G-30 A>C-30 A>T-28 A>G-28 A>C-28 A>T-37 A>G-36 A>G-35 A>G-34 A>G0.00.20.40.60.81.0PSI (Exon 19 Inclusion)0.7000.9680.9780.9840.9840.9860.9800.9800.9820.9840.9870.9860.9830.9800.9790.9840.9830.9840.984C. AlphaGenome PSIWT (0.984)

The Branch Point Motif — GC Replacement

The branch point consensus is YNYYRAY (7 nt), where Y = pyrimidine, R = purine, and A = branch point adenine. At F8 intron 18, positions −32 to −26, the sequence is AAATAAT, where the underlined A at −27 is the branch point adenine.

To test whether the entire motif is required (not just the branch point adenine), we replaced all 7 nucleotides with a GC-rich sequence:

Full motif replacement results

SpliceAI: E19 acceptor drops from 0.86 to 0.13 — the strongest acceptor score reduction of any point-level perturbation

AlphaGenome: E19 acceptor drops from 0.78 to 0.04 — almost fully abolished

AlphaGenome PSI: N/A — the E18→E19 inclusion junction was not predicted, indicating the normal 3′ss is non-functional

The GC replacement produces the strongest acceptor score reduction of any perturbation tested, consistent with complete branch point destruction. PSI is not computable because AlphaGenome's junction model does not predict the E18→E19 splice when the branch point motif is absent — the 3′ splice site is effectively non-functional.

However, AlphaGenome's junction predictions reveal that an alternative junction appears with an acceptor 39 bp upstream of the normal position — extending exon 19 into the intron. This alternative acceptor likely uses a different branch point further upstream. The E19 donor retains a relatively high score (SpliceAI: 0.57, AlphaGenome: 0.84) because it still has a paired acceptor via exon definition, just not the canonical one.

Is there a plausible branch point at this upstream position? The intron 18 sequence at approximately −66 contains AACTAAC — a strong match to the YNYYRAY consensus (matching at all 7 positions). This suggests the model may have learned that when the primary branch point is destroyed, a secondary YNYYRAY-matching sequence further upstream can serve as a backup. Whether this alternative branch point is genuinely functional or a model artifact remains an open question.

Figure 5 — Full perturbation comparison including GC replacement

All perturbation types compared: WT, branch point A>G, GC motif replacement, and Alu insertions. GC replacement produces the strongest acceptor reduction but PSI is N/A (E18→E19 junction not predicted).

WTBP A>G(point mut)BP motif-> GC (7nt)Alu -30(insertion)Alu -19(disease)0.00.20.40.60.81.0Score0.8590.4820.1330.3920.0010.7810.3590.0380.5700.002A. E19 Acceptor ScoreSpliceAIAlphaGenomeWTBP A>G(point mut)BP motif-> GC (7nt)Alu -30(insertion)Alu -19(disease)0.00.20.40.60.81.0Score0.8650.5190.5730.5580.0610.8670.6050.8440.6130.030B. E19 Donor ScoreSpliceAIAlphaGenomeWTBP A>G(point mut)BP motif-> GC (7nt)Alu -30(insertion)Alu -19(disease)0.00.20.40.60.81.0PSI (Exon 19 Inclusion)0.9840.700N/A(E18→E19junctionnot found)0.8530.000C. Junction-Based PSI
Memorized or understood?

Neither SpliceAI nor AlphaGenome has explicit branch point modules. They learn the branch point location implicitly from sequence patterns that surround functional splice acceptors in their training data. The sharp −29/−30 boundary demonstrates that both models have learned precise spatial requirements for lariat formation.

Could this be memorization? A model that merely memorized common splice patterns might get the rough distance right — branch points 18–40 nt upstream are common in training data. But our experimental design makes pure memorization insufficient: we insert a 331 bp Alu element (never seen in training) at positions never encountered in the genome, creating sequences with no natural analog. A memorizing model would have no template to match. The fact that both models independently recover the correct boundary — and reproduce the A>G > A>C > A>T severity ordering at the branch point adenine — argues they have learned the underlying sequence-function relationship, not just statistical regularities.

Summary

Three independent lines of evidence converge on position −27 as the branch point in F8 intron 18:

  1. Position scan: The disruption boundary falls between −29 and −30 (24 Alu insertions, 2 models)
  2. Point mutagenesis: Only position −27 is sensitive to substitution (10 mutations, 3 substitution types)
  3. Clinical genetics: c.5999-27A>G causes hemophilia A via exon 19 skipping (Wang et al. 2020)

Table 3. Summary of all perturbations

ConditionSpliceAI E19 AccAG E19 AccAG PSIInterpretation
WT0.85860.75000.9843Normal
BP −27 A>G0.48200.35940.7000Strong disruption
BP −27 A>C0.66160.62110.9684Moderate disruption
BP −27 A>T0.70410.71090.9776Mild disruption
−28 A>G (neighbor)0.79340.74220.9826No significant effect
BP motif → GC (7 nt)*0.13260.0381N/A*Severe acceptor loss
Alu at −300.54300.51170.8771Partial recovery (BP preserved)
Alu at −290.00770.01220.0000Complete disruption (BP displaced)
Alu at −19 (disease)0.00130.00240.0000Complete disruption

*PSI not computable for the GC motif replacement: the E18→E19 inclusion junction was not predicted by AlphaGenome, indicating the normal 3′ss is non-functional. An alternative junction 39 bp upstream was detected.

Supplementary — MaxEntScan Splice Site Analysis

Full MaxEntScan scoring of all F8 exon boundaries, supporting the observation that exon 19 has inherently weak splice sites (see Background).

Figure 6 — MaxEntScan splice site scores for all F8 exon boundaries

Ranked bar charts showing MaxEntScan scores for all 25 F8 acceptor sites (top) and 25 donor sites (bottom). Exon 19 (highlighted) ranks 6th weakest for both.

E21E22E12E24E11E19E8E10E20E25E4E16E9E3E18E23E7E2E13E15E14E6E5E17E2602468101214MaxEntScan ScoreExon 19(7.1)Exon 19 rank: 6/25 (weakest to strongest)A. Acceptor (3'ss) Strength — All F8 ExonsE7E17E2E21E20E19E13E12E22E5E23E16E10E1E3E14E18E6E8E15E25E9E11E24E40246810MaxEntScan ScoreExon 19(7.0)Exon 19 rank: 6/25 (weakest to strongest)B. Donor (5'ss) Strength — All F8 Exons(Verifying Wang et al. 2020: Exon 19 has weak splice sites)

Table 4. MaxEntScan acceptor (3′ss) scores ranked weakest to strongest

RankExonScoreMotif
1E212.20GTTTCTTTACTTGGGCAAAGGAC
2E225.34AGTGTTATTTTAATTGGTAGGTG
3E125.41TCAATATATGTAATTAACAGATA
4E246.42TCTTTTTTCTTTTCTTTGAGGTT
5E116.83ATGGTTTTGCTTGTGGGTAGGTG
6E197.10TTCCTGTTGGTTTTTATAAGGTG
7E87.11TTTGGTTTGTCTGACTCCAGATG
... 17 exons with scores 7.32–12.72 ...
25E2613.53TGCTATCTTTCCTCTTTCAGGTT

Table 5. MaxEntScan donor (5′ss) scores ranked weakest to strongest

RankExonScoreMotif
1E75.05ATGGTAATA
2E175.05ATGGTAATA
3E25.68TGGGTAATG
4E216.12AAGGTTAGA
5E206.41ATGGTAAAT
6E196.97ATAGTGAGT
7E138.02CAGGTTAGT
... 17 exons with scores 8.09–10.77 ...
25E411.00AAGGTAAGT

Discussion

This notebook is intended as a living document. If you have suggestions, alternative interpretations, or ideas for follow-up experiments, the section below is for you. You can also select any text or click any diagram in the notebook above to leave an inline comment. Your email is never displayed — it's used for reply notifications and to verify your identity. Claiming a name lets you edit your comments from any device.

Start the discussion
Have a question about the methodology? Spotted something interesting? You can also select any text or click any diagram in the notebook above to leave an inline comment.
Supports **bold**, *italic*, `code`, and [links](url)
Lets you claim a name · edit comments · get reply notifications