Autism SLiM Discovery: Pipeline Report

Swati Swati #Autism#Bioinformatics#Motif Discovery#MEME Suite#Python#SLiMs#SFARI

Systematic discovery of conserved short linear motifs (SLiMs) in intrinsically disordered regions of high-confidence autism risk genes (SFARI Category 1) vs. length-matched controls.

Overview Data & Methods Results Figures

Pipeline Overview

242SFARI Category 1 Genes
251UniProt Sequences
1,502Autism IDRs
1,130Control IDRs
10MEME Motifs
20,431Swiss-Prot Pool
  • Status: Pipeline COMPLETE
  • 3 significant motifs found (poly-H, poly-Q, structured pattern)
  • Motif 3 validated by TOMTOM (DEG_SPOP_SBC_1, p=0.005) and bootstrap (98.6% stability)
  • Shuffled control: 0 motifs
  • Positive control: all 4 spiked SLiMs recovered
  • Key enrichment (q<0.05): poly-Q OR=4.5 (q=0.00022), poly-H OR=inf (q=0.0059)

Phase Progress

Phase Description Status
1 Data Acquisition (SFARI, UniProt) Done
2 Quality Control Done
3 IDR Prediction (IUPred3) Done
4 Control Set Construction Done
5 MEME Motif Discovery Done
6 Motif Validation (TOMTOM, bootstrap, controls) Done
7 Statistical Analysis (Fisher’s, FDR) Done
8 Documentation & Reproducibility Done

Key Design Decisions

Decision Detail
Discriminative MEME Uses control IDRs as negative set + position-specific priors instead of default E-value filter
Length-matched controls Non-SFARI Swiss-Prot proteins within ±10–50% length of each autism protein (seed=42)
Shuffled negative control Each IDR shuffled independently preserving AA composition (seed=99)
Positive control spike-in 100 synthetic IDRs with known SLiMs (SH3, 14-3-3, PDZ) spiked into each set
Deduplication Exact duplicate IDR sequences removed (8 autism, 22 control)
Fisher’s exact test One-sided enrichment test + Benjamini-Hochberg FDR (q < 0.05)

Data Sources

SFARI Gene List

Property Value
Source gene.sfari.org / human-gene
Rows 1,277 genes + 1 header
Filter gene-score == 1 (Category 1)
Category 1 genes 245 gene symbols
Protein-coding 242 (3 RNA genes: RNU2-2, RNU4-2, RNU5B-1)

UniProt Sequences (Autism)

Property Value
Database UniProtKB/Swiss-Prot (reviewed, human)
Query gene:{symbol} AND organism_id:9606 AND reviewed:true
Endpoint rest.uniprot.org/uniprotkb/search
Sequences 251 (242/242 protein-coding genes covered)
Skipped 0 protein-coding; 3 RNA genes excluded (expected)

Human Proteome (Control Pool)

Property Value
Source rest.uniprot.org/uniprotkb/stream
Query (reviewed:true) AND (organism_id:9606)
Entries 20,431 reviewed human proteins

IUPred3 (IDR Prediction)

Property Value
Tool IUPred3 (long disorder mode)
Endpoint iupred3.elte.hu/iupred3/{accession}
Method REST API per protein accession

MEME Suite

Property Value
Tool MEME v5.5.9
Mode Discriminative (positive vs. negative)
Server meme-suite.org
Job ID appMEME_5.5.91781622487213525159131

Quality Control

Sequence Statistics

Metric Value
Sequences 251
Min Length 123 AA
Max Length 4,911 AA
Mean Length 1,281 AA
Median Length 927 AA
Invalid Characters 0

Length Distribution

Percentile SFARI (AA) Control (AA)
P5 391 390
P25 612 603
P50 (median) 927 922
P75 1,710 1,716
P95 3,047 3,044
Mean 1,281 1,265
Std 932 917

Length matching between autism and control sets is significant with median difference of just 5 AA (0.5%).

Validation Decisions

Decision Rationale
RNA genes excluded RNU2-2, RNU4-2, RNU5B-1 have no protein product and thus correctly excluded from protein-level analysis
Score/seq mismatches excluded 7 autism + 7 control proteins with IUPred3 score array != FASTA length (isoform variants)
IDR length cutoff ≥ 15 AA Standard in IDR literature (≥15 consecutive disordered residues is biologically meaningful)
Deduplication 8/1510 autism + 22/1152 control exact duplicates removed to avoid inflation of motif enrichment

IDR Prediction

IUPred3 long disorder mode · threshold: score > 0.5 · minimum length: 15 AA

Metric Value
Raw Autism IDRs 1,510
After Dedup 1,502 (8 removed)
Proteins w/ IDR 217 of 242 = 89.7%
Proteins w/o IDR 25 (10.3%)
Mean IDR Length 74 AA (Median: 38 AA)
Proline in IDRs 11.6% (vs 7.4% full-length)

Autism IDR Length Distribution

Bin (AA) Count Percentage
0–20 217 14.4%
20–30 375 25.0%
30–50 322 21.4%
50–100 305 20.3%
100–200 156 10.4%
200–500 110 7.3%
500–1,000 16 1.1%
1,000+ 1 0.1%

Majority of IDRs are short (<50 AA), consistent with known IDR length distributions. Range: 15–1,223 AA. Total residues: 111,499 AA.

Control Set Construction

For each of the 251 SFARI sequences, a non-SFARI human Swiss-Prot protein was randomly selected with similar length (±10% initial, expanding to ±50% as needed). Seed: 42.

Metric Value
Control Proteins 251
Raw Control IDRs 1,152
After Dedup 1,130 (22 removed)
Proteins w/ IDR 196 of 251 = 78.1%
Mean IDR Length 66 AA (Median: 32 AA)
Length Gap SFARI 927 vs Ctrl 922 (5 AA, 0.5%)

Control IDRs are shorter on average (66 AA vs 74 AA), and a lower fraction of control proteins have any IDRs (78.1% vs 89.7%). This suggests autism proteins may be intrinsically more disordered overall which makes it a consistent finding with published literature.

MEME Discriminative Motif Discovery

Job ID: appMEME_5.5.91781622487213525159131 Status: COMPLETE MEME EM: 2,715s (~45 min) · MAST: 0.31s

Parameter Value
Mode Discriminative (positive vs negative)
Sequence type protein
Motif width min 6, max 15
Number of motifs 10
Model zoops (Zero Or One Occurrence Per Sequence)
Objective function classic
Markov order 0
Max time 14,362 seconds
PSP Position-specific priors (generated by psp-gen)

Discovered Motifs

# E-value Sites Width Consensus Status
1 3.1e-031 12 14 XHH[HQ]HHHHHHHHHH Significant
2 7.1e-025 38 11 QQQQQQQQQQQ Significant
3 6.6e-006 6 15 M[SA]T[TS][IV]METTTT[ML]AT[TS] Significant
4 2.4e+000 6 15 DESRNYISNSAQSNG NS
5 1.8e+001 2 15 TDDEDFYTTFPLVTD NS
6 5.7e+000 4 13 VASAECPSDDED[IL] NS
7–10 > 2.2e+003 NS

Motifs 1–2 are composition-biased (poly-H, poly-Q). Motif 3 is the strongest candidate for a functional SLiM with a structured alternating pattern. NS = not significant (E-value > 0.05).

TOMTOM Search vs ELM Database

Motif 3 submitted via TOMTOM against ELM 2024 database. Motifs 1–2 (poly-H, poly-Q) skipped as composition artifacts with no meaningful ELM matches.

Rank ELM ID ELM Class p-value Description
1 ELME000388 DEG_SPOP_SBC_1 5.03e-03 SPOP-binding degron
2 ELME000336 MOD_NEK2_1 6.92e-03 NEK2 phosphorylation site
3 ELME000444 MOD_Plk_4 1.12e-02 Polo-like kinase site
4 ELME000121 LIG_Dynein_DLC8_1 3.53e-02 Dynein light chain binding
5 ELME000438 LIG_Vh1_VBS_1 3.57e-02 VH1 phosphatase binding
6 ELME000337 MOD_NEK2_2 3.81e-02 NEK2 alt. phosphorylation site

Best hit: SPOP-binding degron (DEG_SPOP_SBC_1, p=0.005). SPOP is a ubiquitin ligase that targets substrates for degradation. This suggests Motif 3 may mediate SPOP-dependent turnover of autism-risk proteins.

Validation Controls

Bootstrap Stability Test (Motif 3)

80% of autism IDRs resampled 1,000× with replacement. Score against Motif 3 PWM at bayes_threshold. Seed: 42.

Metric Value
Detection Rate 98.6% (threshold: ≥90%)
Mean Sites/Bootstrap 4.8 (SD=2.2, full set=6)
≥3 Sites/Bootstrap 85.0%

Verdict: STABLE (Motif 3 is robust and not dependent on outlier sequences)

Negative Control: Shuffled Sequences

Each IDR sequence independently shuffled preserving exact AA composition. MEME discriminative with identical parameters.

Result: 0 significant motifs (all E > 10⁴) · Run time: 46 min

Conclusion: Original MEME motifs are not due to random composition noise.

Positive Control: Spike-in SLiMs

100 synthetic IDR-like sequences with known SLiMs (25 each of SH3 Class I/II, 14-3-3, PDZ) spiked into both sets.

Result: 4/4 spiked SLiM classes recovered (E < 0.05) · Run time: 46 min

Conclusion: PIPELINE VALIDATES (MEME discriminative mode successfully detected all 4 spiked SLiM classes)

Statistical Analysis

For each of 10 MEME motifs, scanned all 1,502 autism + 1,130 control IDRs using the motif PWM at its bayes_threshold. One-sided Fisher’s exact test + Benjamini-Hochberg FDR.

Results (q < 0.05)

# Consensus Autism Control OR Fold p-value q-value
2 QQQQQQQQQQQ 41/1,502 (2.7%) 7/1,130 (0.6%) 4.5 4.4× 9.9e-05 0.00022
1 XHH[HQ]HHHHHHHHHH 12/1,502 (0.8%) 0/1,130 (0.0%) 0.00054 0.0059
3 M[SA]T[TS][IV]METTTT[ML]AT[TS] 6/1,502 (0.4%) 0/1,130 (0.0%) 0.034 0.086 (NS)

Motif 2 (poly-Q) shows the strongest enrichment: 4.4-fold increase in autism IDRs with q=0.00022. Motif 1 (poly-H) is exclusive to autism (12 vs 0) but lower prevalence. Motif 3, despite being the best SLiM candidate, does not survive FDR correction (q=0.086) due to low site count (6).

Draft Findings

Claim 1: Poly-Glutamine Tract (Motif 2)

We found evidence that an 11-residue poly-glutamine (poly-Q) motif is significantly enriched in the intrinsically disordered regions of proteins encoded by high-confidence autism risk genes (SFARI Category 1, n=242) compared to length-matched controls (n=251 non-SFARI human proteins).

Enrichment: 41/1,502 autism IDRs (2.7%) vs. 7/1,130 control IDRs (0.6%), odds ratio = 4.5, enrichment fold = 4.4×, FDR q = 0.00022 (Benjamini-Hochberg).

Interpretation: Poly-glutamine tracts are well-known in transcriptional regulation and neurological disorders. The enrichment suggests poly-Q tracts in IDRs may be a shared property of autism-risk transcriptional regulators (e.g., MED13L, KMT2C, CHD2, CHD8, SETD5, ASH1L, TBL1XR1), potentially modulating protein-protein interaction networks via homotypic Q/N-rich phase separation.

Claim 2: Poly-Histidine Tract (Motif 1)

We found evidence that a 14-residue poly-histidine motif (consensus EHHHHHHHHHHHHH) is significantly enriched in the intrinsically disordered regions of high-confidence autism risk proteins compared to length-matched controls.

Enrichment: 12/1,502 autism IDRs (0.8%) vs. 0/1,130 control IDRs (0.0%), odds ratio = ∞, Fisher’s exact p = 0.0012, FDR q = 0.0059.

Interpretation: Poly-histidine tracts are rare in the human proteome and are primarily found in zinc-finger transcription factors and DNA-binding proteins. The complete absence in controls suggests poly-H repeats may be a specific signature of autism-related transcriptional machinery, though the low absolute count (12 hits) warrants cautious interpretation.

Claim 3: SPOP-Binding Degron-like Motif (Motif 3)

We found evidence that a 15-residue motif (consensus M[SA]T[TS][IV]METTTT[ML]AT[TS], best ELM match: DEG_SPOP_SBC_1, SPOP-binding degron, p = 0.005) is enriched in the intrinsically disordered regions of high-confidence autism risk proteins.

Enrichment: 6/1,502 autism IDRs (0.4%) vs. 0/1,130 control IDRs (0.0%), odds ratio = ∞, p = 0.034, FDR q = 0.086 (not significant).

Bootstrap stability (1000×, 80% resample): Mean sites = 4.8 (SD = 2.2), 98.6% detected ≥1 site, 85.0% detected ≥3 sites.

Interpretation: Despite falling short of FDR significance, this motif is the strongest structured SLiM candidate. The SPOP-binding degron match is notable because SPOP is an E3 ubiquitin ligase adaptor, and SPOP mutations are linked to autism. The six proteins containing this motif are strong candidates for SPOP-mediated regulation.

Summary Table

Motif Consensus Width E-value OR q-value FDR sig? ELM hit
1 (poly-H) EHHHHHHHHHHHHH 14 3.1e-31 0.0059 Yes
2 (poly-Q) QQQQQQQQQQQ 11 7.1e-25 4.5 0.00022 Yes
3 (structured) M[SA]T[TS][IV]METTTT[ML]AT[TS] 15 6.6e-06 0.086 No DEG_SPOP_SBC_1

Figures

Figure 1: IDR Length Distribution

Overlaid histogram of autism vs. control IDR lengths. Autism IDRs skewed toward longer regions (mean 42.4 vs 40.5).

IDR Length Distribution

Figure 2: Motif Enrichment

Top: Bar plot of autism vs. control site counts per motif. Bottom: Volcano plot (−log10 p-value vs. enrichment fold).

Motif Enrichment

Figure 3: Bootstrap Distribution

Histogram of Motif 3 site counts across 1,000 bootstrap resamples (80% autism IDRs, with replacement). Mean=4.8, SD=2.2.

Bootstrap Distribution

Motif Logos

Motif 1 (Poly-H)

Motif 1

Motif 2 (Poly-Q)

Motif 2

Motif 3 (SPOP degron-like)

Motif 3

Motif 4 (NS)

Motif 4

Motif 5 (NS)

Motif 5

Motif 6 (NS)

Motif 6

Motif 7 (NS)

Motif 7

Motif 8 (NS)

Motif 8

Motif 9 (NS)

Motif 9

Motif 10 (NS)

Motif 10