Gene assemblers
velvet/oases : v1.2.10 2013;
idba-tran : v.1.1.1 2013;
soap-trans : v.1.03 2013;
trinity : trinityrnaseq_r20140717 (v2.1.1)
Velvet/Oases remains single best gene assembler, but note that each assembler contributes some uniquely best genes. The majority of genes are most accurately assembled with kmer (read shred size) at or above 1/2 read length of 100 bp. Trinity is less capable in part due to its restricted kmer choice, and lack of scaffolding with read pairs.
Anopheles albimanus, Longest 10K genes ----------------------------- Count Unique Method assembler 4622,46.2% 1464,14.6%u idba 2900,29.0% 352, 3.5%u soap 2408,24.1% 305, 3.1%u trin 7636,76.4% 4492,44.9%u velv kmer 2219,22.2% 80, 0.8%u k05 3903,39.0% 811, 8.1%u k25 5130,51.3% 1255,12.6%u k35 4897,49.0% 771, 7.7%u k45 4553,45.5% 492, 4.9%u k55 4764,47.6% 779, 7.8%u k65 4341,43.4% 544, 5.4%u k75 3920,39.2% 375, 3.8%u k85 3460,34.6% 168, 1.7%u k95 ------------------------------ |
Anopheles funestus, Longest 10K genes ------------------------------ Count Unique Method assembler 4092,40.9% 2450,24.5%u idba 2059,20.6% 682, 6.8%u soap 1754,17.5% 561, 5.6%u trin 6122,61.2% 4505,45.1%u velv kmer 1495,15.0% 263, 2.6%u k05 2785,27.9% 1156,11.6%u k25 4047,40.5% 2077,20.8%u k35 3053,30.5% 1112,11.1%u k45 2831,28.3% 983, 9.8%u k55 2173,21.7% 680, 6.8%u k65 1520,15.2% 399, 4.0%u k75 1117,11.2% 378, 3.8%u k85 719, 7.2% 213, 2.1%u k95 ------------------------------ |
Anopheles albimanus, Highly conserved (BUSCO_drosmel 2561 genes) Count Unique Method assembler 1082,42.2% 309,12.1%u idba 692,27.0% 75, 2.9%u soap 569,22.2% 50, 2.0%u trin 2089,81.6% 1285,50.2%u velv kmer 458,17.9% 28, 1.1%u k05 957,37.4% 174, 6.8%u k25 1177,46.0% 251, 9.8%u k35 1169,45.6% 200, 7.8%u k45 1085,42.4% 133, 5.2%u k55 1203,47.0% 266,10.4%u k65 1070,41.8% 192, 7.5%u k75 950,37.1% 136, 5.3%u k85 787,30.7% 70, 2.7%u k95 ------------------------------ |
Anopheles funestus, Highly conserved (BUSCO_drosmel 2648 genes) Count Unique Method assembler 1269,47.9% 700,26.4%u idba 686,25.9% 178, 6.7%u soap 515,19.4% 90, 3.4%u trin 1655,62.5% 1054,39.8%u velv kmer 494,18.7% 107, 4.0%u k05 822,31.0% 261, 9.9%u k25 1089,41.1% 465,17.6%u k35 925,34.9% 293,11.1%u k45 883,33.3% 245, 9.3%u k55 731,27.6% 165, 6.2%u k65 540,20.4% 103, 3.9%u k75 411,15.5% 119, 4.5%u k85 240, 9.1% 68, 2.6%u k95 ------------------------------ |
There are various comparison papers out there, contradicting each other, on how to pick a best gene assembler. One reason for those contradictions is that some comparisons use only 1 kmer setting, which isn't good, or use error-prone ways of merging multiple gene assemblies. The Evigene way is to produce and assess millions of gene assemblies for coding sequence qualities, pulling out the most complete genes from that huge pile of chaff.
Many gene assembler comparison papers focus on technical measures like "N50" length of transcripts, or "reads-mapped-back" counts of gene fragments recovered. These are not biological accuracy measures. I can easily construct transcripts that map all reads and that are longer than anyone elses, but these are not biological transcripts, they are artifacts. A simple, meaningful gene quality replacement for N50 transcript length is the average length of 1000 longest proteins, which has biological maxima, is quick and easy to calculate, and will usefully compare gene sets of same and related species.
-- Don Gilbert, 2016 March