EvidentialGene Annotator Guide

Example for Killifish gene set, 2012-Aug

Killifish gene set, kfish1_best9f 2012-Aug, is generally a good 1st draft set, to go with the first draft genome assembly. Ortholog gene content is high, in top tier of related fish genomes, and gene quality is overall good, full length proteins, with much EST and homology support.

Improving and annotating this gene set involves distinguishing bad from good, where a majority are good, if not yet perfect models, of the available gene evidence. Expect some learning time to distinguish mistakes in modeling gene evidence, by looking at genome map displays, and examples of bad versus good.

Plans for improved Killifish genes

Common gene model errors

Identifying problems

Resolving problems

Annotate locus dialog

Each gene on GBrowse map has click-able information dialog. This includes
  • a. Brief summary about gene, Name and quality.
  • b. Detailed report in View Gene Details .. link.
  • c. Update Choices .. dialog to promote or demote this model.
  • d. Table of all updates made by others, with links to location, View Changes table
Examine the Gene Details for further information, including Protein, cDNA sequences, and homology links.

View the Changes table to see what other problems have been corrected, as examples of what to look for.

Use this dialog to replace poor models with better, or enter note about any problem. Choose among the Tracks for Predictions p. AUGepir1 .. AUGepi9, xmbest7pu if they look best, these are more likely complete models. Or choose EST asm.PASA, EST asm.Newbler assemblies if complete. Choose a good fish gene if need be to fit best model.

Update choices                       use  [X] at top to cancel
  [x] No change . .  [ ] Alternate transcript . .  [ OK ] to change
  [ ] Best model . . [ ] None are good at locus
  [ ] Drop model . . [ ] Skip locus, no gene here
  Note: [ENTER Your Comments Here] 
votegene2

Examples

Gene join (easy case: reversed)

Here two reversed genes are joined, as Funhe5EG000103t1. Both introns and mapped fish genes show these are reversed genes. But EST denovo assembly joined these nearby and did not account for reversal, which is found only with mapping introns to genome. Similar but harder to spot, un-reversed joined genes occur.


Join or Split? Ambiguous

In this case of possible gene split, the evidence is inconsistent.
Funhe5EG000114t1 and Funhe5EG000115t1 match stickleback, tetraodon and zfish genes, and EST assemblies. But human, medaka and tilapia models join these two. Critical to me is that EST expression is strong-ish over two parts, but split middle lacks joining expression. Orthologs of joiners above may clarify.
450


toosmall FISH10117 s1013

Example gene, BRCA2_HUMAN, split at genome gap, has no accurate alternate model. This was found examining "tootiny" orthologs, from EvidentialGene/killifish/project/orthomcl/killifish_outlier.tootiny.tab
400


toobig FISH10096 addalt_s567

This is a case Funhe5EG013406 from killifish_outlier.toobig.tab of a bigger Killifish protein than orthologs. Inspection of genome map says the current model is likely accurate, not a join, as introns and EST support the joining of all exons. However a shorter alternate transcript exists and matches other fish genes.
400


Too many paralogs

A killifish gene family with more paralogs than other fish may include fragment genes. Inspecting these gene familys will find some have fragments and some true paralogs. One way to find such groups is with the below search of gene groups. You can also reverse this search, for groups with too few killifish paralogs, then hunt for the missing paralog with blast or genome map search of other fish gene IDs.
	      
    EvidentialGene/killifish/project/
    Search gene families, 
    >Limit to genes in these taxa:  Fish: 1 or 0
    >Limit to genes in these species: Killifish: 2+
    Search for Vertebrata with above constraints : No. matches = 1225
Ignore: ntaxa below 4, and 274 groups remain. Or use this search link

Look at gene group page of these others. If killifish gene IDs are consecutive (or nearly), they are next to each other, and may be splits.

E.g. FISH8_G2208 Toll-like receptor 2, Funhe5EG021539t1, Funhe5EG021540t1
These are NOT splits, but 2 true paralogs, as each matches complete tilapia/medaka genes that have two mappings.

E.g. FISH8_G2508 Vertebrate protein tyrosine phosphatase, Funhe5EG000402t1 .. 403t1 .. 404t1
These are 3 parts of 1 gene. Expression ESTs only exist for middle part, but all mapped proteins cover all 3 parts, and orthology indicates these kfish genes are too short.

400


Plectin paralogs : big gene problems

There are more Plectin, or FISH8_G231 Microtubule-actin crosslinking factor, paralogs for Killifish than others. This is a big gene, 7000 aa per paralog. Most of the putative 4 common vertebrate paralogs are split for Killifish, at these 4 locations. The 2nd, plectin3_q_sc759, looks complete but fragment alternate exons exist in middle.

locus P1: many fragments, corrected w/ other model
locus P2: accurate full gene, with alternate exon fragments
locus P3: two big parts, no full model, alternate p-AUGepi9 covers most of gene
locus P4: two big parts, could be 2 genes, evidence disagrees


Extra credit

There are other gene family searches that pull out interesting Killifish groups. Of interest are those families with no other fish species, and without good functional information. While the largest of such tend to be uninteresting Transposon groups, there is one very intriguing group of 99 Killifish genes. You can find this with the below Orthomcl family search
	      
    EvidentialGene/killifish/project/
    Search gene families, 
    >Limit to genes in these taxa:  Fish: None
    >Limit to genes in these species: Killifish: 2+
    Search for Vertebrata with above constraints : No. matches = 829
That group FISH8_G28 at top has 99 Kfish genes, and my inspection of ~10 shows EST expression, without transposon overlap. There is some weak homology for some of these. Can anyone win the prize of determining a family function for this large group, from those weak homologies or other analyses? A zinc finger motif is most common of the weak homologs. Looking at the Funhe IDs, they occur in tandem clusters of 2-3 paralogs, but do not appear to be fragments of longer genes.

Other groups of uniquely killifish genes from above may include Transposon genes yet to be assigned. Check for expression, some other groups are killifish expressed, such as FISH8_G321 with 31 paralogs.