*CaTA= chickpea transcriptome assembly (http://www.icrisat.org/gt-bt/ICGGC/GenomeSequencing.htm) composed of 48,668 transcript assembly contigs (TACs) with N50 of 1,671 bp and longest TAC of 15,644 bp.
Protein sequences from six sequenced eudicot species, namely Medicago, Lotus, pigeonpea, soybean, Arabidopsis thaliana, and grape were used to perform prediction, taking one species each time. We mapped them to the genome assembly using TblastN (E-value- 1e-5). After this, homologous genome sequences were aligned against the matching proteins using GeneWise (version 2.2) for accurate spliced alignments. For de novo prediction, Augustus and GlimmerHMM were used to predict genes with parameters trained on Arabidopsis. In the third approach, we used the transcribed sequences, that is, 48,688 TACs (see above) and 17,540 assembled unique transcript (PUT) (GenBank release 181, http://www.plantgdb.org/) to align against the genome assembly using BLAT to generate spliced alignments, and then filtered the overlaps to link the spliced alignments using PASA. As a result, 52,512 genes were defined. Finally, using two de novo set (AUGUSTUS- 30,334 and GlimmerHMM- 33,203) and six homolog based results as gene models (29,920 to 62,939), together with an transcript sequences based gene set (52,512), integration was done using the GLEAN program. Finally, we got the GLEAN gene set (referred to as G-set, 29,025). At this point also, we filtered 769 genes that have a coding sequence length of <150 bp or N content >10%, the genes that have internal stop codon, frame shift, no start codon, no stop codon and/or overlap with other genes.<br> |