Supplementary Table 7 Prediction of protein-coding genes in chickpea

Gene set   Number Average
transcript
length
(bp)
Average
CDS
length
(bp)
Average
exon per
gene
Average
exon
length
(bp)
Average
intron
length
(bp)
De novo AUGUSTUS 30,334 2,534.88 1,098.25 4.88 225.10 370.36
  GlimmerHMM 33,203 2,324.07 933.13 4.35 214.73 415.75
Homolog Medicago 60,352 1,809.60 736.85 2.79 263.74 598.01
  Lotus 62,939 1,419.83 661.07 2.57 256.94 482.41
  Pigeonpea 48,940 2,099.77 814.32 3.32 245.57 555.01
  Soybean 36,574 3,261.78 1,208.28 4.18 289.44 646.86
  Arabidopsis 29,920 2,841.28 996.23 4.14 240.75 587.97
  Grape 38,003 2,582.92 873.65 3.69 236.72 635.26
*CaTA   52,512 2,433.40 808.02 3.22 250.98 732.32
GLEAN   28,269 3,055.39 1,166.44 4.93 236.51 480.43


 

 

 

 

 

 

 

 

 

 

 

*CaTA= chickpea transcriptome assembly (http://www.icrisat.org/gt-bt/ICGGC/GenomeSequencing.htm) composed of 48,668 transcript assembly contigs (TACs) with N50 of 1,671 bp and longest TAC of 15,644 bp.

Protein sequences from six sequenced eudicot species, namely Medicago, Lotus, pigeonpea, soybean, Arabidopsis thaliana, and grape were used to perform prediction, taking one species each time. We mapped them to the genome assembly using TblastN (E-value- 1e-5). After this, homologous genome sequences were aligned against the matching proteins using GeneWise (version 2.2) for accurate spliced alignments. For de novo prediction, Augustus and GlimmerHMM were used to predict genes with parameters trained on Arabidopsis. In the third approach, we used the transcribed sequences, that is, 48,688 TACs (see above) and 17,540 assembled unique transcript (PUT) (GenBank release 181, http://www.plantgdb.org/) to align against the genome assembly using BLAT to generate spliced alignments, and then filtered the overlaps to link the spliced alignments using PASA. As a result, 52,512 genes were defined. Finally, using two de novo set (AUGUSTUS- 30,334 and GlimmerHMM- 33,203) and six homolog based results as gene models (29,920 to 62,939), together with an transcript sequences based gene set (52,512), integration was done using the GLEAN program. Finally, we got the GLEAN gene set (referred to as G-set, 29,025). At this point also, we filtered 769 genes that have a coding sequence length of <150 bp or N content >10%, the genes that have internal stop codon, frame shift, no start codon, no stop codon and/or overlap with other genes.<br>