NURD: a new algorithm to inference isoform expression

NURD is a new algorithm to inference isoform expression. In this program, non-uniform read distribution is taken into consideration to help improve isoform expression inference. Following is the detail information and usage of this program. They can also be found in readme.txt.

 

The latest version is v1.1.1. You can find this software from here: program.

 

Function:

         The function of this program is to estimate isoform expression from reads mapping result.

input: reads mapping file, gene annotation.

output: isoform expression estimation.

 

Environment:

         Linux.

 

Usage:

NURD [options] <-G annotation.gtf>|<-R annotation.refflat> <-S mapping_file.sam>

-G: annotation.gtf: gene annotation in gtf format.

-R: annotation.refflat: gene annotation in refflat format.

-S: mapping_result.sam: reads mapping result in sam format.

Optional:

-A: the weight of GBC when mixturing the GBC and LBC into one gene structure matrix. It's a float number between 0-1. Default: 0.5

-O: output_dir: the directory to output the estimation result. Default: current directory.

 

Example:

We will use an example to make it clear of how to use this software. You can get the example data from here: data

 

If the reads mapping result is in file sample_10000.sam and the annotation file is annotation.gtf which is in GTF format, the command should be like:

NURD CO output_dir CA alpha CG annotation.gtf CS sample_10000.sam

 

The estimation result will be saved in files:

1: rpkm result : sample_10000.sam.nurd.rpkm

2: read count result : sample_10000.sam.nurd.rdcnt

in the specified output directory.

 

The format of rpkm output will be like:

SLCO2B1 3       856     NM_001145211,NM_007256,NM_001145212,    0,19.4646,128.761,     148.226

The file is delimited by table (\t) and the meaning of each column is:

1: Gene name

2: isoform number of this gene

3: number of reads that located in this gene, which somehow indicates whether RNA-Seq is deep enough to make you believe the estimated result.

4: isoform names delimited by comma

5: isoform expression measured by RPKM

6: gene expression measured by RPKM, which is the sum of all the isoform expressions.

 

The format of read count output will be like:

SLCO2B1 3       856     NM_001145211,NM_007256,NM_001145212,    0,109.325,746.675,    

The file is delimited by table (\t) and the meaning of each column is:

1: Gene name

2: isoform number of this gene

3: number of reads that located in this gene, which somehow indicates whether RNA-Seq is deep enough to make you believe the estimated result.

4: isoform names delimited by comma

5: isoform expression measured by read count