DSGseq: a useful tool for identifying differentially spliced genes from two groups of RNA-seq samples

version 0.1.0  2012/5/14 update

 

Download:

Software  DSG-0.1.0.tar

Data for testing  data.tar

 

This is a beta release and more new functions will be added. Please connect with Zhiyi Qin (e-mail: qzy06@mails.tsinghua.edu.cn ) if you have any problems or advises in use.

 

 

Description:

      this program is aimed to identify differentially spliced genes from two groups of RNA-seq samples.

Input: reads count file.

Output: differences in the relative abundance of the isoforms of each gene in the annotation.

 

 

Environment:

      R (version 2.10 or higher)

      g++ compiler

      Linux (command line).

 

 

Installation:

      1.Unpack the DSG package:

           tar xvf DSG-0.1.0.tar

           tar xvf data.tar

      2.Change to the DSG/SeqExpress directory:

           cd DSG-0.1.0/SeqExpress

      3.make SeqExpress:

           make

 

 

Usage:

      1.get merged annotation file. this operation is used to convert biological exons into mathematic exons.

           Rscript merge.R <annotation_file.refFlat> <annotation_file.merge.refFlat>

      2.obtain the ".count" files with the input like ".BED6", ".BED12", and ".BAM" files.

           2.1 obtain ".count" file from ".BED6" file

                 /DSGSeq/SeqExpress/SeqExpress count XXX.bed XXX.merge.refFlat XXX.count

           2.2 obtain ".count" file from ".BED12" file

                 cut -f1-6 XXX.bed > XXX.bed6

                 /DSGSeq/SeqExpress/SeqExpress count XXX.bed6 XXX.merge.refFlat XXX.count

           2.3 obtain ".count" file from ".BAM" file (need bedtools)

                 bamToBed -i XXX.bam > XXX.bed

                 /DSGSeq/SeqExpress/SeqExpress count XXX.bed XXX.merge.refFlat XXX.count

      3.identify differentially spliced genes by calculating the NB-statistic.

           Rscript DSGNB.R <case_group_size> <case_filename> <control_group_size> <control_filename> <output>

           Example: Rscript DSGNB.R 7 ./Liver_1.count ./Liver_2.count ./Liver_3.count ./Liver_4.count ./Liver_6.count ./Liver_7.count ./Liver_8.count 7 ./Kidney_1.count ./Kidney_2.count ./Kidney_3.count ./Kidney_4.count ./Kidney_6.count ./Kidney_7.count ./Kidney_8.count ./Liver_Kidney

 

 

Output:

      it is declared in this file the differences in the relative abundance of the isoforms of each gene in the annotation.

      Each file has the following format:

Column number

Column name

Example

Description

 

1

gene_name

ENSA

A unique ID describing each gene.

 

2

ID

NM_207042

The gene_name or one of transcript_id for testing.

 

3

num_exons

8

The number of mathematic exons.

 

4

exons_length

737,2363,167,48,

420,126,73,209,

The length of each mathematic exon.

 

5

gene_count_mean

380.785714285714

The average count of each gene in all the samples.

 

6

is_filtered

FALSE

FALSE (means the gene has enough read count for testing) or TRUE (not enough for testing).

 

7

NB_stat

51.4237559767432

Differences in the relative abundance of the isoforms of each gene in two groups of samples.

 

8

NB_stat_exon

156.2,88.39,36.59,NA,

0.06,8.69,NA,18.61

Differences in the relative abundance of the isoforms of each exon in each gene.

 

9

diff_exon

1

The order of the exon that the most significant difference happened.