Step 1: Map reads to the reference genome allowing long gaps

# (1) simulate 100,000 100-nt paired-end reads from chr22 (the minimum exon is set to 50-nt long).
# ./output_seqsaw100_2/test_1.fa and ./output_seqsaw100_2/test_2.fa will be generated.

[user@node18 seqsaw]$ cd examples
[user@node18 examples]$ mkdir output_seqsaw100_2
[user@node18 examples]$ ../bin/simulateReads_long_reads.pl ./chr22.fa 100 100000 50 ./output_seqsaw100_2/test


# (2) run SeqSaw with configuration file example100_PE.cfg (provided as an example in example directory)
# map simulated spliced reads to the reference chr22.fa

[user@node18 examples]$ ../bin/seqsaw mapper example100_PE.cfg


# (3) Or you can run SeqSaw with 6 CPU cores simultaneously.

[user@node18 examples]$ ../bin/runSeqSaw.pl mapper example100_PE.cfg 6


#(4) Or you can run SeqSaw with 6 servers using 8 CPU cores on each server (these servers should be in a cluster).

[user@node18 examples]$ ../bin/runSeqSaw2.pl mapper example100_PE.cfg 6 8

Total 6 shell scripts (server0.sh, server1.sh, ...) will be generated under current path.
Please start a new session on each server and run server*.sh respectively!

Note: The simulated reads are used to show the usage of SeqSaw. However, the "sequencing depth" is too low to predict junctions.


Step 2: Predict splice junctions upon the mapping results

# (1) perform SeqSaw on the file test.sam containing mapping results
# the alignments with mismatches more than 1 are not used to predict junctions
# output is raw_junctions.bed in BED format

[user@node18 examples]$ ../bin/seqsaw predictor chr22.fa test.sam raw_junction.bed 1


# (2) filter the raw junctions with two different criteria

[user@node18 examples]$ ../bin/seqsaw filter1 raw_junctions.bed best_junctions_f1.bed
[user@node18 examples]$ ../bin/seqsaw filter2 raw_junctions.bed best_junctions_f2.bed

Note: This step can filter out quite a few junctions, which is necessary to achieve high specificity.
           Filter1 is for single-end reads and filter2 is for paired-end reads.


Step 3: Remap reads only allowing spanning predicted junctions (optional)

# map RNA-seq reads with junction database
# In this optional step, we remap the RNA-seq reads only allowing spanning predicted junctions

[user@node18 examples]$ ../bin/seqsaw remapper example100_PE.cfg best_junctions_f2.bed

# Or you can run SeqSaw with 6 CPU cores simultaneously.

[user@node18 examples]$ ../bin/runSeqSaw.pl remapper example100_PE.cfg best_junctions_f2.bed 6


Input and output format

# a file in FASTA format that contains all chromosomes of the reference genome

File name: hg18.fa (fasta)
Content:
>chr1
...
...
>chr2
...
...

# files in FASTA format that contain sequencing reads (the order of reads in each file must be the same for paired-end reads)

File name: test_1.fa (fasta)
Content:
>HWUSI-EAS627_1:5:1:43:586/1
CCACACCCTGCTAATTTTTGTATTTTTAGTAGAGACAGAGTTTCACCCAGGGCTGCAGATGGAGGTGTGAGCCAC
>HWUSI-EAS627_1:5:1:43:682/1
TGAGTGAGCCCTACGACACTGGCCAATCAGAGCAGGCTGGGCTGGGGAGGCAGGAGGAAGAGGCATCCCCCTCCC
...

File name:test_2.fa (fasta)
Content:
>HWUSI-EAS627_1:5:1:43:586/2
GCATTCGCCTCCACGCCTAAGGGTGAGGGGCGGGGAAGAGGCGGCCCTGCAGGGGAGGCAGAAGCAAAAGGAGGC
>HWUSI-EAS627_1:5:1:43:682/2
GATCAACCCTCCAGAGTGTGTCATCTGACCTGGACCCTCAGGCCACTCCTGCTTCATGGCACTTAGCACTGAGGG
...

# mapping results in SAM format
Example:
HWUSI-EAS627_1:5:1:43:484/2 131 chr11 62054285 255 25M768N37M267794N13M = 62055160 360190 TGGGCATGCTGAACTTGGGCATTTTCATCTTGGGTATTTTCAGGCTCCAGTCCCGGCCTTGGCCCTCCCCCCCTT * MD:Z:C52A12G NM:i:3 NH:i:10 XS:Z:|0:2|0:1|

# optional field: XS:Z:|left1:right1|left2:right2|
# valid sliding lengthes to left and right for gaps (see Section 4 in Method.pdf for more details)
# Note: In this example, the read span two putative splice junctions (i.e., with two gaps).

# splice junctions in BED format
Example:
track name=junctions description="SeqSaw junctions" itemRgb="On"
chr1 2040 2507 (0/0/1/1/4/6/2/1)[13_38~0:1]:3 50 + 2040 2507 0,0,0 2 50,50 0,417
chr1 2040 2525 (0/1/1/1/4/6/2/1)[28_30~0:2]:3 50 + 2040 2525 0,0,0 2 50,50 0,435

#Note: The fourth field contains reliability factors for optional filtering. The format is as follows:

(nUM/nBR/nNR/nR/nUp/nDown/nNb/PE)[LW_RW~LS:RS]:SK


Reliability factors for optional filtering

nUM: number of uniquely mapped reads that support (i.e. span) this junction
nBR: number of reads that support this junction with their best alignment
nNR: number of non-redundant supporting reads *
* While several reads support the same junction with the same cover length on each side, this number will only increase one.
nR: number of reads supporting this junction
nUp: number of reads that are mapped to the 40-bp upstream adjacent region of this junction
nDown: number of reads that are mapped to the 40-bp downstream adjacent region of this junction
nNb: 0: There are no mapped reads both at the 80-bp upstream region and the 80-bp downstream region.
1: There are mapped reads at the 80-bp upstream region or the 80-bp downstream region, but not both.
2: There are mapped reads both at the 80-bp upstream region and the 80-bp downstream region.
PE: Is this junction supported by any paired-end reads? 1: TRUE, 0: FALSE.
LW: the number of nucleotides on the left side of the junction that are covered by any supporting read
RW: the number of nucleotides on the right side of the junction that are covered by any supporting read
LS: valid sliding length to left (see Section 4 in Method.pdf for more details)
RS: valid sliding length to right (see Section 4 in Method.pdf for more details)
SK: 0: non-canonical junctions, 1: AT-AC splicing sites, 2:GC-AG splicing sites, 3:GT-AG splicing sites.

Anyone can use the source codes and executable files of SeqSaw free of charge for non-commercial use. Valid XHTML 1.0 Strict Valid CSS!