ulfasQTL: an Ultra-Fast Method of Splicing QTL Analysis

 

ulfasQTL is an ultra-fast method of splicing QTL analysis. In this method, it test the relationship between gene splicing ratios and gene variants in a way of matrix form. So the program can analysis at speed thousands times faster than existing method. Following is the detail information and usage of this program. You can also found them in READ.ME.

The latest version is v0.1. You can find the software from here.

 

Function:

The function of this program is to test the correlation between gene splicing ratios and gene variants.

input: transcript expression, genotype file.

output: significantly correlated gene-snp pairs(q value<0.001).

 

Installation:

You can download the R package from here and run the command in R:

install.packages(<source_code_file>, repos=NULL, type="source")

Prerequisite: Rcpp, qvalue, MatrixEQTL, vegan

 

Usage:

ulfasQTL:: preprocess.gene (originalFileName)

This function preprocesses the transcripts expression file, for example remove low expressed gene. It takes transcripts expression file as input and will write a file named "gene_after_preprocess" to the disk. This part use some functions of sQTLseekeR, written by Jean Monlong.

 

ulfasQTL::findSqtl(genoTypeFileName, geneObjectName = "gene_after_preprocess", snpFileCate = 1)

This function changes the transcript expressions to sphere axis and calls MatrixEQTL to get the z statistics of transcript-snp pairs. Then It calculates s statistic of gene-snp pairs as well as p-values of each pairs. It takes genotype file and "gene_after_preprocess" file produced by the preprocess.gene() as input and will output a data.frame which reports significantly correlated gene-snp pairs. The third parameter indicates which kind of snp file format is input, you can check it in the following part.

 

Input format:

transcript expression file:

The transcript expression file should be in the format that the first two columns are "trId" and "geneId", and followed with sample names. For example:

trId  geneId      HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00104 HG00105 HG00106         HG00108 HG00109 HG00110

ENST00000450668.1        ENSG00000233664.1       0       0       0       0       0       0       0       0.12907   0       0       0       0       0

ENST00000423995.1        ENSG00000220785.3       0.00114   0.29424   0.016095 0.15047   0       0.07491   0.2555      0.09586         0.02413   0.35814   0.28826   0.048335 0.07126

ENST00000426597.2        ENSG00000220785.3       0       0       0       0       0       0       0       0.02873   0       0       0       0.024525 0         0

 

genotype file:

The genotype file can be in two formats. The first one’s columns should be "chr","start","ID" and sample names. For example:

chr   START       ID     HG00105 HG00107 HG00115 HG00132 HG00145 HG00157 HG00181 HG00308 HG00365 HG00371         HG00379 HG00380 HG01789

1       10583       snp_1_10583   1       1       -1      2       1       0       0       1       1       1       2       0       -1

1       10611       snp_1_10611   0       0       1       2       1       0       2       1       1       1       0       0       1

1       13302       snp_1_13302   1       1       2       1       2       1       2       0       0       1       2       0       1

In this format, -1 stands for unknown genotype, and 0,1,2 represent three different kinds of genotype. If you choose this format, snpFileCate in findSqtl should be 1.

 

Or you can directly use data download from 1000 Genomes Project as input. An example of genotype in the project is as follow:

#CHROM POS  ID     REF  ALT   QUAL        FILTER      INFO         FORMAT  HG00105 HG00107 HG00115 HG00132 HG00145         HG00157 HG00181 HG00308 HG00365 HG00371

1       10583       snp_1_10583   G      A       100.00      PASS AA=.;AC=314;AF=0.14;AFR_AF=0.04;AMR_AF=0.17;AN=2184;ASN_AF=0.13;AVGPOST=0.7707;DAF_GLOBAL=.;ERATE=0.0161;EUR_AF=0.21;GERP=.;LDAF=0.2327;RSQ=0.4319;SNPSOURCE=LOWCOV;THETA=0.0046;VT=SNP;ANNOTATION_CLASS=ACTIVE_CHROM;CELL=GM12878;CHROM_STATE=15         GT:GL:DS:PP:BD       .:.:.:.:.       0|0:-0.18,-0.47,-2.42:0.200:.:.        0|0:-0.24,-0.44,-1.16:0.150:.:.         0|0:-0.15,-0.54,-3.12:0.150:.:.        0|1:-0.48,-0.48,-0.48:0.600:.:.        0|0:-0.48,-0.48,-0.48:0.550:.:.         0|1:-1.92,-0.01,-2.50:0.950:.:.        0|0:-0.05,-0.93,-5.00:0.050:.:.        0|0:-0.11,-0.66,-4.22:0.100:.:.         0|1:-0.28,-0.43,-0.96:0.550:.:.

The program will treat 0|0 as 0, 0|1 or 1|0 as 1, and 1|1 as 2. If you choose this format, snpFileCate in findSqtl should be 2.

 

Test data:

You can find test data here.

Transcript expression file

Genotype file (format 1)

Genotype file (format 2)

You can get the result by simply enter the commands in R:

ulfasQTL::preprocess.gene("gene_example")

ulfasQTL::findSqtl("snp.example.1", snpFileCate=1)  Or  ulfasQTL::findSqtl("snp.example.2",snpFileCate=2)