nuMetaSim (non-uniform Metagenomic sequencing Simulation system)

Created by Shansong Liu, Kui Hua, Sijie Chen, Xuegong Zhang, MOE Key Lab of Bioinformatics, Bioinformatics Division, TNLIST and Department of Automation, Tsinghua University, Beijing 100084, China

Introduction

nuMetaSim is a sequencing data simulation system designed for comprehensive metegenome data simulation by Illumina platform, including nearly all possible features such as sequencing error, quality score, coverage bias, contamination source like human and unknown reads in realistic human microbiome samples.

nuMetaSim is developed with Python. Both single-end and pair-end data generation are supported using nuMetaSim.py and nuMetaSim_pe.py, respectively. Other related tools are also provided in scripts in the nuMetaSim/tools/ folder.

The author suggests running nuMetaSim in a widely-used Python distribution – Anaconda, because most of nuMetaSim’s module dependencies are satisfied in Anaconda environment. The only required missing third-party package that you need to install manually is intervaltree. It can be downloaded at https://pypi.python.org/pypi/intervaltree.

nuMetaSim is free for academic use. If you would like to use it for commercial purposes, please contact zhangxg@tsinghua.edu.cn.

Usage

Main scripts

Note:nuMetaSim_pe.py usage is nearly the same with nuMetaSim.py. It only has one more parameter -g, "gap_length", which controls the insertion length in pair-end simulated data, than nuMetaSim.py. So here only nuMetaSim.py will be introduced.

Generate fasta data (no sequencing error allowed)

python nuMetaSim.py --index microbes_index --propor microbes_proportion_table -F fastaFile2Write_prefix -n read_number -L read_length

Generate fastq data with sequencing error based on a quality score profile

python nuMetaSim.py --index microbes_index --propor microbes_proportion_table -F fastqFile2Write_prefix -n read_number -L read_length --fa_fq_flag 1 --qual quality_score_profile

Generate fastq data with sequencing error and a specific sequencing error rate

python nuMetaSim.py --index microbes_index --propor microbes_proportion_table -F fastqFile2Write_prefix -n read_number -L read_length --fa_fq_flag 1 --qual quality_score_profile --err_adj_flag 1 --err_rate 0.01

Generate fastq data with user-input read distribution table

python nuMetaSim.py --index microbes_index --propor microbes_proportion_table -F fastqFile2Write_prefix -n read_number -L read_length --fa_fq_flag 1 --qual quality_score_profile --dis_flag 1 --distri read_distribution_table

Generate fastq data with human reads and unknown reads using even relative abundance

python nuMetaSim.py --index microbes_index --propor microbes_proportion_table -F fastqFile2Write_prefix -n read_number -L read_length --fa_fq_flag 1 --qual quality_score_profile --h_flag 1 --h_ref_dir human_reference_folder_path --h_ratio 0.1 --unk_flag 1 --unk_ref_dir unknown_reference_folder_path unk_ratio 0.1

Generate fastq data with human reads and unknown reads using user-defined relative abundance

python nuMetaSim.py --index microbes_index --propor microbes_proportion_table -F fastqFile2Write_prefix -n read_number -L read_length --fa_fq_flag 1 --qual quality_score_profile --h_flag 1 --h_ref_dir human_reference_folder_path --h_ratio 0.1 --h_index human_reference_index --h_propor human_proportion_table --unk_flag 1 --unk_ref_dir unknown_reference_folder_path unk_ratio 0.1 --unk_index unknown_reference_index --unk_propor unknown_proportion_table

Other parameters

Tool scripts

tools/index

microbe_index.py
python microbe_index.py -d reference_genome_folder

The output file is named as index, storing in reference_genome_folder/../genome_index/.

humanOrUnknown_index.py
python humanOrUnknown_index.py -d reference_sequence_folder

The output file is named as noisy_data_index, storing in reference_sequence_folder/../genome_index/.

tools/read distribution

Direct mapping
cal_genome_distribution.py
python cal_genome_distribution.py --fq_path fastq_sample_path --geo_path reference_genome_folder -o output_folder -b bin_size

The output file is named as distribution.txt, storing in output_folder.

GC-content-based coverage bias
cal_GC_coverage_relation.py
python cal_GC_coverage_relation.py --fq_path single_genome_sequencing_samples --geo_path corresponding_reference_genomes -o output_folder -b bin_size

The output file is named as GC_coverage.txt, storing in output_folder.

polyfit_GC_coverage.py
python polyfit_GC_coverage.py -f GC_content_based_coverage_bias -p polyfit_order

The output fitting coefficient file is named as polyfit_coefficient.txt, storing in output_folder. A fitting curve named as “GC_coverage.pdf” is generated in output_folder too.

cal_genome_distribution_by_GC.py
python cal_genome_distribution_by_GC.py -f GC_fitting_coefficient --geo_path reference_genome_folder -b bin_size

The output file is named as distribution.txt, storing in the same path where the fitting coefficient file is saved.

tools/unknown reference

genome_shuffle.py
python genome_shuffle.py -d reference_genome_folder
cal_genome_variation.py
python cal_genome_variation.py --ref reference_genome --qry query_genome -d output_folder

The output file is named as genome_variation.txt, storing in output_folder.

gen_genome_strain.py
python gen_genome_strain.py --geo_va genome_variation_file --geo genome_file -o output_folder -m running_mode_flag -t variation_rates -n genome_number --args normal_distribution_arguments -v variable_random_seed_flag -r random_seed

tools/others

extract_quality_score_distribution.py
python extract_quality_score_distribution.py -i real_sequencing_sample -o output_folder

The output file is named as quality_score.txt, storing in output_folder.

change_read_header.py

This tool script is needed when the benchmarked software adopted by the researcher needs a distinct header in each read header information line.

python change_read_header.py -i input_sample_file -o output_file_name -l fasta_fastq_flag