SGSGeneLoss

From Applied Bioinformatics Group
Jump to: navigation, search

What does SGSGeneLoss depend on?

SGSGeneLoss depends on the following:

Download

  • Latest Version 0.1 (29/04/2014):
    • SGSGeneLoss.v0.1.tar.gz should contain
      • three main programs: SGSGeneLoss.v0.1.jar, graph_chromosomes.v0.1.R, graph_circles.v0.1.R
      • readme file
      • folder with source code

From now on in this manual SGSGeneLoss.v0.1.jar, graph_chromosomes.v0.1.R, graph_circles.v0.1.R are referred to as SGSGeneLoss.jar, graph_chromosomes.R, graph_circles.R

To run the programs you have to use full names SGSGeneLoss.v0.1.jar, graph_chromosomes.v0.1.R, graph_circles.v0.1.R

How to install?

  • SGSGeneLoss.tar.gz
  • Unpack SGSGeneLoss.tar.gz and place SGSGeneLoss.jar and all the R scripts in chosen directory/directories, for example ./my_geneloss
  • Move into ./my_geneloss and create SGSGeneLoss_lib directory (on linux: cd ./my_geneloss, mkdir SGSGeneLoss_lib directory)
    • The name of the lib directory is the name of the .jar file without .jar extension + _lib, so if you are using SGSGeneLoss.v0.1.jar the lib directory is: SGSGeneLoss.v0.1_lib
    • The lib directory has to be in the same folder as the .jar file
  • Download picard-tools (SGSGeneLoss was tested with picard-tools 1.89)
  • Place picard-1.89.jar and sam-1.89.jar in ./my_geneloss/SGSGeneLoss_lib
  • Now you are ready to run SGSGeneLoss

Input and output files for SGSGeneLoss.jar

  • Input files:
    • Sorted, indexed .bam file with sequencing reads mapped to the reference genome sequence, multiple .bam files can be provided as comma separated list
    • Gff3 file with reference genome annotation, has to contain gene, mRNA and exon fields
  • Output files
    • Result files for each chromosome separately - .excov
    • File with overall stats - stats.txt
    • File with summary for all the chromosomes used - chrs.csv (this file is used by one of the R scripts)
    • File with list of genes lost for all the chromosomes - graph.csv (this file is used by one of the R scripts)

Command line options for SGSGeneLoss.jar

Required:

bamPath - path to your bam file/files, has to end with / or \ bamPath=/home/my_bams/

bamFileList - a single .bam file or a comma separated list, only file names, bam and corresponding .bai files have to be in a directory provided in bamPath bamFileList=bam1.bam,bam2.bam

gffFile - location of gff3 file gffFile=/home/my_gffs/annot.gff3

outDirPath - location output directory, has to end with / or \ outDirPath=/home/my_results

Optional:

minCov - minimal coverage threshold to consider position covered [minCov=1]

chromosomeList - comma separated list of chromosomes to be used for analysis, use all, for all chromosomes [chromosomeList=all]

lostCutoff - coverage cutoff to consider gene as lost for calculating stats [lostCutoff=0.0]

covCats - coverage categories for visualization [cavCats=0,10,20,30,40,70]

extendedFmt - used extended format, additional info included in output files [regular format]

To see help run: java -jar SGSGeneLoss.jar help

Sample command

  • Move into directory where SGSGeneLoss.jar is
  • Please make sure that all your supplied paths end with / or \
java -Xmx4g -jar SGSGeneLoss.jar bamPath=/home/uqagnieszka/bams/ bamFileList=arabidopsis.sorted.bam gffFile=/home/gff_files/Athaliana_167_gene_exons.gff3 outDirPath=/home/uqagnieszka/results/
chromosomeList=all
java -Xmx4g -jar SGSGeneLoss.jar bamPath=/home/uqagnieszka/bams/ bamFileList=arabidopsis.sorted.bam,arabidopsis2.sorted.bam gffFile=/home/gff_files/Athaliana_167_gene_exons.gff3
outDirPath=/home/uqagnieszka/results/ chromosomeList=Chr1,Chr2 minCov=2 lostCutoff=0.05 covCats=0,2,5,10,20 extendedFmt

Output files format

All the output files are comma separated text files.

  • .excov files - files with results for each chromosome (files use chromosome names as in .bam files), files come in two formats basic (default) or extended (extendedFmt)
    • basic format: chromosome,ID,is_lost,start_position,end_postion,frac_exons_covered,frac_gene_covered,ave_cov_depth_exons,cov_cat,ave_cove_depth_gene
    • extended format: contains additional columns with information about each of the exons
  • stats.csv - file with summary information about all genes
  • chrs.csv - file with summary information about chromosomes
    • chr,start,end,len
  • graph.csv - file with list of genes lost as determined by lostCutoff
    • chr,id,start,end

Plotting results

Results are visualized using R scripts.

Two ways of visualization are possible:

  • results per chromosome
  • results for all chromosomes as a circular graph

Results per chromosome:

What you need:

  • script graph_chromosomes.R
  • .excov files (either basic or extended) with results from SGSGeneLoss.jar: Chr1.excov, Chr2.excov etc.
  • directory (location) where files with results from SGSGeneLoss.jar: Chr1.excov, Chr2.excov etc. can be found

graph_chromosomes.R takes two arguments in this order:

1. location of directory where .excov file are located

2. gene loss cutoff

3. output path ending with /

Rscript --vanilla graph_chromosomes.R /home/uqagnieszka/results 0.0 /home/uqagnieszka/graphs/

Summary results for all chromosomes, possibly multiple samples:

What you need:

  • script graph_circles.R
  • graph.csv from SGSGeneLoss.jar run
  • chrs.csv from SGSGeneLoss.jar run
  • file assigning numeric order to chromosomes (this is done because some chromosomes have complicated names and sorting in ASCII order does not always work) - file should look like this, chromosome names will be replaced by corresponding numbers
chr,no
chr1,1
chr2,2
chr10,10

graph_circles.R takes five arguments in this order:

1. file with chromosome info - chrs.csv from SGSGeneLoss.jar run

2. file with chromosome order

3. file with genes lost - graph.csv from SGSGeneLoss.jar run; it can be a comma separated list of multiple files (for example multiple samples). Circles will be drawn in the following order:

first file in the list is the innermost circle, so if you have graph1.csv,graph2.csv,graph3.csv, order of circles will reflect order of files, starting from the inside

4. Output path ending with /

5. output file

Rscript --vanilla graph_circles.R chrs.csv chrs_order.csv graph1.csv,graph2.csv,graph3.csv /home/results/graphs/ out.png

FAQ

  • If memory consumption is a problem please consider increasing -Xmx or splitting your .bam files
  • Please cite Golicz, A.A., Martinez, P.A., Zander, M., Patel, D.A., Van De Wouw, A.P., Visendi, P., Fitzgerald, T.L. et al. (2015) Gene loss in the fungal canola pathogen Leptosphaeria maculans. Funct. Integr. Genomics, 15, 189–196.

Back to Main_Page