SGSGeneLoss

From Applied Bioinformatics Group
Revision as of 03:11, 18 June 2014 by Agnieszka (talk | contribs)
Jump to: navigation, search

What does SGSGeneLoss depend on?

SGSGeneLoss depends on the following:

Download

  • Latest Version 0.1 (29/04/2014):
    • SGSGeneLoss.v0.1.tar.gz should contain
      • three main programs: SGSGeneLoss.v0.1.jar, graph_chromosomes.v0.1.R, graph_circles.v0.1.R
      • readme file
      • folder with source code

From now on in this manual SGSGeneLoss.v0.1.jar, graph_chromosomes.v0.1.R, graph_circles.v0.1.R are referred to as SGSGeneLoss.jar, graph_chromosomes.R, graph_circles.R

To run the programs you have to use full names SGSGeneLoss.v0.1.jar, graph_chromosomes.v0.1.R, graph_circles.v0.1.R

How to install?

  • SGSGeneLoss.tar.gz
  • Unpack SGSGeneLoss.tar.gz and place SGSGeneLoss.jar and all the R scripts in chosen directory/directories, for example ./my_geneloss
  • Move into ./my_geneloss and create SGSGeneLoss_lib directory (on linux: cd ./my_geneloss, mkdir SGSGeneLoss_lib directory)
    • The name of the lib directory is the name of the .jar file without .jar extension + _lib, so if you are using SGSGeneLoss.v0.1.jar the lib directory is: SGSGeneLoss.v0.1_lib
    • The lib directory has to be in the same folder as the .jar file
  • Download picard-tools (SGSGeneLoss was tested with picard-tools 1.89)
  • Place picard-1.89.jar and sam-1.89.jar in ./my_geneloss/SGSGeneLoss_lib
  • Now you are ready to run SGSGeneLoss

Input and output files for SGSGeneLoss.jar

  • Input files:
    • Sorted, indexed .bam file with sequencing reads mapped to the reference genome sequence, multiple .bam files can be provided as comma separated list
    • Gff3 file with reference genome annotation, has to contain gene, mRNA and exon fields
  • Output files
    • Result files for each chromosome separately - .excov
    • File with overall stats - stats.txt
    • File with summary for all the chromosomes used - chrs.csv (this file is used by one of the R scripts)
    • File with list of genes lost for all the chromosomes - graph.csv (this file is used by one of the R scripts)

Command line options for SGSGeneLoss.jar

Required:

bamPath - path to your bam file/files, has to end with / or \ bamPath=/home/my_bams/

bamFileList - a single .bam file or a comma separated list, only file names, bam and corresponding .bai files have to be in a directory provided in bamPath bamFileList=bam1.bam,bam2.bam

gffFile - location of gff3 file gffFile=/home/my_gffs/annot.gff3

outDirPath - location output directory, has to end with / or \ outDirPath=/home/my_results

Optional:

minCov - minimal coverage threshold to consider position covered [minCov=1]

chromosomeList - comma separated list of chromosomes to be used for analysis, use all, for all chromosomes [chromosomeList=all]

lostCutoff - coverage cutoff to consider gene as lost for calculating stats [lostCutoff=0.0]

covCats - coverage categories for visualization [cavCats=0,10,20,30,40,70]

extendedFmt - used extended format, additional info included in output files [regular format]

To see help run: java -jar SGSGeneLoss.jar help

Sample command

  • Move into directory where SGSGeneLoss.jar is
  • Please make sure that all your supplied paths end with / or \
java -Xmx4g -jar SGSGeneLoss.jar bamPath=/home/uqagnieszka/bams/ bamFileList=arabidopsis.sorted.bam gffFile=/home/gff_files/Athaliana_167_gene_exons.gff3 outDirPath=/home/uqagnieszka/results/
chromosomeList=all
java -Xmx4g -jar SGSGeneLoss.jar bamPath=/home/uqagnieszka/bams/ bamFileList=arabidopsis.sorted.bam, arabidopsis2.sorted.bam gffFile=/home/gff_files/Athaliana_167_gene_exons.gff3
outDirPath=/home/uqagnieszka/results/ chromosomeList=Chr1,Chr2 minCov=2 lostCutoff=0.05 covCats=0,2,5,10,20 extendedFmt

Output files format

All the output files are comma separated text files.

  • .excov files - files with results for each chromosome (files use chromosome names as in .bam files), files come in two formats basic (default) or extended (extendedFmt)
    • basic format: chromosome,ID,is_lost,start_position,end_postion,frac_exons_covered,frac_gene_covered,ave_cov_depth_exons,cov_cat,ave_cove_depth_gene
    • extended format: contains additional columns with information about each of the exons
  • stats.csv - file with summary information about all genes
  • chrs.csv - file with summary information about chromosomes
    • chr,start,end,len
  • graph.csv - file with list of genes lost as determined by lostCutoff
    • chr,id,start,end

Plotting results

Results are visualized using R scripts.

Two ways of visualization are possible:

  • results per chromosome
  • results for all chromosomes as a circular graph

Results per chromosome:

What you need:

  • script graph_chromosomes.R
  • .excov files (either basic or extended) with results from SGSGeneLoss.jar: Chr1.excov, Chr2.excov etc.
  • directory (location) where files with results from SGSGeneLoss.jar: Chr1.excov, Chr2.excov etc. can be found

graph_chromosomes.R takes two arguments in this order:

1. location of directory where .excov file are located

2. gene loss cutoff

Rscript --vanilla graph_chromosomes.R /home/uqagnieszka/results 0.0

Summary results for all chromosomes, possibly multiple samples:

What you need:

  • script graph_circles.R
  • graph.csv from SGSGeneLoss.jar run
  • chrs.csv from SGSGeneLoss.jar run
  • file assigning numeric order to chromosomes (this is done because some chromosomes have complicated names and sorting in ASCII order does not always work) - file should look like this, chromosome names will be replaced by corresponding numbers
chrs,no
chr1,1
chr2,2
chr10,10

graph_circles.R takes four arguments in this order:

1. file with chromosome info - chrs.csv from SGSGeneLoss.jar run

2. file with chromosome order

3. file with genes lost - graph.csv from SGSGeneLoss.jar run; it can be a comma separated list of multiple files (for example multiple samples). Circles will be drawn in the following order:

first file in the list is the innermost circle, so if you have graph1.csv,graph2.csv,graph3.csv, order of circles will reflect order of files, starting from the inside

4. output file

Rscript --vanilla graph_circles.R chrs.csv chrs_order.csv graph1.csv,graph2.csv,graph3.csv out.png

FAQ

  • If memory consumption is a problem please consider increasing -Xmx or splitting your .bam files


Back to Main_Page