Difference between revisions of "SGSGeneLoss"

From Applied Bioinformatics Group
Jump to: navigation, search
 
(12 intermediate revisions by 2 users not shown)
Line 3: Line 3:
 
* [http://www.java.com/en/ Java 1.6] or higher
 
* [http://www.java.com/en/ Java 1.6] or higher
 
* [http://www.r-project.org/ R/3.1.0]
 
* [http://www.r-project.org/ R/3.1.0]
* [http://sourceforge.net/projects/picard/files/picard-tools/ picard-tools]
+
* [http://sourceforge.net/projects/picard/files/picard-tools/ picard-tools] v1.89 (or directly from https://sourceforge.net/projects/picard/files/picard-tools/1.89/ )
 
* [http://ggplot2.org/ ggplot2]
 
* [http://ggplot2.org/ ggplot2]
 
* [http://www.bioconductor.org/packages/release/bioc/html/ggbio.html ggbio]
 
* [http://www.bioconductor.org/packages/release/bioc/html/ggbio.html ggbio]
Line 23: Line 23:
 
* Move into ./my_geneloss and create SGSGeneLoss_lib directory (on linux: cd ./my_geneloss, mkdir SGSGeneLoss_lib directory)
 
* Move into ./my_geneloss and create SGSGeneLoss_lib directory (on linux: cd ./my_geneloss, mkdir SGSGeneLoss_lib directory)
 
** The name of the lib directory is the name of the .jar file without .jar extension + _lib, so if you are using SGSGeneLoss.v0.1.jar the lib directory is: SGSGeneLoss.v0.1_lib
 
** The name of the lib directory is the name of the .jar file without .jar extension + _lib, so if you are using SGSGeneLoss.v0.1.jar the lib directory is: SGSGeneLoss.v0.1_lib
 +
** The lib directory has to be in '''the same folder as the .jar file'''
 
* Download picard-tools (SGSGeneLoss was tested with picard-tools 1.89)
 
* Download picard-tools (SGSGeneLoss was tested with picard-tools 1.89)
 
* Place picard-1.89.jar and sam-1.89.jar in ./my_geneloss/SGSGeneLoss_lib
 
* Place picard-1.89.jar and sam-1.89.jar in ./my_geneloss/SGSGeneLoss_lib
Line 32: Line 33:
 
** Gff3 file with reference genome annotation, has to contain gene, mRNA and exon fields
 
** Gff3 file with reference genome annotation, has to contain gene, mRNA and exon fields
 
* Output files
 
* Output files
** Result files for each chromosome separately
+
** Result files for each chromosome separately - .excov
 
** File with overall stats - stats.txt
 
** File with overall stats - stats.txt
** File with summary for all the chromosomes used - chrs.txt (this file is used by one of the R scripts)
+
** File with summary for all the chromosomes used - chrs.csv (this file is used by one of the R scripts)
** File with list of genes lost for all the chromosomes - graph.txt (this file is used by one of the R scripts)
+
** File with list of genes lost for all the chromosomes - graph.csv (this file is used by one of the R scripts)
  
 
== Command line options for SGSGeneLoss.jar==
 
== Command line options for SGSGeneLoss.jar==
Line 70: Line 71:
 
  chromosomeList=all
 
  chromosomeList=all
  
  java -Xmx4g -jar SGSGeneLoss.jar bamPath=/home/uqagnieszka/bams/ bamFileList=arabidopsis.sorted.bam, arabidopsis2.sorted.bam gffFile=/home/gff_files/Athaliana_167_gene_exons.gff3
+
  java -Xmx4g -jar SGSGeneLoss.jar bamPath=/home/uqagnieszka/bams/ bamFileList=arabidopsis.sorted.bam,arabidopsis2.sorted.bam gffFile=/home/gff_files/Athaliana_167_gene_exons.gff3
 
  outDirPath=/home/uqagnieszka/results/ chromosomeList=Chr1,Chr2 minCov=2 lostCutoff=0.05 covCats=0,2,5,10,20 extendedFmt
 
  outDirPath=/home/uqagnieszka/results/ chromosomeList=Chr1,Chr2 minCov=2 lostCutoff=0.05 covCats=0,2,5,10,20 extendedFmt
  
Line 79: Line 80:
 
**basic format: chromosome,ID,is_lost,start_position,end_postion,frac_exons_covered,frac_gene_covered,ave_cov_depth_exons,cov_cat,ave_cove_depth_gene
 
**basic format: chromosome,ID,is_lost,start_position,end_postion,frac_exons_covered,frac_gene_covered,ave_cov_depth_exons,cov_cat,ave_cove_depth_gene
 
**extended format: contains additional columns with information about each of the exons
 
**extended format: contains additional columns with information about each of the exons
*stats.txt - file with summary information about all genes
+
*stats.csv - file with summary information about all genes
*chrs.txt - file with summary information about chromosomes
+
*chrs.csv - file with summary information about chromosomes
 
**chr,start,end,len
 
**chr,start,end,len
*graph.txt - file with list of genes lost as determined by lostCutoff
+
*graph.csv - file with list of genes lost as determined by lostCutoff
 
**chr,id,start,end
 
**chr,id,start,end
  
Line 96: Line 97:
  
 
What you need:
 
What you need:
*scripts graph_chromosomes.R, graph_main.R in the same directory
+
*script graph_chromosomes.R
 
*.excov files (either basic or extended) with results from SGSGeneLoss.jar: Chr1.excov, Chr2.excov etc.
 
*.excov files (either basic or extended) with results from SGSGeneLoss.jar: Chr1.excov, Chr2.excov etc.
 
*directory (location) where files with results from SGSGeneLoss.jar: Chr1.excov, Chr2.excov etc. can be found
 
*directory (location) where files with results from SGSGeneLoss.jar: Chr1.excov, Chr2.excov etc. can be found
*file listing all the result files for which you want graphs drawn, one per line - for example graph_list.txt file which looks like this:
 
Chr1.excov
 
Chr2.excov
 
Chr3.excov
 
  
graph_chromosomes.R takes three arguments in this order:
+
graph_chromosomes.R takes two arguments in this order:
  
1. location of directory where .excov file are located  
+
1. location of directory where .excov file are located
  
2. file listing all the result files for which you want graphs drawn
+
2. gene loss cutoff
  
3. gene loss cutoff
+
3. output path '''ending with /'''
 
   
 
   
  Rscript --vanilla graph_chromosomes.R /home/uqagnieszka/results /home/uqagnieszka/results/graph_list.txt 0.0
+
  Rscript --vanilla graph_chromosomes.R /home/uqagnieszka/results 0.0 /home/uqagnieszka/graphs/
  
 
'''Summary results for all chromosomes, possibly multiple samples:'''
 
'''Summary results for all chromosomes, possibly multiple samples:'''
Line 118: Line 115:
 
What you need:
 
What you need:
 
*script graph_circles.R
 
*script graph_circles.R
*graph.txt from SGSGeneLoss.jar run
+
*graph.csv from SGSGeneLoss.jar run
*chrs.txt from SGSGeneLoss.jar run
+
*chrs.csv from SGSGeneLoss.jar run
 
*file assigning numeric order to chromosomes (this is done because some chromosomes have complicated names and sorting in ASCII order does not always work) - file should look like this, chromosome names will be replaced by corresponding numbers
 
*file assigning numeric order to chromosomes (this is done because some chromosomes have complicated names and sorting in ASCII order does not always work) - file should look like this, chromosome names will be replaced by corresponding numbers
  chrs,no
+
  chr,no
 
  chr1,1
 
  chr1,1
 
  chr2,2
 
  chr2,2
 
  chr10,10
 
  chr10,10
  
graph_circles.R takes four arguments in this order:
+
graph_circles.R takes five arguments in this order:
  
 
1. file with chromosome info - chrs.csv from SGSGeneLoss.jar run
 
1. file with chromosome info - chrs.csv from SGSGeneLoss.jar run
Line 134: Line 131:
 
3. file with genes lost - graph.csv from SGSGeneLoss.jar run; it can be a comma separated list of multiple files (for example multiple samples). Circles will be drawn in the following order:
 
3. file with genes lost - graph.csv from SGSGeneLoss.jar run; it can be a comma separated list of multiple files (for example multiple samples). Circles will be drawn in the following order:
  
first file in the list is the innermost circle, so if you have graph1.txt,graph2.txt,graph3.txt, order of circles will reflect order of files, starting from the inside
+
first file in the list is the innermost circle, so if you have graph1.csv,graph2.csv,graph3.csv, order of circles will reflect order of files, starting from the inside
  
4. output file
+
4. Output path '''ending with /'''
 +
 
 +
5. output file
 
   
 
   
  Rscript --vanilla graph_circles.R chrs.csv chrs_order.csv graph1.csv,graph2.csv,graph3.csv out.png
+
  Rscript --vanilla graph_circles.R chrs.csv chrs_order.csv graph1.csv,graph2.csv,graph3.csv /home/results/graphs/ out.png
  
 
== FAQ ==
 
== FAQ ==
 
* If memory consumption is a problem please consider increasing -Xmx or splitting your .bam files  
 
* If memory consumption is a problem please consider increasing -Xmx or splitting your .bam files  
  
 
+
* Please cite Golicz, A.A., Martinez, P.A., Zander, M., Patel, D.A., Van De Wouw, A.P., Visendi, P., Fitzgerald, T.L. et al. (2015) Gene loss in the fungal canola pathogen Leptosphaeria maculans. Funct. Integr. Genomics, 15, 189–196.
  
 
Back to [[Main_Page]]
 
Back to [[Main_Page]]

Latest revision as of 10:51, 14 February 2020

What does SGSGeneLoss depend on?

SGSGeneLoss depends on the following:

Download

  • Latest Version 0.1 (29/04/2014):
    • SGSGeneLoss.v0.1.tar.gz should contain
      • three main programs: SGSGeneLoss.v0.1.jar, graph_chromosomes.v0.1.R, graph_circles.v0.1.R
      • readme file
      • folder with source code

From now on in this manual SGSGeneLoss.v0.1.jar, graph_chromosomes.v0.1.R, graph_circles.v0.1.R are referred to as SGSGeneLoss.jar, graph_chromosomes.R, graph_circles.R

To run the programs you have to use full names SGSGeneLoss.v0.1.jar, graph_chromosomes.v0.1.R, graph_circles.v0.1.R

How to install?

  • SGSGeneLoss.tar.gz
  • Unpack SGSGeneLoss.tar.gz and place SGSGeneLoss.jar and all the R scripts in chosen directory/directories, for example ./my_geneloss
  • Move into ./my_geneloss and create SGSGeneLoss_lib directory (on linux: cd ./my_geneloss, mkdir SGSGeneLoss_lib directory)
    • The name of the lib directory is the name of the .jar file without .jar extension + _lib, so if you are using SGSGeneLoss.v0.1.jar the lib directory is: SGSGeneLoss.v0.1_lib
    • The lib directory has to be in the same folder as the .jar file
  • Download picard-tools (SGSGeneLoss was tested with picard-tools 1.89)
  • Place picard-1.89.jar and sam-1.89.jar in ./my_geneloss/SGSGeneLoss_lib
  • Now you are ready to run SGSGeneLoss

Input and output files for SGSGeneLoss.jar

  • Input files:
    • Sorted, indexed .bam file with sequencing reads mapped to the reference genome sequence, multiple .bam files can be provided as comma separated list
    • Gff3 file with reference genome annotation, has to contain gene, mRNA and exon fields
  • Output files
    • Result files for each chromosome separately - .excov
    • File with overall stats - stats.txt
    • File with summary for all the chromosomes used - chrs.csv (this file is used by one of the R scripts)
    • File with list of genes lost for all the chromosomes - graph.csv (this file is used by one of the R scripts)

Command line options for SGSGeneLoss.jar

Required:

bamPath - path to your bam file/files, has to end with / or \ bamPath=/home/my_bams/

bamFileList - a single .bam file or a comma separated list, only file names, bam and corresponding .bai files have to be in a directory provided in bamPath bamFileList=bam1.bam,bam2.bam

gffFile - location of gff3 file gffFile=/home/my_gffs/annot.gff3

outDirPath - location output directory, has to end with / or \ outDirPath=/home/my_results

Optional:

minCov - minimal coverage threshold to consider position covered [minCov=1]

chromosomeList - comma separated list of chromosomes to be used for analysis, use all, for all chromosomes [chromosomeList=all]

lostCutoff - coverage cutoff to consider gene as lost for calculating stats [lostCutoff=0.0]

covCats - coverage categories for visualization [cavCats=0,10,20,30,40,70]

extendedFmt - used extended format, additional info included in output files [regular format]

To see help run: java -jar SGSGeneLoss.jar help

Sample command

  • Move into directory where SGSGeneLoss.jar is
  • Please make sure that all your supplied paths end with / or \
java -Xmx4g -jar SGSGeneLoss.jar bamPath=/home/uqagnieszka/bams/ bamFileList=arabidopsis.sorted.bam gffFile=/home/gff_files/Athaliana_167_gene_exons.gff3 outDirPath=/home/uqagnieszka/results/
chromosomeList=all
java -Xmx4g -jar SGSGeneLoss.jar bamPath=/home/uqagnieszka/bams/ bamFileList=arabidopsis.sorted.bam,arabidopsis2.sorted.bam gffFile=/home/gff_files/Athaliana_167_gene_exons.gff3
outDirPath=/home/uqagnieszka/results/ chromosomeList=Chr1,Chr2 minCov=2 lostCutoff=0.05 covCats=0,2,5,10,20 extendedFmt

Output files format

All the output files are comma separated text files.

  • .excov files - files with results for each chromosome (files use chromosome names as in .bam files), files come in two formats basic (default) or extended (extendedFmt)
    • basic format: chromosome,ID,is_lost,start_position,end_postion,frac_exons_covered,frac_gene_covered,ave_cov_depth_exons,cov_cat,ave_cove_depth_gene
    • extended format: contains additional columns with information about each of the exons
  • stats.csv - file with summary information about all genes
  • chrs.csv - file with summary information about chromosomes
    • chr,start,end,len
  • graph.csv - file with list of genes lost as determined by lostCutoff
    • chr,id,start,end

Plotting results

Results are visualized using R scripts.

Two ways of visualization are possible:

  • results per chromosome
  • results for all chromosomes as a circular graph

Results per chromosome:

What you need:

  • script graph_chromosomes.R
  • .excov files (either basic or extended) with results from SGSGeneLoss.jar: Chr1.excov, Chr2.excov etc.
  • directory (location) where files with results from SGSGeneLoss.jar: Chr1.excov, Chr2.excov etc. can be found

graph_chromosomes.R takes two arguments in this order:

1. location of directory where .excov file are located

2. gene loss cutoff

3. output path ending with /

Rscript --vanilla graph_chromosomes.R /home/uqagnieszka/results 0.0 /home/uqagnieszka/graphs/

Summary results for all chromosomes, possibly multiple samples:

What you need:

  • script graph_circles.R
  • graph.csv from SGSGeneLoss.jar run
  • chrs.csv from SGSGeneLoss.jar run
  • file assigning numeric order to chromosomes (this is done because some chromosomes have complicated names and sorting in ASCII order does not always work) - file should look like this, chromosome names will be replaced by corresponding numbers
chr,no
chr1,1
chr2,2
chr10,10

graph_circles.R takes five arguments in this order:

1. file with chromosome info - chrs.csv from SGSGeneLoss.jar run

2. file with chromosome order

3. file with genes lost - graph.csv from SGSGeneLoss.jar run; it can be a comma separated list of multiple files (for example multiple samples). Circles will be drawn in the following order:

first file in the list is the innermost circle, so if you have graph1.csv,graph2.csv,graph3.csv, order of circles will reflect order of files, starting from the inside

4. Output path ending with /

5. output file

Rscript --vanilla graph_circles.R chrs.csv chrs_order.csv graph1.csv,graph2.csv,graph3.csv /home/results/graphs/ out.png

FAQ

  • If memory consumption is a problem please consider increasing -Xmx or splitting your .bam files
  • Please cite Golicz, A.A., Martinez, P.A., Zander, M., Patel, D.A., Van De Wouw, A.P., Visendi, P., Fitzgerald, T.L. et al. (2015) Gene loss in the fungal canola pathogen Leptosphaeria maculans. Funct. Integr. Genomics, 15, 189–196.

Back to Main_Page