SkimGBS

From Applied Bioinformatics Group
Revision as of 06:06, 9 September 2014 by Philippbayer (talk | contribs) (SNP-calling)
Jump to: navigation, search

This page collects the scripts and a manual for our SkimGBS genotyping by sequencing pipeline.

What SkimGBS depends on

For the Perl-scripts, just BioPerl. For the Python-scripts, BioPython and at least Python 2.7. we haven't tested the pipeline much with Python 3 but it should work.

Download

How to install

Download the package from above, then just extract the files to a directory of your choosing. Remember the directory, since you'll need that in the next step.

How to run SkimGBS

Data

As input data, you need BAM alignments for your two parental individuals, and one BAM alignment for each individual of the DH or RIL population.

As a general overview: First align all reads, then call SNPs for the parental individuals using SGSautoSNP, then call genotypes for the population using snp_genotyping_all.pl, then filter SNPs using removeMonomorphicSNPs.py, then impute using imputeFlapjackSNPs.py, then visualize using Flapjack: http://ics.hutton.ac.uk/flapjack/

Alignments

You can filter reads based on quality first, however, since SGSautoSNP doesn't allow heterozygous SNPs, receiving a false-positive SNP from two reads that have exactly the same sequencing error seems unlikely.

Important: SGSautoSNP determines which read comes from which cultivar by checking the read name and expects either to see a P (for paired) or S (for single) followed by the number of the library. In practise, it's enough to use P1- for both cultivars. Example: We used Tapidor and Ningyou, so we added 'TP1-' and 'NP1-' to both read collections. You can do that either before you run the alignments, or after the alignments, here it is after the alignments using SOAPaligner:

   sed 's/^/TP1-/' all.soap > all_renamed.soap

We use SOAPaligner with the option -r 0 to exclude homeologous SNPs from reads aligning in several places, alternatively, filter BWA or Bowtie for quality scores, something above 20 or 30 should work.

   samtools view -q 30 mappings.bam > mappings_filtered.bam

It is best to use only reads that align in pairs in order to increase the accuracy of called SNPs, so depending on your aligner you will have to filter unpaired reads. In SOAPaligner, just use the paired output files.

We merged all files for the two parents:

  samtools merge TapidorAndNingyou.bam Tapidor.bam Ningyou.bam

Remove clones using Picard's MarkDuplicates:

  java -jar `which MarkDuplicates.jar` INPUT=TapidorAndNingyou.bam OUTPUT=TapidorAndNingyou_cleaned.bam METRICS_FILE=stats.txt REMOVE_DUPLICATES=true

This overfilters a bit, since by chance you will find some reads align from different libraries in exactly the same position, so they may be treated as clones. Alternatively, you can run MarkDuplicates for each library itself and merge the results.

To make sure everything worked fine, use Picard again:

  java -jar `which ValidateSamFile.jar` INPUT=your_output_file

SGSautoSNP can either run one process per reference chromosome using multiple threads, or you can run many SGSautoSNP runs, one per chromosome. I prefer the latter, so I split the merged parental reads by chromosomes using bamtools. Both give you exactly the same results, running one SGSautoSNP per chromosome is easier to do when you have job arrays as with PBS.

  bamtools split -in TapidorAndNingyou_cleaned.bam -reference

This will result in one BAM file per chromosome which includes the data of both parental cultivars.

SNP-calling

SGSautoSNP can produce SNPs for FASTA files containing one or several references. To make sure that it offsets correctly, you have to supply a gff3 file of contigs, even if it's only one chromomosome. For that, we have supplied the fast_multiple_to_single_fasta.py script, which will produce a gff3 file for each chromosome.

Usage:

   fast_multiple_to_single_fasta.py chromosome_a1.fasta

Will create chromosome_a1_pseudo.gff3, this file is needed for SGSautoSNP.

To run SGSautoSNP:

--bam BA01m.bam

--chr_offset BA01_v4.0_contig.gff3 (from fast_multiple_to_single_fasta.py)

--fasta chromosome_m.fasta (the chromosome - we just need this to extract the sequence ID)

--cultivars T, N (list of your cultivars you renamed to, so you should have NP1-, TP1- or something in your bam-file)

--contig_output Tapidor_Ningyou_SNPs (BA_01_output)

--chr_output Tapidor_Ningyou_SNPs_chr (also output - because SGSautoSNP is slightly buggy this will be identical to the above)

--cpu 1 (amount of CPUs used, 1 per chromosome)

Output files are the SNPs in snp, vcf and gff3 format.

Usage:

   python SGSautoSNP.py --bam BA01.bam --fasta BA01.fasta --contig_output BA_01_output --chr_offset BA01.gff3 --chr_output BA_01_output_chr --cultivars "T,N" --cpu 1

FAQ

To come.


Reference

Back to Main_Page