Difference between revisions of "SkimGBS"

From Applied Bioinformatics Group
Jump to: navigation, search
(Genotyping)
 
(18 intermediate revisions by the same user not shown)
Line 1: Line 1:
This page collects the scripts and a manual for our SkimGBS genotyping by sequencing pipeline.
+
This page collects the scripts and a manual for our SkimGBS genotyping by sequencing pipeline as published in Bayer et al. 2015: http://www.ncbi.nlm.nih.gov/pubmed/25754422
  
 
== Dependencies ==
 
== Dependencies ==
  
For the Perl-scripts, just BioPerl. For the Python-scripts, BioPython and at least Python 2.7. we haven't tested the pipeline much with Python 3 but it should work.
+
For the Perl-scripts, just BioPerl. For the Python-scripts, BioPython, Pysam and at least Python 2.7. we haven't tested the pipeline much with Python 3 but it should work.
  
 
== Download ==
 
== Download ==
 
* Latest Version 1.0:
 
* Latest Version 1.0:
** [http://appliedbioinformatics.com.au/download/PAKAP_1.7.zip PAKAP_1.7.zip]
+
** [http://appliedbioinformatics.com.au/download/Skim_GBS_pipeline_v1.zip Skim_GBS_pipeline_v1.zip]
  
 
== How to install ==
 
== How to install ==
  
Download the package from above, then just extract the files to a directory of your choosing. Remember the directory, since you'll need that in the next step.
+
Download the package from above, then just extract the files to a directory of your choosing. The pipeline is a collection of Perl and Python scripts so no need to compile anything.
 +
 
 +
The pipeline does work with BAM files so you need one short read aligner like SOAPaligner, BWA or Bowtie2, and you need some additional tools like Picard, samtools and bamtools.
  
 
== How to run SkimGBS ==
 
== How to run SkimGBS ==
  
 
As a general overview: First align all reads, then call SNPs for the parental individuals using SGSautoSNP, then call genotypes for the population using snp_genotyping_all.pl, then filter SNPs using removeMonomorphicSNPs.py, then impute using imputeFlapjackSNPs.py, then visualize using Flapjack: http://ics.hutton.ac.uk/flapjack/  
 
As a general overview: First align all reads, then call SNPs for the parental individuals using SGSautoSNP, then call genotypes for the population using snp_genotyping_all.pl, then filter SNPs using removeMonomorphicSNPs.py, then impute using imputeFlapjackSNPs.py, then visualize using Flapjack: http://ics.hutton.ac.uk/flapjack/  
 +
Lastly, call recombinations.
  
 
=== Data ===
 
=== Data ===
  
 
As input data, you need BAM alignments for your two parental individuals, and one BAM alignment for each individual of the DH or RIL population.
 
As input data, you need BAM alignments for your two parental individuals, and one BAM alignment for each individual of the DH or RIL population.
 +
 +
=== Reference sequence ===
 +
 +
It's best to run the pipeline with a concatenated pseudo-molecule instead of thousands of contigs.
 +
 +
The script fast_multiple_to_single.pl makes such a sequence out of a fasta file of contigs. Let's say you have 8 fasta files for 8 pseudo-molecules, each fasta file containing all contigs, ending in ".fasta"
 +
 +
<pre>
 +
for l in *fasta
 +
do
 +
  fast_multiple_to_single_fasta.py -v ${l%%.fasta} -f $l -s fasta -r ? -n 100
 +
done
 +
</pre>
 +
 +
This will put 100 Ns in between each contig and generate several output files, from which we need the ones ending in ".pseudo_contig.gff3" and "_pseudo.fasta"
  
 
=== Alignments ===
 
=== Alignments ===
Line 54: Line 72:
  
 
This will result in one BAM file per chromosome which includes the data of both parental cultivars.
 
This will result in one BAM file per chromosome which includes the data of both parental cultivars.
 +
 +
Warning: There seems to be a really rare error when running bamtools on a cluster, but I've never encountered it: https://github.com/pezmaster31/bamtools/issues/32
 +
Check your output files whether reads suddenly appear more times.
  
 
=== SNP-calling ===
 
=== SNP-calling ===
Line 91: Line 112:
 
Usage:
 
Usage:
  
     perl snp_genotyping_all.pl -b TN10_BC0FH8ACXX_GGCTAC_L008_R1.fastq.par_sorted.bam -gff Contigs_IsobellC1_snp.gff3 -off Contigs_IsobellC1_pseudo_contig.gff3 -f Contigs_IsobellC1.fasta -o IsobellC1_TN10_GBS -c1 T -c2 N -h TN -s 1 > TN10_GBS.out
+
     perl snp_genotyping_all.pl -b TN10_BC0FH8ACXX_GGCTAC_L008_R1.fastq.par_sorted.bam -gff Contigs_C1_snp.gff3 -off Contigs_C1_pseudo_contig.gff3 -f Contigs_C1.fasta -o C1_TN10_GBS -c1 T -c2 N -h TN -s 1 > TN10_GBS.out
  
 
Explanation:
 
Explanation:
  
    TN10_BC0FH8ACXX_GGCTAC_L008_R1.fastq.par_sorted.bam - is the BAM  file for the individual TN10 of the DH populatioin mapped against the whole reference.
+
-b TN10_BC0FH8ACXX_GGCTAC_L008_R1.fastq.par_sorted.bam (the BAM  file for the individual TN10 of the DH population mapped against the whole reference)
    Contigs_C1_snp.gff3 - is the gff3 file from SGSautoSNP for chromosome C1 containing all SNPs
+
 
    Contigs_C1_pseudo_contig.gff3 - is the offset file to map each chromosome contig to a global chromosome location from  
+
-gff Contigs_C1_snp.gff3 (the gff3 file from SGSautoSNP for chromosome C1 containing all SNPs)
    Contigs_C1.fasta - the chromosome fasta file
+
 
    T - the letter for the first cultivar
+
-off Contigs_C1_pseudo_contig.gff3 (the offset file to map each chromosome contig to a global chromosome location from fast_multiple_to_single_fasta.py)
    C1_TN10_GBS - the output file name
+
 
 +
-f Contigs_C1.fasta (the chromosome fasta file from fast_multiple_to_single_fasta.py)
 +
 
 +
-c1 T, -c2 N, -h TN (the letter for the first cultivar, second cultivar, and the combination of both)
 +
 
 +
-s 1 (minimum score for each cultivar. There have to be more than, or equal to, 1 read to support an allele )
 +
 
 +
> C1_TN10_GBS (the output file name)
  
 
It's best practise to have one output folder per chromosome since this step will generate many files - 3 files per individual per chromosome, so for 20 chromosomes and 92 individuals that means 92 * 3 * 20 = 5520 files.
 
It's best practise to have one output folder per chromosome since this step will generate many files - 3 files per individual per chromosome, so for 20 chromosomes and 92 individuals that means 92 * 3 * 20 = 5520 files.
Line 112: Line 140:
 
This will take all TN samples in the GBS_C8 directory and output both .genotype and .map files for importing to flapjack. The .genotype file is identical to the .dat file Flapjack exports, so you may want to rename it if you like consistency.
 
This will take all TN samples in the GBS_C8 directory and output both .genotype and .map files for importing to flapjack. The .genotype file is identical to the .dat file Flapjack exports, so you may want to rename it if you like consistency.
  
'''Important:''' This script assumes that the first element of your _csv-file is the name of the individual. Example: chrUnrandom_TN99_GBS_TN.genotyping_csv is correct for the individual TN99.
+
'''Important:''' This script assumes that the first element of your _csv-file is the name of the individual. Example: chrUnrandom_TN99_GBS_TN.genotyping_csv is correct for the individual TN99. chr_Unrandom_TN99_GBS_TN.genotyping_csv would give you "Unrandom" as the name of all individuals.
  
 
If your chromosome-names contain underscores (i.e., "chrA01_random") this script will generate garbage output where the names of individuals are replaced by "random" or whatever your second element is. Replace the "_" in your chromosome names first before running.
 
If your chromosome-names contain underscores (i.e., "chrA01_random") this script will generate garbage output where the names of individuals are replaced by "random" or whatever your second element is. Replace the "_" in your chromosome names first before running.
Line 127: Line 155:
  
 
     grep -v 'TN99' dat_file > cleaned_dat_file
 
     grep -v 'TN99' dat_file > cleaned_dat_file
 
To impute the cleaned results:
 
 
  python imputeFlapjackSNPs.py some_flapjack_outputfile.dat
 
  
 
=== Sideways imputation ===
 
=== Sideways imputation ===
Line 148: Line 172:
  
 
=== Calling recombinations ===
 
=== Calling recombinations ===
 +
Usage:
 +
 +
1-makeGCAndLDMapPerIndividual.py dat_file map_file
 +
 +
This will make two files:
 +
 +
    dat_file_crossovers.txt
 +
    dat_file_gene_conversions.txt
 +
 +
The format of the file is like this, one block per individual:
 +
 +
<pre>
 +
TN21
 +
#Allele Start SNP End SNP Start position End position Length Outer End SNP Outer End Position Outer End Length\
 +
T M1 M3 1 3 2 M5 7 6
 +
</pre>
 +
 +
dat_file_gene_conversions.txt contains all recombinations >20bp and <10kb, crossovers contains everything bigger 10kb.
 +
 +
2-plotCOs.py is an example script to plot the distribution of GCs and COs.
 +
 +
=== Filtering GCs and COs ===
 +
 +
Since so many GCs and COs overlap, I wrote a script that removes COs whose outer end, inner end or outer start and inner start overlap. Here's some ASCII art to show the difference between outer and inner end:
 +
 +
<pre>
 +
AAAAA------BBBBBBB--------AAAAAA
 +
    |    |    |      |
 +
    o    i    i      o
 +
</pre>
 +
 +
Genotypes are A and B, - is missing. The outer start of that B recombination is the first SNP after the last SNP of the first A block (first |, o), the inner start is the first B, the inner end is the last B, the outer end is the last SNP before the next A.
 +
 +
Here's an example for a filtered GC in two individuals:
 +
 +
<pre>
 +
TN1 AAAAAA-----------BBBBBBBBBBBBBBBBBB
 +
TN3 AAAA-------BBBBBBBBBBBBBBBBBBBBBBBB
 +
</pre>
 +
 +
This CO overlaps so we ignore it.
 +
 +
python 3-fuzzyFilter.py table_to_filter CO/GC
 +
 +
You have to specify CO or GC - adjacent COs are merged, adjacent GCs are left alone.
 +
 +
This will create output files with the 'filtered' suffix.
 +
 +
In the zip archive are also a couple of different plotting and table making scripts I used for the publication.
  
 
== FAQ ==
 
== FAQ ==
To come.
 
  
 +
- *I get many "Use of uninitialized value $snp_name in hash.." etc. when running snp_genotyping_all.pl*
 +
 +
- These are "just" warnings, feel free to ignore them.
  
 
== Reference ==
 
== Reference ==

Latest revision as of 03:16, 17 February 2016

This page collects the scripts and a manual for our SkimGBS genotyping by sequencing pipeline as published in Bayer et al. 2015: http://www.ncbi.nlm.nih.gov/pubmed/25754422

Dependencies

For the Perl-scripts, just BioPerl. For the Python-scripts, BioPython, Pysam and at least Python 2.7. we haven't tested the pipeline much with Python 3 but it should work.

Download

How to install

Download the package from above, then just extract the files to a directory of your choosing. The pipeline is a collection of Perl and Python scripts so no need to compile anything.

The pipeline does work with BAM files so you need one short read aligner like SOAPaligner, BWA or Bowtie2, and you need some additional tools like Picard, samtools and bamtools.

How to run SkimGBS

As a general overview: First align all reads, then call SNPs for the parental individuals using SGSautoSNP, then call genotypes for the population using snp_genotyping_all.pl, then filter SNPs using removeMonomorphicSNPs.py, then impute using imputeFlapjackSNPs.py, then visualize using Flapjack: http://ics.hutton.ac.uk/flapjack/ Lastly, call recombinations.

Data

As input data, you need BAM alignments for your two parental individuals, and one BAM alignment for each individual of the DH or RIL population.

Reference sequence

It's best to run the pipeline with a concatenated pseudo-molecule instead of thousands of contigs.

The script fast_multiple_to_single.pl makes such a sequence out of a fasta file of contigs. Let's say you have 8 fasta files for 8 pseudo-molecules, each fasta file containing all contigs, ending in ".fasta"

for l in *fasta
do
  fast_multiple_to_single_fasta.py -v ${l%%.fasta} -f $l -s fasta -r ? -n 100
done

This will put 100 Ns in between each contig and generate several output files, from which we need the ones ending in ".pseudo_contig.gff3" and "_pseudo.fasta"

Alignments

You can filter reads based on quality first, however, since SGSautoSNP doesn't allow heterozygous SNPs, receiving a false-positive SNP from two reads that have exactly the same sequencing error seems unlikely.

Important: SGSautoSNP determines which read comes from which cultivar by checking the read name and expects either to see a P (for paired) or S (for single) followed by the number of the library. In practise, it's enough to use P1- for both cultivars. Example: We used Tapidor and Ningyou, so we added 'TP1-' and 'NP1-' to both read collections. You can do that either before you run the alignments, or after the alignments, here it is after the alignments using SOAPaligner:

   sed 's/^/TP1-/' all.soap > all_renamed.soap

We use SOAPaligner with the option -r 0 to exclude homeologous SNPs from reads aligning in several places, alternatively, filter BWA or Bowtie for quality scores, something above 20 or 30 should work.

   samtools view -q 30 mappings.bam > mappings_filtered.bam

It is best to use only reads that align in pairs in order to increase the accuracy of called SNPs, so depending on your aligner you will have to filter unpaired reads. In SOAPaligner, just use the paired output files.

We merged all files for the two parents:

  samtools merge TapidorAndNingyou.bam Tapidor.bam Ningyou.bam

Remove clones using Picard's MarkDuplicates:

  java -jar `which MarkDuplicates.jar` INPUT=TapidorAndNingyou.bam OUTPUT=TapidorAndNingyou_cleaned.bam METRICS_FILE=stats.txt REMOVE_DUPLICATES=true

This overfilters a bit, since by chance you will find some reads align from different libraries in exactly the same position, so they may be treated as clones. Alternatively, you can run MarkDuplicates for each library itself and merge the results.

To make sure everything worked fine, use Picard again:

  java -jar `which ValidateSamFile.jar` INPUT=your_output_file

SGSautoSNP can either run one process per reference chromosome using multiple threads, or you can run many SGSautoSNP runs, one per chromosome. I prefer the latter, so I split the merged parental reads by chromosomes using bamtools. Both give you exactly the same results, running one SGSautoSNP per chromosome is easier to do when you have job arrays as with PBS.

  bamtools split -in TapidorAndNingyou_cleaned.bam -reference

This will result in one BAM file per chromosome which includes the data of both parental cultivars.

Warning: There seems to be a really rare error when running bamtools on a cluster, but I've never encountered it: https://github.com/pezmaster31/bamtools/issues/32 Check your output files whether reads suddenly appear more times.

SNP-calling

SGSautoSNP can produce SNPs for FASTA files containing one or several references. To make sure that it offsets correctly, you have to supply a gff3 file of contigs, even if it's only one chromomosome. For that, we have supplied the fast_multiple_to_single_fasta.py script, which will produce a gff3 file for each chromosome.

Usage:

   fast_multiple_to_single_fasta.py C1.fasta

Will create C1_pseudo.gff3, this file is needed for SGSautoSNP.

To run SGSautoSNP:

--bam C1.bam

--chr_offset C1_v4.0_contig.gff3 (from fast_multiple_to_single_fasta.py)

--fasta chromosome_m.fasta (the chromosome - we just need this to extract the sequence ID)

--cultivars T, N (list of your cultivars you renamed to, so you should have NP1-, TP1- or something in your bam-file)

--contig_output Tapidor_Ningyou_SNPs (BA_01_output)

--chr_output Tapidor_Ningyou_SNPs_chr (also output - because SGSautoSNP is slightly buggy this will be identical to the above)

--cpu 1 (amount of CPUs used, 1 per chromosome)

Output files are the SNPs in snp, vcf and gff3 format.

Usage:

   python SGSautoSNP.py --bam BA01.bam --fasta BA01.fasta --contig_output BA_01_output --chr_offset BA01.gff3 --chr_output BA_01_output_chr --cultivars "T,N" --cpu 1

Genotyping

Next, we run the genotyping script, one run per individual of the DH or RIL mapping population and per chromosome.

Usage:

   perl snp_genotyping_all.pl -b TN10_BC0FH8ACXX_GGCTAC_L008_R1.fastq.par_sorted.bam -gff Contigs_C1_snp.gff3 -off Contigs_C1_pseudo_contig.gff3 -f Contigs_C1.fasta -o C1_TN10_GBS -c1 T -c2 N -h TN -s 1 > TN10_GBS.out

Explanation:

-b TN10_BC0FH8ACXX_GGCTAC_L008_R1.fastq.par_sorted.bam (the BAM file for the individual TN10 of the DH population mapped against the whole reference)

-gff Contigs_C1_snp.gff3 (the gff3 file from SGSautoSNP for chromosome C1 containing all SNPs)

-off Contigs_C1_pseudo_contig.gff3 (the offset file to map each chromosome contig to a global chromosome location from fast_multiple_to_single_fasta.py)

-f Contigs_C1.fasta (the chromosome fasta file from fast_multiple_to_single_fasta.py)

-c1 T, -c2 N, -h TN (the letter for the first cultivar, second cultivar, and the combination of both)

-s 1 (minimum score for each cultivar. There have to be more than, or equal to, 1 read to support an allele )

> C1_TN10_GBS (the output file name)

It's best practise to have one output folder per chromosome since this step will generate many files - 3 files per individual per chromosome, so for 20 chromosomes and 92 individuals that means 92 * 3 * 20 = 5520 files.

Flapjack visualization

To display the genotype data in flapjack, use the script parseGenotype.pl, for example for a directory called GBS_C8 containing the results from the above script:

  parseGenotype.pl -out GBS_C8 GBS_C8/*_csv

This will take all TN samples in the GBS_C8 directory and output both .genotype and .map files for importing to flapjack. The .genotype file is identical to the .dat file Flapjack exports, so you may want to rename it if you like consistency.

Important: This script assumes that the first element of your _csv-file is the name of the individual. Example: chrUnrandom_TN99_GBS_TN.genotyping_csv is correct for the individual TN99. chr_Unrandom_TN99_GBS_TN.genotyping_csv would give you "Unrandom" as the name of all individuals.

If your chromosome-names contain underscores (i.e., "chrA01_random") this script will generate garbage output where the names of individuals are replaced by "random" or whatever your second element is. Replace the "_" in your chromosome names first before running.

Example: Rename "./GBS_chrUn_random/chrUn_random_TN99_GBS_TN.genotyping_csv" into "./GBS_chrUn_random/chrUnrandom_TN99_GBS_TN.genotyping_csv"

Removing monomorphic SNPs

Some parental SNPs behave monomorphically in the population, either because they are false positives or because the coverage in the population is too low to observe the second allele.

   python removeMonomorphicSNPs.py map-file dat-file

In case there are individuals that have a too low coverage, use 'grep -v' to remove them:

   grep -v 'TN99' dat_file > cleaned_dat_file

Sideways imputation

To impute the cleaned results:

  python imputeFlapjackSNPs.py some_flapjack_outputfile.dat

This step will put out a .dat-file ending in _imputed.dat.

Imputation follows these simple rules for the two genotypes N and T, assuming that recombinations are so rare so that we can ignore them:

   N, , , N -> N, n, n, N
   T, , , T -> T, t, t, T
   T, , , N -> T, , , N
   , , , T -> , , , T  

Calling recombinations

Usage:

1-makeGCAndLDMapPerIndividual.py dat_file map_file

This will make two files:

   dat_file_crossovers.txt
   dat_file_gene_conversions.txt

The format of the file is like this, one block per individual:

TN21
#Allele	Start SNP	End SNP	Start position	End position	Length	Outer End SNP	Outer End Position	Outer End Length\
T	M1	M3	1	3	2	M5	7	6

dat_file_gene_conversions.txt contains all recombinations >20bp and <10kb, crossovers contains everything bigger 10kb.

2-plotCOs.py is an example script to plot the distribution of GCs and COs.

Filtering GCs and COs

Since so many GCs and COs overlap, I wrote a script that removes COs whose outer end, inner end or outer start and inner start overlap. Here's some ASCII art to show the difference between outer and inner end:

AAAAA------BBBBBBB--------AAAAAA
     |     |     |       |
     o     i     i       o

Genotypes are A and B, - is missing. The outer start of that B recombination is the first SNP after the last SNP of the first A block (first |, o), the inner start is the first B, the inner end is the last B, the outer end is the last SNP before the next A.

Here's an example for a filtered GC in two individuals:

TN1 AAAAAA-----------BBBBBBBBBBBBBBBBBB
TN3 AAAA-------BBBBBBBBBBBBBBBBBBBBBBBB

This CO overlaps so we ignore it.

python 3-fuzzyFilter.py table_to_filter CO/GC

You have to specify CO or GC - adjacent COs are merged, adjacent GCs are left alone.

This will create output files with the 'filtered' suffix.

In the zip archive are also a couple of different plotting and table making scripts I used for the publication.

FAQ

- *I get many "Use of uninitialized value $snp_name in hash.." etc. when running snp_genotyping_all.pl*

- These are "just" warnings, feel free to ignore them.

Reference

Back to Main_Page