University of Pretoria SNP Identification and Probe Designs Includes transcriptome work Madison.Caballero@uconn.edu ################################################################################################################################################################### README layout: 1. Background 2. Transcriptomes 3. Alignment of exome capture to transcriptomes 4. Calling SNPs from those alignments 5. Designing and initial filtering of probes 6. Frame selection of transcripts 7. Genomethreader alignment of transcriptomes to loblolly genome 8. Probe correction using those alignments - final sets 9. Files of interest ################################################################################################################################################################### 1. BACKGROUND Species: Pinus taeda (Pita) --> reference used for mapping, Loblolly Pine. Native to south eastern United States from central Texas to east Florida and north to Delaware. Genome size is roughly 22 billion bp. Pinus greggii (Pigr) --> A species of pine tree native to eastern Mexico, in the Sierra Madre Oriental. Can hybridize with Pinus patula. File Locations: http://tgwebdev.cam.uchc.edu/FTP/temp/Pita_Genotyping/Pretoria/Pigr/ Pinus oocarpa (Pioo) --> A species of pine tree native to Mexico and Central America including western Mexico, Guatemala and the higher elevations of Honduras, El Salvador and northwestern Nicaragua. File Locations: http://tgwebdev.cam.uchc.edu/FTP/temp/Pita_Genotyping/Pretoria/Pioo/ Pinus patula (Pipa) --> A species of pine tree native to the highlands of Mexico. Can hybridize with Pinus greggii. File Locations: http://tgwebdev.cam.uchc.edu/FTP/temp/Pita_Genotyping/Pretoria/Pipa/ Pinus tecunumanii (Pite) --> A species of pine tree native to Mexico and Central America. It grows from the highlands of Chiapas and Oaxaca through Guatemala, Belize, El Salvador, Honduras to Nicaragua. Closer to Pinus oocarpa. File Locations: http://tgwebdev.cam.uchc.edu/FTP/temp/Pita_Genotyping/Pretoria/Pite/ RAPiD Genomics contricted 5 and 10-tree bulks of one provenance of each of the four species (see above greggii-tecunumanii). There are 8 libraries, 2 for each species. Reads were captured using the 80K capture probe set designed for Pinus taeda. Information on the libraries: RG_Sample_Code Customer_Code Project_Code Initial_gDNA_Conc Initial_gDNA_Ng_Used Lib_DNA_Conc Total_Aligned_Reads_Percentage Uniquely_Aligned_Mates_Percentage Number_Probes Median_Probe_Depth UPT_116401_P01_WC02 P. patula var patula -10samples UPT_116401 16.66998918 500 37.45743404 66.7 62.56666667 75395 239.75 UPT_116401_P01_WC03 P. greggii (South)-10samples UPT_116401 16.66998918 500 34.12748039 60.56666667 56.2 73432 167.69 UPT_116401_P01_WC04 P. tecunumanii LE-10samples UPT_116401 16.66998918 500 26.88594568 64.26666667 59.63333333 75592 217.645 UPT_116401_P01_WC05 P. oocarpa-10samples UPT_116401 16.66998918 500 21.03594894 64.83333333 60.36666667 75574 190.535 UPT_116401_P01_WC06 P. patula var patula-5samples UPT_116401 16.66998918 500 42.71031117 65.33333333 60.86666667 75011 239.4 UPT_116401_P01_WC07 P. greggii (South)-5samples UPT_116401 16.66998918 500 33.07675546 61.5 57.46666667 73371 220.88 UPT_116401_P01_WC08 P. tecunumanii LE-5samples UPT_116401 16.66998918 500 39.76021282 64.23333333 59.6 75184 184.98 UPT_116401_P01_WC09 P. oocarpa-5samples UPT_116401 16.66998918 500 33.04211615 61.13333333 56.43333333 75426 178.51 ################################################################################################################################################################### 2. TRANSCRIPTOMES - sizes Greggii: 38,179 transcripts Oocarpa: 35,092 transcripts Patula: 52,735 transcripts Tecunumanii: 28,621 transcripts ################################################################################################################################################################### 3. ALIGNMENT Exome capture reads from above are trimmed with sickle and aligned to their respective transcriptome. The transcriptomes are indexed with bwa mem. Sample command: bwa index -p greggii_TSA greggii_TSA.fasta bwa mem -t 32 greggii_TSA \ /UCHC/LABS/Wegrzyn/ConiferGenomes/Pita/UniPretora_RAPiD/trimmed_fastq/trimmed_RAPiD-Genomics_HJ2FJBBXX_UPT_116401_P01_WC03_i5-510_i7-78_S16_L006_1.fastq.gz \ /UCHC/LABS/Wegrzyn/ConiferGenomes/Pita/UniPretora_RAPiD/trimmed_fastq/trimmed_RAPiD-Genomics_HJ2FJBBXX_UPT_116401_P01_WC03_i5-510_i7-78_S16_L006_2.fastq.gz \ > greggii_1.sam Alignment statistics: Greggii 1: 11,905,265 reads mapped of 38,523,068 (30.90%) 2: 15,067,689 reads mapped of 47,992,776 (31.96%) Oocarpa 1: 13,364,645 reads mapped of 42,819,676 (31.21%) 2: 13,584,955 reads mapped of 44,178,222 (30.75%) Patula 1: 17,559,110 reads mapped of 53,124,628 (33.05%) 2: 17,908,833 reads mapped of 53,644,884 (33.38%) Tecunumanii 1: 15,051,010 reads mapped of 50,041,512 (30.07%) 2: 12,734,881 reads mapped of 42,657,526 (29.85%) The sam files are then converted to bam and sorted. Sample command: samtools view -@ 32 -b -o greggii_1.bam greggii_1.sam samtools sort -@ 32 -o greggii_1.sorted.bam greggii_1.bam ################################################################################################################################################################### 4. CALLING SNPS SNPs are called using Freebayes in a strict and generic setting. Sample command: freebayes -f greggii_TSA.fasta -i -u -m 0 -q 15 -F .2 --min-coverage 8 -C 2 greggii_1.sorted.bam > greggii_1.standard.vcf freebayes -f greggii_TSA.fasta -i -u -m 20 -q 25 -F .2 --min-coverage 10 -C 6 greggii_1.sorted.bam > greggii_1.strict.vcf SNP counts: (strict and standard) Greggii 1: 162,269 and 225,597 2: 170,000 and 233,738 Oocarpa 1: 180,135 and 245,904 2: 187,961 and 258,954 Patula 1: 159,301 and 211,383 2: 159,912 and 212,463 Tecunumanii 1: 165,965 and 218,309 2: 157,870 and 207,684 The vcf files are then merged by species (2 per species). Sample command: bcftools merge --force-samples -o greggii.strict.vcf greggii_1.strict.vcf.gz greggii_2.strict.vcf.gz bcftools merge --force-samples -o greggii.standard.vcf greggii_1.standard.vcf.gz greggii_2.standard.vcf.gz SNP counts: (strict and standard) Greggii: 218,658 and 318,711 Oocarpa: 246,018 and 353,542 Patula: 206,492 and 288,290 Tecunumanii: 208,856 and 288,456 ################################################################################################################################################################### 5. DESIGNING and FILTERING PROBES 1. Done with 2_probe_design/70_mer_probes_for_transcriptome.pl : Probe designing is done in two steps. The first is to take a variant and evaluate its distances on the left and right to the nearest SNP (or beginning or end of the scaffold). IUPACs and codes are given and stored at this stage. A capitol letter IUPAC is given when the call is of the strict quality from Freebayes, and lower case for the standard. All multi-allelic SNPs are considered standard to prevent probe creation around them. The second step prints every probe for every strict quality bi-allelic SNP. The flanking sequence will have any number of IUPAC SNP bases. 2. Done with 2_probe_design/MQ_probe_filter.pl : These probes are then filtered so that only probes without SNPs in the first 20 bases on either side of the SNP remain. The other 15 outer sequence may have SNPs. Number of probes that pass MQ filter: Greggii: 15,557 Oocarpa: 25,905 Patula: 22,224 Tecunumanii: 23,842 ################################################################################################################################################################### 5. FRAME SELECTION with GENEMARKS-ET Transcripts are frame-selected. Sample command: perl ~/GeneMarks/GeneMarks-T/gmst.pl -faa -fnn greggii_TSA.fasta Transcripts Retained: Greggii: 38,143 (99.90%) Oocarpa: 35,010 (99.76%) Patula: 52,418 (99.36%) Tecunumanii: 28,604 (99.94%) --> These results suggest that the transcripts were already frame selected. ################################################################################################################################################################### 7. FRAME SELECTED PROTEIN TRANSLATED TRANSCRIPTS ALIGNED with GENOMETHREADER To increase number of alignments, the protein translation will be used for alignment to the Pinus taeda genome. Sample command: gth \ -genomic ../pita.v2.0.1.masked3k2.fa \ -protein ../../genemarks/greggii/greggii_TSA.fasta.faa \ -force \ -gff3out\ -showseqnums \ -startcodon \ -finalstopcodon \ -dpminexonlen 20 \ -dpminintronlen 9 \ -skipalignmentout \ -o greggii_TSA.gth.faa.gff3 Since genome threader does combination alignments, with one gene prediction being made up of multiple transcripts or with one transcript aligning to multiple places in the genome, statistics on alignments must consider different ideas. 1. How many transcripts are aligning including the fact that one transcript can align an unlimited amount of time and one gene can be made up of an unlimited number of transcript evidences. 2. Then, how many unique transcripts are aligning? Gff3 output statistics: Greggii Net alignments: 1,901,009 Unique number of transcripts that aligned: 17,769 --> saved as probe_correction_new/greggii/aligning_transcripts.txt Oocarpa Net alignments: 2,260,797 Unique number of transcripts that aligned: 19,058 --> saved as probe_correction_new/oocarpa/aligning_transcripts.txt Patula Net alignments: 1,467,907 Unique number of transcripts that aligned: 15,012 --> saved as probe_correction_new/patula/aligning_transcripts.txt Tecunumanii Net alignments: 1,436,257 Unique number of transcripts that aligned: 14,511 --> saved as probe_correction_new/tecunumanii/aligning_transcripts.txt ################################################################################################################################################################### 8. ALIGNMENT IS USED TO INFORM PROBES A "Rosetta stone" of transcriptome to genome positions is very difficult because: 1. The gff3 file does not tell you what exons or parts of a gene correspond to which transcript if more than one is used as evidence. Solution: Only allow gene models where only 1 transcript is the evidence. 2. Alignments can have insertions or deletions. Solution: Create a threshold for alignment length differences. (Decided on +- 5nt) 3. You cannot know where those insertions or deletions are. Solution: Create a blacklist "range" of positions where a splice site may be. (10nt range given the threshold above) Here are the statistics on those filters: Greggii: Unique single mapping transcripts: 13,059 That also align +- 5nt to genome: 8,760 --> saved as probe_correction_new/greggii/single_evicence_5nt_unique_aligning_transcripts.txt Oocarpa Unique single mapping transcripts: 13,926 That also align +- 5nt to genome: 9,246 --> saved as probe_correction_new/oocarpa/single_evicence_5nt_unique_aligning_transcripts.txt Patula Unique single mapping transcripts: 13,105 That also align +- 5nt to genome: 8,468 --> saved as probe_correction_new/patula/single_evicence_5nt_unique_aligning_transcripts.txt Tecunumanii Unique single mapping transcripts: 12,946 That also align +- 5nt to genome: 8,425 --> saved as probe_correction_new/tecunumanii/single_evicence_5nt_unique_aligning_transcripts.txt Using the information of transcript positions where a splice site may be, probes are reevaluated. If a probe has at least one of those positions, it is removed. A second filter can also be created that removes probes that were built on transcripts that did not align at all. (See the --> saved as probe_correction_new/[species]/aligning_transcripts.txt from above. A third filter is also created from probes that have positive evidence of passing. These are from the 5nt sized, unique, single mapping transcripts that can be found from --> saved as probe_correction_new/tecunumanii/single_evicence_5nt_unique_aligning_transcripts.txt from above. Here are the statistics on those filters: Greggii Original MQ probes: 15,557 (MQ_greggii_SNP_probes.txt) After correction: 14,844 (corrected_MQ_greggii_SNP_probes.txt) Built on aligned transcriptome: 8,617 (mapping_corrected_MQ_greggii_SNP_probes.txt) Built on single mapping, +-5nt sized, unique aligned transcriptome: 5,067 (mapping_single_evicence_5nt_unique_corrected_MQ_greggii_SNP_probes.txt) Oocarpa Original MQ probes: 25,905 After correction: 25,012 Built on aligned transcriptome: 13,695 Built on single mapping, +-5nt sized, unique aligned transcriptome: 7,316 Patula Original MQ probes: 22,224 After correction: 21,264 Built on aligned transcriptome: 11,431 Built on single mapping, +-5nt sized, unique aligned transcriptome: 6,685 Tecunumanii Original MQ probes: 23,842 After correction: 22,854 Built on aligned transcriptome: 11,924 Built on a single mapping, +-5nt sized, unique aligned transcriptome: 7,004 All of these steps are done with the scripts at 5_probe_correction_new/ level. ################################################################################################################################################################### 9. FILES OF INTEREST I name files in a very predictable manner. I am only going to show greggii file names as the other three species follow the same naming theme. SNP files: 1_alignments/greggii/greggii.standard.vcf.gz 1_alignments/greggii/greggii.strict.vcf.gz Probe Files: Unfiltered probes: 2_probe_design/greggii/greggii_SNP_probes.txt MQ filtered probes: 2_probe_design/greggii/MQ_greggii_SNP_probes.txt Genome threader gff3 files: 3_genomethreader/greggii/greggii_gmet.gth.faa.gff3 Final filtered probe files (in the order they appear above): 5_probe_correction_new/greggii/corrected_MQ_greggii_SNP_probes.txt 5_probe_correction_new/greggii/mapping_corrected_MQ_greggii_SNP_probes.txt 5_probe_correction_new/greggii/mapping_single_evicence_5nt_unique_corrected_MQ_greggii_SNP_probes.txt <-- has the least amount of probes ################################################################################################################################################################### Thanks for reading me!