University of Pretoria SNP Identification and Probe Designs
Includes transcriptome work
Madison.Caballero@uconn.edu

###################################################################################################################################################################

README layout:
   1. Background
   2. Transcriptomes
   3. Alignment of exome capture to transcriptomes
   4. Calling SNPs from those alignments
   5. Designing and initial filtering of probes
   6. Frame selection of transcripts 
   7. Genomethreader alignment of transcriptomes to loblolly genome
   8. Probe correction using those alignments - final sets
   9. Files of interest 

###################################################################################################################################################################

1. BACKGROUND

Species:
Pinus taeda (Pita) --> reference used for mapping, Loblolly Pine. Native to south eastern United States from central Texas to east Florida and north to Delaware.
        Genome size is roughly 22 billion bp.

Pinus greggii (Pigr) --> A species of pine tree native to eastern Mexico, in the Sierra Madre Oriental. Can hybridize with Pinus patula.
        File Locations: http://tgwebdev.cam.uchc.edu/FTP/temp/Pita_Genotyping/Pretoria/Pigr/
Pinus oocarpa (Pioo) --> A species of pine tree native to Mexico and Central America including western Mexico, Guatemala and the higher elevations of Honduras,
        El Salvador and northwestern Nicaragua.
        File Locations: http://tgwebdev.cam.uchc.edu/FTP/temp/Pita_Genotyping/Pretoria/Pioo/
Pinus patula (Pipa) --> A species of pine tree native to the highlands of Mexico. Can hybridize with Pinus greggii.
        File Locations: http://tgwebdev.cam.uchc.edu/FTP/temp/Pita_Genotyping/Pretoria/Pipa/
Pinus tecunumanii (Pite) --> A species of pine tree native to Mexico and Central America. It grows from the highlands of Chiapas and Oaxaca through Guatemala,
        Belize, El Salvador, Honduras to Nicaragua. Closer to Pinus oocarpa.
        File Locations: http://tgwebdev.cam.uchc.edu/FTP/temp/Pita_Genotyping/Pretoria/Pite/


RAPiD Genomics contricted 5 and 10-tree bulks of one provenance of each of the four species (see above greggii-tecunumanii). There are 8 libraries, 2 for
each species. Reads were captured using the 80K capture probe set designed for Pinus taeda.

Information on the libraries:

RG_Sample_Code  Customer_Code   Project_Code    Initial_gDNA_Conc       Initial_gDNA_Ng_Used    Lib_DNA_Conc    Total_Aligned_Reads_Percentage  Uniquely_Aligned_Mates_Percentage       Number_Probes
 Median_Probe_Depth
UPT_116401_P01_WC02     P. patula var patula -10samples UPT_116401      16.66998918     500     37.45743404     66.7            62.56666667     75395   239.75
UPT_116401_P01_WC03     P. greggii (South)-10samples    UPT_116401      16.66998918     500     34.12748039     60.56666667     56.2            73432   167.69
UPT_116401_P01_WC04     P. tecunumanii LE-10samples     UPT_116401      16.66998918     500     26.88594568     64.26666667     59.63333333     75592   217.645
UPT_116401_P01_WC05     P. oocarpa-10samples            UPT_116401      16.66998918     500     21.03594894     64.83333333     60.36666667     75574   190.535
UPT_116401_P01_WC06     P. patula var patula-5samples   UPT_116401      16.66998918     500     42.71031117     65.33333333     60.86666667     75011   239.4
UPT_116401_P01_WC07     P. greggii (South)-5samples     UPT_116401      16.66998918     500     33.07675546     61.5            57.46666667     73371   220.88
UPT_116401_P01_WC08     P. tecunumanii LE-5samples      UPT_116401      16.66998918     500     39.76021282     64.23333333     59.6            75184   184.98
UPT_116401_P01_WC09     P. oocarpa-5samples             UPT_116401      16.66998918     500     33.04211615     61.13333333     56.43333333     75426   178.51

###################################################################################################################################################################

2. TRANSCRIPTOMES - sizes

	Greggii: 38,179 transcripts
	Oocarpa: 35,092 transcripts
	Patula: 52,735 transcripts
	Tecunumanii: 28,621 transcripts

###################################################################################################################################################################
3. ALIGNMENT

	Exome capture reads from above are trimmed with sickle and aligned to their respective transcriptome. The transcriptomes are indexed with bwa mem.

	Sample command:

	bwa index -p greggii_TSA greggii_TSA.fasta

	bwa mem -t 32 greggii_TSA \
        /UCHC/LABS/Wegrzyn/ConiferGenomes/Pita/UniPretora_RAPiD/trimmed_fastq/trimmed_RAPiD-Genomics_HJ2FJBBXX_UPT_116401_P01_WC03_i5-510_i7-78_S16_L006_1.fastq.gz \
        /UCHC/LABS/Wegrzyn/ConiferGenomes/Pita/UniPretora_RAPiD/trimmed_fastq/trimmed_RAPiD-Genomics_HJ2FJBBXX_UPT_116401_P01_WC03_i5-510_i7-78_S16_L006_2.fastq.gz \
        > greggii_1.sam

		Alignment statistics:

			Greggii
				1: 11,905,265 reads mapped of 38,523,068 (30.90%)
				2: 15,067,689 reads mapped of 47,992,776 (31.96%)
			Oocarpa
				1: 13,364,645 reads mapped of 42,819,676 (31.21%)
				2: 13,584,955 reads mapped of 44,178,222 (30.75%)
			Patula
				1: 17,559,110 reads mapped of 53,124,628 (33.05%)
				2: 17,908,833 reads mapped of 53,644,884 (33.38%)
			Tecunumanii
				1: 15,051,010 reads mapped of 50,041,512 (30.07%)
				2: 12,734,881 reads mapped of 42,657,526 (29.85%)

	The sam files are then converted to bam and sorted. 

	Sample command:
	
	samtools view -@ 32 -b -o greggii_1.bam greggii_1.sam
	samtools sort -@ 32 -o greggii_1.sorted.bam greggii_1.bam


###################################################################################################################################################################

4. CALLING SNPS

	SNPs are called using Freebayes in a strict and generic setting.
	
	Sample command:

	freebayes -f greggii_TSA.fasta -i -u -m 0 -q 15 -F .2 --min-coverage 8 -C 2 greggii_1.sorted.bam > greggii_1.standard.vcf
	freebayes -f greggii_TSA.fasta -i -u -m 20 -q 25 -F .2 --min-coverage 10 -C 6 greggii_1.sorted.bam > greggii_1.strict.vcf

	SNP counts: (strict and standard)
		Greggii
                       1: 162,269 and 225,597
                       2: 170,000 and 233,738
          	Oocarpa
                       1: 180,135 and 245,904
                       2: 187,961 and 258,954
            Patula
                       1: 159,301 and 211,383
                       2: 159,912 and 212,463
            Tecunumanii
                       1: 165,965 and 218,309
                       2: 157,870 and 207,684


	The vcf files are then merged by species (2 per species).

	Sample command:

	bcftools merge --force-samples -o greggii.strict.vcf greggii_1.strict.vcf.gz greggii_2.strict.vcf.gz
	bcftools merge --force-samples -o greggii.standard.vcf greggii_1.standard.vcf.gz greggii_2.standard.vcf.gz

	SNP counts: (strict and standard) 
		Greggii: 218,658 and 318,711
		Oocarpa: 246,018 and 353,542
		Patula: 206,492 and 288,290
		Tecunumanii: 208,856 and 288,456


###################################################################################################################################################################

5. DESIGNING and FILTERING PROBES

  1. Done with 2_probe_design/70_mer_probes_for_transcriptome.pl :

	Probe designing is done in two steps. The first is to take a variant and evaluate its distances on the left and right to the nearest SNP (or beginning
        or end of the scaffold). IUPACs and codes are given and stored at this stage. A capitol letter IUPAC is given when the call is of the strict quality 
	from Freebayes, and lower case for the standard. All multi-allelic SNPs are considered standard to prevent probe creation around them.

	The second step prints every probe for every strict quality bi-allelic SNP. The flanking sequence will have any number of IUPAC SNP bases. 

  2. Done with 2_probe_design/MQ_probe_filter.pl :

	These probes are then filtered so that only probes without SNPs in the first 20 bases on either side of the SNP remain. The other 15 outer sequence may 
	have SNPs.

	Number of probes that pass MQ filter:
		Greggii: 15,557
		Oocarpa: 25,905
		Patula: 22,224
		Tecunumanii: 23,842		


###################################################################################################################################################################

5. FRAME SELECTION with GENEMARKS-ET

	Transcripts are frame-selected.

	Sample command:
		perl ~/GeneMarks/GeneMarks-T/gmst.pl -faa -fnn greggii_TSA.fasta

	Transcripts Retained:
		Greggii: 38,143 (99.90%)
		Oocarpa: 35,010 (99.76%)
		Patula: 52,418 (99.36%)
		Tecunumanii: 28,604 (99.94%)

		--> These results suggest that the transcripts were already frame selected. 


###################################################################################################################################################################

7. FRAME SELECTED PROTEIN TRANSLATED TRANSCRIPTS ALIGNED with GENOMETHREADER
	
	To increase number of alignments, the protein translation will be used for alignment to the Pinus taeda genome. 

	Sample command:
		gth \
		-genomic ../pita.v2.0.1.masked3k2.fa \
		-protein ../../genemarks/greggii/greggii_TSA.fasta.faa \
		-force \
		-gff3out\
		-showseqnums \
		-startcodon \
		-finalstopcodon \
		-dpminexonlen 20 \
		-dpminintronlen 9 \
		-skipalignmentout \
		-o greggii_TSA.gth.faa.gff3

	Since genome threader does combination alignments, with one gene prediction being made up of multiple transcripts or with one transcript aligning to
	multiple places in the genome, statistics on alignments must consider different ideas.
		1. How many transcripts are aligning including the fact that one transcript can align an unlimited amount of time and one gene can be made up of an 
			unlimited number of transcript evidences. 
		2. Then, how many unique transcripts are aligning? 

	Gff3 output statistics:
		Greggii
			Net alignments: 1,901,009
			Unique number of transcripts that aligned: 17,769 --> saved as probe_correction_new/greggii/aligning_transcripts.txt
		Oocarpa
			Net alignments: 2,260,797
			Unique number of transcripts that aligned: 19,058 --> saved as probe_correction_new/oocarpa/aligning_transcripts.txt
		Patula
			Net alignments: 1,467,907
			Unique number of transcripts that aligned: 15,012 --> saved as probe_correction_new/patula/aligning_transcripts.txt
		Tecunumanii
			Net alignments: 1,436,257
			Unique number of transcripts that aligned: 14,511 --> saved as probe_correction_new/tecunumanii/aligning_transcripts.txt


###################################################################################################################################################################

8. ALIGNMENT IS USED TO INFORM PROBES

	A "Rosetta stone" of transcriptome to genome positions is very difficult because:
		1. The gff3 file does not tell you what exons or parts of a gene correspond to which transcript if more than one is used as evidence.
			Solution: Only allow gene models where only 1 transcript is the evidence.  

		2. Alignments can have insertions or deletions.
			Solution: Create a threshold for alignment length differences. (Decided on +- 5nt)

		3. You cannot know where those insertions or deletions are.
			Solution: Create a blacklist "range" of positions where a splice site may be. (10nt range given the threshold above) 
		
	Here are the statistics on those filters:
		
		Greggii:
			Unique single mapping transcripts: 13,059 
			That also align +- 5nt to genome: 8,760 --> saved as probe_correction_new/greggii/single_evicence_5nt_unique_aligning_transcripts.txt
		Oocarpa 
			Unique single mapping transcripts: 13,926
			That also align +- 5nt to genome: 9,246 --> saved as probe_correction_new/oocarpa/single_evicence_5nt_unique_aligning_transcripts.txt
		Patula 
			Unique single mapping transcripts: 13,105
			That also align +- 5nt to genome: 8,468 --> saved as probe_correction_new/patula/single_evicence_5nt_unique_aligning_transcripts.txt
		Tecunumanii 
			Unique single mapping transcripts: 12,946
			That also align +- 5nt to genome: 8,425 --> saved as probe_correction_new/tecunumanii/single_evicence_5nt_unique_aligning_transcripts.txt


	Using the information of transcript positions where a splice site may be, probes are reevaluated. If a probe has at least one of those positions, it
	is removed. A second filter can also be created that removes probes that were built on transcripts that did not align at all. (See the 
	--> saved as probe_correction_new/[species]/aligning_transcripts.txt from above.

	A third filter is also created from probes that have positive evidence of passing. These are from the 5nt sized, unique, single mapping transcripts that
	can be found from --> saved as probe_correction_new/tecunumanii/single_evicence_5nt_unique_aligning_transcripts.txt from above. 

	Here are the statistics on those filters:

	   Greggii
            Original MQ probes: 15,557 (MQ_greggii_SNP_probes.txt)
            After correction: 14,844 (corrected_MQ_greggii_SNP_probes.txt)
            Built on aligned transcriptome: 8,617 (mapping_corrected_MQ_greggii_SNP_probes.txt)
            Built on single mapping, +-5nt sized, unique aligned transcriptome: 5,067 (mapping_single_evicence_5nt_unique_corrected_MQ_greggii_SNP_probes.txt)
        Oocarpa
            Original MQ probes: 25,905
            After correction: 25,012
            Built on aligned transcriptome: 13,695
            Built on  single mapping, +-5nt sized, unique aligned transcriptome: 7,316
        Patula
            Original MQ probes: 22,224
            After correction: 21,264
            Built on aligned transcriptome: 11,431
            Built on single mapping, +-5nt sized, unique aligned transcriptome: 6,685
        Tecunumanii
            Original MQ probes: 23,842 
            After correction: 22,854
            Built on aligned transcriptome: 11,924
            Built on a single mapping, +-5nt sized, unique aligned transcriptome: 7,004


	All of these steps are done with the scripts at 5_probe_correction_new/ level. 

###################################################################################################################################################################

9. FILES OF INTEREST

I name files in a very predictable manner. I am only going to show greggii file names as the other three species follow the same naming theme.


SNP files:
	1_alignments/greggii/greggii.standard.vcf.gz 
	1_alignments/greggii/greggii.strict.vcf.gz

Probe Files:
	Unfiltered probes: 2_probe_design/greggii/greggii_SNP_probes.txt
	MQ filtered probes: 2_probe_design/greggii/MQ_greggii_SNP_probes.txt

Genome threader gff3 files:
	3_genomethreader/greggii/greggii_gmet.gth.faa.gff3

Final filtered probe files (in the order they appear above):
	5_probe_correction_new/greggii/corrected_MQ_greggii_SNP_probes.txt
	5_probe_correction_new/greggii/mapping_corrected_MQ_greggii_SNP_probes.txt
	5_probe_correction_new/greggii/mapping_single_evicence_5nt_unique_corrected_MQ_greggii_SNP_probes.txt <-- has the least amount of probes


###################################################################################################################################################################

Thanks for reading me!