README FOR LOBLOLLY GENOTYPING ASSAY Contact: Madison.caballero@uconn.edu ############################################################################################################################################################### Note: Pita = Pinus taeda = Loblolly pine What this README entails: 1. Background 1. UF Exome Capture 2. TAMU Exome Capture 3. NCSU GBS 4. VA Tech GBS 2. Read processing (quality control) 3. Alignment 4. SNP/Indel calling 5. CTGN SNP addition 6. Kmer testing and probe design 7. Highlights of statistics 8. File locations and extra READMEs 9. Update 8/23/18 : section on formatting for Thermo ############################################################################################################################################################### 1. BACKGROUND - Where the data is coming from. 1. UF EXOME CAPUTRE: SNP probes are designed from exome capture of 24 Loblolly (Pinus taeda, Pita) and 24 Slash (Pinus elliotii, Piel). Samples were taken from haploid tissue (megagametophyte) and probes were designed from the original transcripts. For SNPs, only samples corresponding to Pita are used. Library was prepared from 8 individuals and pooled after ligation of adapters containing barcodes. The pooled sample was used for capture hybridization and sequence. See publication for detailed protocol. Raw sequencing data for each lane was demultiplexed using tool fastx_barcode_splitter.pl from fastx_toolkit(version 11sep2008) accepting no mismatches. During the demultiplexing process, some mates of the same fragment are lost. The data provided has been synchronized using custom scripts to keep only fragments with correct paired-end structure. Sequencing was done on Illumina HiSeq 2000 paired-end (2x100bp). Coverage assumed to be 30x. 2. TAMU EXOME CAPTURE: SNP probes are designed exome capture from a population of 375 trees using NimbleGen probes and then sequenced using the Illumina HiSeq 2500 platform. Coverage also assumed to be 30x. 3. NCSU GBS: 1,562 samples of tree phloem were collected from the Plant Seed Source Selection Study (PSSSS) located in Screven County, Georgia. The PSSSS consists of 140 pollen mix families planted in 20 test locations representing much of the loblolly pine range. Height and diameter at breast height (DBH) measurements were taken at this site for ages 4 and 8 years. Tissue samples were collected using 19mm diameter leather hole punches. The phloem was peeled off the bark disk and placed in a pre-labeled 2.0mL tube with 1mL of storage solution containing 10mM EDTA (pH 8.0) and 10mM sodium bisulfite dissolved in water. The tissue samples were stored at room temperature for up to 4 months before DNA was extracted. One half of each phloem disk was used per individual extraction. The disks were smashed with the bottom of a test tube prior to grinding to increase yields. DNA extractions were performed using a modified protocol from the Canadian Center for DNA Barcoding (Ivanova, Dewaard, & Hebert, 2006) with the insect lysis buffer and the following parameters: Tissue samples were ground in 96-sample deep-well plates from Fisher Scientific using a ceramic bead in the bottom of each well, followed by the tissue sample, and topped with a ceramic cylinder. 600µl of warmed Insect Lysis buffer was used per sample. Samples were ground in a Mixer Mill at a speed of 30 Hz for 6 minutes on each side. A 1.5 hour of incubation at 60°C followed, with shaking at 30 Hz for 1 minute at every 30 minute interval of the 1.5 hour incubation. 250µl of crude lysate was taken from each sample and allowed to bind with the Plant Binding Buffer for 15 minutes before transferring to the glass filter plates. Both Pall and Nunc glass filter plates were used successfully. A 15 minute spin after the final wash step was done in place of the 1.5 hour drying in a pre-heated oven. Samples were eluted using a low salt solution and allowed to stand for 5 minutes before centrifugation. DNA was quantified using a fluorescent dye-binding assay called AccuBlue (Biotium). DNA yields were lower than expected so three different concentration methods were used. Samples were either dried down to concentration in a large scale plate speed vacuum, dried-down individually in a single tube speed vacuum, or were pooled at different volumes (differential pooling) during the first plate pooling process. The restriction digest, ligations, and sample pooling were performed according to the published protocol (Poland et al., 2012) with the following modifications: Approximately 700ng of DNA was digested with PstI and MspI restriction enzymes, and then ligation using T7 or T4 ligase was performed overnight (~18 hours). No 65°C kill step was performed. 96 of the designed 384 barcode adapters from the Poland et al., 2012 protocol were used. Ligated samples from each plate were pooled together using 5 µl of each ligation reaction (unless the sample started with less than 700ng; if so, the volume pooled was adjusted appropriately). After ligation, samples were size selected for 200-500base pair (BP) range using a PippinPrep instrument (Sage Science). Each pooled plate was separated into 6 PCR reactions. A total of 12 indexing primers were used. Three different primers per plate were used, allowing for a total of 4 plates to be pooled together. The PCR reactions were done using Q5 polymerase (New England Biolabs). After amplification, all plates were quantified using PicoGreen (Life Technologies) and the plates were normalized to the same concentration and pooled together, producing one library with up to 384 samples each (4 plates). A total of 4 pooled libraries were produced. Each of the libraries were sequenced for 100bp paired end reads over 4 flowcell lanes (~96 samples/lane) on an Illumina HiSeq2000 using V3 chemistry. A total of 16 flowcell lanes were sequenced by the Beijing Genomics Institute service center in Philadelphia. Here are the 16 lanes: Ar1_FCC21M0ACXX_s_5_fastq Ar1_FCC21WAACXX_s_7_fastq Ar1_FCD23ETACXX_s_5_fastq Ar1_FCD23ETACXX_s_6_fastq Ar2_FCC21M0ACXX_s_5_fastq Ar2_FCC21WAACXX_s_7_fastq Ar2_FCD23ETACXX_s_5_fastq Ar2_FCD23ETACXX_s_6_fastq Br1_FCC21M0ACXX_s_1_fastq Br1_FCC21WAACXX_s_8_fastq Br1_FCD23ETACXX_s_7_fastq Br1_FCD23ETACXX_s_8_fastq Br2_FCC21M0ACXX_s_1_fastq Br2_FCC21WAACXX_s_8_fastq Br2_FCD23ETACXX_s_7_fastq Br2_FCD23ETACXX_s_8_fastq Cr1_FCC21M0ACXX_s_2_fastq Cr1_FCD23J4ACXX_s_1_fastq Cr1_FCD23J4ACXX_s_2_fastq Cr1_FCD23J4ACXX_s_3_fastq Cr2_FCC21M0ACXX_s_2_fastq Cr2_FCD23J4ACXX_s_1_fastq Cr2_FCD23J4ACXX_s_2_fastq Cr2_FCD23J4ACXX_s_3_fastq Dr1_FCD23J4ACXX_s_4_fastq Dr1_FCD23J4ACXX_s_5_fastq Dr1_FCD23J4ACXX_s_6_fastq Dr2_FCD23J4ACXX_s_4_fastq Dr2_FCD23J4ACXX_s_5_fastq Dr2_FCD23J4ACXX_s_6_fastq Each starting letter is the pool and the 1 or 2 in Ar1 or Ar2, for example, specifies the forward and reverse. Data in each of the raw fastq.gz data files from the four lanes within a pool were split into the four plates present based on the index sequences of the adapters, using a Linux command line script. These are in 6nt sequences that identify the plates from the pools. The barcode is found in the read header and begins with a # and always ends with AT. Therefore, there are 8nt with the first 6 being the index (excluding errors). Index sequences: name seq read index1 ATCACG index2 CGATGT index3 TTAGGC index4 TGACCA index5 ACAGTG index6 GCCAAT index7 CAGATC index8 ACTTGA index9 GATCAG index10 TAGCTT index11 GGCTAC index12 CTTGTA The assignments of indexes to plates within pools A, B, C and D was as follows: Pool Plate # indexes A Plate 2 1,2,3 Plate 3 4,5,6 Plate 13 7,8,9 Plate 14 10,11,12 B Plate 4 1,2,3 Plate 5 4,5,6 Plate 6 7,8,9, Plate 7 10,11,12 C Plate 8 1,2,3, Plate 9 4,5,6 Plate 10 7,8,9, Plate 11 10,11,12 D Plate 15 1,2,3 Plate 16 4,5,6 Plate 17 7,8,9 Plate 1 none - so indexes CTTGTA, GGCTAC, and TAGCTT are not found in Pool D headers. Instead, a much wider variety of sequences that don't match any index (even at Hamming distance 1) are found in the header lines. Presumably those reads are from plate 1. Each lane must first have the plates separated as each lane has a mix of four plates. A grep command pulls out the reads (grep -EA3) by matching to the 6nt sequences in the header following the # symbol. The sequences from each lane are put into separate folders still separated by their lane pool and forward and reverse reads (8 files in total in each folder). For plate 1, a perl script was written to pull out non-matching indices but is presumed to have index sequences with errors that may belong to other plates. The 4 lanes of the forward were concatenated into a single file. The forward reads were run through FastX toolkit barcode splitter which can only read single ends. For the reverse, a script was created to pull the corresponding reverse reads for each sample from the reverse reads in lanes merged. In each plate, at the end, there should be 96 samples forward and 96 samples reverse. The number of reads in each sample should be similar but rarely perfectly so. Example of Fastx command: zcat ../reads/forward_reads.fastq.gz | ../../fastx_barcode_splitter.pl --bcfile ../../barcodes.txt --prefix 1_ -bol Separating the plates by barcode:The provided barcodes: A01 TGACGCCATG A02 CAGATATGCA A03 GAAGTGTGCA A04 TAGCGGATTG A05 TATTCGCATT A06 ATAGATTGCA A07 CCGAACATGC A08 GGAAGACATT A09 GGCTTATGCA A10 AACGCACATT A11 GAGCGACATT A12 CCTTGCCATT B01 GGTATATGCA B02 TCTTGGTGCA B03 GGTGTTGCAG B04 GGATATGCAG B05 CTAAGCATGC B06 ATTATTGCAG B07 GCGCTCATGC B08 ACTGCGATTG B09 TTCGTTTGCA B10 ATATAATGCA B11 TGGCAACAGA B12 CTCGTCGTGC C01 GCCTACCTTG C02 CACCATGCAG C03 AATTAGTGCA C04 GGAACGATGC C05 ACAACTTGCA C06 ACTGCTTGCA C07 CGTGGACAGT C08 TGGCACAGAT C09 TGCTTTGCAG C10 GCAAGCCATT C11 CGCACCAATT C12 CTCGCGGTGC D01 AACTGGTGCA D02 ATGAGCAATG D03 CTTGATGCAG D04 GCGTCCTTGC D05 ACCAGGATGC D06 CCACTCATGC D07 TCACGGAAGT D08 TATCATGCAG D09 TAGCCAATGC D10 ATATCGCCAT D11 CTCTATGCAG D12 GGTGCACATT E01 CTCTCGCATT E02 CAGAGGTTGC E03 GCGTACAATT E04 ACGCGCGTGC E05 GTCGCCTTGC E06 AATAACCAAT E07 AATGAACGAT E08 CGTCGCCACT E09 ATGGCAATGC E10 GAAGCATGCA E11 AACGTGCCTT E12 CCTCGTGCAG F01 CTCATTGCAG F02 ACGGTACTTG F03 GCGCCGTGCA F04 CAAGTTGCAG F05 TCCGAGTGCA F06 TAGATGATGC F07 TGGCCAGTGC F08 GCACGATTGC F09 TTGCTGTGCA F10 CGCAACCAGT F11 TCACTGTGCA F12 ACAGTTGCAG G01 GGAGTCAAGT G02 TGAATTGCAG G03 CATATTGCAG G04 GTGACACATT G05 TATGTTGCAG G06 CAGTGCCATT G07 ACAACCAACT G08 TGCAGATGCA G09 CATCTGCCGT G10 GGACAGTGCA G11 ATCTGTTGCA G12 AAGACGCTTG H01 GAATGCAATA H02 TAGCAGTGCA H03 ATCCGTGCAG H04 CTTAGTGCAG H05 TTATTACATT H06 GCCAACAAGA H07 TGCCGCATTG H08 CGTGTCATGC H09 CAACCACACA H10 GCTCCGATGC H11 TCAGAGATTG H12 CGTTCATGCA These barcodes are variable in length but are filled out to 10bp from the remnant of the PstI restriction site (TGCAG) for fastx barcode splitter (fastx tools). Then, as mentioned before, a personal script removes the reverse matching sample. 4. VA TECH DATA: Files were delivered already demultiplexed with stacks process_radtags. Libraries were constructed from double digests (PstI/MspI) of gDNA with a target size range of 300-500bp (verified by bioanalyzer). Target depth is fairly low - 96 samples per lane in 2x100bp format on a HiSeq 2500. There is some overlap of families but different individuals are planted at the two sites (when compared to NCSU). Radtags were processed with stacks. Barcodes for samples were provided in a file. Adapters are from command: /work/newriver/jah1/software/stacks-1.44/process_radtags -1 forward_reads \ -2 reverse_reads -b barcodes.txt -P -c -q -r -o out_dir \ --inline_null --renz_1 pstI -i gzfastq \ --adapter_1 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT \ --adapter_2 CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT Samples were decompressed, but 5 were corrupted and not used for SNP steps: A17-46.fq.gz2 B10-01.fq.gz2 B13-50.fq.gz2 B4-28.fq.gz2 B5-56.fq.gz2 ############################################################################################################################################################### 2. READ PROCESSING AND QUALITY CONTROL All reads are trimmed with sickle pe (version 1.33). Quality and length thresholds were set to 30 and 50 respectively. Sample command: sickle pe \ -f ../samples/1_A01.fastq \ -r ../samples/2_A01.fastq \ -t illumina \ # Or sanger depending on read type -o trimmed_1_A01.fastq \ -p trimmed_2_A01.fastq \ -s trimmed_single_A01.fastq \ -q 30 \ -l 50 ############################################################################################################################################################### 3. ALIGNMENT Alignment is always done using BWA (0.7.15) and all reads are aligned against the 2.01 3k version of the Pita Genome. Sample command: bwa mem \ -t 32 \ genome/pita.v2.0.1.masked.3k \ /UCHC/LABS/Wegrzyn/ConiferGenomes/Pita/PRJNA320114_TAMU/trimmed_fastq/trimmed_SRR3540310_1.fastq.gz \ /UCHC/LABS/Wegrzyn/ConiferGenomes/Pita/PRJNA320114_TAMU/trimmed_fastq/trimmed_SRR3540310_1.fastq.gz \ > sam/SRR3540310.sam Once alignment sam files are produced, they are converted to bam and then sorted using samtools. Sample commands: /isg/shared/apps/samtools/1.3.1/bin/samtools view -@ 32 -b -o bam/SRR3540310.bam sam/SRR3540310.sam /isg/shared/apps/samtools/1.3.1/bin/samtools sort -@ 32 -o sorted_bam/SRR3540310.bam bam/SRR3540310.bam ############################################################################################################################################################### 4. SNP/INDEL CALLING SNP and indel calling is always done with Freebayes (version 1.0.2). Two types of vcf files are produced, one run with standard, generic setting, and another with more stringent settings. This is to have high quality SNPs to build probes around while considering other SNPs in the surrounding sequence. Sample commands: freebayes -f genome/pita.v2.0.1.masked.3k.fasta -u -m 0 -q 15 -F .2 --min-coverage 8 -C 2 sorted_bam/SRR3540314.bam > vcf/standard_SRR3540314.vcf freebayes -f genome/pita.v2.0.1.masked.3k.fasta -u -m 20 -q 25 -F .2 --min-coverage 10 -C 6 sorted_bam/SRR3540314.bam > vcf/strict_SRR3540314.vcf VCF files are merged with bcftools by their respective set and then merged into a super set of all strict and all standard variants. These files are then annotated with SNPeff (4.3i) and the new 2.01 Pita annotation. Number of SNPs and Indels after project merges: TAMU Strict: 7,702,804 Standard: 25,216,501 UF Strict: 1,516,877 Standard: 8,338,718 VA TECH Strict: 261,768 Standard: 828,122 NCSU Strict: 1,105,218 Standard: 2,056,210 Number of SNPs and Indels after final merge: Strict: 10,079,809 Strict bi-allelic: 8,272,630 # This number is produced by the 70_mer_probes.pl script. 3+ allele variants are ignored. Standard: 35,196,703 All vcf files are available in a .tar file. This includes all individual files of both qualities and full merges of both qualities. ############################################################################################################################################################### 5. CTGN SNP addition A set of 4,862 probes that had previous genotypes should be included in the final set of probes. However, these probe sequences were developed off the original versionof the P. taeda genome and must be reestablished. The overall project of CTGN SNP inclusion follows these steps: 1. Convert sequences to a format that can be aligned to the P. taeda genome. 2. Identify the matches by their scaffold and position. 3. Merge these sequences into the set of SNPs and Indels from the main project. 4. Redesign probes around the SNPs again. PROCESS: 1. Convert sequences to a format that can be aligned to the P. taeda genome. The probes came in a csv format that included variable probe sequences, their name, and associated genotypes. There were 4,862 sequences. This file was converted to a fasta format with the header including relevant information and the probe as a sequence with a IUPAC code for the SNP. This was done with the script csv_to_fasta.pl. These fasta files are aligned using Bowtie2 and the sam is converted to a bed file. 1,294 reads mapped (26.61%). 2. Identify the matches by their scaffold and position. This had to be dealt with carefully. Since bed files give a range of the sequence map, the true location of the SNP can only be accurately determined if the entire sequence is a match without insertions or deletions. The SNP location can then be given a scaffold and position. These sequences are turned into a pseudo vcf file by vcf_from_bed.pl. This is not a real vcf file and cannot merge because the numbers are innacurate. I cannot tell what is the reference base or alternate base and the support, read depth, and other information cannot be filled out. 3. Merge these sequences into the set of SNPs and Indels from the main project. This pseudo vcf of 1,181 sequences is concatenated into the full vcf from the main project. A true bcftools merge could not occur due to that formatting issue. Overlap and reference/alternate base issues will be resolved by my probe creation script. 4. Redesign probes around the SNPs again. The psuedo vcfs: pseudo_vcf_standard.vcf and pseudo_vcf_strict.vcf are going to be used in kmer testing for final probe creation. ############################################################################################################################################################### 6. KMER TESTING and PROBE DESIGN To understand probe mapping to the genome, the kmers that make up the probes are to be run across the genome for mapping frequency. First, probes need to be created using the pseudo vcfs from before, however the probes must be printed WITHOUT IUPAC codes. This is because my simplistic matching will not work if non-nucleotide codes are present in the sequence. A modified version of my probe creation script called 70_mer_probes_just_print.pl is used to create probes as they match the genome sequence. The SNP is also just the reference genome base. The 8,273,868 probes can then be split into 4 18-nt kmers like this: GTTTTTCTGCAGGTTCCAGGAGAGTCGACTTAGTT G CTTCTGGGTCTTTCTACGACCTGTTAATTTGGGGT to GTTTTTCTGCAGGTTCCA GGAGAGTCGACTTAGTG GCTTCTGGGTCTTTCTAC GACCTGTTAATTTGGGGT Notice that the SNP was used twice just because the full sequence is 71 nt long. All kmers produced are thrown into a hash, and duplicates are already lost. There were 24,875,817 unique kmers in the probe sets, but 33,087,264 net total. Since keys in a hash must be unique, the duplicates are replaced. This suggests there already is a large amount of overlap within the sequences. The entire P. taeda genome is then run with an 18nt sliding window. Mathes to the hatch are tallied so that each kmer has a corresponding hit-rate to look like this: Kmers hash (all same case): GTTTGGGGTTTTCTGCAG 12 GTTCCAGGAGAGTCGACA 108 ATGTCATTTGCTGGATCT 1 ATGTCATTTGCTGGATCT 4 The kmers with the highest mapping rate were, predictably, simple repeat sequences like tatta or gtgt. The average kmer had 14.6 hits, but the median was 1 hit. The probes are then reevaluated with this information. Since each probe has 4 kmers, the four scores of each kmers in how it maps to the genome is added up. This is not the most accurate or telling way, but it gives a probe a good sense of its contents. The lowest score for a probe would be 4 if each kmer only has a hit-rate of 1. The record holder for highest scoring probe: 4,667,554 Average: 285.91385289969 Median: 8 So, in order to set a cutoff that kept a lot of viable probes, the average was used as a cut off for SNPs and probes that mapped too much for comfort (286). For probe redesign (with IUPAC codes), SNPs with a mapping score of more than 286 were barred as HIGHMAP. This flag was put onto the scored_SNPs.txt file and act like the STANDARD FLAG to tell the script to not build a probe around it, but to recognize its presence with an upper-case IUPAC. This is done with the scripts inform_SNPs_with_kmer_table.pl and 70_mer_probes_from_adjusted_scores.pl using the file scored_SNPs_mapping_adjusted.txt. Probe designing is done in two steps. The first is to take a variant and evaluate its distances on the left and right to the nearest SNP/Indel (or beginning or end of the scaffold). IUPACs and codes (I for an indel) are given and stored at this stage. A capitol letter IUPAC is given when the call is of the strict quality from Freebayes, and lower case for the standard. All multi-allelic SNPs/Indels are considered standard to prevent probe creation around them. The second step prints every probe for every strict quality bi-allelic SNP or indel. The flanking sequence will have any number of IUPAC SNP or indel bases. These probes are then filtered so that only probes without SNPs or Indels in the first 20 bases on either side of the SNP remain. The other 15 outer sequence may have SNPs, but no indels. Duplicates are also removed. The final result is 194,397 probes with 22% in or near genes. ############################################################################################################################################################### 8. STATISTICS The final result is 194,397 probes with 22% in or near genes. Final stats highlights: Total number of probes: 194397 Number of probes annotated near/in genes 42749 (21.99%) intergenic_region 151648 (78.01%) synonymous_variant 5787 (2.98%) missense_variant 6297 (3.24%) downstream_gene_variant 7025 (3.61%) upstream_gene_variant 6363 (3.27%) intron_variant 15698 (8.08%) ############################################################################################################################################################### 9. File locations and extra READMEs trimmed_fastq/ --> reads trimmed with sickle from every fastq raw file. sam_files/ --> output of bwa alignments. bam_files/ --> locations of all .bam conversions. (BWA output is sam, samtools converts it to .bam. sorted_bam_files/ --> locations of all sorted .bam files of above done with bamtoos. vcf_files/ --> location of individual vcf files and their merges. A file explanation README is available within that folder. CTGN_SNP_stuff/ --> folder that contains everything done for the past array addition. A file explanation README is available within that folder. kmer_testing/ --> kmer testing was done here. A file explanation README is available within that folder. probe_creation/ without_IUPACS_second_run/ --> files required for kmer testing. A file explanation README is available within that folder. with_IUPACS_second_run/ --> final printed probes. A file explanation README is available within that folder. final_PITA_probes.txt --> Final 194,397 probes. ############################################################################################################################################################### ############################################################################################################################################################### UPDATE 8/23/18: Formatting for thermo To allow for a greater selection of probes, the 194,397 probes from above were re-evaluated. First, the original set of 8,272,630 unfiltered probes with their IUPAC codes were evaluated. The probes were filtered using Aug18_probe_filter.pl so that they: 1. Had no flanking indels (but the probe could still be designed around an indel). 2. Had no SNPs on one arm of the probe. (The other arm could have any number of SNPs). SNPs that passed the filter: 485,016 (file: out_noSNP_oneside_probes.txt) The CTGN SNPs were removed leaving 485,016 probes. These probes were then cut before their IUPAC using process.pl. All probes then have one full 35-mer flanking sequence and the other sequence somewhere between 0 and 35 nt. All probes are also given a SNP identifier. Final formatted probes are in final_Pita_probes.txt. all_CTGN_probes.txt contains all 2,840 that had sequence trimmed to a 75-mer length with no flanking SNPs incorporated. The high quality SNPs removed above are in CTGN_probes_HQ.txt ############################################################################################################################################################### ###############################################################################################################################################################