README FOR LOBLOLLY GENOTYPING ASSAY ############################################################################################################################################################### Note: Pita = Pinus taeda = Loblolly pine What this README entails: 1. Background 1. UF Exome Capture 2. TAMU Exome Capture 3. NCSU GBS 4. VA Tech GBS 2. Read processing (quality control) 3. Alignment 4. SNP/Indel calling 5. Probe design 6. File locations 7. Highlights of statistics ############################################################################################################################################################### 1. BACKGROUND 1. UF EXOME CAPUTRE: SNP probes are designed from exome capture of 24 Loblolly (Pinus taeda, Pita) and 24 Slash (Pinus elliotii, Piel). Samples were taken from haploid tissue (megagametophyte) and probes were designed from the original transcripts. For SNPs, only samples corresponding to Pita are used. Library was prepared from 8 individuals and pooled after ligation of adapters containing barcodes. The pooled sample was used for capture hybridization and sequence. See publication for detailed protocol. Raw sequencing data for each lane was demultiplexed using tool fastx_barcode_splitter.pl from fastx_toolkit(version 11sep2008) accepting no mismatches. During the demultiplexing process, some mates of the same fragment are lost. The data provided has been synchronized using custom scripts to keep only fragments with correct paired-end structure. Sequencing was done on Illumina HiSeq 2000 paired-end (2x100bp). Coverage assumed to be 30x. 2. TAMU EXOME CAPTURE: SNP probes are designed exome capture from a population of 375 trees using NimbleGen probes and then sequenced using the Illumina HiSeq 2500 platform. 3. NCSU GBS: 1,562 samples of tree phloem were collected from the Plant Seed Source Selection Study (PSSSS) located in Screven County, Georgia. The PSSSS consists of 140 pollen mix families planted in 20 test locations representing much of the loblolly pine range. Height and diameter at breast height (DBH) measurements were taken at this site for ages 4 and 8 years. Tissue samples were collected using 19mm diameter leather hole punches. The phloem was peeled off the bark disk and placed in a pre-labeled 2.0mL tube with 1mL of storage solution containing 10mM EDTA (pH 8.0) and 10mM sodium bisulfite dissolved in water. The tissue samples were stored at room temperature for up to 4 months before DNA was extracted. One half of each phloem disk was used per individual extraction. The disks were smashed with the bottom of a test tube prior to grinding to increase yields. DNA extractions were performed using a modified protocol from the Canadian Center for DNA Barcoding (Ivanova, Dewaard, & Hebert, 2006) with the insect lysis buffer and the following parameters: Tissue samples were ground in 96-sample deep-well plates from Fisher Scientific using a ceramic bead in the bottom of each well, followed by the tissue sample, and topped with a ceramic cylinder. 600µl of warmed Insect Lysis buffer was used per sample. Samples were ground in a Mixer Mill at a speed of 30 Hz for 6 minutes on each side. A 1.5 hour of incubation at 60°C followed, with shaking at 30 Hz for 1 minute at every 30 minute interval of the 1.5 hour incubation. 250µl of crude lysate was taken from each sample and allowed to bind with the Plant Binding Buffer for 15 minutes before transferring to the glass filter plates. Both Pall and Nunc glass filter plates were used successfully. A 15 minute spin after the final wash step was done in place of the 1.5 hour drying in a pre-heated oven. Samples were eluted using a low salt solution and allowed to stand for 5 minutes before centrifugation. DNA was quantified using a fluorescent dye-binding assay called AccuBlue (Biotium). DNA yields were lower than expected so three different concentration methods were used. Samples were either dried down to concentration in a large scale plate speed vacuum, dried-down individually in a single tube speed vacuum, or were pooled at different volumes (differential pooling) during the first plate pooling process. The restriction digest, ligations, and sample pooling were performed according to the published protocol (Poland et al., 2012) with the following modifications: Approximately 700ng of DNA was digested with PstI and MspI restriction enzymes, and then ligation using T7 or T4 ligase was performed overnight (~18 hours). No 65°C kill step was performed. 96 of the designed 384 barcode adapters from the Poland et al., 2012 protocol were used. Ligated samples from each plate were pooled together using 5 µl of each ligation reaction (unless the sample started with less than 700ng; if so, the volume pooled was adjusted appropriately). After ligation, samples were size selected for 200-500base pair (BP) range using a PippinPrep instrument (Sage Science). Each pooled plate was separated into 6 PCR reactions. A total of 12 indexing primers were used. Three different primers per plate were used, allowing for a total of 4 plates to be pooled together. The PCR reactions were done using Q5 polymerase (New England Biolabs). After amplification, all plates were quantified using PicoGreen (Life Technologies) and the plates were normalized to the same concentration and pooled together, producing one library with up to 384 samples each (4 plates). A total of 4 pooled libraries were produced. Each of the libraries were sequenced for 100bp paired end reads over 4 flowcell lanes (~96 samples/lane) on an Illumina HiSeq2000 using V3 chemistry. A total of 16 flowcell lanes were sequenced by the Beijing Genomics Institute service center in Philadelphia. Here are the 16 lanes: Ar1_FCC21M0ACXX_s_5_fastq Ar1_FCC21WAACXX_s_7_fastq Ar1_FCD23ETACXX_s_5_fastq Ar1_FCD23ETACXX_s_6_fastq Ar2_FCC21M0ACXX_s_5_fastq Ar2_FCC21WAACXX_s_7_fastq Ar2_FCD23ETACXX_s_5_fastq Ar2_FCD23ETACXX_s_6_fastq Br1_FCC21M0ACXX_s_1_fastq Br1_FCC21WAACXX_s_8_fastq Br1_FCD23ETACXX_s_7_fastq Br1_FCD23ETACXX_s_8_fastq Br2_FCC21M0ACXX_s_1_fastq Br2_FCC21WAACXX_s_8_fastq Br2_FCD23ETACXX_s_7_fastq Br2_FCD23ETACXX_s_8_fastq Cr1_FCC21M0ACXX_s_2_fastq Cr1_FCD23J4ACXX_s_1_fastq Cr1_FCD23J4ACXX_s_2_fastq Cr1_FCD23J4ACXX_s_3_fastq Cr2_FCC21M0ACXX_s_2_fastq Cr2_FCD23J4ACXX_s_1_fastq Cr2_FCD23J4ACXX_s_2_fastq Cr2_FCD23J4ACXX_s_3_fastq Dr1_FCD23J4ACXX_s_4_fastq Dr1_FCD23J4ACXX_s_5_fastq Dr1_FCD23J4ACXX_s_6_fastq Dr2_FCD23J4ACXX_s_4_fastq Dr2_FCD23J4ACXX_s_5_fastq Dr2_FCD23J4ACXX_s_6_fastq Each starting letter is the pool and the 1 or 2 in Ar1 or Ar2, for example, specifies the forward and reverse. Data in each of the raw fastq.gz data files from the four lanes within a pool were split into the four plates present based on the index sequences of the adapters, using a Linux command line script. These are in 6nt sequences that identify the plates from the pools. The barcode is found in the read header and begins with a # and always ends with AT. Therefore, there are 8nt with the first 6 being the index (excluding errors). Index sequences: name seq read index1 ATCACG index2 CGATGT index3 TTAGGC index4 TGACCA index5 ACAGTG index6 GCCAAT index7 CAGATC index8 ACTTGA index9 GATCAG index10 TAGCTT index11 GGCTAC index12 CTTGTA The assignments of indexes to plates within pools A, B, C and D was as follows: Pool Plate # indexes A Plate 2 1,2,3 Plate 3 4,5,6 Plate 13 7,8,9 Plate 14 10,11,12 B Plate 4 1,2,3 Plate 5 4,5,6 Plate 6 7,8,9, Plate 7 10,11,12 C Plate 8 1,2,3, Plate 9 4,5,6 Plate 10 7,8,9, Plate 11 10,11,12 D Plate 15 1,2,3 Plate 16 4,5,6 Plate 17 7,8,9 Plate 1 none - so indexes CTTGTA, GGCTAC, and TAGCTT are not found in Pool D headers. Instead, a much wider variety of sequences that don't match any index (even at Hamming distance 1) are found in the header lines. Presumably those reads are from plate 1. Each lane must first have the plates separated as each lane has a mix of four plates. A grep command pulls out the reads (grep -EA3) by matching to the 6nt sequences in the header following the # symbol. The sequences from each lane are put into separate folders still separated by their lane pool and forward and reverse reads (8 files in total in each folder). For plate 1, a perl script was written to pull out non-matching indices but is presumed to have index sequences with errors that may belong to other plates. The 4 lanes of the forward were concatenated into a single file. The forward reads were run through FastX toolkit barcode splitter which can only read single ends. For the reverse, a script was created to pull the corresponding reverse reads for each sample from the reverse reads in lanes merged. In each plate, at the end, there should be 96 samples forward and 96 samples reverse. The number of reads in each sample should be similar but rarely perfectly so. Example of Fastx command: zcat ../reads/forward_reads.fastq.gz | ../../fastx_barcode_splitter.pl --bcfile ../../barcodes.txt --prefix 1_ -bol Separating the plates by barcode:The provided barcodes: A01 TGACGCCATG A02 CAGATATGCA A03 GAAGTGTGCA A04 TAGCGGATTG A05 TATTCGCATT A06 ATAGATTGCA A07 CCGAACATGC A08 GGAAGACATT A09 GGCTTATGCA A10 AACGCACATT A11 GAGCGACATT A12 CCTTGCCATT B01 GGTATATGCA B02 TCTTGGTGCA B03 GGTGTTGCAG B04 GGATATGCAG B05 CTAAGCATGC B06 ATTATTGCAG B07 GCGCTCATGC B08 ACTGCGATTG B09 TTCGTTTGCA B10 ATATAATGCA B11 TGGCAACAGA B12 CTCGTCGTGC C01 GCCTACCTTG C02 CACCATGCAG C03 AATTAGTGCA C04 GGAACGATGC C05 ACAACTTGCA C06 ACTGCTTGCA C07 CGTGGACAGT C08 TGGCACAGAT C09 TGCTTTGCAG C10 GCAAGCCATT C11 CGCACCAATT C12 CTCGCGGTGC D01 AACTGGTGCA D02 ATGAGCAATG D03 CTTGATGCAG D04 GCGTCCTTGC D05 ACCAGGATGC D06 CCACTCATGC D07 TCACGGAAGT D08 TATCATGCAG D09 TAGCCAATGC D10 ATATCGCCAT D11 CTCTATGCAG D12 GGTGCACATT E01 CTCTCGCATT E02 CAGAGGTTGC E03 GCGTACAATT E04 ACGCGCGTGC E05 GTCGCCTTGC E06 AATAACCAAT E07 AATGAACGAT E08 CGTCGCCACT E09 ATGGCAATGC E10 GAAGCATGCA E11 AACGTGCCTT E12 CCTCGTGCAG F01 CTCATTGCAG F02 ACGGTACTTG F03 GCGCCGTGCA F04 CAAGTTGCAG F05 TCCGAGTGCA F06 TAGATGATGC F07 TGGCCAGTGC F08 GCACGATTGC F09 TTGCTGTGCA F10 CGCAACCAGT F11 TCACTGTGCA F12 ACAGTTGCAG G01 GGAGTCAAGT G02 TGAATTGCAG G03 CATATTGCAG G04 GTGACACATT G05 TATGTTGCAG G06 CAGTGCCATT G07 ACAACCAACT G08 TGCAGATGCA G09 CATCTGCCGT G10 GGACAGTGCA G11 ATCTGTTGCA G12 AAGACGCTTG H01 GAATGCAATA H02 TAGCAGTGCA H03 ATCCGTGCAG H04 CTTAGTGCAG H05 TTATTACATT H06 GCCAACAAGA H07 TGCCGCATTG H08 CGTGTCATGC H09 CAACCACACA H10 GCTCCGATGC H11 TCAGAGATTG H12 CGTTCATGCA These barcodes are variable in length but are filled out to 10bp from the remnant of the PstI restriction site (TGCAG) for fastx barcode splitter (fastx tools). Then, as mentioned before, a personal script removes the reverse matching sample. 4. VA TECH DATA: Files were delivered already demultiplexed with stacks process_radtags. Libraries were constructed from double digests (PstI/MspI) of gDNA with a target size range of 300-500bp (verified by bioanalyzer). Target depth is fairly low - 96 samples per lane in 2x100bp format on a HiSeq 2500. There is some overlap of families but different individuals are planted at the two sites (when compared to NCSU). Radtags were processed with stacks. Barcodes for samples were provided in a file. Adapters are from command: /work/newriver/jah1/software/stacks-1.44/process_radtags -1 forward_reads \ -2 reverse_reads -b barcodes.txt -P -c -q -r -o out_dir \ --inline_null --renz_1 pstI -i gzfastq \ --adapter_1 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT \ --adapter_2 CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT Samples were decompressed, but 5 were corrupted and not used for SNP steps: A17-46.fq.gz2 B10-01.fq.gz2 B13-50.fq.gz2 B4-28.fq.gz2 B5-56.fq.gz2 ############################################################################################################################################################### 2. READ PROCESSING AND QUALITY CONTROL All reads are trimmed with sickle pe (version 1.33). Quality and length thresholds were set to 30 and 50 respectively. Sample command: sickle pe \ -f ../samples/1_A01.fastq \ -r ../samples/2_A01.fastq \ -t illumina \ # Or sanger depending on read type -o trimmed_1_A01.fastq \ -p trimmed_2_A01.fastq \ -s trimmed_single_A01.fastq \ -q 30 \ -l 50 ############################################################################################################################################################### 3. ALIGNMENT Alignment is always done using BWA (0.7.15) and all reads are aligned against the 2.01 3k version of the Pita Genome. Sample command: bwa mem \ -t 32 \ genome/pita.v2.0.1.masked.3k \ /UCHC/LABS/Wegrzyn/ConiferGenomes/Pita/PRJNA320114_TAMU/trimmed_fastq/trimmed_SRR3540310_1.fastq.gz \ /UCHC/LABS/Wegrzyn/ConiferGenomes/Pita/PRJNA320114_TAMU/trimmed_fastq/trimmed_SRR3540310_1.fastq.gz \ > sam/SRR3540310.sam Once alignment sam files are produced, they are converted to bam and then sorted using samtools. Sample commands: /isg/shared/apps/samtools/1.3.1/bin/samtools view -@ 32 -b -o bam/SRR3540310.bam sam/SRR3540310.sam /isg/shared/apps/samtools/1.3.1/bin/samtools sort -@ 32 -o sorted_bam/SRR3540310.bam bam/SRR3540310.bam ############################################################################################################################################################### 4. SNP/INDEL CALLING SNP and indel calling is always done with Freebayes (version 1.0.2). Two types of vcf files are produced, one run with standard, generic setting, and another with more stringent settings. This is to have high quality SNPs to build probes around while considering other SNPs in the surrounding sequence. Sample commands: freebayes -f genome/pita.v2.0.1.masked.3k.fasta -u -m 0 -q 15 -F .2 --min-coverage 8 -C 2 sorted_bam/SRR3540314.bam > vcf/standard_SRR3540314.vcf freebayes -f genome/pita.v2.0.1.masked.3k.fasta -u -m 20 -q 25 -F .2 --min-coverage 10 -C 6 sorted_bam/SRR3540314.bam > vcf/strict_SRR3540314.vcf VCF files are merged with bcftools by their respective set and then merged into a super set of all strict and all standard variants. These files are then annotated with SNPeff (4.3i) and the new 2.01 Pita annotation. Number of SNPs and Indels after project merges: TAMU Strict: 7,702,804 Standard: 25,216,501 UF Strict: 1,516,877 Standard: 8,338,718 VA TECH Strict: 261,768 Standard: 828,122 NCSU Strict: 1,105,218 Standard: 2,056,210 Number of SNPs and Indels after final merge: Strict: 10,079,809 Strict bi-allelic: 8,272,630 Standard: 35,196,703 ############################################################################################################################################################### 5. PROBE DESIGN Done with 70_mer_probes.pl, overlap_filter.pl, MQ_probe_filter.pl, and preference.pl. Probe designing is done in two steps. The first is to take a variant and evaluate its distances on the left and right to the nearest SNP/Indel (or beginning or end of the scaffold). IUPACs and codes (I for an indel) are given and stored at this stage. A capitol letter IUPAC is given when the call is of the strict quality from Freebayes, and lower case for the standard. All multi-allelic SNPs/Indels are considered standard to prevent probe creation around them. The second step prints every probe for every strict quality bi-allelic SNP or indel. The flanking sequence will have any number of IUPAC SNP or indel bases. These probes are then filtered so that only probes without SNPs or Indels in the first 20 bases on either side of the SNP remain. The other 15 outer sequence may have SNPs, but no indels. Duplicates are also removed. Prior to printing, there were 8,272,630 probes. When the filter above is applied, 249,603 remain. Since only 200,000 probes may be used in this assay, here is the order of preference: 1. All annotated (non intergenic, as defined by SNPeff as existing < 5000bp away from a gene) SNPs. 2. All annotated indel probes. 3. All SNP probes that have NO flanking SNPs. 4. Rest of the space filled with any SNP probe. ############################################################################################################################################################### 6. FILE LOCATIONS bam_files/ sorted_bam_files/ sam_files/ trimmed_fastq/ --> These are all just tar files of each of the four samples sources. vcf_files/standard/ --> vcf files of each sample sources. vcf_files/standard/Pita_merge_standard.vcf.gz --> the merged vcf without SNPeff annotation. vcf_files/standard/Pita_merge_standard_snpeff.vcf.gz --> the merged vcf with SNPeff annotation. (This is the same for strict) probes/70_mer_probes.pl --> script made to create the probes. probes/35bp_probe_stats.pl --> creates statistics on probe SNPeff annotations. probes/Pita_final_all_potential_probes_Wegrzyn.txt.gz --> the 8,272,630 possible, unfiltered probes. probes/Pita_final_200K_probes_Wegrzyn.txt.gz --> the final 200,000 probes. probes/final_200k_probes_stats.txt --> statistics on the final 200,000 probes. ############################################################################################################################################################### 7. HIGHLIGHTS OF STATISTICS Total number of probes: 200000 Number of probes annotated near/in genes 48054 (24.03%) intergenic_region 151946 (75.97%) intron_variant 18574 (9.29%) synonymous_variant 5873 (2.94%) splice_region_variant&synonymous_variant 60 (0.12%) missense_variant 6433 (3.22%) missense_variant&splice_region_variant 73 (0.15%) stop_lost 3 (0.00%) stop_gained 149 (0.31%) start_lost 9 (0.00%) frameshift_variant 133 (0.28%) For more statistics, see probes/final_200k_probes_stats.txt.