README FOR LOBLOLLY GENOTYPING ASSAY
Contact: Madison.caballero@uconn.edu
###############################################################################################################################################################

Note: Pita = Pinus taeda = Loblolly pine

What this README entails:
 1. Background
        1. UF Exome Capture
        2. TAMU Exome Capture
        3. NCSU GBS
        4. VA Tech GBS
 2. Read processing (quality control)
 3. Alignment
 4. SNP/Indel calling
 5. CTGN SNP addition
 6. Kmer testing and probe design
 7. Highlights of statistics
 8. File locations and extra READMEs

 9. Update 8/23/18 : section on formatting for Thermo

###############################################################################################################################################################

1. BACKGROUND - Where the data is coming from. 

 	 1. UF EXOME CAPUTRE: SNP probes are designed from exome capture of 24 Loblolly (Pinus taeda, Pita) and 24 Slash (Pinus elliotii, Piel).
                Samples were taken from haploid tissue (megagametophyte) and probes were designed from the original transcripts. For SNPs, only samples
                corresponding to Pita are used.

                Library was prepared from 8 individuals and pooled after ligation of adapters containing barcodes. The pooled sample was used for capture
                hybridization and sequence. See publication for detailed protocol. Raw sequencing data for each lane was demultiplexed using tool
                fastx_barcode_splitter.pl from fastx_toolkit(version 11sep2008) accepting no mismatches. During the demultiplexing process, some mates of
                the same fragment are lost. The data provided has been synchronized using custom scripts to keep only fragments with correct paired-end structure.

                Sequencing was done on Illumina HiSeq 2000 paired-end (2x100bp). Coverage assumed to be 30x.

        2. TAMU EXOME CAPTURE: SNP probes are designed exome capture from a population of 375 trees using NimbleGen probes and then sequenced using
                the Illumina HiSeq 2500 platform. Coverage also assumed to be 30x. 

 	3. NCSU GBS: 1,562 samples of tree phloem were collected from the Plant Seed Source Selection Study (PSSSS) located in Screven County, Georgia. The  PSSSS
                consists of 140 pollen mix families planted in 20 test locations representing much of the loblolly pine range. Height and diameter at breast height
                (DBH) measurements were taken at this site for ages 4 and 8 years.

                Tissue samples were collected using 19mm diameter leather hole punches. The phloem was peeled off the bark disk and placed in a pre-labeled 2.0mL
                tube with 1mL of storage solution containing 10mM EDTA (pH 8.0) and 10mM sodium bisulfite dissolved in water. The tissue samples were stored at
                room temperature for up to 4 months before DNA was extracted.

                One half of each phloem disk was used per individual extraction. The disks were smashed with the bottom of a test tube prior to grinding to
                increase yields. DNA extractions were performed using a modified protocol from the Canadian Center for DNA Barcoding (Ivanova, Dewaard, & Hebert,
                2006) with the insect lysis buffer and the following parameters:

                                Tissue samples were ground in 96-sample deep-well plates from Fisher Scientific using a ceramic bead in the bottom of each well,
                                followed by the tissue sample, and topped with a ceramic cylinder. 600µl of warmed Insect Lysis buffer was used per sample.
                                Samples were ground in a Mixer Mill at a speed of 30 Hz for 6 minutes on each side. A 1.5 hour of incubation at 60°C followed,
                                with shaking at 30 Hz for 1 minute at every 30 minute interval of the 1.5 hour incubation. 250µl of crude lysate was taken from
                                each sample and allowed to bind with the Plant Binding Buffer for 15 minutes before transferring to the glass filter plates.
                                Both Pall and Nunc glass filter plates were used successfully. A 15 minute spin after the final wash step was done in place of the
                                1.5 hour drying in a pre-heated oven. Samples were eluted using a low salt solution and allowed to stand for 5 minutes before
                                centrifugation.


		 DNA was quantified using a fluorescent dye-binding assay called AccuBlue (Biotium). DNA yields were lower than expected so three different
                concentration methods were used. Samples were either dried down to concentration in a large scale plate speed vacuum, dried-down individually in a
                single tube speed vacuum, or were pooled at different volumes (differential pooling) during the first plate pooling process. The restriction digest,
                ligations, and sample pooling were performed according to the published protocol (Poland et al., 2012) with the following modifications:

                                Approximately 700ng of DNA was digested with PstI and MspI restriction enzymes, and then ligation using T7 or T4 ligase was performed
                                overnight (~18 hours). No 65°C kill step was performed. 96 of the designed 384 barcode adapters from the Poland et al., 2012
                                protocol were used. Ligated samples from each plate were pooled together using 5 µl of each ligation reaction (unless the sample
                                started with less than 700ng; if so, the volume pooled was adjusted appropriately). After ligation, samples were size selected for
                                200-500base pair (BP) range using a PippinPrep instrument (Sage Science). Each pooled plate was separated into 6 PCR reactions.
                                A total of 12 indexing primers were used. Three different primers per plate were used, allowing for a total of 4 plates to be
                                pooled together. The PCR reactions were done using Q5 polymerase (New England Biolabs). After amplification, all plates were
                                quantified using PicoGreen (Life Technologies) and the plates were normalized to the same concentration and pooled together,
                                producing one library with up to 384 samples each (4 plates). A total of 4 pooled libraries were produced.

                Each of the libraries were sequenced for 100bp paired end reads over 4 flowcell lanes (~96 samples/lane) on an Illumina HiSeq2000 using V3
                chemistry. A total of 16 flowcell lanes were sequenced by the Beijing Genomics Institute service center in Philadelphia. Here are the 16 lanes:

   		Ar1_FCC21M0ACXX_s_5_fastq
                Ar1_FCC21WAACXX_s_7_fastq
                Ar1_FCD23ETACXX_s_5_fastq
                Ar1_FCD23ETACXX_s_6_fastq
                Ar2_FCC21M0ACXX_s_5_fastq
                Ar2_FCC21WAACXX_s_7_fastq
                Ar2_FCD23ETACXX_s_5_fastq
                Ar2_FCD23ETACXX_s_6_fastq
                Br1_FCC21M0ACXX_s_1_fastq
                Br1_FCC21WAACXX_s_8_fastq
                Br1_FCD23ETACXX_s_7_fastq
                Br1_FCD23ETACXX_s_8_fastq
                Br2_FCC21M0ACXX_s_1_fastq
                Br2_FCC21WAACXX_s_8_fastq
                Br2_FCD23ETACXX_s_7_fastq
                Br2_FCD23ETACXX_s_8_fastq
                Cr1_FCC21M0ACXX_s_2_fastq
                Cr1_FCD23J4ACXX_s_1_fastq
                Cr1_FCD23J4ACXX_s_2_fastq
                Cr1_FCD23J4ACXX_s_3_fastq
                Cr2_FCC21M0ACXX_s_2_fastq
                Cr2_FCD23J4ACXX_s_1_fastq
                Cr2_FCD23J4ACXX_s_2_fastq
                Cr2_FCD23J4ACXX_s_3_fastq
                Dr1_FCD23J4ACXX_s_4_fastq
                Dr1_FCD23J4ACXX_s_5_fastq
                Dr1_FCD23J4ACXX_s_6_fastq
                Dr2_FCD23J4ACXX_s_4_fastq
                Dr2_FCD23J4ACXX_s_5_fastq
                Dr2_FCD23J4ACXX_s_6_fastq

	Each starting letter is the pool and the 1 or 2 in Ar1 or Ar2, for example, specifies the forward and reverse.

                Data in each of the raw fastq.gz data files from the four lanes within a pool were split into the four plates present based on the index
                sequences of the adapters, using a Linux command line script.

                These are in 6nt sequences that identify the plates from the pools. The barcode is found in the read header and begins with a # and always
                ends with AT. Therefore, there are 8nt with the first 6 being the index (excluding errors).

                Index sequences:

                name    seq read
                index1  ATCACG
                index2  CGATGT
                index3  TTAGGC
                index4  TGACCA
                index5  ACAGTG
                index6  GCCAAT
                index7  CAGATC
                index8  ACTTGA
                index9  GATCAG
                index10 TAGCTT
                index11 GGCTAC
                index12 CTTGTA

	The assignments of indexes to plates within pools A, B, C and D was as follows:

	Pool    Plate #    indexes

	A     Plate 2    1,2,3
                Plate 3    4,5,6
                Plate 13   7,8,9
                Plate 14   10,11,12

        B     Plate 4    1,2,3
                Plate 5    4,5,6
                Plate 6    7,8,9,
                Plate 7    10,11,12

        C     Plate 8    1,2,3,
                Plate 9    4,5,6
                Plate 10   7,8,9,
                Plate 11   10,11,12

        D     Plate 15    1,2,3
                Plate 16    4,5,6
                Plate 17    7,8,9
                Plate 1    none - so indexes CTTGTA, GGCTAC, and TAGCTT are not found in Pool D headers. Instead, a much wider variety of sequences that don't
                match any index (even at Hamming distance 1) are found in the header lines. Presumably those reads are from plate 1.

                Each lane must first have the plates separated as each lane has a mix of four plates. A grep command pulls out the reads (grep -EA3) by matching
                to the 6nt sequences in the header following the # symbol. The sequences from each lane are put into separate folders still separated by their
                lane pool and forward and reverse reads (8 files in total in each folder). For plate 1, a perl script was written to pull out non-matching
                indices but is presumed to have index sequences with errors that may belong to other plates.

                The 4 lanes of the forward were concatenated into a single file. The forward reads were run through FastX toolkit barcode splitter which can
                only read single ends. For the reverse, a script was created to pull the corresponding reverse reads for each sample from the reverse reads
                in lanes merged. In each plate, at the end, there should be 96 samples forward and 96 samples reverse. The number of reads in each sample should
                be similar but rarely perfectly so.

                Example of Fastx command:
                zcat ../reads/forward_reads.fastq.gz | ../../fastx_barcode_splitter.pl --bcfile ../../barcodes.txt --prefix 1_ -bol

	Separating the plates by barcode:The provided barcodes:

        A01     TGACGCCATG
        A02     CAGATATGCA
        A03     GAAGTGTGCA
        A04     TAGCGGATTG
        A05     TATTCGCATT
        A06     ATAGATTGCA
        A07     CCGAACATGC
        A08     GGAAGACATT
        A09     GGCTTATGCA
        A10     AACGCACATT
        A11     GAGCGACATT
        A12     CCTTGCCATT
        B01     GGTATATGCA
        B02     TCTTGGTGCA
        B03     GGTGTTGCAG
        B04     GGATATGCAG
        B05     CTAAGCATGC
        B06     ATTATTGCAG
        B07     GCGCTCATGC
        B08     ACTGCGATTG
        B09     TTCGTTTGCA
        B10     ATATAATGCA
        B11     TGGCAACAGA
        B12     CTCGTCGTGC
        C01     GCCTACCTTG
        C02     CACCATGCAG
        C03     AATTAGTGCA
        C04     GGAACGATGC
        C05     ACAACTTGCA
        C06     ACTGCTTGCA
        C07     CGTGGACAGT
        C08     TGGCACAGAT
        C09     TGCTTTGCAG
        C10     GCAAGCCATT
        C11     CGCACCAATT
        C12     CTCGCGGTGC
        D01     AACTGGTGCA
        D02     ATGAGCAATG
 	D03     CTTGATGCAG
        D04     GCGTCCTTGC
        D05     ACCAGGATGC
        D06     CCACTCATGC
        D07     TCACGGAAGT
        D08     TATCATGCAG
        D09     TAGCCAATGC
        D10     ATATCGCCAT
        D11     CTCTATGCAG
        D12     GGTGCACATT
        E01     CTCTCGCATT
        E02     CAGAGGTTGC
        E03     GCGTACAATT
        E04     ACGCGCGTGC
        E05     GTCGCCTTGC
        E06     AATAACCAAT
        E07     AATGAACGAT
        E08     CGTCGCCACT
        E09     ATGGCAATGC
        E10     GAAGCATGCA
        E11     AACGTGCCTT
        E12     CCTCGTGCAG
        F01     CTCATTGCAG
        F02     ACGGTACTTG
        F03     GCGCCGTGCA
        F04     CAAGTTGCAG
        F05     TCCGAGTGCA
        F06     TAGATGATGC
        F07     TGGCCAGTGC
        F08     GCACGATTGC
        F09     TTGCTGTGCA
        F10     CGCAACCAGT
        F11     TCACTGTGCA
        F12     ACAGTTGCAG
        G01     GGAGTCAAGT
        G02     TGAATTGCAG
        G03     CATATTGCAG
        G04     GTGACACATT
        G05     TATGTTGCAG
        G06     CAGTGCCATT
	G07     ACAACCAACT
        G08     TGCAGATGCA
        G09     CATCTGCCGT
        G10     GGACAGTGCA
        G11     ATCTGTTGCA
        G12     AAGACGCTTG
        H01     GAATGCAATA
        H02     TAGCAGTGCA
        H03     ATCCGTGCAG
        H04     CTTAGTGCAG
        H05     TTATTACATT
        H06     GCCAACAAGA
        H07     TGCCGCATTG
        H08     CGTGTCATGC
        H09     CAACCACACA
        H10     GCTCCGATGC
        H11     TCAGAGATTG
        H12     CGTTCATGCA

                These barcodes are variable in length but are filled out to 10bp from the remnant of the PstI restriction site (TGCAG) for fastx barcode
                splitter (fastx tools). Then, as mentioned before, a personal script removes the reverse matching sample.


	4. VA TECH DATA: Files were delivered already demultiplexed with stacks process_radtags.

        	Libraries were constructed from double digests (PstI/MspI) of gDNA with a target size range of 300-500bp (verified by bioanalyzer).
                Target depth is fairly low - 96 samples per lane in 2x100bp format on a HiSeq 2500. There is some overlap of families but different
                individuals are planted at the two sites (when compared to NCSU).

                        Radtags were processed with stacks. Barcodes for samples were provided in a file. Adapters are from command:

                        /work/newriver/jah1/software/stacks-1.44/process_radtags -1 forward_reads  \
                                -2 reverse_reads  -b barcodes.txt -P -c -q -r -o out_dir \
                                --inline_null --renz_1 pstI -i gzfastq \
                                --adapter_1 AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT \
                                --adapter_2 CAAGCAGAAGACGGCATACGAGATCGGTCTCGGCATTCCTGCTGAACCGCTCTTCCGATCT

                                Samples were decompressed, but 5 were corrupted and not used for SNP steps:
                                        A17-46.fq.gz2
                                        B10-01.fq.gz2
                                        B13-50.fq.gz2
                                        B4-28.fq.gz2
                                        B5-56.fq.gz2


###############################################################################################################################################################
2. READ PROCESSING AND QUALITY CONTROL

        All reads are trimmed with sickle pe (version 1.33). Quality and length thresholds were set to 30 and 50 respectively.

        Sample command:
        sickle pe \
        -f  ../samples/1_A01.fastq \
        -r ../samples/2_A01.fastq \
        -t illumina \                                                   # Or sanger depending on read type
        -o trimmed_1_A01.fastq \
        -p trimmed_2_A01.fastq \
        -s trimmed_single_A01.fastq \
        -q 30 \
        -l 50

###############################################################################################################################################################
3. ALIGNMENT

        Alignment is always done using BWA (0.7.15) and all reads are aligned against the 2.01 3k version of the Pita Genome.

        Sample command:
                bwa mem \
                -t 32 \
                genome/pita.v2.0.1.masked.3k \
                /UCHC/LABS/Wegrzyn/ConiferGenomes/Pita/PRJNA320114_TAMU/trimmed_fastq/trimmed_SRR3540310_1.fastq.gz \
                /UCHC/LABS/Wegrzyn/ConiferGenomes/Pita/PRJNA320114_TAMU/trimmed_fastq/trimmed_SRR3540310_1.fastq.gz \
                > sam/SRR3540310.sam

        Once alignment sam files are produced, they are converted to bam and then sorted using samtools.

        Sample commands:
                /isg/shared/apps/samtools/1.3.1/bin/samtools view -@ 32 -b -o bam/SRR3540310.bam sam/SRR3540310.sam
                /isg/shared/apps/samtools/1.3.1/bin/samtools sort -@ 32 -o sorted_bam/SRR3540310.bam bam/SRR3540310.bam


###############################################################################################################################################################
4. SNP/INDEL CALLING

        SNP and indel calling is always done with Freebayes (version 1.0.2). Two types of vcf files are produced, one run with standard, generic setting,
        and another with more stringent settings. This is to have high quality SNPs to build probes around while considering other SNPs in the surrounding
        sequence.

        Sample commands:
        freebayes -f genome/pita.v2.0.1.masked.3k.fasta  -u -m 0 -q 15 -F .2 --min-coverage 8 -C 2 sorted_bam/SRR3540314.bam > vcf/standard_SRR3540314.vcf
        freebayes -f genome/pita.v2.0.1.masked.3k.fasta  -u -m 20 -q 25 -F .2 --min-coverage 10 -C 6 sorted_bam/SRR3540314.bam > vcf/strict_SRR3540314.vcf

        VCF files are merged with bcftools by their respective set and then merged into a super set of all strict and all standard variants. These files are
        then annotated with SNPeff (4.3i) and the new 2.01 Pita annotation.

        Number of SNPs and Indels after project merges:
                TAMU
                        Strict: 7,702,804
                        Standard: 25,216,501
                UF
                        Strict: 1,516,877
                        Standard: 8,338,718
                VA TECH
                        Strict: 261,768
                        Standard: 828,122
                NCSU
                        Strict: 1,105,218
                        Standard: 2,056,210

        Number of SNPs and Indels after final merge:
                        Strict: 10,079,809
                                Strict bi-allelic: 8,272,630  		# This number is produced by the 70_mer_probes.pl script. 3+ allele variants are ignored.
                        Standard: 35,196,703

	All vcf files are available in a .tar file. This includes all individual files of both qualities and full merges of both qualities. 

###############################################################################################################################################################
5. CTGN SNP addition

A set of 4,862 probes that had previous genotypes should be included in the final set of probes. However, these probe sequences were developed off the original 
versionof the P. taeda genome and must be reestablished. The overall project of CTGN SNP inclusion follows these steps:
        1. Convert sequences to a format that can be aligned to the P. taeda genome.
        2. Identify the matches by their scaffold and position.
        3. Merge these sequences into the set of SNPs and Indels from the main project.
        4. Redesign probes around the SNPs again.

        PROCESS:

        1. Convert sequences to a format that can be aligned to the P. taeda genome.

                The probes came in a csv format that included variable probe sequences, their name, and associated genotypes. There were 4,862 sequences. This file
                was converted to a fasta format with the header including relevant information and the probe as a sequence with a IUPAC code for the SNP. This was
                done with the script csv_to_fasta.pl.

                These fasta files are aligned using Bowtie2 and the sam is converted to a bed file. 1,294 reads mapped (26.61%).

        2. Identify the matches by their scaffold and position.

                This had to be dealt with carefully. Since bed files give a range of the sequence map, the true location of the SNP can only be accurately determined
                if the entire sequence is a match without insertions or deletions. The SNP location can then be given a scaffold and position. These sequences are
                turned into a pseudo vcf file by vcf_from_bed.pl. This is not a real vcf file and cannot merge because the numbers are innacurate. I cannot tell what
                is the reference base or alternate base and the support, read depth, and other information cannot be filled out.

        3. Merge these sequences into the set of SNPs and Indels from the main project.

                This pseudo vcf of 1,181 sequences is concatenated into the full vcf from the main project. A true bcftools merge could not occur due to that
                formatting issue. Overlap and reference/alternate base issues will be resolved by my probe creation script.

        4. Redesign probes around the SNPs again.

                The psuedo vcfs: pseudo_vcf_standard.vcf and pseudo_vcf_strict.vcf are going to be used in kmer testing for final probe creation. 

###############################################################################################################################################################
6. KMER TESTING and PROBE DESIGN

        To understand probe mapping to the genome, the kmers that make up the probes are to be run across the genome for mapping frequency. First, probes need to be
        created using the pseudo vcfs from before, however the probes must be printed WITHOUT IUPAC codes. This is because my simplistic matching will not work if
        non-nucleotide codes are present in the sequence.

        A modified version of my probe creation script called 70_mer_probes_just_print.pl is used to create probes as they match the genome sequence. The SNP is also
        just the reference genome base. The 8,273,868 probes can then be split into 4 18-nt kmers like this:

                GTTTTTCTGCAGGTTCCAGGAGAGTCGACTTAGTT     G       CTTCTGGGTCTTTCTACGACCTGTTAATTTGGGGT

                                                        to
                GTTTTTCTGCAGGTTCCA    GGAGAGTCGACTTAGTG     GCTTCTGGGTCTTTCTAC   GACCTGTTAATTTGGGGT

        Notice that the SNP was used twice just because the full sequence is 71 nt long. All kmers produced are thrown into a hash, and duplicates are already lost.
        There were 24,875,817 unique kmers in the probe sets, but 33,087,264 net total. Since keys in a hash must be unique, the duplicates are replaced. This suggests
        there already is a large amount of overlap within the sequences.

        The entire P. taeda genome is then run with an 18nt sliding window. Mathes to the hatch are tallied so that each kmer has a corresponding hit-rate to look
        like this:

                Kmers hash (all same case):
                GTTTGGGGTTTTCTGCAG                      12
                GTTCCAGGAGAGTCGACA                      108
                ATGTCATTTGCTGGATCT                      1
                ATGTCATTTGCTGGATCT                      4

        The kmers with the highest mapping rate were, predictably, simple repeat sequences like tatta or gtgt. The average kmer had 14.6 hits, but the median was 1
        hit.

        The probes are then reevaluated with this information. Since each probe has 4 kmers, the four scores of each kmers in how it maps to the genome is added up.
        This is not the most accurate or telling way, but it gives a probe a good sense of its contents. The lowest score for a probe would be 4 if each kmer only
        has a hit-rate of 1.

        The record holder for highest scoring probe:    4,667,554
        Average:                                        285.91385289969
        Median:                                         8

        So, in order to set a cutoff that kept a lot of viable probes, the average was used as a cut off for SNPs and probes that mapped too much for comfort (286).
        For probe redesign (with IUPAC codes), SNPs with a mapping score of more than 286 were barred as HIGHMAP. This flag was put onto the scored_SNPs.txt file
        and act like the STANDARD FLAG to tell the script to not build a probe around it, but to recognize its presence with an upper-case IUPAC. This is done
        with the scripts inform_SNPs_with_kmer_table.pl and 70_mer_probes_from_adjusted_scores.pl using the file scored_SNPs_mapping_adjusted.txt.

	Probe designing is done in two steps. The first is to take a variant and evaluate its distances on the left and right to the nearest SNP/Indel (or beginning
        or end of the scaffold). IUPACs and codes (I for an indel) are given and stored at this stage. A capitol letter IUPAC is given when the call is of
        the strict quality from Freebayes, and lower case for the standard. All multi-allelic SNPs/Indels are considered standard to prevent probe creation around
        them.

        The second step prints every probe for every strict quality bi-allelic SNP or indel. The flanking sequence will have any number of IUPAC SNP or
        indel bases. These probes are then filtered so that only probes without SNPs or Indels in the first 20 bases on either side of the SNP remain.
        The other 15 outer sequence may have SNPs, but no indels. Duplicates are also removed. 

	The final result is 194,397 probes with 22% in or near genes.

###############################################################################################################################################################       
8. STATISTICS

The final result is 194,397 probes with 22% in or near genes. Final stats highlights:

Total number of probes: 194397
Number of probes annotated near/in genes        42749  (21.99%)
intergenic_region       151648  (78.01%)
synonymous_variant      5787  (2.98%)
missense_variant        6297  (3.24%)
downstream_gene_variant 7025  (3.61%)
upstream_gene_variant   6363  (3.27%)
intron_variant  15698  (8.08%)


###############################################################################################################################################################
9. File locations and extra READMEs

trimmed_fastq/ --> reads trimmed with sickle from every fastq raw file. 
sam_files/ --> output of bwa alignments.
bam_files/ --> locations of all .bam conversions. (BWA output is sam, samtools converts it to .bam. 
sorted_bam_files/ --> locations of all sorted .bam files of above done with bamtoos. 
vcf_files/ --> location of individual vcf files and their merges. A file explanation README is available within that folder. 
CTGN_SNP_stuff/ --> folder that contains everything done for the past array addition. A file explanation README is available within that folder.
kmer_testing/ --> kmer testing was done here. A file explanation README is available within that folder.
probe_creation/
	without_IUPACS_second_run/ --> files required for kmer testing. A file explanation README is available within that folder.
	with_IUPACS_second_run/ --> final printed probes. A file explanation README is available within that folder.
		final_PITA_probes.txt --> Final 194,397 probes. 

###############################################################################################################################################################
###############################################################################################################################################################

UPDATE 8/23/18: Formatting for thermo

To allow for a greater selection of probes, the 194,397 probes from above were re-evaluated. 

First, the original set of 8,272,630 unfiltered probes with their IUPAC codes were evaluated. The probes were filtered using Aug18_probe_filter.pl 
so that they:
   1. Had no flanking indels (but the probe could still be designed around an indel).
   2. Had no SNPs on one arm of the probe. (The other arm could have any number of SNPs).

SNPs that passed the filter: 485,016 (file: out_noSNP_oneside_probes.txt)

The CTGN SNPs were removed leaving 485,016 probes. These probes were then cut before their IUPAC using process.pl. All probes then have one full 35-mer flanking
sequence and the other sequence somewhere between 0 and 35 nt. All probes are also given a SNP identifier. Final formatted probes are in final_Pita_probes.txt. 

all_CTGN_probes.txt contains all 2,840 that had sequence trimmed to a 75-mer length with no flanking SNPs incorporated. The high quality SNPs removed above 
are in CTGN_probes_HQ.txt

###############################################################################################################################################################
###############################################################################################################################################################