OUP user menu

ThermoPhyl: a software tool for selecting phylogenetically optimized conventional and quantitative-PCR taxon-targeted assays for use with complex samples

Brian B. Oakley , Scot E. Dowd , Kevin J. Purdy
DOI: http://dx.doi.org/10.1111/j.1574-6941.2011.01079.x 17-27 First published online: 1 July 2011

Abstract

The ability to specifically and sensitively target genotypes of interest is critical for the success of many PCR-based analyses of environmental or clinical samples that contain multiple templates. Next-generation sequence data clearly show that such samples can harbour hundreds to thousands of operational taxonomic units, a richness that precludes the manual evaluation of candidate assay specificity and sensitivity using multiple sequence alignments. To solve this problem, we have developed and validated a free software tool that automates the identification of PCR assays targeting specific genotypes in complex samples. ThermoPhyl uses user-defined target and nontarget sequence databases to assess the phylogenetic sensitivity and specificity of thermodynamically optimized candidate assays derived from primer design software packages. ThermoPhyl derives its name from its central premise of testing Thermodynamically optimal assays for Phylogenetic specificity and sensitivity and can be used for two primer (traditional PCR) or two primers with an internal probe (e.g. TaqMan® qPCR) application and potentially for oligonucleotide probes. Here, we describe the use of ThermoPhyl for traditional PCR and qPCR assays. PCR assays selected using ThermoPhyl were validated using 454 pyrosequencing of a traditional specific PCR assay and with a set of four genotype-specific qPCR assays applied to estuarine sediment samples.

Keywords
  • PCR primer design
  • specificity
  • sensitivity
  • phylogenetic assays
  • qPCR
  • PCR

Introduction

A basic task for many environmental and clinical researchers is to target a specific genotype in a complex sample to understand, for example, the abundance of a particular taxon or the expression of a particular gene. For conventional PCR, users have traditionally designed PCR assays manually, starting with a visual comparison of the alignments of multiple sequences. This approach is at best laborious and, even when a phylogenetically optimal assay (i.e. maximally sensitive and specific) can be identified, empirical tests are time-consuming and can still produce poor PCR results. Software is available that can facilitate primer and probe design and analysis, but none of these completely meet the needs of designing PCR assays to detect target clades specifically. In particular, there is no software presently available to the community that can assess the probe and primer sets required for quantitative-PCR (qPCR) assays using an internal probe, the most accurate and specific qPCR technique. Given the wealth of sequence data rapidly accumulating from high-throughput sequencing and the diversity such methods are detecting in environmental and clinical samples (Sogin et al., 2006; Acosta-Martinez et al., 2008; Andersson et al., 2008; Biddle et al., 2008; Dowd et al., 2008; Hamady et al., 2008), it is clear that high-throughput PCR primer and probe design approaches are required to specifically and sensitively target genotypes in a complex sample.

The software we describe here, ThermoPhyl, exploits the output of proprietary software that can produce large numbers of thermodynamically optimized candidate assays for PCR and qPCR [e.g. Primer Express (Applied Biosystems (ABI), Warrington, UK), BatchPrimer3 (You et al., 2008)] and subsequently assessing each individual assay for sensitivity (proportion of target group perfectly matched) and specificity (number of nontarget organisms perfectly matched) to user-defined target and nontarget sequences in a local database. ThermoPhyl is built around a simple pattern-matching script and is designed for applications in which a user wishes to target a taxonomic group of interest in a complex sample. The name ThermoPhyl derives from its central goal, which is to test thermodynamically optimal PCR assays for phylogenetic sensitivity and specificity to arrive at an assay that is both thermodynamically and phylogenetically optimal. ThermoPhyl is designed to analyse a very large number of candidate assays simultaneously and to facilitate primer and probe set choice by providing an output that clearly summarizes the specificity and sensitivity of each candidate assay.

ThermoPhyl complements several software tools currently available such as NCBI Primer-Blast, probeCheck (Loy et al., 2008) and arb (Ludwig et al., 2004), which meet a range of user needs related to determining the phylogenetic specificity of targeted primers and probes. ThermoPhyl performs the high-throughput assessments of assays with one (e.g. FISH), two (conventional PCR) or three (e.g. TaqMan® qPCR) oligonucleotides, which no other available software can perform. Additionally, ThermoPhyl is installed locally, which allows users to utilize personal databases and fully control processing speed and throughput. ThermoPhyl can efficiently harness the power of large datasets by rapidly performing a large number of comparisons that are summarized in output sorted by specificity and sensitivity.

In this paper, we describe the rationale behind and the use of ThermoPhyl, compare it with other commonly used primer PCR primer and probe design and assessment programs and validate its use in the design of both conventional PCR and qPCR assays. For conventional PCR, we used ThermoPhyl to select a PCR primer set to amplify the mcrA gene of the methanogenic archaeal genus Methanosaeta and experimentally tested its specificity and sensitivity using pyrosequencing. For qPCR, we used ThermoPhyl to select four Desulfobulbus genotype-specific qPCR assays and compared the data produced by this analysis with existing data. The program, a user's manual, and a training dataset are made freely available to the research community at http://go.warwick.ac.uk/thermophyl.

Materials and methods

Program validation: traditional PCR primer set

Sequence database and target group definition

Using mcrA sequences from the functional gene pipeline/repository (http://fungene.cme.msu.edu/), a database was constructed in ARB (Ludwig et al., 2004). Based on maximum-likelihood phylogenetic reconstructions, a monophyletic clade containing 19 Methanosaeta sequences, including sequences from the characterized three isolates Methanosaeta concilii, Methanosaeta harundinacea, and Methanosaeta thermophila, was identified as a target group. BatchPrimer3 (You et al., 2008) was used with default parameters to design 50 primer pairs for each of these target sequences with amplicon lengths set between 400 and 500 bp.

After comparison with the target sequences, two primer pairs ranked highly by ThermoPhyl, F1-1044 (5′-CTACATGTCCGGYGGTGTC-3′) and R1-1507 (5′-TAGTTRGCGCCYCTCAKCTC-3′), and F2-1060 (5′-GTCGGWTTCACMCAGTACGC-3′) and R2-1470 (5′-TGCCCTCGTCKGACTGGTA-3′), were chosen for empirical testing after the inclusion of several degeneracies based on a manual comparison with the sequence database.

Primer assessments

Primer specificity, sensitivity, and amplification efficiency were evaluated empirically using genomic DNA from M. concilii (DSM6752), M. harundinacea (DSM 17206), M. thermophila (DSM4774), Methanosarcina mazei (DSM2053), Methanosarcina acetivorans (DSM2834), and environmental DNA from sediment in the Colne Estuary, UK. DNA was extracted from actively growing cultures using the DNAeasy Blood and Tissue kit (Qiagen, Crawley, UK) and environmental DNA was extracted from sediment samples representing marine and freshwater conditions [sites 1 and 10 in Oakley. (2010)] as described previously (Purdy et al., 1996; Purdy, 2005). The first round of PCR was performed in 25 μL volumes containing 1 × EpiCentre FailSafe Master Mix G (EpiCentre, Madison WI), 600 nM each primer F1 and R1, and 1.25 U EpiCentre FailSafe Enzyme Mix. Touchdown PCR was performed with an initial denaturation of 96 °C for 2 min, followed by 10 cycles of 94 °C for 30 s, 52 °C (−1 °C per cycle) for 30 s, and 72 °C for 40 s, followed by 22 cycles with 42 °C annealing. Because of the low amounts of Methanosaeta DNA present in the environmental samples, 1 μL of these PCR products was used in a second round of PCR containing 1 × Promega PCR buffer, 2.5 mM MgCl2, 10 μg bovine serum albumin, 600 nM each primer F2 and R2, 200 nM dNTPs, and 1 U Promega Taq polymerase. Thermal cycling consisted of 96 °C for 2 min, 30 cycles of 94 °C for 30 s, 50 °C for 30 s, 72 °C for 40 s, and a final extension at 72 °C for 10 min. Triplicate PCRs from each of three biological replicates [sediment samples taken within 50 cm at each site (Hawkins & Purdy, 2007)] were cleaned using the QiaQuick Gel Extraction kit (Qiagen) and pooled after normalization based on the quantification of PCR products using QuantIt PicoGreen (Invitrogen, Paisley, UK) as per the manufacturer's instructions and fluorescence was measured using a Perkin-Elmer Wallac Victor2 1420 plate reader.

Pyrosequencing methods

Pyrosequencing was performed at Research and Testing Laboratory (Lubbock, TX: http://www.researchandtesting.com) using tagged amplicon methods similar to those described previously (Dowd et al., 2008) modified for titanium chemistry (Roche, Indianapolis). In brief, concatamer primers were synthesized using the construct ‘5′ 454TitaniumLinkerA-tag-primer 3′’, where the 454 linker A was biotin labeled, and based on Roche amplicon sequencing Titanium Linker A, the tag was a random 10mer (GC content 40–60%) and utilized to bin out sequences resulting from a specific sample and the primers used for this study. The reverse concatenate was in the format ‘5′ 454TitaniumLinkerB-primer’, where the primer was the appropriate reverse primer for the reaction. Twenty cycles of PCR were utilized (94 °C for 30 s, 50 °C for 30 s, 72 °C for 40 s), with a final extension at 72 °C for 10 min to incorporate the linkers and tag. Pyrosequencing based on titanium bulk sequencing methods was utilized based on the manufacturers' protocols by introducing the amplicon into the steps in the protocol following library creation. A 200 flow Titanium sequencing run was performed according to the Roche protocols with amplicon signal processing. Following the sequencing and image processing, the sequences were binned out based on tag sequence into individual multi-fasta files and used for data analysis.

Pyrosequence data analysis

Raw sequence data were edited using a series of custom perl and bioperl scripts, which performed the following initial steps: trimming of pyrosequencing tag sequences, removal of sequences with one or more ambiguous base calls, and removal of sequences shorter than 410 bp. Sequences were screened for the presence of both forward and reverse primer sequences and then translated in all three forward frames and screened for the presence of a conserved motif (VGF) within the forward primer region; translated sequences without stop codons and no more than one unknown amino acid passed the screen and leading and trailing nucleotides were trimmed to complete codons in the appropriate frame. A total of 1745 sequences from site 10 (out of 7661) and 4517 sequences from site 1 (out of 10874) passed all screens. The minimum, median, and maximum sequence lengths were 411, 415, and 419 bp respectively.

To determine the identity of sequences, a blastp analysis was performed by querying each translated sequence against a custom database of 44 mcrA reference sequences from pure cultures including all known Methanosaeta strains. The only changes to default blastp parameters were the use of soft-masking (-F ‘m S’) to enable filtering for low-complexity subsequences during the word seeding phase, but not the extension phase of the blastp algorithm.

Sequences were aligned with muscle (Edgar, 2004) invoked from a bioperl shell, which first appended an anchoring oligo sequence (5′-ACCACACAAAAACCCACA-3′) to both the 5′ and the 3′ ends of the alignment and then randomly split all sequences into subsets of 1000 sequences and aligned these first to each other, and then to a single reference sequence from M. concilii (AF313802). Each alignment of 1000 sequences was then appended to the previous, and finally, the entire alignment was aligned as a profile to the reference sequence. Alterations to default muscle parameters were gap-open and gap-extend penalties of −500. This alignment strategy was arrived at empirically by optimization of a training data set of pyrosequencing data derived from a clonal M. concilii sample and resulted in significantly increased accuracy and reduced computational time (c. 1 vs. 66 h) relative to the default command line invocation of muscle (data not shown).

Distance matrices were constructed with phylip (Felsenstein, 1989) using the Kimura two-parameter model and sequences were grouped into operational taxonomic units (OTUs) using the furthest-neighbour method of dotur (Schloss & Handelsman, 2005). perl was used to generate a custom import filter to incorporate sequences into ARB (Ludwig et al., 2004) including a data table of sequence membership for each OTU. Phylogenetic trees were built with maximum-likelihood algorithms in ARB using representative sequences from each site for the 27 OTUs identified by a 20% sequence dissimilarity cutoff conservatively based on a pairwise nucleotide difference between M. concilii and M. harundinacea of 25% for the amplicon region.

Program validation: qPCR methods

Probe and primer sets were designed using Primer Express (ABI), which outputs Taqman probe and primer sets that should function effectively using ABI's standard qPCR conditions. Multiple potential assays were derived from all of the target sequences that were available and all candidate assays were tested using ThermoPhyl.

qPCR assays were performed using Applied Biosystems TaqMan® gene expression master mix with MGB probes as described previously (Oakley et al., 2010). The primer and probe sequences for assays M, Mh, FW1, and FW2 are listed in Table 1. In brief, PCR conditions were as per the manufacturer's recommendations on an ABI 7000 or ABI 7500. Each assay was optimized with titrations of the primer and probe and the sensitivity and specificity were validated with plasmids from representative target and closely related nontarget clones. Optimized assays contained 300 nM (600 nM for FW2) each of the forward primer and reverse primers, 200 nM MGB probe, and 1 × ABI GeneExpression Master Mix in 25 μL reactions. Assays were considered valid when all no-template controls were negative, calibration-curve R2 values were >98%, and the amplification efficiency was between 90% and 115%. All replicates from each sampling site were run on a single plate for consistency. Assays have been utilized and analysed previously in Oakley. (2010).

View this table:
1

Desulfobulbus clade-specific qPCR assay primer and probe sequences selected using ThermoPhyl

AssayForward primerReverse primerProbePosition
MTGATTGACCACACCCGTATTACCGCCGTTCACCTCAGCCTTAGATCTCTGCTTGTCCGCTC673–783
MhCGCTGTTCATGCTTCCGATAGATCGATCATCGGCGGTTTCCTCGGTGTGCATCG620–681
FW1TCGCCATTCTCGGTATCCATCCGGTGATCCGGTCGTTCAAACCGCCGATGAT640–704
FW2CCGGTTAAGGCGGTTATGGCGCCGGCAAGGTCATGTGATCTGTTCGAGTATTTTGGTT595–653

Results and discussion

Program structure and function

Conventional methods of PCR primer design for the analysis of complex communities rely on an initial visual comparison of sequence alignments and then a painstaking comparison of potential primers. While programs such as ARB and Primer-Blast can perform these functions and do produce good PCR primers and probes, performing comparisons between different primer sets is laborious and does not guarantee success. Furthermore, designing a phylogenetically coherent, thermodynamically optimized Taqman-like qPCR assay is simply not possible in any of the presently available programs. Primer design programs can produce large numbers of thermodynamically optimized primer (Primer3, http://www.sourceforge.net) or primer/probe (Primer Express, ABI) sets, but only do so for individual sequences and so cannot produce assays that specifically target only the members of a clearly defined clade without the need for very laborious comparisons of individual candidate assays. ThermoPhyl fills this gap by comparing the output of high-quality primer design software with sequence data derived from a coherent phylogeny. A flow diagram showing the steps required to use ThermoPhyl is given in Fig. 1. ThermoPhyl is a simple pattern-matching perl script that compares primers and probes with two user-defined datasets: the ‘target group’ and the ‘nontarget group’. For any number of possible primer/probe sets, ThermoPhyl determines assay sensitivity, that is, how many of the ‘target group’ are a perfect match for each primer/probe set, and specificity, how many of the ‘nontarget group’ are a perfect match for the primer/probe sets, for each individual assay. ThermoPhyl then outputs a ‘sorted’ assay file detailing assays in order of the highest sensitivity and specificity first (see Fig. 2a). It also outputs a ‘raw-data file’ showing which members of the ‘target group’ and ‘nontarget group’ matched with each assay (Fig. 2b). From this output, it is possible to determine whether the addition of degeneracies in the primers might improve assay sensitivity, although such changes may adversely affect the assay as a whole.

1

Flow diagram showing how ThermoPhyl is used in comparison with existing standard practice.

2

Output files from ThermoPhyl (text files opened in Excel). (a) Results sorted according to the highest number of ‘target group’ hits for each individual. The F1/R1 primers have been moved to the top of table and highlighted in grey. (b) Raw data, showing how each assay accumulates matches with target and nontarget sequences. The F1/R1 primers have been moved to the top of table and highlighted in grey.

Program requirements

ThermoPhyl, a freeware perl script program (download perl from http://www.activestate.com/activeperl), is a simple matching program that runs on both windows and Unix machines. In tests on moderately fast WinXP machines (e.g. 2 GHz Pentium CPU with 3 Gb of RAM), testing 5000 candidate qPCR assays against a database with 5000 taxa, and makes the required 25 million comparisons in about 2.5 h. However, most users will have many fewer comparisons than this and for most applications ThermoPhyl generally produces output in seconds to minutes.

To run ThermoPhyl, three input files are required: first, a fasta file that contains all of the desired target and nontarget sequences, all with unique names, thought to be present in the samples of interest. This file should contain as many representative sequences as possible (typically 100–50 000 depending on the application) to maximize confidence in distinguishing between target and nontarget groups. Because many public databases [e.g. GreenGenes (DeSantis et al., 2006) and Silva (Pruesse et al., 2007) for 16S rRNA genes] now contain many very similar sequences, users may want to reduce these databases to representative sequences.

The second file required is a text file containing only the names of the target sequences. The names must correspond exactly to those in the fasta file above and should be unique, such as a GenBank accession number.

Finally, a list of candidate assays based on the target sequences must be provided. These can be produced by a number of primer design programs. For traditional PCR BatchPrimer3 (You et al., 2008), http://probes.pw.usda.gov/cgi-bin/batchprimer3/batchprimer3.cgi, can provide specific candidate assays for a number of different target sequences from a single fasta file and is a good high-throughput solution to creating candidate assays to test. For qPCR, software such as ABI's Primer Express can quickly generate a list of candidate assays to test and so allow the use of standardized protocols for qPCR. Using these approaches, we have typically generated 50 candidate assays per target sequence, which are then compiled into a single tab-delimited text file, the candidate-assay file.

Comparison of ThermoPhyl with other primer analysis programs

Several programs have been developed to allow users to design or assess probes and primers to determine whether they are specific and sensitive. However, ThermoPhyl was designed when it became apparent that there was no software available that was designed to assess whether a Taqman-like qPCR probe/primer set was specific and sensitive. To highlight the differences and similarities between these programs, ThermoPhyl is compared with several other commonly used programs in Table 2. The primary advantage that ThermoPhyl has over all the other programs is that it is the only program that is capable of assessing qPCR probe/primer sets and that is designed to assess many candidate assays at one time. The other programs all have additional limitations, which makes ThermoPhyl a valuable tool in the design of probes and primers. One of arb's strengths is its oligonucleotide probe design capability, which it does with reference to its sequence database and allows the user to visually assess the newly designed probes. However, arb's PCR primer design is much more limited. It utilizes just a single target sequence to produce candidate assays and offers no simple possibility of assessing these primer sets against the database. Primer-Blast also designs PCR primer sets (using Primer3) and, like ARB this is presently limited to using a single target sequence. Primer sets are then compared with a choice of databases, including the entire nucleotide database, but with a strong recommendation to use nonredundant databases such as Refseq RNA. The user can set the target clade; however, it is then entirely dependent on the GenBank taxonomy to define the target clade and there is no option to use a user-defined database. PRIMROSE (Ashelford et al., 2002) does allow the use of user-defined databases, although it is designed to work with the RDP database. Its major drawbacks are that it only designs oligonucleotides not PCR primer sets and each oligonucleotide needs to be individually assessed against the database to determine exactly what it matches, which is time-consuming. Therefore, after a search for potential oligonucleotides, potential primer sets will need to be further assessed for thermodynamic suitability. Finally, probeCheck is designed to assess previously designed oligonucleotides, but does so individually and not as a primer set. Therefore, ThermoPhyl along with programs such as Primer Express and BatchPrimer3, allows the very rapid design and assessment of very many thermodynamically optimized PCR and qPCR assays against any dataset that the user chooses.

View this table:
2

Comparison of ThermoPhyl with four other commonly used oligonucleotide analysis programs

Programs
CharacteristicsThermoPhylARBPrimer-BlastPRIMROSEprobeCheck
Types of assay
qPCR probe/primer setsYesNoNoNoNA
Standard PCRYesYesYesNoNA
OligonucleotideYesYesNoYesYes
Design primers/oligosNoYesYesYesNo
No. of sequences assays designed againstUser defined (external program)User defined (oligos) 1 (primers)1User definedUser defined (external program)
Assessment
Phylogenetic basisYesYes (oligos) No (primers)YesYesYes
DatabasesUser defined16/23SRNA ref SeqUser defined16S
No. of target sequences assessedUser definedUser defined (oligos) 1 (primers)User definedUser definedNA
Target/nontarget definitionYesYesYesYesYes
Defines nontarget hitsYesYes (oligo) No (primers)NoYesYes
Degenerate bases allowed?NoYesNoYesYes
No. of assays assessedNo limitUser defined (oligos) 0 (primers)≤2000NA10
OutputTextTextGraphic htmlSpecificText
Local/RemoteLocalLocalRemoteLocalRemote
ReferenceThis studyLudwig et al., 1997NCBI websiteAshelford et al., 2002Loy et al., 2008
  • * ARB designs and compares oligonucleotide probes with reference to the entire database, but does not do so for PCR primer sets, which are designed with reference to a single sequence only and the subsequent primer sets cannot be directly compared against the database.

  • PRIMROSE designs and assesses single oligonucleotides, not PCR primer sets.

  • probeCheck is designed to check single oligonucleotide specificity.

  • § Limit is based on the processing time, which depends of the speed of the local computer, the number of candidate assays, and the sizes of the target and nontarget databases.

Validation of ThermoPhyl-selected PCR and qPCR assays

Validation of a ThermoPhyl-selected conventional PCR assay

As the first empirical test of ThermoPhyl, conventional PCR primers specifically targeting the α-subunit of the methyl-coenzyme M reductase gene (mcrA) of Methanosaeta were selected following the scheme set out in Fig. 1. Over 700 candidate primers sets were designed using BatchPrimer3 from a range of available Methanosaeta mcrA sequences from both isolates and environmental clone sequences. As this particular taxon generally represents a small proportion of DNA in the estuarine sediments analysed here, a nested PCR approach was adopted and so two primer sets were selected. After routine manual optimization of thermal cycling conditions, the primers selected by ThermoPhyl produced a strong single band from both genomic DNA prepared from the three available Methanosaeta isolates (M. concilii, M. harundinacea and M. thermophila) and environmental DNA preparations, but did not amplify DNA from the closely related M. mazei or M. acetivorans (Fig. 3).

3

PCR amplification of the Methanosaeta mcrA gene from Colne estuary sediment using the primers F2 and R2 as described in the text. Lanes: 1, Methanosaeta concilii; 2, Methanosaeta harundinacea; 3, Methanosaeta thermophila; 4, Methanosarcina mazei; 5, Methanosarcina acetivorans; 6, site 1 sediment DNA; 7, site 10 sediment DNA; and 8, no-template control.

Using these nested primer sets, Methanosaeta-specific PCR products were amplified from DNA extracted directly from the two contrasting environmental sediment samples (marine-dominated site 1 and freshwater-dominated site 10) from the River Colne, Essex, UK (Hawkins & Purdy, 2007; Oakley et al., 2010). These amplicons were analysed using only a small proportion (∼1%) of a 454 pyrosequence read. After screening to remove poor-quality sequences as described above, 6262 high-quality sequences remained (4517 sequences from site 1 and 1745 sequences from site 10). These sequences were checked using a local blastp analysis to a database of 44 mcrA sequences, and across the two sampling sites, 99.9% (6257/6262) of sequences were most closely related to Methanosaeta (Table 3). Rarefaction analysis of these sequences showed that, at a sequence dissimilarity of 20% (the difference between the mcrA genes of the two mesophilic Methanosaeta isolates, M. concilii and M. harundinacea, is 25%; therefore, 20% is a reasonable species-level definition), the Methanosaeta communities at both sites have been completely sampled (Fig. 4). Pyrosequencing of these environmental amplicons also revealed an extensive novel diversity within the Methanosaeta clade (Fig. 5). Twenty-seven OTUs were defined at a 20% cut-off, all falling within the Methanosaeta clade, and yet many clearly represent novel lineages affiliated with Methanosaeta. Therefore, these nested PCR primer sets are both specific and sensitive and show the value of using ThermoPhyl in primer selection.

View this table:
3

Summary of the local blastp analysis of the Methanosaeta mcrA sequence data

Top blastp hitSite 1Site 10
Targets
Methanosaeta concilii AF31380225631674
Methanosaeta harundinacea AY970348182513
Methanosaeta concilii VeAc9 AF313803115148
Methanosaeta thermophila PT gb|ABK14360.1910
Nontargets
Methanothermobacter marburgensis X077942
Methanothermus fervidus J033751
Methanococcus jannaschii mrtA U674651
Methanococcoides burtonii U222341
Total45171745
  • Pyrosequence data were analysed by blastp run locally on a custom database containing 44 pure culture mcrA sequences, including all Methanosaeta strains. The only changes to default parameters were the use of soft-masking (-F ‘m S’) to enable filtering for low-complexity subsequences during the word seeding phase, but not the extension phase of the blastp algorithm.

4

Rarefaction analysis of Methanosaeta mcrA pyrosequence data with OTUs defined at 10%, 15%, 20%, and 25% sequence dissimilarity from (a) site 1 and (b) site 10. Curves marked with * were saturated for OTU definitions. The sequence dissimilarity between Methanosaeta concilii and Methanosaeta harundinacea is 25% for the amplicon region.

5

Phylogenetic representation of mcrA sequence diversity recovered by ThermoPhyl-generated primers. The tree is a maximum-likelihood phylogenetic reconstruction based on alignment of nucleotides restricted to the amplicon region. Sequences are labelled with either S1 or S10, indicating whether they are from site 1 or site 10, respectively, and those shown are representatives of the 27 OTUs defined as described in the text.

Validation of ThermoPhyl-selected qPCR assays

The second validation test was to use ThermoPhyl to select Taqman qPCR primer and probe sets, designed using Primer Express (ABI), that targeted four Desulfobulbus genotypes detected in the Colne estuary, UK (data presented previously in Oakley et al., 2010). The four Desulfobulbus clades were selected for qPCR analysis because they exhibited a differential distribution along the estuary based on an initial denaturing gradient gel electrophoresis analysis, a distribution that was subsequently supported by clone sequence data (Fig. 6). Candidate assays were designed to target each of the four genotypes; the specificity and sensitivity of these assays were determined using ThermoPhyl and the best assays were selected. Data from these four assays supported our previous data showing that all four genotypes have a restricted distribution along the estuary (Fig. 6), indicating that the assays were targeting the correct genotypes. However, these data do not prove that the assays are specific. It can be reasoned that a ‘good’ qPCR assay should produce PCR products that, if sequenced and analysed phylogenetically, should produce a monophyletic clade, with the caveat that as qPCR assays usually produce very short fragments, the resultant trees are unlikely to be very robust. Therefore, we cloned and sequenced ∼12 amplicons for each assay and all four assays produce monophyletic groups after sequence analysis (data not shown). Therefore, ThermoPhyl was successful in selecting highly specific and sensitive qPCR primers and probes from a large number of thermodynamically optimized candidate assays.

6

Identity and distributions of Desulfobulbus-affiliated dsrB ecotypes (adapted from fig. 3 in Oakley et al., 2010). (a) Phylogenetic positions of the four assayed clades within Desulfobulbus. Tree was reconstructed from amino acid informed DNA alignment using the maximum-likelihood algorithm AxML. (b) Distributions of these four genotypes across the estuary as assessed by denaturing gradient gel electrophoresis. Values represent peak heights normalized within each lane to control for loading differences. (c) Distributions of four genotypes as assessed by clade-specific qPCR assays. Values represent means of three biological replicates (error bars=1 SEM).

These two validation tests show that ThermoPhyl is capable of analysing large numbers of potential PCR and qPCR assays for specificity and sensitivity using a user-defined sequence database and thus will allow the user to make a phylogenetically informed choice about which primer sets to use for a specific target group. This is particularly powerful with qPCR assays as no presently available phylogenetic program is capable of assessing the validity of even a single qPCR primer/probe set, let alone many hundreds of candidate assays. Therefore, while ThermoPhyl is in itself a simple pattern-matching program, it fills a gap in the available software by linking a wholly user-defined dataset to powerful PCR and qPCR primer design software.

Potential pitfalls and specific recommendations

To use ThermoPhyl effectively, target groups must form a natural phylogenetic group. Before using ThermoPhyl, sequences should be properly placed in some sort of a phylogenetic tree to evaluate this and to designate target and nontarget sequences in a way that reflects the evolutionary history of the gene in question. If the target sequences do not form a coherent phylogenetic group, it will be difficult to design an accurate assay, although it is possible that different sequence data (e.g. another gene) for the same taxa could still be used in such a case.

Additionally, the more sequence data available for both target and nontarget groups, the better. The strength of ThermoPhyl in fact its central goal, is to summarize a very large number of comparisons to arrive at a single ‘best’ assay. However, users should be aware that some genes or clades may prove more challenging than others, especially if the targeted gene is highly variable or does not carry a strong phylogenetic signal. Additional guidance is provided in the user's manual, and common questions are listed in the FAQ, both accessible via the ThermoPhyl website.

While ThermoPhyl can perform the most laborious aspects of selecting primer sets, it is necessary for the user to engage with the ThermoPhyl output to determine how well the ‘best’ assays suit their purpose. We have found that using ThermoPhyl's output within programs such as ARB can rapidly confirm the potential value of a primer set and highlight where degenerate bases could improve sensitivity without unduly compromising specificity, although this is not recommended for qPCR probe and primer sets unless absolutely necessary.

Conclusions

ThermoPhyl can utilize large sequence datasets now commonly available to identify phylogenetically specific and sensitive assays for traditional and qPCR. ThermoPhyl is run locally on a user's computer, avoiding the constraints of internet data transmission, and allowing for customized, personal databases. ThermoPhyl can provide a high-throughput data-driven solution to the problem of targeted assay design in complex samples and is made available free to the research community at http://go.warwick.ac.uk/thermophyl/.

Acknowledgements

This work is part of the Marie Curie Excellence Grant for Teams project, MicroComXT (MEXT-CT-2005-024112), funded by the European Commission under FP6.

References

View Abstract