How to Design Primers Using Gene Runner
Abstract
Bioinformatics is one of the fastest growing scientific areas over the last decade. It focuses on the use of informatics tools for the organization and analysis of biological data. An example of their importance is the availability nowadays of dozens of software programs for genomic and proteomic studies. Thus, there is a growing field (private and academic) with a need for bachelor of science students with bioinformatics skills. In consideration of this need, described here is a problem-based class in which students are asked to design a set of intrageneric primers for PCR. The exercise is divided into five classes of 1 h each, in which students use freeware bioinformatics tools and data bases available through the Internet. Besides designing the set of primers, the students will consequently learn the significance and use of the major bioinformatics procedures, such as searching a data base, conducting and analyzing sequence multialignment, comparing sequences with a data base, and selecting primers.
PCR is a technique used to amplify specific nucleic acid sequences [1]. It utilizes two synthetic oligonucleotides called primers, which are complementary to two regions of the target DNA to be amplified. The reaction mixture contains template DNA, primers, deoxyribonucleoside triphosphates, buffer, and TaqDNA polymerase, which are then submitted to a series of amplification cycles by exposing the mixture to three different temperatures. First, template DNA is denatured (94 °C) and then annealed to the primers (50–65 °C), and finally, a new DNA strand is polymerized (72 °C). DNA fragments matching both primers are amplified exponentially. By this technique, a short stretch of DNA (<10 kb) can be amplified about a millionfold so that one can determine its characteristics, size, nucleotide sequence, etc. This method is very useful as a molecular biology tool. It allows for the isolation of enough DNA for subcloning and expression for protein production and for use as a DNA probe for gene cloning or for template quantification, among other applications [2].
The design of such primers is a key step in successful PCR. It is a straightforward procedure when the sequence of the target DNA is known. In this case, it may be designed just by considering as rules some characteristics of the primers (length, composition, melting temperature (T m ), duplex formation, hairpin, etc.), which can be easily calculated using dedicated software [3, 4]. However, it may become complex when a pair of primers is required to amplify a specific gene in different species of a particular genus. This type of intrageneric primers is of great interest when cloning a new gene where the sequence information is available only for related species [5, 6]. Also, it may be applied for phylogenetic purposes [7, 8] and diagnosis [9]. One way to design intrageneric primers is through the use of a set of bioinformatics software programs. These tools had arisen, mainly in the last two decades, from a merging of informatics and biological sciences to help investigators organize and analyze the huge amount of data (DNA and protein sequences) generated. Today, more than 40 million sequences have been deposited in GenBank™ at the National Center for Biotechnology Information [10]. Thus, there is a great deal of information being generated, and it needs to be analyzed to produce more useful information. As a result, there is a growing market for developers and users of bioinformatics tools. Therefore, it is clear that a molecular biology course, or any related science program, would be a good place to introduce undergraduate students to some of the main bioinformatics tools and databases. This way, students would have an opportunity to learn about the importance of current bioinformatics tools and how to use them.
A problem-based approach is proposed to introduce bioinformatics into the undergraduate curriculum, in which the goal would be to design a set of intrageneric primers to amplify bacterial genes. The problem-based method is an educational strategy that has already been described as being efficient in teaching bioinformatics concepts and tools, such as protein structure and motifs, gene mining and comparison, reference search, etc. [11, 12]. To design the primers, students are guided through a series of five sessions, in which they will be required to use various bioinformatics tools through the Internet. Besides designing the primers, by the end of the exercise, the students will be capable of conducting some analysis, for instance data base search, sequence multialignment/comparison, and primer selection/evaluation. It is important to consider that this procedure is dependent on the availability of target gene sequences, as well as on a minimum of identity among the analyzed sequences. For this reason, the method presented is not considered infallible. However, it increases the possibilities of designing an efficient intrageneric primer and expanding the knowledge of bioinformatics among undergraduate science students.
EXPERIMENTAL PROCEDURES
The procedures to design intrageneric primers are divided into five sessions of 1 h each as follows: (a) search string establishment; (b) data bank construction; (c) alignment and consensus sequence determination; (d) BLAST1 (Basic Local Alignment Search Tool) and conserved region identification; and (e) primer design and in silico testing. To exemplify the above mentioned activities, the designing of a set of intrageneric primers for α-amylases from Streptomyces species is presented under "Results and Discussion." Before starting the project, it is suggested that instructors motivate the students by emphasizing the applicability of the result that they will generate. For this purpose, it is recommended that students consider a problem situation in which the researcher (student) needs to determine the best set of intrageneric primers, to use them as a tool (for example, to develop a diagnostic test for a disease), to establish a fast system to determine whether a given isolate has some gene of biotechnological importance, to clone fragments of a gene, or to determine phylogenetics relationships. In any situation, it is recommended that students be given the opportunity to decide what genus and gene they want to consider.
Search String Establishment—
The first step in designing a set of intrageneric primers is the construction of a specific data base that will compile all the nucleotide sequences available at GenBank National Center for Biotechnology Information (NCBI) [10] that code for a given protein in a particular genus. To select those sequences, it is necessary to establish a good search strategy, in which correct keywords and Boolean operators are fundamental. To obtain these keywords, the search begins with the enzyme nomenclature data base at the Expasy website [13, 14] to determine all the synonyms for a given target protein. To search the data base, one uses the name of the enzyme or the Enzyme Commission number of the target protein. All the names retrieved are then copied to a keyword data bank text file, and such information is used further for sequence search. Besides the enzyme information, another important keyword is the organism identification number (Taxonomy ID), which can be obtained at NCBI. By using this identification, it is then possible to search GenBank for a whole genus without needing to add all the species names in the query line. The Taxonomy ID is obtained by going to the NCBI website [10], choosing the TaxBrowser link, filling the box with the target genus, and clicking on the genus name when the results return. Having both sets of information, i.e. enzyme names and Taxonomy ID, the user is ready to design the search string.
In any data mining procedure, the aim is to obtain the maximum number of positive hits and the minimum of useless information. To achieve this, most of the public databases can be searched using Boolean operators. These operators are of two types: the positive ones (OR and AND), which connect the words that we want to be present in the result, and the negative ones (NOT), which exclude entries from the results. In some databases, as in the NCBI GenBank, the user may be able to specify which field should be searched (for example, endoglucanase[protein name] and bacillus[organism]). The NCBI Handbook is used for further information on Boolean operators, fields, or other NCBI tools [15]. At the end of this class, the student should be able to establish his/her own search string, which should contain the selected keywords (Proteins synonymous and Taxonomy ID) organized (Boolean operators) in such a way to retrieve the appropriate results. However, besides the mentioned keywords, it is possible to enhance the quality (full-length gene sequences) of the results by including other expressions in the search string, such as cds (complete DNA sequence) or ORF (open reading frame). Also, one may eliminate all nonsense entries, for example, accessions that include vector in the title. These are just some suggestions for a search strategy, and certainly other terms can be added or omitted.
Data Bank Construction—
To proceed to the search step, the user needs to access the NCBI homepage [10] and select nucleotide at the search list. The search string is then entered in the box. Among the results displayed, the student then selects those of interest because often, some "garbage" is retrieved. Those sequences considered of interest will be related to the gene of interest that belongs to the desired genus and has the complete DNA sequence. The checkbox next to it is filled in to select entries. In most cases, a unique entry will appear for each DNA sequence. However, if duplicates show up, only the more recent one is chosen. To find out which one is newer, both are opened by clicking on Accession (alphanumeric code that precedes the organism and/or protein name) and then examining the date field. Also, some entries, such as the complete genome ones, have thousands of genes presented in just one accession. In this case, the user must open the entry and find (Edit bar, Find On This Page) the desired gene sequence by browsing using the gene name. All enzyme synonyms are tried until the gene of interest is found.
Once the student has determined which gene sequences are useful, the sequences need to be copied in fasta format to a text document, which will be the specific gene data bank. The fasta format can be read by most of the bioinformatics tools and comprises basically two parts. The first is a tag line that starts with the ">" symbol followed by the gene name or sequence identification. The second is the DNA sequence per se, which usually appears on a new line, right after its identification label. An example of sequences organized in fasta format is presented in Fig. 1. To obtain the sequences ready to be copied in the fasta format, the user may select FASTA in the Display list and then select Send to Text. All selected sequences (filled checkbox) will be displayed in the desired format and will be ready to be copied. This procedure may only be performed with those Accessions that encode just one sequence. Those genes included in Accessions that encode more than one sequence must be copied in fasta format individually to the user data bank. As a result of this session, the student will be capable of searching a data base and retrieving the desired sequences.
Alignment and Consensus Sequence Determination—
Once the students have organized (fasta format) all the nucleotide sequences available for the desired gene, they will be able to proceed to the alignment and consensus sequence (CS) determination step. In this class, all the data bank information will be analyzed together to determine the CS. The CS is a unique sequence that represents the most conserved nucleotides among the aligned genes. It will be used for further BLAST analysis and priming position selection. Depending on the tool used for CS determination, the result obtained may be a gapped sequence [16, 17], where the gaps represent the sections in which the nucleotides aligned do not reach an identity threshold (e.g. 50%). On the other hand, one may retrieve a continuous sequence [18, 19] with supplementary information regarding the percentage of identity for each nucleotide. In either case, the degree of identity among the aligned genes will determine the pattern of the CS. For instance, several sequences with high identity will give a representative and consistent CS. However, when evaluating diverse genes, the CS will be less informative with perhaps just a few blocks with high identity. The sections where CS has a high identity may be called conserved domains or regions. Conserved domains are regions where DNA has been less modified during species evolution. Usually these segments encode fundamental amino acids for protein function and for this reason, they are maintained [20].
To determine the CS, the students will be required to use an alignment program. For teaching purposes, we chose VisCoSe (Visualization and Comparison of Consensus Sequences) [18, 19] because it is fast, easy to handle, and can support several accesses at the same time. However, any multialignment program that determines CS may be used. Before getting into the VisCoSe home page, students must select from the data bank the sequences to be aligned. Care must be taken to avoid drawing from the analysis repeated sequences (same GI number) and those that are extremely long (3–4 times the average length) or very short (one-fourth the average length). The information is entered in the box at VisCoSe, and the data are submitted for processing. Shortly thereafter, the results are available as hyperlinks, including the consensus_src_alignment.htm link, which is opened displaying two blocks of information. The first (on the top) is the alignment of the sequences submitted. The second (on the bottom) is the CS and its percentage of identity among the aligned sequences (Fig. 2). The user can then select the CS (by double-clicking on it) and copy it to proceed to the next step.
BLAST and Conserved Region Identification—
As mentioned before, the best sections for primer attaching are those located at the conserved regions in the CS. An easy way to determine these regions is to compare the full-length CS with those sequences deposited at GenBank (NCBI). In doing so, the best regions will be those where more sequences are aligned. This is a very straightforward procedure and is common in any bioinformatics task, in which the goal is to compare a given sequence with all those deposited at the NCBI. For this purpose, the Basic Local Alignment Search Tool (BLAST) [21] is used, which is available at NCBI. At the NCBI BLAST web page [22], students choose among the nucleotide tools the one called Search for short, nearly exact matches. The CS is copied to the Search box, and before proceeding with the analysis (BLAST! button), the data base to be searched is limited by selecting Bacteria[ORGN] in the Limit results by entrez query at the Options section. Right after the analysis request, a new window will come up, where students should hit the Format! button. The results will be displayed in seconds to minutes, depending on how busy the NCBI server is at the time. The results are then shown graphically and as a text. The prior format facilitates the "macro" view of the alignment. It gives a clear picture of which and where the sequences deposited at the GenBank have homology to the CS. In this case, the query sequence (CS) is represented as a wider red bar (with the nucleotide numbers underneath), and the GenBank sequences are presented as narrower ones. Their colors depend on their size and identity to the target. Colors and their significance are presented in the results window. Considering the BLAST results, the students then search for two sections of the CS, where the largest number of high identity sequences aligned to it can be found. Besides this characteristic, these blocks should be at least 30–40 bp long and be no less than 150–200 bp apart from each other. Once the best segments are determined, the user needs to get the nucleotide sequence from that specific CS block. For this purpose, one may click on the bar that represents the aligned region and choose from it the nucleotide sequence (query). It is important to copy the correct segment (by checking the nucleotide number of it) instead of other sequences presented for the same organism.
Primer Design and in Silico Testing—
Considering the two blocks selected from the consensus sequence, the next step is to design a primer based on each of them. It may be performed using the Gene Runner v. 3.05 software (Hastings Software, Inc.), which can be downloaded for free [3]. Once the software is installed and opened, the analysis/oligo bar is selected, and the forward sequence is then pasted in the appropriate oligo box. Afterward, some nucleotides are removed from both ends until the oligonucleotide shows some characteristics of a good primer. Among the main properties for an efficient primer, one may consider a GC content of 35–60% and small difference (0–3 °C) in T m (55–65 °C) between the primers, avoiding the formation of primer dimers, hairpins, and loops. By using the switch oligo button, it is possible to fill in the oligo box for the antisense primer and perform the same steps as for the sense or forward primer. Finally, the forward and reverse (antisense) primer sequences and characteristics should be saved, as well their annealing position on the CS (amplicon size expected). The efficiency of these primers in gene amplification may be determined by using the BLAST tool (Search for short, nearly exact matches; against the bacterial data base) at the NCBI as done with the consensus sequence [22]. However, to conduct this analysis for both primers at the same time, they should be submitted with an "N" between the sequences. The "N" specifies that any distance may occur between the primers. More efficient primers will lead to a higher number of positive hits (sequences specific to the target gene and genus) in the BLAST results. Besides obtaining graphical and text data, it is also possible to obtain a taxonomy report, which facilitates the determination of how many retrieved sequences belong to the target genus. Finally, the students should compare the BLAST results, i.e. sequences that match the specifications (genus and gene) and those already organized in the data bank. They should discuss the conformity or discrepancies between the information obtained by considering the differences among the sequences and the quality (number of aligned sequences) of the block selected at the CS.
RESULTS AND DISCUSSION
All the experimental procedures presented here have already been implemented in actual bioinformatics classes. An example of their potential is presented here in the design of intrageneric primers for Streptomyces α-amylase. First, the search string was established after procuring the synonymous enzyme names, as well the genus Taxonomy ID. The search expression obtained was: ("txid1883"[Organism]) AND (alpha-amylase[protein name] OR 1,4-alpha-D-glucan glucanohydrolase[protein name] OR Glycogenase[protein name] OR 3.2.1.1[protein name]) AND (cds[All Fields]) NOT (vector[title]). The use of the above string retrieved 10 positive accessions at the NCBI GenBank. They were organized in fasta format, and the CS was determined as described. By analyzing the CS using the BLAST tool, it was possible to identify two blocks where the sequences were preferentially aligned (Fig. 3A). The CS sequences at those blocks were recovered for primer property analysis. Finally, the sense (5′-TCAACGACCTGCTCTCG-3′) and antisense (5′-GCCGGTCAGCTACAAGAT-3′) primers were tested in silico. BLAST analysis returned 36 positive hits for the Streptomyces genus, which was distributed among eight species and one unknown species. Among those hits, 20% were attributed to α-amylase, covering five species. Among the original α-amylase sequences present in the data base, two accessions of Streptomyces lividans were not recovered by our primers. Further analysis (data not shown) indicated that those sequences had less identity to the others, which was probably the reason the primers were not representative of them.
During the exercises, the students' performance was subjectively evaluated. Usually, more than two-thirds of them were able to conduct the whole exercise following the above protocol without much difficulty. The other one-third had problems with information availability, which was easily solved by suggesting other targets. Nonetheless, it may be necessary to address problems in the consensus step analysis, in which the enormous biological diversity may result in dispersed conserved regions, consequently yielding no representative CS. This may be solved by including some keyword in the search strategy that could better specify the desired gene family, class, or subtype. Also, a phylogenetic analysis may be included before the CS determination. This will allow the students to identify and select the closer sequences, in other words, those that probably are going to give a good alignment and therefore a representative CS. This could be done using the CLUSTAL W tool [23] of the European Bioinformatics Institute [24], with further tree visualization by TreeViewer [25]. Despite the difficulties mentioned, all students were excited with the potential of the information generated. Furthermore, they could make use of this application with other genes. It is important to mention that the whole exercise was designed for first-user students. Therefore, a total of 5 h is recommended to complete the project. However, if the students already have some experience, this may be reduced to 2–3 h or even conducted on their own as an exercise outside the class. Another option is to conduct a guided project with the mentioned example (Streptomyces/amylase) in class and ask them to do another one by themselves with different combinations (genus/gene).
This methodology does not profess to be an infallible way to design intrageneric primers. However, it is a procedure that may indicate the most probable set of oligonucleotides for a given target. The establishment of a good search string allows this process to be easily implemented with various genera by just switching the organisms of interest. However, as mentioned before, its efficiency is dependent on the availability of information regarding the gene/genus of interest. Nevertheless, the objective here was to use this methodology as a problem-based approach to introduce bioinformatics into the undergraduate curriculum. In this situation, the professor may want to avoid the problem of a lack of information for a given gene/genus chosen by the student. Therefore, he or she may provide names of genes and genera for which more information is already available at GenBank, which will allow students to get a functional result at the end of the exercise. The following are suggested examples for this assignment: Bacillus, Streptomyces, Pseudomonas, Vibrio, Shewanella, Salmonella, Xanthomonas, Staphylococcus, Lactobacillus, Streptococcus, Clostridium, and Mycoplasma. With regard to the genes, the professor may suggest genes that encode proteins of biotechnological importance (amylases, cellulases, proteases, and lipases) or proteins that are active in carbon source utilization, transport, energy uptake, etc. Alternatively, the professor may ask the student to design a set of intergeneric primers that will be conducted following almost the same protocol, with the exception of adding other genera in the search string. In conclusion, this hands-on approach allows students to learn the significance of some of the main bioinformatics procedures employed today (data base search, sequence manipulation and analysis, and primer design) and how to use them. This information is very important in enhancing their knowledge in this growing area and may provide another opportunity to learn some aspects of biochemistry, genetics and molecular biology as well.
Acknowledgements
We thank Dr. Albert Leyva for reading the manuscript.
REFERENCES
- 1 R. K. Saiki, D. H. Gelfand, S. Stoffel, S. J. Scharf, R. Higuchi, G. T. Horn, K. B. Mullis, H. A. Erlich (1988) Primer-directed enzymatic amplification of DNA with a thermostable DNA polymerase, Science. 239, 487– 491.
- 2 J. Sambrook, D. W. Russel (2001) in Molecular Cloning: A Laboratory Manual, 3rd Ed. ( J. Sambrook and D. W. Russel, eds.) pp. 8.04– 8.102, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY.
- 3 Gene Runner for Windows, available on-line at www.generunner.com.
- 4 S. Rozen, H. Skaletsky, (2000) in Bioinformatics Methods and Protocols: Methods in Molecular Biology ( S. Krawetz and S. Misener, eds.) pp. 365– 386, Humana Press, Totowa.
- 5 F. Goedegebuur, T. Fowler, J. Phillips, P. Van der Kley, P. Van Solingen, L. Dankmeyer, S. D. Power (2002) Cloning and relational analysis of 15 novel fungal endoglucanases from family 12 glycosyl hydrolase, Curr. Genet. 41, 89– 98.
- 6 A. Sunna, P. L. Bergquist (2003) A gene encoding a novel extremely thermostable 1,4-β-xylanase isolated directly from an environmental DNA sample, Extremophiles 7, 63– 70.
- 7 E. Lieckfeldt, Y. Cavignac, C. Fekete, T. Börner (2000) Endochitinase gene-based phylogenetic analysis of Trichoderma , Microbiol. Res. 155, 7– 15.
- 8 S. Yamamoto, H. Kasai, D. L. Arnold, R. W. Jackson, A. Vivian, S. Harayama (2000) Phylogeny of the genus Pseudomonas: intrageneric structure reconstructed from the nucleotide sequences of gyrB and rpoD genes, Microbiology (N. Y.) 146, 2385– 2394.
- 9 A. Kamiya, A. Kikuchi, Y. Tomita, T. Kanbe (2004) PCR and PCR-RFLP techniques targeting the DNA topoisomerase II gene for rapid clinical diagnosis of the etiologic agent of dermatophytosis, J. Dermatol. Sci. 34, 35– 48.
- 10 NCBI, available on-line at www.ncbi.nlm.nih.gov.
- 11 A. L. Feig, E. Jabri (2002) Incorporation of Bioinformatics Exercises into the Undergraduate Biochemistry Curriculum, Biochem. Mol. Biol. Educ. 30, 224– 231.
- 12 J. A. Boyle (2004) Bioinformatics in undergraduate education, Biochem. Mol. Biol. Educ. 32, 236– 238.
- 13 ENZYME, Enzyme nomenclature database, available on-line at ca.expasy.org/enzyme.
- 14 A. Bairoch (2000) The ENZYME database in 2000, Nucleic Acids Res. 28, 204– 205.
- 15 The NCBI Handbook, available on-line at www.ncbi.nlm.nih.gov/books/bv.fcgi?call=bv.View.ShowTOC&rid=handbook.TOC&depth=2.
- 16 MultAlin, Multiple sequence alignment by Florence Corpet, available on-line at prodes.toulouse.inra.fr/multalin.
- 17 F. Corpet (1988) Multiple sequence alignment with hierarchical clustering, Nucleic Acids Res. 16, 10881– 10890.
- 18 VisCoSe, Visualization and Comparison of consensus Sequences, available on-line at viscose.ifg.uni-muenster.de.
- 19 M. Spitzer, G. Fuellen, P. Cullen, S. Lorkowski (2004) VisCoSe: visualization and comparison of consensus sequences, Bioinformatics (Oxf.) 20, 433– 435.
- 20 D. Higgins, W. Taylor (2000) Bioinformatics: Sequence, Structure and Databanks, A Practical Approach, pp. 143– 165, Oxford University Press, Oxford.
- 21 S. F. Altschul, T. L. Madden, A. A. Schäffer, J. Zhang, Z. Zhang, W. Miller, D. J. Lipman (1997) Gapped BLAST and PSI-BLAST: A new generation of protein database search programs, Nucleic Acids Res. 25, 3389– 3402.
- 22 BLAST (Basic Local Alignment Search Tool) available on-line at www.ncbi.nlm.nih.gov/BLAST.
- 23 J. D. Thompson, D. G. Higgins, T. J. Gibson (1994) CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, 4673– 4680.
- 24 CLUSTAL W, available on-line at www.ebi.ac.uk/clustalw.
- 25 TreeView, available on-line at taxonomy.zoology.gla.ac.uk/rod/treeview.html.
- 1 The abbreviations used are: BLAST, Basic Local Alignment Search Tool; NCBI, National Center for Biotechnology Information; CS, consensus sequence; VisCoSe, Visualization and Comparison of Consensus Sequences.
Citing Literature
How to Design Primers Using Gene Runner
Source: https://iubmb.onlinelibrary.wiley.com/doi/full/10.1002/bmb.2006.494034052641
0 Response to "How to Design Primers Using Gene Runner"
Post a Comment