Bioinformatics

for Non-Biologists

by Nikolai Shokhirev

Up ABC Tutorials

Introduction

There is a lot of information about the genetic code, chromosomes, genes, DNA and RNA in books, papers and on the Internet. Therefore, I do not explain all the terms here. If necessary, you can find detailed explanations, beautiful pictures and 3D models.

On the other hand, many explanations are overloaded with unnecessary detail or suffer from the lack of clarity.

Below I present the minimal version that I figured out for myself.

 

Structure

The molecules of DNA and RNA consist of a sugar-phosphate backbone with attached nucleobases. The backbone is not symmetrical and has  a so called 5' end and 3' end:

 

 One DNA or RNA strand

It also can be said that DNA and RNA are composed of nucleotide subunits (monomers):

 

Nucleotide 

The nucleotides can have one of five bases attached: adenine (A), cytosine (C), guanine (G), thymine (T) and uracil (U). U is rarely found in DNA but RNA usually contains U in place of T.

   Base = A, C, G, T, U
   Sugar (deoxyribose)
   Phosphate 

Usually a whole nucleotide is denoted with the corresponding letter. From the bioinformatics standpoint, such polymer molecules are just long text strings composed of 4-letter alphabet (A, C, G, T for DNA and A, C, G, U for RNA).

Unfortunately the reality is not so simple.

 

DNA Double Helix

 DNA consists of two individual molecules (strands) running in opposite directions and connected by hydrogen bonds between the nucleobases: 

(5')..A..C..T..G..(3')
      |  |  |  |
(3')..T..G..A..C..(5')

Double stranded DNA 

A is always connected with T and C with G. A-T and G-C base pairs can occur in any order within DNA molecules.

 

Cell Division (Mitosis)

This is a relatively simple process (supported by a very complex molecular mechanism): the two strands of DNA split ("unzip") and a complementary strand is synthesized for each strand. It results in formation of two DNAs identical to the original one because both strands contain the identical information. 

Transcription

Despite the fact that both strands contain the identical information, only one is used in protein synthesis *). The reading of information is performed chemically by synthesizing mRNA from the complementary (template) DNA strand: 

 

Transcription

 mRNA stands for Messenger RNA. It is complementary to DNA in the following sense:

RNA

(5')..A..C..U..G..(3')
      |  |  |  |
(3')..T..G..A..C..(5')

DNA 

After transcription mRNA is identical to the part of the information strand of DNA except that T is replaced with U.

 

Genetic Code

The synthesis of mRNA is only the beginning of a complex multi-step process of protein synthesis (translation). Proteins are built of amino acids and each acid is coded by 3-letter word (codon) in the {A, C, G, U} alphabet. However, if we take two steps back and return to the informational strand, then the genetic code can be expressed in the {A, C, G, T} alphabet which is usually the case (see e.g. the references below). Codons are case-insensitive :-) .   

 

Translation and Open Reading Frame

Experimental nucleotide sequences are not perfect. As a rule, they represent only DNA fragments. The beginning of words and even readin directions are usually not known. It is also important to decide which nucleotide to start translation, and when to stop, this is called an open reading frame (ORF). 

Because a codon is a three-letter word, every region of DNA has six possible reading frames, three in each direction. The reading frame that is used determines which amino acids will be encoded by a gene. An open reading frame starts with an ATG (Met) in most species and ends with a stop codon (TAA, TAG or TGA).

In the following example, the three reading frames in the forward direction are shown with the translated amino acids below each DNA seqeunce. Frame 1 starts with the "a", Frame 2 with the "t" and Frame 3 with the "g". Stop codons are indicated by an "*" in the protein sequence. 

   5'                                                   3'
   atgcccaagctgaatagcgtagaggggttttcatcatttgaggacgatgtataa

 1 atg ccc aag ctg aat agc gta gag ggg ttt tca tca ttt gag gac gat gta taa
    M   P   K   L   N   S   V   E   G   F   S   S   F   E   D   D   V   *
 2  tgc cca agc tga ata gcg tag agg ggt ttt cat cat ttg agg acg atg tat
     C   P   S   *   I   A   *   R   G   F   H   H   L   R   T   M   Y
 3   gcc caa gct gaa tag cgt aga ggg gtt ttc atc att tga gga cga tgt ata
      A   Q   A   E   *   R   R   G   V   F   I   I   *   G   R   C   I

Three forward reading frames

The longest ORF is in Frame 1. Check this and three other frame with the SequenceTransform program.

   

Downloads

While I am preparing the next installment, you can use my program SequenceTransform for Windows to play with sequences and genetic codes. Download the program and sample data here (zipped directory ~ 230 KB) or from my download section.

References


*) This is similar to speech: a magnetic tape (or a WAV file) contains the same information regardless on the direction it is played. However we can understand only forward direction.

ABC TutorialsIndirect Measurements | NMR Tutorials

Home | Resumé |  Shokhirev.com |  Computing |  Links |  Publications
[Mailbox]

Please e-mail me at nikolai@shokhirev.com

©Nikolai Shokhirev, 2001-2010