Scientific programming

Numerical objects

by Nikolai V. Shokhirev

Up ABC tutorials

Up: Scientific Programming
Previous: Platforms, Languages, Tools
Next: Shaman introduction to C++

Numerical methods

"Computer Methods for Mathematical Computations by George Forsythe, Michael Malcolm and Cleve Moler is one of the great classic textbooks of numerical methods for scientists and engineers" (Ralph Carmichael). I also used this book[1-3] as a reference for my numeric projects. I translated the code from this book to Pascal (a part of PasMatLib) and C++ (in preparation). The book inspired me for a tutorial on numerical methods (see also the links below).

The book covers the following subjects:

  1. Matrix triangularization by Gaussian elimination
  2. Solution of linear systems
  3. Numerical integration
  4. Integration of a system of first order differential equations
  5. Spline interpolation
  6. Singular value decomposition, SVD, of a real, rectangular matrix
  7. Function minimization (optimization)
  8. Random number generation
  9. Root (zero) search

The above subjects represent a bare minimum of programming tools. Nevertheless these methods allow solution of a variety of computational problems. However, I would extend the list with

  1. Visualization (plotting)
  2. String and text management
  3. Input/Output
  4. Fourier and signal processing
  5. Statistics
  6. Partial differential equations

Precision choice

Integer numbers are used for counting, indices, enumerations and sets. They are never used for numerical calculations (e.g. complex<int> makes little sense) . There are several integer formats. I use 32-bit integers as universal integer numbers (in particular, to avoid conversion issues). They can store ~109 values.

Floating-point numbers model continuous real numbers in computers. This is rather poor representation of real numbers. In general, all arithmetic operations produce so called round-off errors. The measure of the relative effects of rounding errors is the machine epsilon [1, 4 - 6], macheps. It is defined as the smallest value so that

1.0 + macheps > 1.0

It measures the effects of rounding errors made when adding, subtracting, multiplying, or dividing two numbers. For single precision floating-point numbers macheps ~ 10 -7,  which is not enough for extensive numeric computations. For double precision floating-point numbers macheps ~ 10 -16. With properly selected algorithms, this accuracy is sufficient even for extensive (≥106 operations) calculations. Moreover, it should be a warning signal if an algorithm requires extended (long double) precision.

Integer and float precision.
Type C++ VS 2005  g++ MinGW  C++Builder 6
 int  4  4  4
 long  4  4  4
 long long  8  8  8
 double  8  8  8
 long double  8 12 10

In my PasMatLib I use the following type definition:

    type TFloat = double;
and in CppMatLib
    typedef double real;

Linear algebra

Unlike Fortran, many languages (C++, Pascal, Java, C#) lack of  built-in vector, matrix and complex types. However they allow creation of objects with the necessary functionality. 

We can formulate some requirements for such objects:

Indices. The object should support arbitrary index limits in order to model naturally real objects. Example 1: Angular momentum projection l, -LlL . Example 2: Negative indices can label historic prices.

Compatibility with good old Fortran 1-based algorithms is an additional important reason.

Integer arrays. They should be used to store indices, for example, permutations and pivot elements indices. Arithmetic operations (e.g. multiplication) between integer arrays or arrays and scalars have limited use and are not required. The comparison operators (e.g. > or ==) on entire objects do not make much sense as well. However, fast search by value, swap, index shift can be useful. Some specific functionality can be implemented in external functions or derived objects.    

Float arrays. For float arrays such quantities as the dot product of vectors or Norm, matrix-vector or tensor products make sense. Some operations can be implemented by overloading arithmetic operators (C++. C#, Delphi.Net). This makes your code more compact and readable, but this approach should be used with caution because creation of temporary objects can be expensive (especially for big matrices) [7].

Comparison operators for real-valued arrays are useless and its overloading is an example of the speculative generality. The parameterless equality operator == should be replaced with a method implementing a distance between objects (for a specified norm) and some tolerance level.

Important note about the dot (scalar) product. It is defined as a sum of the products of vector components:

It was pointed out [4] that the accumulation of products is a possible source of round-off errors and should be performed with higher accuracy. It can be especially important in calculation of matrix rotations and reflections [4]. Usually double precision is sufficient, but one should consider using long double (extended) precision for this purpose.

Complex numbers can be considered as a special case of two-component floating-point vectors. The above considerations are also applicable to complex numbers. In particular, the use of integers, increment (++), decrement (--) and comparison operations is senseless.


Cool code - About programming style

C++  is a general-purpose programming language. C++ is regarded as a mid-level language, as it comprises a combination of both high-level and low-level language features [8]. It is overburdened with backward compatibility with a low level C language. C++ provides with ample potentiality for doing "cool" (or stupid) things. I would recommend the book "C++ Gotchas" [9 ] , in order to avoid using such tricks. For example, in Gotcha #7 there are three variants of the code:

  1.   if (a < b)
        a = d;
      else if (b < c)
        b = d;
        c = d;
  2.   (a < b) ? (a = d) : (b < c) ? (b = d) : (c = d);
  3.   (a<b?a:b<c?b:c) = d;

The variants 2 and 3 are less readable (especially the last one), but most importantly they do not bring any value. Such "cool" code is error-prone and can be implementation-dependent. For example, the variant 3 is equivalent to the variants 1 and 2 in C++Builder 6 and Visual Studio 2005. This is not the case with gcc version 3.4.5 (both on MinGW and Linux).

Modern compilers are extremely efficient: keep your code simple, readable and let the compiler do the rest.


  1. George E. Forsythe, Mike A. Malcolm, and Cleve B. Moler, Computer Methods for Mathematical Computations, Englewood Cliffs, N.J.: Prentice Hall, 1977.
  2. The original FMM at NetLib
  3. Fortran 77 version of FMM library
  4.  J. H. Wilkinson, C. Reinsch, Handbook for Automatic Computation, Volume II, Linear Algebra, Springer-Verlag, New York, Heidelberg, Berlin, 1971
  5. Machine epsilon, From Wikipedia, the free encyclopedia
  6. Floating-point arithmetics  by Nikolai Shokhirev
  7. John R. Berryhill. "C++ Scientific Programming: Computational Recipes at a Higher Level".
  8. C++ (Wikipedia article).
  9. C++ Gotchas: Avoiding Common Problems in Coding and Design by Stephen C. Dewhurst



Up ABC tutorials

Up: Scientific Programming
Previous: Platforms, Languages, Tools
Next: Shaman introduction to C++

Home | Resumé | |  Computing |  Links Publications

©Nikolai V. Shokhirev, 2004-2008