How many iterations are needed for the bisection algorithm?

<TL;DR>

  • The bisection algorithm is a very simple algorithm for finding the root of a 1-D function.
  • Working out the number of iterations of the algorithm required to determine the root location within a specified tolerance can be determined from a very simple little hack, which I explain here.
  • Things get more interesting when we consider variants of the bisection algorithm, where we cut an interval into unequal portions.

</TL;DR>

A little while ago a colleague mentioned that they were repeatedly using an off-the-shelf bisection algorithm to find the root of a function. The algorithm required the user to specify the number of iterations to run the bisection for. Since my colleague was running the algorithm repeatedly they wanted to set the number of iterations efficiently and also to achieve a guaranteed level of accuracy, but they didn’t know how to do this.

I mentioned that it was very simple to do this and it was a couple of lines of arithmetic in a little hack that I’d used many times. Then I realised that the hack was obvious and known to me because I was old – I’d been doing this sort of thing for years. My colleague hadn’t. So I thought the hack would be a good subject for a short blog post.

The idea behind a bisection algorithm is simple and illustrated in Figure 1 below.

How the bisection algorithm works
Figure 1: Schematic of how the bisection algorithm works

At each iteration we determine whether the root is to the right of the current mid-point, in the right-hand interval, or to the left of the current mid-point, in the left-hand interval. In either case, the range within which we locate the root halves. We have gone from knowing it was in the interval [x_{lower}, x_{upper}], which has width x_{upper}-x_{lower}, to knowing it is in an interval of width \frac{1}{2}(x_{upper}-x_{lower}). So with every iteration we reduce our uncertainty of where the root is located by half. After N iterations we have reduced our initial uncertainty by (1/2)^{N}. Given our initial uncertainty is determined by the initial bracketing of the root, i.e.  an interval of width (x_{upper}^{(initial)}-x_{lower}^{(initial)}), we can now work out that after N iterations we have narrowed down the root to an interval of width {\rm initial\;width} \times \left ( \frac{1}{2}\right ) ^{N}. Now if we want to locate the root to within a tolerance {\rm tol}, we just have to keep iterating until the uncertainty reaches {\rm tol}. That is, we run for N iterations where N satisfies,

\displaystyle N\;=\; -\frac{\ln({\rm initial\;width/tol})}{\ln\left (\frac{1}{2} \right )}

Strictly speaking we need to run for \lceil N \rceil iterations. Usually I will add on a few extra iterations, e.g. 3 to 5, as an engineering safety factor.

As a means of easily and quickly determining the number of iterations to run a bisection algorithm the calculation above is simple, easy to understand and a great little hack to remember.

Is bisection optimal?

The bisection algorithm works by dividing into two our current estimate of the interval in which the root lies. Dividing the interval in two is efficient. It is like we are playing the childhood game “Guess Who”, where we ask questions about the characters’ features in order to eliminate them.

Asking about a feature that approximately half the remaining characters possess is the most efficient – it has a reasonable probability of applying to the target character and eliminates half of the remaining characters. If we have single question, with a binary outcome and a probability p of one of those outcomes, then the question that has p = \frac{1}{2} maximizes the expected information (the entropy), p\ln (p)\;+\; (1-p)\ln(1-p).

Dividing the interval unequally

When we first played “Guess Who” as kids we learnt that asking questions with a much lower probability p of being correct didn’t win the game. Is the same true for our root finding algorithm? If instead we divide each interval into unequal portions is the root finding less efficient than when we bisect the interval?

Let’s repeat the derivation but with a different cut-point e.g. 25% along the current interval bracketing the root. In general we can test whether the root is to the left of right of a point that is a proportion \phi along the current interval, meaning the cut-point is x_{lower} + \phi (x_{upper}-x_{lower}). At each iteration we don’t know in advance which side of the cut-point the root lies until we test for it, so in trying to determine in advance the number of iterations we need to run, we have to assume the worst case scenario and assume that the root is still in the larger of the two intervals. The reduction in uncertainty is then, {\rm max}\{\phi, 1-\phi\}. Repeating the derivation we find that we have to run at least,

\displaystyle N_{Worst\;Case}\;=\;\ -\frac{\ln({\rm initial\;width/tol})}{\ln\left ({\rm max}\{\phi, 1 - \phi \right \})}

iterations to be guaranteed that we have located the root to within tol.

Now to determine the cut-point \phi that minimizes the upper bound on number of iterations required, we simply differentiate the expression above with respect to \phi. Doing so we find,

\displaystyle \frac{\partial N_{Worst\;Case}}{\partial \phi} \;=\; -\frac{\ln({\rm initial\;width/tol})}{ (1-\phi) \left ( \ln (1 - \phi) \right )^{2}} \;\;,\;\; \phi < \frac{1}{2}

and

\displaystyle \frac{\partial N_{Worst\;Case}}{\partial \phi} \;=\; \frac{\ln({\rm initial\;width/tol})}{\phi \left ( \ln (\phi) \right)^{2}} \;\;,\;\; \phi > \frac{1}{2}

The minimum of N_{Worst\;Case} is at \phi =\frac{1}{2}, although \phi=\frac{1}{2} is not a stationary point of the upper bound N_{Worst\;Case}, as N_{Worst\;Case} has a discontinuous gradient there.

That is the behaviour of the worst-case scenario. A similar analysis can be applied to the best-case scenario – we simply replace max with min in all the above formula. That is, in the best-case scenario the number of iterations required is given by,

\displaystyle N_{Best\;Case}\;=\;-\frac{\ln({\rm initial\;width/tol})}{\ln\left ({\rm min}\{\phi, 1 - \phi \right \})}

Here, the maximum of the best-case number of iterations occurs when \phi = \frac{1}{2}.

That’s the worst-case and best-case scenarios, but how many iterations do we expect to use on average? Let’s look at the expected reduction in uncertainty in the root location after N iterations. In a single iteration a root that is randomly located within our interval will lie, with probability \phi, in segement to the left of our cut-point and leads to a reduction in the uncertainty by a factor of \phi. Similarly, we get a reduction in uncertainty of 1-\phi with probability 1-\phi if our randomly located root is to the right of the cut-point. So after N iterations the expected reduction in uncertainty is,

\displaystyle {\rm Expected\;reduction}\;=\;\left ( \phi^{2}\;+\;(1-\phi)^{2}\right )^{N}

Using this as an approximation to determine the typical number of iterations, we get,

\displaystyle N_{Expected\;Reduction}\;=\;-\frac{\ln({\rm initial\;width/tol})}{\ln\left ( \phi^{2} + (1-\phi)^{2} \right )}

This still isn’t the expected number of iterations, but to see how it compares Figure 2 belows shows simulation estimates of \mathbb{E}\left ( N \right ) plotted against \phi when the root is random and uniformly distributed within the original interval.

The number of iterations needed for the bisection algorithm
Number of iterations required for the different root finding methods.

For Figure 2 we have set w = ({\rm initial\;width/tol}) = 0.01. Also plotted in Figure 2 are our three theoretical estimates, \lceil N_{Worst\;Case}\rceil, \lceil N_{Best\;Case}\rceil, \lceil N_{Expected\;Reduction}\rceil. The stepped structure in these 3 integer quantities is clearly apparent, as is how many more iterations are required under the worst case method when \phi \neq \frac{1}{2}.

The expected number of iterations required, \mathbb{E}( N ), actually shows a rich structure that isn’t clear unless you zoom in. Some aspects of that structure were unexpected, but requires some more involved mathematics to understand. I may save that for a follow-up post at a later date.

© 2022 David Hoyle. All Rights Reserved

Faa di Bruno and derivatives of an iterated function

I have recently needed to do some work evaluating high-order derivatives of composite functions. Namely, given a function f(t), evaluate the n^{th} derivative of the composite function  \underbrace{\left (f\circ f\circ f \circ \ldots\circ f \right )}_{l\text{ terms}}(t). That is, we define f_{l}(t) to be the function obtained by iterating the base function f(t)=f_{1}(t) l-1 times. One approach is to make recursive use of the Faa di Bruno formula,

\displaystyle \frac{d^{n}}{dx^{n}}f(g(x))\;=\;\sum_{k=1}^{n}f^{(k)}(g(x))B_{n,k}\left (g'(x), g''(x), \ldots, g^{(n-k+1)}(x) \right )      Eq.(1)

The fact that the exponential partial Bell polynomials B_{n,k}\left (x_{1}, x_{2},\ldots, x_{n-k+1} \right ) are available within the {\tt sympy} symbolic algebra Python package, makes this initially an attractive route to evaluating the required derivatives. In particular, I am interested in evaluating the derivatives at t=0 and I am focusing on odd functions of t, for which t=0 is then obviously a fixed-point. This means I only have to supply numerical values for the derivatives of my base function f(t) evaluated at t=0, rather than supplying a function that evaluates derivatives of f(t) at any point t.

Given the Taylor expansion of f(t) about t=0 we can easily write code to implement the Faa di Bruno formula using sympy. A simple bit of pseudo-code to represent an implementation might look like,

  1. Generate symbols.
  2. Generate and store partial Bell polynomials up to known required order using the symbols from step 1.
  3. Initialize coefficients of Taylor expansion of the base function.
  4. Substitute numerical values of derivatives from previous iteration into symbolic representation of polynomial.
  5. Sum required terms to get numerical values of all derivatives of current iteration.
  6. Repeat steps 4 & 5.

I show python code snippets below implementing the idea. First we generate and cache the Bell polynomials,

 # generate and cache Bell polynomials
 bellPolynomials = {}
 for n in range(1, nMax+1):
      for k in range(1, n+1):
         bp_tmp = sympy.bell(n, k, symbols_tmp)
         bellPolynomials[str(n) + '_' + str(k)] = bp_tmp

Then we iterate over the levels of function composition, substituting the numerical values of the derivatives of the base function into the Bell polynomials,

for iteration in range(nIterations):
    if( verbose ):
        print( "Evaluating derivatives for function iteration " + str(iteration+1) )

    for n in range(1, nMax+1):
        sum_tmp = 0.0
        for k in range(1, n+1):
            # retrieve kth derivative of base function at previous iteration
            f_k_tmp = derivatives_atFixedPoint_tmp[0, k-1]

            #evaluate arguments of Bell polynomials
            bellPolynomials_key = str( n ) + '_' + str( k )
            bp_tmp = bellPolynomials[bellPolynomials_key]
            replacements = [( symbols_tmp[i],
            derivatives_atFixedPoint_tmp[iteration, i] ) for i in range(n-k+1) ]
            sum_tmp = sum_tmp + ( f_k_tmp * bp_tmp.subs(replacements) )

        derivatives_atFixedPoint_tmp[iteration+1, n-1] = sum_tmp

Okay – this isn’t really making use true recursion, merely looping, but the principle is the same. The problem one encounters is that the manipulation of the symbolic representation of the polynomials is slow, and run-times slow significantly for n > 15.

However, the n^{th} derivative can alternatively be expressed as a sum over partitions of n as,

\displaystyle \frac{d^{n}}{dx^{n}}f(g(x))\;=\;\sum \frac{n!}{m_{1}!m_{2}!\ldots m_{n}!} f^{(m_{1}+m_{2}+\ldots+m_{n})}\left ( g(x)\right )\prod_{j=1}^{n}\left ( \frac{g^{(j)}(x)}{j!}\right )^{m_{j}}   Eq.(2)

where the sum is taken over all non-negative integers tuples m_{1}, m_{2},\ldots, m_{n} that satisfy 1\cdot m_{1}+ 2\cdot m_{2}+\ldots+ n\cdot m_{n}\;=\; n. That is, the sum is taken over all partitions of  n. Fairly obviously the Faa di Bruno formula is just a re-arrangement of the above equation, made by collecting terms involving f^{(k)}(g(x)) together, and as such that rearrangement gives the fundamental definition of the partial Bell polynomial.

I’d shied away from the more fundamental form of Eq.(2) in favour of Eq.(1) as I believed the fact that a number of terms had already been collected together in the form of the Bell polynomial would make any implementation that used them quicker. However, the advantage of the form in Eq.(2) is that the summation can be entirely numeric, provided an efficient generator of partitions of n is available to us. Fortunately, sympy also contains a method for iterating over partitions. Below are code snippets that implement the evaluation of f_{l}^{(n)}(0) using Eq.(2). First we generate and store the partitions,

# store partitions
pStore = {}
for k in range( n ):
    # get partition iterator
    pIterator = partitions(k+1)
    pStore[k] = [p.copy() for p in pIterator]

After initializing arrays to hold the derivatives of the current function iteration we then loop over each iteration, retrieving each partition and evaluating the product in the summand of Eq.(2). Obviously, it is relatively easy working on the log scale, as shown in the code snippet below,

# loop over function iterations
for iteration in range( nIterations ):

    if( verbose==True ):
        print( "Evaluating derivatives for function iteration " + str(iteration+1)  )

    for k in range( n ):
        faaSumLog = float( '-Inf' )
        faaSumSign = 1

        # get partitions
        partitionsK = pStore[k]
        for pidx in range( len(partitionsK) ):
            p = partitionsK[pidx]
            sumTmp = 0.0
            sumMultiplicty = 0
            parityTmp = 1
            for i in p.keys():
                value = float(i)
                multiplicity = float( p[i] )
                sumMultiplicty += p[i]
                sumTmp += multiplicity * currentDerivativesLog[i-1]
                sumTmp -= gammaln( multiplicity + 1.0 )
                sumTmp -= multiplicity * gammaln( value + 1.0 )
                parityTmp *= np.power( currentDerivativesSign[i-1], multiplicity )	

            sumTmp += baseDerivativesLog[sumMultiplicty-1]
            parityTmp *= baseDerivativesSign[sumMultiplicty-1]

            # now update faaSum on log scale
            if( sumTmp > float( '-Inf' ) ):
                if( faaSumLog > float( '-Inf' ) ):
                    diffLog = sumTmp - faaSumLog
                    if( np.abs(diffLog) = 0.0 ):
                            faaSumLog = sumTmp
                            faaSumLog += np.log( 1.0 + (float(parityTmp*faaSumSign) * np.exp( -diffLog )) )
                            faaSumSign = parityTmp
                        else:
                            faaSumLog += np.log( 1.0 + (float(parityTmp*faaSumSign) * np.exp( diffLog )) )
                    else:
                        if( diffLog > thresholdForExp ):
                            faaSumLog = sumTmp
                            faaSumSign = parityTmp
                else:
                    faaSumLog = sumTmp
                    faaSumSign = parityTmp

        nextDerivativesLog[k] = faaSumLog + gammaln( float(k+2) )
        nextDerivativesSign[k] = faaSumSign

Now let’s run both implementations, evaluating up to the 15th derivative for 4 function iterations. Here my base function is f(t) =1 -\frac{2}{\pi}\arccos t. A plot of the base function is shown below in Figure 1.

FaaDiBrunoExample_BaseFunction
Figure 1: Plot of the base function f(t)\;=\;1\;-\;\frac{2}{\pi}\arccos(t)

The base function has a relatively straight forward Taylor expansion about t=0,

\displaystyle f(t)\;=\;\frac{2}{\pi}\sum_{k=0}^{\infty}\frac{\binom{2k}{k}t^{2k+1}}{4^{k}\left ( 2k+1 \right )}\;\;\;,\;\;\;|t| \leq 1 \;\;,    Eq.(3)

and so supplying the derivatives, f^{(k)}(0), of the base function is easy. The screenshot below shows a comparison of f_{l}^{(15)}(0) for l\in \{2, 3, 4, 5\}. As you can see we obtain identical output whether we use sympy’s Bell polynomials or sympy’s partition iterator.

FaaDiBruno_TimingComparison

The comparison of the implementations is not really a fair one. One implementation is generating a lot of symbolic representations that aren’t really needed, whilst the other is keeping to entirely numeric operations. However, it did highlight several points to me,

  • Directly working with partitions, even up to moderate values of n, e.g. n=50, can be tractable using the sympy package in python.
  • Sometimes the implementation of the more concisely expressed representation (in this case in terms of Bell polynomials) can lead to an implementation with significantly longer run-times, even if the more concise representation can be implemented concisely (less lines of code).
  • The history of the Faa di Bruno formula, and the various associated polynomials and equivalent formalisms (such as the Jabotinksy matrix formalism) is a fascinating one.

I’ve put the code for both methods of evaluating the derivatives of an iterated function as a gist on github.

At the moment the functions take an array of Taylor expansion coefficients, i.e. they assume the point at which derivatives are requested is a fixed point of the base function. At some point I will add methods that take a user-supplied function for evaluating the k^{th} derivative, f^{(k)}(t), of the base function at any point t and will return the derivatives, f_{l}^{(k)}(t) of the iterated function.

I haven’t yet explored whether, for reasonable values of n (say n \leq 50), I need to work on the log scale, or whether direct evaluation of the summand will be sufficiently accurate and not result in overflow error.