Next: 4.9 Summary Up: 4. Implementation Previous: 4.7 Wavefunction optimisation Contents

Subsections

4.8 Statistics

The efficient generation and accurate analysis of data is central to all Monte Carlo simulations. In this section the statistical techniques used to compute the expected value of observables and then error bars are described. The techniques are similar to those used in other Monte Carlo and molecular dynamics methods, and the observables, though normally energies, can in principle be any quantity.

When analysing the data, two factors that must be taken into account are the equilibration (or relaxation) time of the observable, and the likely presence of serial correlations in the data. Compensating for correlations is essential in order to obtain an accurate error bar.

For the purposes of this section, a set of sequentially observed data values $x_{i}$ will be considered. The values of correspond to simulation times $t_{i}=i\tau$ , where $\tau$ is the Monte Carlo timestep between samples.

The mean or average value of the observable is given straightforwardly,

$\begin{displaymath} <X>=\frac{1}{M}\sum_{i=1}^{M} x_{i}\;\;\;, \end{displaymath}$

(4.49)

however evaluation of the error bar (the standard deviation of the mean) is more involved due to the likelyhood of serial correlations. The correlation time, $\tau_{c}$ is a measure of how long it takes the system to evolve between states that are independent. The correlation time is conventionally defined [16] via the autocorrelation function of the data, $\chi(t)$ ,

$\begin{displaymath} \chi(t)=\int [x(t^\prime)x(t^\prime+t)-<x>^2]dt^\prime \end{displaymath}$

(4.50)

The autocorrelation function is defined such that $\chi=0$ for completely uncorrelated data and $\chi=1$ for completely correlated data. At long times, Markovian processes are expected to lead to an exponentially decay in $\chi$ ,

$\begin{displaymath} \chi(t) \sim \exp(-t/\tau_{c}) \;\;\;. \end{displaymath}$

(4.51)

The autocorrelation function, equation 4.50, is estimated by [16]

$\displaystyle \chi(t)=\chi(j\tau)$	$\textstyle =$	$\displaystyle \frac{1}{M-j}\sum_{i=0}^{M-j}x_{i}x_{x+j}$	(4.52)
	$\textstyle -$	$\displaystyle \frac{1}{M-j}\sum_{i=0}^{M-j}x_{i}\times\frac{1}{M-j}\sum_{i=0}^{M-j}x_{i+j} \;\;\;.$

Assuming the central limit theorem holds and following the conventional statistical definitions, [19] the best estimate of the standard deviation of the mean is given by

$\begin{displaymath} \sigma=\sqrt{\frac{2\tau_c}{M\tau}\left[\frac{1}{M}\sum_{i=1... ..._{i}^{2}-(\frac{1}{M}\sum_{i=1}^{M} x_{i})^{2}\right]} \;\;\;. \end{displaymath}$

(4.53)

A simpler approach based on a reblocking of the data can be used in practice. In the reblocking scheme data is grouped into a series of block sizes, $1,2,4\ldots$ , and the standard deviation obtained by computing the variance of the block averages. For each block size, an estimate of the standard deviation of the mean can be shown to be [16,75]

$\begin{displaymath} \sigma_{b}=\sqrt{\frac{1}{M_{b}-1}\sum_{j=1}^{M_{b}}(<X_{j}^{2}>-<X>_{b}^{2})}\;\;\;, \end{displaymath}$

(4.54)

where $X_{j}$ is an average over block

the average over all blocks and $M_{b}$ is the number of blocks. The standard deviation of the mean is proportional to the square root of the mean squared deviation in block averages, assuming a sufficient number of blocks and samples are included in the analysis. $\sigma_{b}$ increases with block length

until a limiting value is reached, and this value corresponds to the true standard deviation of the mean. Where is an insufficient volume of data, a limiting value of $\sigma_{b}$ is not reached. Other methods for computing the standard deviation of the mean, such as the jackknife and bootstrap procedures, [16] give similar results. An example of the reblocking analysis is shown in figure 4.4.

**Figure 4.4:** Reblocking of DMC energies. For this data, the reblocked error bar is the limiting value of $\sigma_{b}=0.004$ eV. When insufficient data is reblocked, $\sigma_{b}$ (equation 4.54) does not reach a limiting value with increasing block length.
$\includegraphics [width=12cm]{Figures/reblocking.eps}$

4.8.1 VMC applications

In VMC calculations an accurate estimate of the correlation length can be used to improve the efficiency of the simulation. In VMC calculations an energy measurement is costly compared with a Metropolis move. It is therefore efficient to only measure energies at timesteps approximately equal to the correlation time. The correlation time is system and wavefunction dependent but may be accurately determined using only short Monte Carlo runs. This criterion is also applicable when configurations are generated for wavefunction optimisation (see section 4.7).

The equilibration time of an observable in a VMC calculation is best determined by graphing the observable as a function of time. Equivalently, trends block averages may be used where the data is divided into short blocks. After equilibrating, the observable fluctuates around its final mean value. A further check, used in all calculations in this thesis, is to divide the complete set of data into halves or thirds. For a sufficiently long simulation, the averages of each block should agree to within error bars.

4.8.2 DMC applications

In DMC calculations an accurate estimate of the correlation length is essential as the calculations are typically one order of magnitude more expensive than VMC, partly due to the increased correlation length observed in DMC calculations. The increased correlation length results from the shorter timestep used in DMC calculations which are necessary to limit the errors in the short time approximation to the Green's function.

The ensemble average energy is computable at every timestep in DMC, as the individual walker energies are required to compute the branching factors. Equilibration is best judged by examining a graph of the ensemble average energy and DMC walker population against time. For observables other than the energy, it is not necessary to measure the observable at every step and the techniques used in VMC are applicable. Reblocking of the DMC data after equilibration gives accurate error bars.

In addition to the checks described for VMC calculations, comparing running averages with the overall average is a further useful method to check that there are no long term trends in the data. A long term downward trend in energy can result from a poor guiding wavefunction, which effectively increases the equilibration and correlation times for the system. The DMC walker population provides a further guide: the population is stable when the reference energy is equal to the fixed-node ground state energy. Figure 4.5 shows an example DMC calculation.

**Figure 4.5:** An example DMC calculation: Energies versus move number and (inset) walker population versus move number. During the initial equilibration stage, approximately 300 ensemble moves, the ensemble average energy rapidly decays to the fixed-node ground state energy and the trial energy, , is updated frequently to maintain the population of the ensemble close to a target value. After the excited states have propagated out the trial energy is only updated occasionally to avoid population control errors. For this data, the DMC energy would be computed after discarding approximately the first 1000 moves. (Inset) During equilibration, the walker population increases rapidly and must be renormalised. After equilibration, the walker population remains stable around the target value of 640.
$\includegraphics [width=12cm]{Figures/dmchist.eps}$

4.8.3 Systematic errors

In the previous two sections, statistical techniques for the analysis of QMC data were presented. These techniques are unable to detect any systematic errors which could be present in the simulation.

Potential systematic errors include:

Insufficient exploration of configuration space results if a simulation is run for too short a period of time, or if the timestep chosen severely limits the number of accepted Monte Carlo moves. (The ratio of accepted to proposed moves is typically $50\%$ in VMC and $>99\%$ in DMC calculations.)
Inclusion of unequilibrated data in final averaged data gives a systematic bias to the averages obtained.
Updating the trial energy or renormalising the number of walkers in the ensemble too frequently results in population control errors, and the true ground state distribution and energy is not obtained.
In DMC calculations, a systematic timestep error will always be present due to the approximation used for the Green's function. For every system, DMC calculations must be repeated with progressively smaller timesteps until results agree to within error bars at the final level of statistical accuracy desired.

Systematic errors can be avoided by careful benchmarking of new systems. When doing so it is important to consider possible changes in wavefunction quality as system size and geometry is changed. The cancellation of remaining timestep and other systematic errors in energy differences may be encouraged by choosing a fixed timestep and block size (period between population renormalisation) for all calculations. However, the calculations should always be setup such that these errors are not significant even if no cancellation takes place.

Next: 4.9 Summary Up: 4. Implementation Previous: 4.7 Wavefunction optimisation Contents