{ "cells": [ { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\nModel Selection using lmfit and emcee\n=====================================\n\nFIXME: this is a useful examples; however, it doesn't run correctly anymore as\nthe PTSampler was removed in emcee v3...\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`lmfit.emcee` can be used to obtain the posterior probability distribution\nof parameters, given a set of experimental data. This notebook shows how it\ncan be used for Bayesian model selection.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import matplotlib.pyplot as plt\nimport numpy as np\n\nimport lmfit" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define a Gaussian lineshape and generate some data:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def gauss(x, a_max, loc, sd):\n return a_max * np.exp(-((x - loc) / sd)**2)\n\n\nx = np.linspace(3, 7, 250)\nnp.random.seed(0)\ny = 4 + 10 * x + gauss(x, 200, 5, 0.5) + gauss(x, 60, 5.8, 0.2)\ndy = np.sqrt(y)\ny += dy * np.random.randn(y.size)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plot the data:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "plt.errorbar(x, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Define the normalised residual for the data:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def residual(p, just_generative=False):\n v = p.valuesdict()\n generative = v['a'] + v['b'] * x\n M = 0\n while 'a_max%d' % M in v:\n generative += gauss(x, v['a_max%d' % M], v['loc%d' % M], v['sd%d' % M])\n M += 1\n\n if just_generative:\n return generative\n return (generative - y) / dy" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a Parameter set for the initial guesses:\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def initial_peak_params(M):\n p = lmfit.Parameters()\n\n # a and b give a linear background\n a = np.mean(y)\n b = 1\n\n # a_max, loc and sd are the amplitude, location and SD of each Gaussian\n # component\n a_max = np.max(y)\n loc = np.mean(x)\n sd = (np.max(x) - np.min(x)) * 0.5\n\n p.add_many(('a', a, True, 0, 10), ('b', b, True, 1, 15))\n\n for i in range(M):\n p.add_many(('a_max%d' % i, 0.5 * a_max, True, 10, a_max),\n ('loc%d' % i, loc, True, np.min(x), np.max(x)),\n ('sd%d' % i, sd, True, 0.1, np.max(x) - np.min(x)))\n return p" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Solving with `minimize` gives the Maximum Likelihood solution.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "p1 = initial_peak_params(1)\nmi1 = lmfit.minimize(residual, p1, method='differential_evolution')\n\nlmfit.printfuncs.report_fit(mi1.params, min_correl=0.5)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "From inspection of the data above we can tell that there is going to be more\nthan 1 Gaussian component, but how many are there? A Bayesian approach can\nbe used for this model selection problem. We can do this with `lmfit.emcee`,\nwhich uses the `emcee` package to do a Markov Chain Monte Carlo sampling of\nthe posterior probability distribution. `lmfit.emcee` requires a function\nthat returns the log-posterior probability. The log-posterior probability is\na sum of the log-prior probability and log-likelihood functions.\n\nThe log-prior probability encodes information about what you already believe\nabout the system. `lmfit.emcee` assumes that this log-prior probability is\nzero if all the parameters are within their bounds and `-np.inf` if any of\nthe parameters are outside their bounds. As such it's a uniform prior.\n\nThe log-likelihood function is given below. To use non-uniform priors then\nshould include these terms in `lnprob`. This is the log-likelihood\nprobability for the sampling.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def lnprob(p):\n resid = residual(p, just_generative=True)\n return -0.5 * np.sum(((resid - y) / dy)**2 + np.log(2 * np.pi * dy**2))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To start with we have to create the minimizers and *burn* them in. We create\n4 different minimizers representing 0, 1, 2 or 3 Gaussian contributions. To\ndo the model selection we have to integrate the over the log-posterior\ndistribution to see which has the higher probability. This is done using the\n`thermodynamic_integration_log_evidence` method of the `sampler` attribute\ncontained in the `lmfit.Minimizer` object.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Work out the log-evidence for different numbers of peaks:\ntotal_steps = 310\nburn = 300\nthin = 10\nntemps = 15\nworkers = 1 # the multiprocessing does not work with sphinx-gallery\nlog_evidence = []\nres = []\n\n# set up the Minimizers\nfor i in range(4):\n p0 = initial_peak_params(i)\n # you can't use lnprob as a userfcn with minimize because it needs to be\n # maximised\n mini = lmfit.Minimizer(residual, p0)\n out = mini.minimize(method='differential_evolution')\n res.append(out)\n\nmini = []\n# burn in the samplers\nfor i in range(4):\n # do the sampling\n mini.append(lmfit.Minimizer(lnprob, res[i].params))\n out = mini[i].emcee(steps=total_steps, ntemps=ntemps, workers=workers,\n reuse_sampler=False, float_behavior='posterior',\n progress=False)\n # get the evidence\n print(i, total_steps, mini[i].sampler.thermodynamic_integration_log_evidence())\n log_evidence.append(mini[i].sampler.thermodynamic_integration_log_evidence()[0])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Once we've burned in the samplers we have to do a collection run. We thin\nout the MCMC chain to reduce autocorrelation between successive samples.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "for j in range(6):\n total_steps += 100\n for i in range(4):\n # do the sampling\n res = mini[i].emcee(burn=burn, steps=100, thin=thin, ntemps=ntemps,\n workers=workers, reuse_sampler=True, progress=False)\n # get the evidence\n print(i, total_steps, mini[i].sampler.thermodynamic_integration_log_evidence())\n log_evidence.append(mini[i].sampler.thermodynamic_integration_log_evidence()[0])\n\n\nplt.plot(log_evidence[-4:])\nplt.ylabel('Log-evidence')\nplt.xlabel('number of peaks')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The Bayes factor is related to the exponential of the difference between the\nlog-evidence values. Thus, 0 peaks is not very likely compared to 1 peak.\nBut 1 peak is not as good as 2 peaks. 3 peaks is not that much better than 2\npeaks.\n\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": false }, "outputs": [], "source": [ "r01 = np.exp(log_evidence[-4] - log_evidence[-3])\nr12 = np.exp(log_evidence[-3] - log_evidence[-2])\nr23 = np.exp(log_evidence[-2] - log_evidence[-1])\n\nprint(r01, r12, r23)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These numbers tell us that zero peaks is 0 times as likely as one peak. Two\npeaks is 7e49 times more likely than one peak. Three peaks is 1.1 times more\nlikely than two peaks. With this data one would say that two peaks is\nsufficient. Caution has to be taken with these values. The log-priors for\nthis sampling are uniform but improper, i.e. they are not normalised properly.\nInternally the lnprior probability is calculated as 0 if all parameters are\nwithin their bounds and `-np.inf` if any parameter is outside the bounds.\nThe `lnprob` function defined above is the log-likelihood alone. Remember,\nthat the log-posterior probability is equal to the sum of the log-prior and\nlog-likelihood probabilities. Extra terms can be added to the lnprob function\nto calculate the normalised log-probability. These terms would look something\nlike:\n\n\\begin{align}\\log (\\prod_i \\frac{1}{max_i - min_i})\\end{align}\n\nwhere $max_i$ and $min_i$ are the upper and lower bounds for the\nparameter, and the prior is a uniform distribution. Other types of prior are\npossible. For example, you might expect the prior to be Gaussian.\n\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "nbformat": 4, "nbformat_minor": 0 }