The math that explains why bell curves are everywhere

(quantamagazine.org)

80 points | by ibobev 2 days ago

15 comments

  • mikrl 4 hours ago
    Great article. Personally I have been learning more about the mathematics of beyond-CLT scenarios (fat tails, infinite variance etc)

    The great philosophical question is why CLT applies so universally. The article explains it well as a consequence of the averaging process.

    Alternatively, I’ve read that natural processes tend to exhibit Gaussian behaviour because there is a tendency towards equilibrium: forces, homeostasis, central potentials and so on and this equilibrium drives the measurable into the central region.

    For processes such as prices in financial markets, with complicated feedback loops and reflexivity (in the Soros sense) the probability mass tends to ends up in the non central region, where the CLT does not apply.

    • parpfish 4 hours ago
      As to ye philosophy of “why” the CLT gives you normals, my hunch is that it’s because there’s some connection between:

      a) the CLT requires samples drawn from a distribution with finite mean and variance

      and b) the Gaussian is the maximum entropy distribution for a particular mean and variance

      I’d be curious about what happens if you starting making assumptions about higher order moments in the distro

      • ramblingrain 2 hours ago
        It is the not knowing, the unknown unknowns and known unknowns which result in the max entropy distribution's appearance. When we know more, it is not Gaussian. That is known.
        • mitthrowaway2 1 hour ago
          Exactly this. From this perspective, the CLT then can be restated as: "it's interesting that when you add up a sufficiently large number of independent random variables, then even if you have a lot of specific detailed knowledge about each of those variables, in the end all you know about their sum is its mean and variation. But at least you do reliably know that much."
          • D-Machine 48 minutes ago
            Came here basically looking to see this explanation. Normal dist is [approximately] common when summing lots of things we don't understand, otherwise, it isn't really.
      • orangemaen 2 hours ago
        The standard framing defines the Gaussian as this special object with a nice PDF, then presents the CLT as a surprising property it happens to have. But convolution of densities is the fundamental operation. If you keep convolving any finite-variance distribution with itself, the shape converges, and we called the limit "normal." The Gaussian is a fixed point of iterated convolution under √n rescaling. It earned its name by being the thing you inevitably get, not by having elegant closed-form properties.

        The most interesting assumptions to relax are the independence assumptions. They're way more permissive than the textbook version suggests. You need dependence to decay fast enough, and mixing conditions (α-mixing, strong mixing) give you exactly that: correlations that die off let the CLT go through essentially unchanged. Where it genuinely breaks is long-range dependence -fractionally integrated processes, Hurst parameter above 0.5, where autocorrelations decay hyperbolically instead of exponentially. There the √n normalization is wrong, you get different scaling exponents, and sometimes non-Gaussian limits.

        There are also interesting higher order terms. The √n is specifically the rate that zeroes out the higher-order cumulants. Skewness (third cumulant) decays at 1/√n, excess kurtosis at 1/n, and so on up. Edgeworth expansions formalize this as an asymptotic series in powers of 1/√n with cumulant-dependent coefficients. So the Gaussian is the leading term of that expansion, and Edgeworth tells you the rate and structure of convergence to it.

      • derbOac 3 hours ago
        IIRC the third moment defines a maxent distribution under certain conditions and with a fourth moment it becomes undefined? It's been awhile though.

        If I'm remembering it correctly it's interesting to think about the ramifications of that for the moments.

      • sobellian 3 hours ago
        IIRC there's a video by 3b1b that talks about that, and it is important that gaussians are closed under convolution.
        • gowld 1 hour ago
          That makes it an equilibrium point in function space, but the other half is why it's an a global attractor.
    • benmaraschino 4 hours ago
      You (and others) may enjoy going down the rabbit hole of universality. Terence Tao has a nice survey article on this which might be a good place to start: https://direct.mit.edu/daed/article/141/3/23/27037/E-pluribu...
  • bicepjai 14 minutes ago
    This is one of my favorite philosophical questions to ponder. I always ask it in interviews as a warmup to get their thoughts. I’ve noticed that interviewees often curl up, thinking it’s a technical question, so I’ve been modifying the question one after the other to make it less scary. The interviews are for data scientist roles.
    • Buttons840 2 minutes ago
      I haven't read the article, but my understanding is that a normal curve results from summing several samples from most common probability distributions, and also a normal curve results from summing many normal curves.

      All summation roads lead to normal curves. (There might be an exception for weird probability distributions that do not have a mean; I was surprised when I learned these exist.)

      Life is full of sums. Height? That's a sum of genetics and nutrition, and both of those can be broken down into other sums.

      I'm not a data scientist. I'm just a programmer that works with piles of poorly designed business logic.

      How did I do in my interview? (I am looking for a job.)

    • hilliardfarmer 6 minutes ago
      A lot of times I can't tell if I'm the idiot or if everyone else is. Says that this isn't an interesting question at all and the article was horrible. I studied data science for a few years but I'm no expert, but it seems pretty obvious to me that if you make a series of 50/50 choices randomly, that's the shape you end up with and there's really nothing more interesting about it than that.
  • jibal 1 minute ago
    https://en.wikipedia.org/wiki/Central_limit_theorem

    > suppose that a large sample of observations is obtained, each observation being randomly produced in a way that does not depend on the values of the other observations, and the average (arithmetic mean) of the observed values is computed. If this procedure is performed many times, resulting in a collection of observed averages, the central limit theorem says that if the sample size is large enough, the probability distribution of these averages will closely approximate a normal distribution.

  • AxEy 30 minutes ago
    I remember seeing one of these

    https://en.wikipedia.org/wiki/Galton_board

    at the (I think) Boston Science Museum when I was a kid. They have some pretty cool videos on Youtube if you're curious.

  • fiforpg 1 hour ago
    On opening the article, I was somehow expecting a mention of the large deviations formalism, which was (is?) fashionable in late 20th century, and gives a nice information theoretic view of the CLT. Or something like that. There's a ton of deep math there. So having a bio statistician say "look, the CLT is cool" is a bit underwhelming.

    Edit: see eg John Baez's write-up What is Entropy? about the entropy maximization principle, where gaussians make an entrance.

  • abetusk 53 minutes ago
    Sorry, does the article actually give reasons why the bell curve is "everywhere"?

    For simplicity, take N identically distributed random variables that are uniform on the interval from [-1/2,1/2], so the probability distribution function, f(x), on the interval from [-1/2,1/2] is 1.

    The Fourier transform of f(x), F(w), is essentially sin(w)/w. Taking only the first few terms of the Taylor expansion, ignoring constants, gives (1-w^2).

    Convolution is multiplication in Fourier space, so you get (1-w^2)^n. Squinting, (1-w^2)^n ~ (1-n w^2 / n)^n ~ exp(-n w^2). The Fourier transform of a Gaussian is a Gaussian, so the result holds.

    Unfortunately I haven't worked it out myself but I've been told if you fiddle with the exponent of 2 (presumably choosing it to be in the range of (0,2]), this gives the motivation for Levy stable distributions, which is another way to see why fat-tailed/Levy stable distributions are so ubiquitous.

    • WCSTombs 25 minutes ago
      It's not super hard to prove the central limit theorem, and you gave the flavor of one such proof, but it's still a bit much for the likely audience of this article, who can't be assumed to have the math background needed to appreciate the argument. And I think you're on the right track with the comment about stable distributions.
  • nsnzjznzbx 25 minutes ago
    So Abraham de Moivre was the worlds first quant?
  • fritzo 4 hours ago
    Hot take: bell curves are everywhere exactly because the math is simple.

    The causal chain is: the math is simple -> teachers teach simple things -> students learn what they're taught -> we see the world in terms of concepts we've learned.

    The central limit theorem generalizes beyond simple math to hard math: Levy alpha stable distributions when variance is not finite, the Fisher-Tippett-Gnedenko theorem and Gumbel/Fréchet/Weibull distributions regarding extreme values. Those curves are also everwhere, but we don't see them because we weren't taught them because the math is tough.

    • atrettel 1 hour ago
      I've often described this as a bias towards easily taught ("teachable") material over more realistic but difficult to teach material. Sometimes teachers teach certain subjects because they fit the classroom well as a medium. Some subjects are just hard to teach in hour-long lectures using whiteboards and slides. They might be better suited to other media, especially self study, but that does not mean that teachers should ignore them.
    • BobbyTables2 3 hours ago
      It also took me a little while to realize “least squares” and MMSE approaches were not necessarily the “correct” way to do things but just “one thing we actually know how to do” because everything else is much harder.

      We can use Calculus to do so much but also so little…

    • gowld 1 hour ago
      Most things aren't infinite or extreme, though. Almost by definition, most phenomena aren't extreme phenomena.
      • D-Machine 4 minutes ago
        No, but when you get into the nitty gritty of most things sometimes being influenced by extremely rare things, and also that the convergence rate of the central limit theorem is not universal at all, then much of the utility (and apparent universality) of the CLT starts to evaporate.

        In practice when modeling you are almost always better never assuming normality, and want to test models that allow the possibility of heavy tails. The CLT is an approximation, and modern robust methods or Bayesian methods that don't assume Gaussian priors are almost always better models. But this of course brings into question the very universality of the CLT (i.e. it is natural in math, but not really in nature).

    • AndrewKemendo 3 hours ago
      That’s exactly the right take and the article proves it:

      Statisticians love averages so everywhere that could be sampled as a normal distribution will be presented as one

      The median is actually more descriptive and power law is equally as pervasive if not more

      • fsckboy 26 minutes ago
        combining repeated samples of any distribution* (any population density fuction including power law distributions) will converge to the normal distribution, that's why it appears everywhere.

        * excluding bizarre degenerates like constants or impulse functions

    • orangemaen 2 hours ago
      [dead]
  • bluGill 2 hours ago
    100 year floods are not happening more often in most cases - it is just that the central limit therom teachs us the 10 year flood is almost as high water as the 100 or even 1000 year flood.
    • gowld 1 hour ago
      Explain?

      What are "most cases"?

  • gwern 1 hour ago
    A little disappointing. All about the history of bell curves, but I don't think it does a very good job explaining why the bell curve appears or the CLT is as it is.
  • gowld 3 hours ago
  • DroneBetter 4 hours ago
    I hate Quanta a lot

    a vast amount of fluff for less than a college statistics professor would (hopefully) be able to impart with a chalkboard in 10 minutes, when Quanta has the ability to prepare animated diagrams like 3Blue1Brown but chooses not to use it

    they could go down myriad paths, like how it provides that random walks on square lattices are asymptotically isotropic, or give any other simple easy-to-understand applications (like getting an asymptotic on the expected # of rolls of an n-sided die before the first reoccurring face) or explain what a normal distribution is, but they only want to tell a story to convey a feeling

    they are a blight upon this world for not using their opportunity to further public engagement in a meaningful way

    • KnuthIsGod 2 hours ago
      3Blue1Brown

      Seems a bit like Ted Talks. Lightweight popcorn for the simple minded.

    • tptacek 4 hours ago
      A lot of times on HN when a math topic comes up that isn't about 3b1b, someone will jump in to say "this isn't as good as 3b1b". Last time I saw that, I was moved to comment:

      https://news.ycombinator.com/item?id=45800657

      3b1b doesn't have the same goal as Quanta, or as introductory guides. It's actually not that great a teaching tool (it's truly great at what it is for, which is (a) appreciation and motivation, and (b) allowing people to signal how smart they are on message board threads by talking about how much people would get out of watching 3b1b).

      This is prose writing about math. It's something you're meant to read for enjoyment. If you don't enjoy it, fine; I don't enjoy cowboy fiction. So I don't read it. I don't so much look for opportunities to yell at how much I hate "The Ballad of Easy Breezy".

      • bmenrigh 3 hours ago
        I don’t fault Quanta (or 3b1b) for being the way they are. Each is serving their goal audience pretty well.

        My compliant is only that there should be a dozen more just like them, each competing with each other for the best, most engaging math and science content. This would allow for more a broader audience skillevel to be reached.

        As it stands, we’re lucky even to have Quanta and 3b1b.

        I think there is hope though, quite a few new-ish creators on YouTube are following in Grant’s footsteps and producing very technically detailed and informative content at similar quality levels.

      • paulpauper 2 hours ago
        there is no getting around that learning math requires actually having to buckle down and read and do math . A video will not suffice.
        • DroneBetter 1 hour ago
          well for one who does buckle down and read and do math, the expected amount of new information brought to them by a 3B1B video as supplementary material upon a topic (with the normal distribution being one that admits a direct comparison from the article) is nonzero, by merit of it possibly having ideas to convey from outside their usual purview and formal background that may be applicable to the doing of math (as has been the case for me, someone who [does math](https://oeis.org/wiki/User:Natalia_L._Skirrow)), while for Quanta fluff pieces it's zero.

          by the metric of "if this expository piece were to be taken to a time before its subject had been considered and presented to researchers, how useful would its outline be towards reproducing the theory in its totality," Quanta's writings (on both classical and research math) mostly score 0

        • tptacek 2 hours ago
          Couldn't agree more, which is why I think it's odd to suggest that a pop-sci magazine article is somehow a disservice that 3b1b would correct.
  • EGreg 1 hour ago
  • tsunamifury 1 hour ago
    Bell curves are everywhere because all distributions of any properties clump in some way at some level. The basics of any probability shows this. The result is you “seeing” bell curves everywhere. Aka clumps.

    This is a tautology to the extreme.

    • D-Machine 1 minute ago
      Yup. And in general more heavy-tailed bumps are in fact better models (assuming normality tends to lead to over-confidence). Really think the universality is strictly mathematical, and actually rare in nature.
  • WCSTombs 37 minutes ago
    It's not a bad article, but I have to point something out:

    > Laplace distilled this structure into a simple formula, the one that would later be known as the central limit theorem. No matter how irregular a random process is, even if it’s impossible to model, the average of many outcomes has the distribution that it describes. “It’s really powerful, because it means we don’t need to actually care what is the distribution of the things that got averaged,” Witten said. “All that matters is that the average itself is going to follow a normal distribution.”

    This is not really true, because the central limit theorem requires a huge assumption: that the random process has finite variance. I believe that distributions that don't satisfy that assumption, which we can call heavy-tailed distributions, are much more common in the real world than this discussion suggests. Pointing out that infinities don't exist in the real world is also missing the point, since a distribution that just has a huge but finite variance will require a correspondingly huge number of samples to start behaving like a normal distribution.

    Apart from the universality, the normal distribution has a pretty big advantage over others in practice, which is that it leads to mathematical models that are tractable in practice. To go into a slightly more detail, in mathematical modeling, often you define some mathematical model that approximates a real-world phenomenon, but which has some unknown parameters, and you want to determine those parameters in order to complete the model. To do that, you take measurements of the real phenomenon, and you find values for the parameters that best fit the measurements. Crucially, the measurements don't need to be exact, but the distribution of the measurement errors is important. If you assume the errors are independent and normally distributed, then you get a relatively nice optimization problem compared to most other things. This is, in my opinion, about as much responsible for the ubiquity of normal distributions in mathematical modeling as the universality from the central limit theorem.

    However, as most people who solve such problems realize, sometimes we have to contend with these things called "outliers," which by another name are really samples from a heavy-tailed distribution. If you don't account for them somehow, then Bad Things(TM) are likely to happen. So either we try to detect and exclude them, or we replace the normal distribution with something that matches the real data a bit better.

    Anyway, to connect this all back to the central limit theorem, it's probably fair to say measurement errors tend to be the combined result of many tiny unrelated effects, but the existence of outliers is pretty strong evidence that some of those effects are heavy-tailed and thus we can't rely on the central limit theorem giving us a normal distribution.

    • D-Machine 29 minutes ago
      This is also right I believe, normal distributions are not ubiquitous really, just they are approximately ubiquitous (and only really if "ignoring rare outliers", and if you also close your eyes to all the things we don't actually understand at all).

      The point on convergence rates re: the central limit theorem is also a major point otherwise clever people tend to miss, and which comes up in a lot of modeling contexts. Many things which make sense "in the limit" likely make no sense in real world practical contexts, because the divergence from the infinite limit in real-world sizes is often huge.

      EDIT: Also from a modeling standpoint, say e.g. Bayesian, I often care about finding out something like the "range" of possible results for (1) a near-uniform prior, (2), a couple skewed distributions, with the tail in either direction (e.g. some beta distributions), and (3) a symmetric heavy-tailed distribution (e.g. Cauchy). If you have these, anything assuming normality is usually going to be "within" the range of these assumptions, and so is generally not anything I would care about.

      Basically, in practical contexts, you care about tails, so assuming they don't meaningfully exist is a non-starter. Looking at non-robust stats of any kind today, without also checking some robust models or stats, just strikes me as crazy.