Simple self-distillation improves code generation

(arxiv.org)

237 points | by Anon84 4 hours ago

24 comments

  • bensyverson 2 hours ago
    Really fascinating how this works; it's basically context-aware decoding. From the paper:

    > Code interleaves fork positions, where several continuations are genuinely plausible and may correspond to different solution approaches, with lock positions, where syntax and semantics leave little ambiguity but a low-probability distractor tail still remains… The best global decoding setting is therefore necessarily a compromise; we call this tension the precision-exploration conflict.

    In other words, just like us, the model needs to shift from "exploration" in "fork" mode (divergent thinking to produce a creative solution) to "precision" in "lock" mode (producing syntactically correct code).

    What this paper shows is that their simple technique (SSD) can improve the ranking of optimal tokens in both lock and fork positions, meaning the model is more likely to explore when it should be exploring, and more likely to be precise when it needs to be.

    I love that we're still learning the emergent properties of LLMs!

    • user_7832 1 hour ago
      > I love that we're still learning the emergent properties of LLMs!

      TBH, this is (very much my opinion btw) the least surprising thing. LLMs (and especially their emergent properties) are still black boxes. Humans have been studying the human brain for millenia, and we are barely better at predicting how humans work (or for eg to what extent free will is a thing). Hell, emergent properties of traffic was not understood or properly given attention to, even when a researcher, as a driver, knows what a driver does. Right now, on the front page, is this post:

      > 14. Claude Code Found a Linux Vulnerability Hidden for 23 Years (mtlynch.io)

      So it's pretty cool we're learning new things about LLMs, sure, but it's barely surprising that we're still learning it.

      (Sorry, mini grumpy man rant over. I just wish we knew more of the world but I know that's not realistic.)

      • AlphaAndOmega0 7 minutes ago
        I'm a psychiatry resident who finds LLM research fascinating because of how strongly it reminds me of our efforts to understand the human brain/mind.

        I dare say that in some ways, we understand LLMs better than humans, or at least the interpretability tools are now superior. Awkward place to be, but an interesting one.

      • amelius 50 minutes ago
        Studies of LLMs belong in their own field of science, just like psychology is not being studied in the physics department.
        • zer00eyz 10 minutes ago
          The intersection of physics isnt psychology it is philosophy, and the same is true (at present) with LLM's

          Much as Diogenes mocked Platos definition of a man with a plucked chicken, LLM's revealed what "real" ai would require: contigous learning. That isnt to diminish the power of LLM's (the are useful) but that limitation is a fairly hard one to over come if true AGI is your goal.

      • bensyverson 1 hour ago
        Learning about the emergent properties of these black boxes is not surprising, but it's also not daily. I think every new insight is worth celebrating.
        • TeMPOraL 45 minutes ago
          Indeed. For me, it's also a good reminder that AI is here to stay as technology, that the hype and investment bubble don't actually matter (well, except to those that care about AI as investment vehicle, of which I'm not one). Even if all funding dried out today, even if all AI companies shut down tomorrow, and there are no more models being trained - we've barely begun exploring how to properly use the ones we have.

          We have tons of low-hanging fruits across all fields of science and engineering to be picked, in form of different ways to apply and chain the models we have, different ways to interact with them, etc. - enough to fuel a good decade of continued progress in everything.

          • bathtub365 33 minutes ago
            AI has been here to stay for decades
            • TeMPOraL 24 minutes ago
              Maybe, but you couldn't tell that these days, casually scrolling this or any other tech-oriented discussion board.
      • Invictus0 19 minutes ago
        To say we've been studying the brain for millennia is an extreme exaggeration. Modern neuroscience is only about 50 years old.
        • timcobb 15 minutes ago
          I came here to say this :)
    • khalic 1 hour ago
      Another example of the mindf@#$ these systems are: I was doing some fine tuning to a small model, take data fields and make a sentence out of it. I was running into mode collapse (basically when the AI simplifies too much and always output the same thing).

      I got unstuck by randomizing the field order for each row?!? At training, and now I'm thinking I should do the same at inference time...

      • p_stuart82 16 minutes ago
        the irony of modern software engineering: we spent decades perfecting deterministic algorithms, and now we're basically just shaking a black box and hoping the magic rocks align.
      • toddmorey 31 minutes ago
        wow that's fascinating
    • stingraycharles 2 hours ago
      Seems like this is true for not just code but for all content being generated? Albeit for code it’s more well-defined, but the fork / lock mechanism works for a lot more problem domains.
      • bensyverson 2 hours ago
        That would seem intuitively true; it certainly applies to written language, where a clause could go off in another direction, but at other positions the correct grammar/syntax is unambiguous.
      • bryanrasmussen 2 hours ago
        thinking - well if we think of lock as happening in a narrative, then I think we can see there can be points where "everything you know is wrong" which essentially allows you to go back into a sort of fork mode and work towards another lock.

        Completely artistic creation, creating something that does not exist and that cannot produce things out of itself, means that locking can be more diffuse, not as settled.

        • stingraycharles 2 hours ago
          I think this seems similar to what Anthropic had been doing since the latest few Opus releases, which is interleaved thinking; CoT reasoning in the middle of a message. But they operate at different layers.
    • orbital-decay 37 minutes ago
      One relevant thing is that these forks are unnaturally narrow in all models, and rather resemble locks (not quite but close). From multiple possible continuations models tend to prefer just a couple, i.e. the model is a lot less random than it should be. That's why you're seeing annoying slop in writing and instantly recognizable color schemes in vibecoded sites. Lack of diversity probably limits the usefulness of this method as well.

      >I love that we're still learning the emergent properties of LLMs!

      There are tons of low-hanging fruits there.

    • DavidPiper 1 hour ago
      Sounds just like John Cleese's "Open Mode" and "Closed Mode" - https://www.youtube.com/watch?v=Pb5oIIPO62g
    • michaelbuckbee 1 hour ago
      I don't really understand the internal mechanics of of this, but my first thought was why not combine this with a linter/tests. So that it produces all the forks and only keeps the syntactically correct ones.
    • TacticalCoder 1 hour ago
      > What this paper shows is that their simple technique (SSD)

      "Simple Self-Distillation". We had an acronym for Solid-State Drive. Don't know about that technique but the naming sure sound.. Simple?

  • wg0 2 hours ago
    After TurboQuant and Gemma 4, came across the following video[0] running Gemma on local machine at 50 token/second.

    That already looks like Sonnet 3x and 4 level capabilities to me where the model in question (Gemma 4) set ups whole python project with a UI and installs python libraries using uv etc.

    Add this Simple Self Distillation to the picture and by 2028 I see cheaper coding model providers with much more generous usage limits in the future and power users would be mostly running their own models anyway.

    Anyone using these models as "non-deterministic transpilers" from natural language to code (experienced engineers who can write code themselves) would probably not be paying to any AI providers.

    [0] https://www.youtube.com/watch?v=-_hC-C_Drcw

    • red75prime 38 minutes ago
      > power users would be mostly running their own models

      ...with a fair amount of supervision, while frontier models would be running circles around them using project-specific memory and on-demand training (or whatever we would have by then).

      • 3abiton 16 minutes ago
        Honestly right now it's mainly stagnation in frontiere model capabilities. Most of the recent afvancemdnts are towards generation speed, compression and tool usage. The quality of the models are not improving at the same rate as before. I doubt this big gap will continue, given that open source and especially chinese labs keep pushing well documented frontiere papers.
      • darkerside 14 minutes ago
        Those will be great for projects that look just like everybody else's. That's not a knock. We'll see plenty of new systems built by anyone who needs one.

        If you're building something groundbreaking and new, the advantage will be slim to none.

    • spiderfarmer 1 hour ago
      I always wonder how much smaller and faster models could be if they were only trained on the latest versions of the languages I use, so for me that is PHP, SQL, HTML, JS, CSS, Dutch, English, plus tool use for my OS of choice (MacOS).

      Right now it feels like hammering a house onto a nail instead of the other way around.

      • ACCount37 35 minutes ago
        Not very. LLMs derive a lot of their capability profile from the sheer scale.

        LLMs have something that's not entirely unlike the "g factor" in humans - a broad "capability base" that spans domains. The best of the best "coding LLMs" need both good "in-domain training" for coding specifically and a high "capability base". And a lot of where that "base" comes from is: model size and the scale of data and compute used in pre-training.

        Reducing the model scale and pruning the training data would result in a model with a lower "base". It would also hurt in-domain performance - because capabilities generalize and transfer, and pruning C code from the training data would "unteach" the model things that also apply to code in PHP.

        Thus, the pursuit of "narrow specialist LLMs" is misguided, as a rule.

        Unless you have a well defined set bar that, once cleared, makes the task solved, and there is no risk of scope adjustment, no benefit from any future capability improvements above that bar, and enough load to justify the engineering costs of training a purpose-specific model? A "strong generalist" LLM is typically a better bet than a "narrow specialist".

        In practice, this is an incredibly rare set of conditions to be met.

      • BarryMilo 1 hour ago
        I seem to remember that's one of the first things they tried, but the general models tended to win out. Turns out there's more to learn from all code/discussions than from just JS.
      • Someone1234 1 hour ago
        Wouldn't that mean they're bad at migration tasks? I feel like for most languages, going from [old] to [current] is a fairly to very common usage scenario.
      • nareyko 1 hour ago
        [dead]
  • augment_me 4 minutes ago
    Isn't this was DeepSeek + Kimi did to Claude?
  • ultramann 51 minutes ago
    Maybe not the thing I should be focusing on, but I was surprised this paper came from apple. I was under the impression that apples ai/LLM research was far behind the curve. I get that research is a rising tides lifts all boats situation, I just thought that I had seen lots of negative news about apples progress in the front, and heuristically haven’t seen many (any?) apple research papers make it the front page of hacker news. Wondering if anyone more familiar with apple/ai research could comment on this?
  • khalic 2 hours ago
    Incredible, will translate to better coding models in the near future.

    We really need to develop better tools to understand what's happening inside these NNs. Working with high-D spaces is not something we're good at, and we're basically throwing stuff at it and seeing if it sticks.

  • fooker 19 minutes ago
    I'm excited for the long tail of techniques like this that are going to be discovered over the next several decades that's going to make this technology eventually run on a toaster!
  • 0x3f 2 hours ago
    Haven't read the paper yet, but it is interesting how seemingly simple many breakthroughs in ML are. Even transformers are like that. Maybe it's hindsight bias.

    I suppose we just don't have a deeper underlying theory to lean on and help us 'design' anything.

    • christophilus 2 hours ago
      A lot of discoveries are like that. In fact, simplicity is often the hallmark of correctness, and complexity is often a sign that our understanding is incomplete and we’re still stumbling towards the right model. Not always, but often. It’s been a good rule of thumb in my programming career.
      • heeton 2 hours ago
        100%. I have a guiding approach when solving problems: keep reframing and exploring until the solution becomes obvious.

        I often find, if I've got a complicated solution, it’s because I haven’t fully examined the problem.

  • l5870uoo9y 2 hours ago
    > Our method, simple self-distillation (SSD), is embarrassingly simple: sample solutions from the base model with specified temperature and truncation, then fine-tune on those raw, unverified samples via standard cross-entropy loss.

    So you prompt the base model for answer and then rerun the prompt with the answer from the first run?

    • ACCount37 2 hours ago
      No. There's no "answer" really.

      They use self-distillation to shift the output distribution of the model towards that of the same model, but running with different temperature/truncation settings in sampling.

      This effectively "folds" the logit tail truncation behavior into the model itself.

      Not entirely unlike a few "model controlled sampling settings" things I've seen in what it does, but different in execution.

  • an0malous 1 hour ago
    I’d like to understand AI research better and I recall some posts a while back where someone collected all the key papers that one should read, but I don’t remember enough to be able to find it. Does anyone know what I’m talking about and could link me to that post?
  • xbmcuser 1 hour ago
    So the chances of Singularity went up.
  • vishnugupta 1 hour ago
    Can someone please eli5 this to a friend web developer? I read the abstract but couldn’t understand much.
    • unknownx113 35 minutes ago
      you're probably overcomplicating it; as the paper says, it's embarrassingly simple: given a problem set, generate a response for each problem with a fixed temperature and truncation - then fine tune the model on the generations.

      Their hypothesis as to why this works requires a bit more knowledge about model architecture, but basically when a model generates code some positions have only one right answer and some have many valid options - but the model has to use one global confidence setting for both. Sampling with a specific temperature + a garbage-token filter, then training on those outputs, teaches the model to internalize 'be precise where there's one answer, stay open-minded where there are several' — without anyone labeling which is which.

      Note that there's a lot more nuance to this and I simplified a lot.

    • useful 19 minutes ago
      if the probability mass is on a single token, its a precise answer like `1 + 1 = ` if next token predicted shares probability with other token, then there are multiple answers like `position: `

      you can generate and train answers by exploring on varying the length of the code generated

  • roger_ 2 hours ago
    Skimmed this but don't have an intuitive understanding of why this works and how temperature and truncation factor in.
  • antirez 47 minutes ago
    Another potentially usable trick is the following: based on the observation that longer token budget improves model performances, one could generate solutions using a lot of thinking budget, then ask the LLM to turn the trace into a more compact one, and later SFT on that. That said, I have the feeling the result of the paper will likely be hard to apply in practice without affecting other capabilities, and/or not superior to other techniques that provide similar improvement in sampling.
  • 4b11b4 1 hour ago
    Self-consistency meets fine-tuning?
  • drooby 2 hours ago
    Fascinating...

    This feels eerily similar to sleep consolidation or synaptic pruning

    • ACCount37 20 minutes ago
      I don't see much similarity? Unless you're looking at self-distillation in general and not just this use of it.
  • smallerize 2 hours ago
    I don't suppose they published the improved models?
  • VoqalAI 34 minutes ago
    [dead]
  • usermac 1 hour ago
    [dead]
  • pithtkn 1 hour ago
    [dead]
  • dist-epoch 2 hours ago
    [flagged]
    • avaer 2 hours ago
      I definitely pay more attention to papers affiliated with Chinese companies; the economics seem to be more conducive to doing good academic work and publishing it. I would say the same for companies like Apple (where TFA came from).

      But to filter based on author's names sounds pretty darn racist.

    • ptidhomme 2 hours ago
      I used to have the opposite rule in my signal processing field : the more Chinese names, the less innovation was there.

      They seemed like they had to be churning out papers and any little adaptation to existing research triggered a new publication.

      But it may have changed now.

    • 0x3f 2 hours ago
      That's... almost every AI paper.
    • amelius 2 hours ago
      So

      "Made in China, designed by Apple in California"

      should be:

      "Made in China, designed by Chinese people in California"?

  • jofzar 3 hours ago
    > simple self-distillation (SSD):

    Sorry apple, SSD is already taken, you can't use that acronym.

    • love2read 2 hours ago
      You're right, I offer these alternatives:

      Consistency Preservation Update (CPU)

      Guided Probability Update (GPU)

      History-aware Distillation Driving (HDD)

      Probability Smoothing Update (PSU)

    • drittich 2 hours ago
      I used to invent TLAs on the spot for fun, and when someone asked what it was, would respond, "It's a PUA", eventually revealing that meant "previously unknown acronym". It was even more annoying that it sounds.
    • ape4 2 hours ago
      ATT=All TLAs are Taken
  • politelemon 2 hours ago
    It's cringe worthy to see that the original paper itself is editorialised.

    Title should be: Simple Self-Distillation Improves Code Generation

    • StevenWaterman 2 hours ago
      "Embarrassingly" has a history as a technically meaningful word roughly equivalent to "maximally", see "Embarrassingly parallel"

      https://en.wikipedia.org/wiki/Embarrassingly_parallel

    • Aurornis 2 hours ago
      The phrase embarrassingly parallel has a history in computer science.

      Many computer science paper titles allude to past titles in other CS papers.

      Calling it “cringe worthy” is unnecessarily mean. There is context and history you don’t understand.

      • gottheUIblues 2 hours ago
        "Embarrassingly" considered harmful?
        • cbm-vic-20 1 hour ago
          "Embarrassingly" considered harmful is all you need.
          • TeMPOraL 32 minutes ago
            Programming Introduction to "Embarrasingly" considered harmful is all you need in 21 hours.
  • ape4 2 hours ago
    Shouldn't a scientific paper be using metric units (like 30T) rather than 30B.

    There are two distinct billions. https://en.wikipedia.org/wiki/Billion

    • mikkupikku 1 hour ago
      Objective one should be to communicate effectively, not confuse everybody.
      • unknownx113 34 minutes ago
        that disqualifies like 80% of papers lmao
        • mikkupikku 30 minutes ago
          Lol, you're probably not wrong. But have you ever noticed that the most important papers tend to be on the clear and readable side of things? It's as if researchers understand that being understood is important, but deemphasize that when the paper itself isn't important in the first place. (Maybe if they're only publishing to not perish, not being understood is actually a goof thing from their perspective?)