How Taalas "prints" LLM onto a chip?

(anuragk.com)

84 points | by beAroundHere 13 hours ago

13 comments

thesz 16 minutes ago
8B coefficients are packed into 53B transistors, 6.5 transistors per coefficient. Two-inputs NAND gate takes 4 transistors and register takes about the same. One coefficient gets processed (multiplied by and result added to a sum) with less than two two-inputs NAND gates.
I think they used block quantization: one can enumerate all possible blocks for all (sorted) permutations of coefficients and for each layer place only these blocks that are needed there. For 3-bit coefficients and block size of 4 coefficients only 330 different blocks are needed.
Matrices in the llama 3.1 are 4096x4096, 16M coefficients. They can be compressed into only 330 blocks, if we assume that all coefficients' permutations are there, and network of correct permutations of inputs and outputs.
Assuming that blocks are the most area consuming part, we have block's transistor budget of about 250 thousands of transistors, or 30 thousands of 2-inputs NAND gates per block.
250K transistors per block * 330 blocks / 16M transistors = about 5 transistors per coefficient.
Looks very, very doable.
It does look doable even for FP4 - these are 3-bit coefficients in disguise.
Hello9999901 2 hours ago
This would be a very interesting future. I can imagine Gemma 5 Mini running locally on hardware, or a hard-coded "AI core" like an ALU or media processor that supports particular encoding mechanisms like H.264, AV1, etc.
Other than the obvious costs (but Taalas seems to be bringing back the structured ASIC era so costs shouldn't be that low [1]), I'm curious why this isn't getting much attention from larger companies. Of course, this wouldn't be useful for training models but as the models further improve, I can totally see this inside fully local + ultrafast + ultra efficient processors.
[1] https://en.wikipedia.org/wiki/Structured_ASIC_platform
[-]
- roncesvalles 55 minutes ago
  Well even programmable ASICs like Cerebras and Groq give many-multiples speedup over GPUs and the market has hardly reacted at all.
owenpalmer 1 hour ago
> Kinda like a CD-ROM/Game cartridge, or a printed book, it only holds one model and cannot be rewritten.
Imagine a slot on your computer where you physically pop out and replace the chip with different models, sort of like a Nintendo DS.
[-]
- roncesvalles 52 minutes ago
  That slot is called USB-C. I can fully imagine inference ASICs coming in powerbank form factor that you'd just plug and play.
  [-]
  - XorNot 46 minutes ago
    Pretty sure it'd just be a thumbdrive. Are the Taalas chips particularly large in surface area?
    [-]
    - thesz 10 minutes ago
      800 mm2, about 90mm per side, if imagined as a square. Also, 250 W of power consumption.
      The form factor should be anything but thumbdrive.
      [-]
      - pfortuny 2 minutes ago
        mmmhhhhh 800mm2 ~= (30mm)2, which is more like a (biggish) thumb drive.
    - dmurray 15 minutes ago
      The only product they've announced at the moment [0] is a PCI-e card. It's more like a small power bank than a big thumb drive.
      But sure, the next generation could be much smaller. It doesn't require battery cells, (much) heat management, or ruggedization, all of which put hard limits on how much you can miniaturise power banks.
      [0] https://taalas.com/the-path-to-ubiquitous-ai/
- beAroundHere 1 hour ago
  That's the kind of hardware am rooting for. Since it'll encourage Open weighs models, and would be much more private.
  Infact, I was thinking, if robots of future could have such slots, where they can use different models, depending on the task they're given. Like a Hardware MoE.
- 8cvor6j844qw_d6 1 hour ago
  A cartridge slot for models is a fun idea. Instead of one chip running any model, you get one model or maybe a family of models per chip at (I assume) much better perf/watt. Curious whether the economics work out for consumer use or if this stays in the embedded/edge space.
- Onavo 56 minutes ago
  Yeah maybe you can call it PCIe.
cpldcpu 40 minutes ago
I wonder how well this works with MoE architectures?
For dense LLMs, like llama-3.1-8B, you profit a lot from having all the weights available close to the actual multiply-accumulate hardware.
With MoE, it is rather like a memory lookup. Instead of a 1:1 pairing of MACs to stored weights, you suddenly are forced to have a large memory block next to a small MAC block. And once this mismatch becomes large enough, there is a huge gain by using a highly optimized memory process for the memory instead of mask ROM.
At that point we are back to a chiplet approach...
[-]
- pests 6 minutes ago
  For comparison I wanted to write on how Google handles MoE archs with its TPUv4 arch.
  They use Optical Circuit Switches, operating via MEMS mirrors, to create highly reconfigurable, high-bandwidth 3D torus topologies. The OCS fabric allows 4,096 chips to be connected in a single pod, with the ability to dynamically rewire the cluster to match the communication patterns of specific MoE models.
  The 3D torus connects 64-chip cubes with 6 neighbors each. TPUv4 also contains 2 SpareCores which specialize handling high-bandwidth, non-contiguous memory accesses.
  Of course this is a DC level system, not something on a chip for your pc, but just want to express the scale here.
kinduff 1 hour ago
Very nice read, thank you for sharing this so well written.
punnerud 19 minutes ago
Could we all get bigger FPGAs and load the model onto it using the same technique?
[-]
- wmf 1 minute ago
  FPGAs have really low density so that would be ridiculously inefficient, probably requiring ~100 FPGAs to load the model. You'd be better off with Groq.
- fercircularbuf 17 minutes ago
  I thought about this exact question yesterday. Curious to know why we couldn't, if it isn't feasible. Would allow one to upgrade to the next model without fabricating all new hardware.
rustybolt 1 hour ago
Note that this doesn't answer the question in the title, it merely asks it.
[-]
- beAroundHere 1 hour ago
  Yeah, I had written the blog to wrap my head around the idea of 'how would someone even be printing Weights on a chip?' 'Or how to even start to think in that direction?'.
  I didn't explore the actual manufacturing process.
  [-]
  - pixelmelt 1 hour ago
    You should add an RSS feed so I can follow it!
    [-]
    - beAroundHere 1 hour ago
      I don't post blogs often, so haven't added RSS there, but will do. I mostly post to my linkblog[1], hence have RSS there.
      [1] https://www.anuragk.com/linkblog
londons_explore 55 minutes ago
So why only 30,000 tokens per second?
If the chip is designed as the article says, they should be able to do 1 token per clock cycle...
And whilst I'm sure the propagation time is long through all that logic, it should still be able to do tens of millions of tokens per second...
rustyhancock 2 hours ago
Edit: reading the below it looks like I'm quite wrong here but I've left the comment...
The single transistor multiply is intriguing.
Id assume they are layers of FMA operating in the log domain.
But everything tells me that would be too noisy and error prone to work.
On the other hand my mind is completely biased to the digital world.
If they stay in the log domain and use a resistor network for multiplication, and the transistor is just exponentiating for the addition that seems genuinely ingenious.
Mulling it over, actually the noise probably doesn't matter. It'll average to 0.
It's essentially compute and memory baked together.
I don't know much about the area of research so can't tell if it's innovative but it does seem compelling!
[-]
- generuso 1 hour ago
  The document referenced in the blog does not say anything about the single transistor multiply.
  However, [1] provides the following description: "Taalas’ density is also helped by an innovation which stores a 4-bit model parameter and does multiplication on a single transistor, Bajic said (he declined to give further details but confirmed that compute is still fully digital)."
  [1] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...
  [-]
  - londons_explore 49 minutes ago
    It'll be different gates on the transistor for the different bits, and you power only one set depending on which bit of the result you wish to calculate.
    Some would call it a multi-gate transistor, whilst others would call it multiple transistors in a row...
    [-]
    - hagbard_c 17 minutes ago
      That, or a resistor ladder with 4 bit branches connected to a single gate, possibly with a capacitor in between, representing the binary state as an analogue voltage, i.e. an analogue-binary computer. If it works for flash memory it could work for this application as well.
  - rustyhancock 1 hour ago
    That's much more informative, I think my original comment is quite off the mark then.
- jsjdjrjdjdjrn 14 minutes ago
  I'd expect this is analog multiplication with voltage levels being ADC'd out for the bits they want. If you think about it, it makes the whole thing very analog.
  [-]
  - jsjdjrjdjdjrn 10 minutes ago
    Note: reading further down, my speculation is wrong.
abrichr 1 hour ago
ChatGPT Deep Research dug through Taalas' WIPO patent filings and public reporting to piece together a hypothesis. Next Platform notes at least 14 patents filed [1]. The two most relevant:
"Large Parameter Set Computation Accelerator Using Memory with Parameter Encoding" [2]
"Mask Programmable ROM Using Shared Connections" [3]
The "single transistor multiply" could be multiplication by routing, not arithmetic. Patent [2] describes an accelerator where, if weights are 4-bit (16 possible values), you pre-compute all 16 products (input x each possible value) with a shared multiplier bank, then use a hardwired mesh to route the correct result to each weight's location. The abstract says it directly: multiplier circuits produce a set of outputs, readable cells store addresses associated with parameter values, and a selection circuit picks the right output. The per-weight "readable cell" would then just be an access transistor that passes through the right pre-computed product. If that reading is correct, it's consistent with the CEO telling EE Times compute is "fully digital" [4], and explains why 4-bit matters so much: 16 multipliers to broadcast is tractable, 256 (8-bit) is not.
The same patent reportedly describes the connectivity mesh as configurable via top metal masks, referred to as "saving the model in the mask ROM of the system." If so, the base die is identical across models, with only top metal layers changing to encode weights-as-connectivity and dataflow schedule.
Patent [3] covers high-density multibit mask ROM using shared drain and gate connections with mask-programmable vias, possibly how they hit the density for 8B parameters on one 815mm2 die.
If roughly right, some testable predictions: performance very sensitive to quantization bitwidth; near-zero external memory bandwidth dependence; fine-tuning limited to what fits in the SRAM sidecar.
Caveat: the specific implementation details beyond the abstracts are based on Deep Research's analysis of the full patent texts, not my own reading, so could be off. But the abstracts and public descriptions line up well.
[1] https://www.nextplatform.com/2026/02/19/taalas-etches-ai-mod...
[2] https://patents.google.com/patent/WO2025147771A1/en
[3] https://patents.google.com/patent/WO2025217724A1/en
[4] https://www.eetimes.com/taalas-specializes-to-extremes-for-e...
[-]
- generuso 5 minutes ago
  LSI Logic and VLSI Systems used to do such things in 1980s -- they produced a quantity of "universal" base chips, and then relatively inexpensively and quickly customized them for different uses and customers, by adding a few interconnect layers on top. Like hardwired FPGAs. Such semi-custom ASICs were much less expensive than full custom designs, and one could order them in relatively small lots.
  Taalas of course builds base chips that are already closely tailored for a particular type of models. They aim to generate the final chips with the model weights baked into ROMs in two months after the weights become available. They hope that the hardware will be profitable for at least some customers, even if the model is only good enough for a year. Assuming they do get superior speed and energy efficiency, this may be a good idea.
- cpldcpu 46 minutes ago
  It could simply be bit serial. With 4 bit weights you only need four serial addition steps, which is not an issue if the weight are stored nearby in a rom.
sargun 1 hour ago
Isn’t the highly connected nature of the model layers problematic to build into physical layer?
moralestapia 48 minutes ago
>HOW NVIDIA GPUs process stuff? (Inefficiency 101)
Wow. Massively ignorant take. A modern GPUs is an amazing feat of engineering, particularly about making computation more efficient (low power/high throughput).
Then proceeds to explain, wrongly, how inference is supposssedly implemented and draws conclusions from there ...
[-]
- beAroundHere 39 minutes ago
  Hey, Can you please point out explain the inaccuracies in the article?
  I had written this post to have a higher level understanding of traditional vs Taalas's inference. So it does abstracts lots of things.
villgax 58 minutes ago
This read itself is slop lol, literally dances around the term printing as if its some inkjet printer