Overfitted a 900KB Transformer to Compress a 100MB CSV into 7MB

I built an experiment that uses an overfitted transformer and arithmetic coding to compress individual files.

Instead of training the model to generalize, I train a 900KB transformer to memorize a single file and predict the next byte. Those predictions are fed into an arithmetic coder to produce the compressed output.

On a 100MB NYC taxi CSV, it compresses to about 7MB (~0.5 bits/byte). On a 100MB slice of enwik9, it compresses to about 21MB (~1.68 bits/byte).

It's pretty slow right now (roughly 20–30 minutes of training and 45 minutes each for compression and decompression on my AMD 7800XT).

Checkout the repo - https://github.com/samyak112/pym-particles

8 points | by spidy__ 2 days ago

7 comments

tae0086 19 hours ago
Neat approach. Since the 900KB model ships with the compressed file, is there a file size below which the model overhead just eats the gains? Curious where the crossover is.
[-]
- spidy__ 13 hours ago
  For the model overhead to become significant enough to eat into the gains, the file size would need to be fairly small, right? I assumed nobody would use this for compressing anything below 100 MB.
  I tested with 100 MB files because anything larger takes a long time to evaluate. The actual target was at least 1 GB, and in that case I would use a 100 MB model (Shannon entropy rules).
  I also tried it on a 100 MB Photoshop file and was able to compress it down to 45 MB, whereas ZIP could only get it down to 60 MB. So yeah still not losing gains.
7373737373 2 days ago
What does it compress the full 1GB file to? http://prize.hutter1.net/
[-]
- spidy__ 2 days ago
  I tried it on a enwik9 100 mb slice and was able to compress it to 20 mb + 900kb transformer so 21mb.
  I know the top submission was able to get it to 13 mb.
  Still trying some ideas to get better compression.
purple-leafy 1 day ago
That’s so awesome! I want to try something similar. I’ve been going crazy with compression work. I reckon I can beat that prize link
[-]
- spidy__ 13 hours ago
  Reallly?? So have you published something so far? Can i read something? Sounds like you got some interesting ideas.
  [-]
  - purple-leafy 2 hours ago
    I will be showcasing something on hackernews soon! Basically I found a way to “compress” a multiplayer game state from ~100KB+ to ~1KB
    But it’s only for the game I’m building and it’s not pure compression work, I had to do some tricky things
    [-]
    - purple-leafy 1 minute ago
      And just for comparison, my absolute best compression method managed to get down to 10s of KB, but the real unlock got to the ~1KB figures
      For context these numbers are for a grid based game where players can perform 4 actions per second, and the numbers I’m sharing are for 30 minutes of gameplay with anywhere from 2-1024+ players (human players) playing simultaneously
      So if you do the math, my compression feat is effectively ~99% compression on naive best case.
      It sounds absolutely bullshit I know :D
      But I will be posting a blog post soon once I release the game
roshiya 8 hours ago
[flagged]
keynha 17 hours ago
[dead]
xunevega 1 day ago
[flagged]
jessedaniel 4 hours ago
[dead]