Instead of training the model to generalize, I train a 900KB transformer to memorize a single file and predict the next byte. Those predictions are fed into an arithmetic coder to produce the compressed output.
On a 100MB NYC taxi CSV, it compresses to about 7MB (~0.5 bits/byte). On a 100MB slice of enwik9, it compresses to about 21MB (~1.68 bits/byte).
It's pretty slow right now (roughly 20–30 minutes of training and 45 minutes each for compression and decompression on my AMD 7800XT).
Checkout the repo - https://github.com/samyak112/pym-particles
I tested with 100 MB files because anything larger takes a long time to evaluate. The actual target was at least 1 GB, and in that case I would use a 100 MB model (Shannon entropy rules).
I also tried it on a 100 MB Photoshop file and was able to compress it down to 45 MB, whereas ZIP could only get it down to 60 MB. So yeah still not losing gains.
I know the top submission was able to get it to 13 mb.
Still trying some ideas to get better compression.
But it’s only for the game I’m building and it’s not pure compression work, I had to do some tricky things
For context these numbers are for a grid based game where players can perform 4 actions per second, and the numbers I’m sharing are for 30 minutes of gameplay with anywhere from 2-1024+ players (human players) playing simultaneously
So if you do the math, my compression feat is effectively ~99% compression on naive best case.
It sounds absolutely bullshit I know :D
But I will be posting a blog post soon once I release the game