Lakebase architecture delivers faster Postgres writes

(databricks.com)

117 points | by sp_from_db 3 days ago

10 comments

hardwaresofton 18 hours ago
This is essentially a re-explanation of Neon’s architecture as a blog post.
Amazing that the Postgres ecosystem got this software for “free” (as in at least a basic version of it is F/OSS, IIRC there wasn’t any core bits held back), and the extremely engineer-heavy company got to make money, AND they got bought out in true acquisition style by a larger player that truly benefits from the tech.
The Postgres ecosystem is pretty unique in its ability to produce a “boring” stable product, innovate, stay F/OSS, and create financial outcomes for participants.
[-]
- philippemnoel 12 hours ago
  This a 100 times^. The Postgres ecosystem is remarkable and has managed to strike the balance between OSS and commercial successes in a way that most infra verticals have not.
- conradludgate 12 hours ago
  (Neon/databricks employee here)
  Neon also only just disabled FPWs - so there is new substance here. We published a similar blog on Neon
  https://neon.com/blog/turning-off-fpw-for-faster-writes
- xyzzy_plugh 17 hours ago
  As far as I know Neon's open source repositories are no longer being updated/maintained.
  [-]
  - hardwaresofton 17 hours ago
    You can't have it all, forever -- the tech is there for anyone to fork and improve, build new businesses (!), inspect, etc. F/OSS is basically a miracle as-is.
    If we compare the current state of the world to one in which they were acquired and then continued to put out more F/OSS, things look bad (which I assume is your implication). I choose to instead make the comparison to the world where we never see this tech and it stays proprietary. Sure, eventually someone in F/OSS might have gotten around to building this solution, but they pulled forward the future and we get to see and build on the result for free.
uhoh-itsmaciek 1 day ago
>Without those periodic full page images in the log, the storage layer would have to replay an infinitely long chain of small deltas to reconstruct a page for a read request. What was once a bounded O(checkpoint frequency) replay becomes an unbounded chain, leading to a spike in read latency and resource consumption.
I don't follow: read requests are not served from the WAL. They read the current state of the page from the buffer cache, where the page is updated after the change (FPI or not) is written to the WAL.
[-]
- nikita 23 hours ago
  This applies to our storage implementation. In Lakebase architecture storage serves pages and it doesn't always have the most recent version of the page and therefore it reconstructs it on demand.
  In the past we relied on Postgres compute to periodically send a full page so reconstructive a page was always a bounded process. Once we turned it off (and got all those perf gains) we got another problem: unbounded page reconstruction which we had to solve separately.
gavinray 22 hours ago
So, the general architecture described here is solid, and I support it, but I take issue with the "Lakebase" naming thing.
Disaggregated storage and disaggregated compute have been an open trend in DBMS development for the last half-decade. This is an obvious move with modern computing paradigms, and the academic literature has a standard name for it.
This feels like "JAMStack" from Netlify happening all over again.
I tweeted about this in 2022, as a general trend, and also from the RocksDB meetup emphasizing disaggregated storage:
- https://x.com/GavinRayDev/status/1607769112234823680
- https://x.com/GavinRayDev/status/1600666127025156096
[-]
- jeremyjh 22 hours ago
  I don't think it should be surprising that vendors are not going to lead with "disaggregated storage". I don't see that taking off either. This isn't a paper in a journal. Aurora doesn't call it that either. But yes, it is not a new idea.
  [-]
  - gavinray 22 hours ago
    Avoiding industry-standard names and trying to introduce your own convention comes off as hubristic and grift-ey to me.
    "Basic literacy" -> "Prompt Engineering"
    "P2P networking" -> "Web3"
    "Service-Oriented Architecture" -> "Microservices"
    Maybe I'm old-man-yelling-at-cloud.
    [-]
    - jeremyjh 21 hours ago
      Its more like old-man-yelling-at-billboard. Its just marketing. Its like complaining about the font they chose.
- nikita 22 hours ago
  Lakebase is referring to the fact that in addition to disaggregated storage s3 is authoritative storage for older data.
  Since data is on s3 (or lake) you can perform direct to s3 type operations like data loading, reading this data by engines that are not Postgres and more
  [-]
  - gavinray 22 hours ago
```
  > in addition to disaggregated storage s3 is authoritative storage for older data
```
    Suppose a person retrives cold data from another Object Storage protocol rather than S3. This is no longer a "Lakebase", so we have to come up with a different name to avoid confusion.
    But if you say "Disaggregated Storage on S3" then you have the flexibility to change that to "Disaggregated Storage on FOOBAR" to avoid confusion.
    [-]
    - majormajor 21 hours ago
      > Suppose a person retrives cold data from another Object Storage protocol rather than S3. This is no longer a "Lakebase", so we have to come up with a different name to avoid confusion.
      I've never seen "lake" or adjacent terminology refer to S3 specifically like that vs other object storage. A data lake on Ceph would still be a data lake.
      (My quibble would be that "lake" often refers to inconsistent or unstructured, and itself has always been a bit handwavy compared to "warehouse," whereas this is very structured data on object storage.)
      [-]
      - andyferris 20 hours ago
        Yes.
        Maybe I’m wrong, but AFAICT this is block (page) storage backed by S3, tuned for Postgres with some paxos-linked storage/caching servers sitting in front? Sounds good, but I’m not sure “lake” or “warehouse” is a word I’d choose… much closer to Litestream-with-reads, or the somewhat-famous “I ran out of RAM so I downloaded some more” blog article.
noashavit 5 hours ago
Lakebase is based on then Neon, that is why it was acquired. These are the performance gains from that underlying tech
faangguyindia 17 hours ago
Most problems with overgrowing data can be fixed by having data deletion rule.
Many people just keep adding data and think "maybe it will be useful in future" till their system goes down.
Many of your data is essentially useless for anything in future.
You can simply have data retention policy and for most app this ensures your data does not grow top huge
[-]
- hasyimibhar 15 hours ago
  In some cases you have no choice but to retain the data, e.g. due to compliance. But the good thing is it doesn't have to be in Postgres. You can periodically offload data to a lakehouse, then delete it from Postgres. If the table is partitioned, delete should be cheap.
  I'm guessing with Neon, since their storage is a lakehouse, you get this for free.
- redwood 17 hours ago
  This is orthogonal to many classes of write throughput. And in fact in many cases deletes themselves are effectively adding to the write workload
hasyimibhar 19 hours ago
How does Lakebase compare to OrioleDB[0]?
[0] https://www.orioledb.com/
erikcw 22 hours ago
How does Lakebase compare to Ducklake[0]?
[0] https://ducklake.select/
[-]
- jeremyjh 22 hours ago
  Lakebase is for transactional use cases - this is more comparable to AWS Aurora.
- nikita 22 hours ago
  Lakebase is OLTP.
nikita 1 day ago
I'm a VP on Databricks and former CEO of Neon. Happy to answer performance related or any other questions here.
[-]
- jeremyjh 21 hours ago
  In the blog article[1] that linked to, it says "Unified transactional and analytical workloads: Lakebase integrates seamlessly with the Lakehouse, sharing the same storage layer across OLTP and OLAP. This makes it possible to run real-time analytics, machine learning, and AI-driven optimization directly on transactional data without moving or duplicating it."
  Is the "without moving or duplicating" part actually a true statement? If the actual table state is only reconstructed by the pageserver, its not like Spark can just read it from S3.
  [1] https://www.databricks.com/blog/what-is-a-lakebase
- weli 1 day ago
  How does it affect HA postgres? (Replicas, consensus, etc). Especially with extensions like citus.
  [-]
  - nikita 23 hours ago
    This specific perf improvement is orthogonal to HA.
    However generally disaggregating storage makes HA simpler and allows for things like zero downtime patching: https://www.databricks.com/blog/zero-downtime-patching-lakeb...
    Read replicas can be "shallow". You don't need to replicate all the data to create a replica. This allows to create them very very quickly (sub second).
    All the extension still work. We don't support Citus today, but mostly because customers are not asking for it rather due to technical limitations. We support lots of extensions: https://docs.databricks.com/aws/en/oltp/projects/extensions
- Veelox 23 hours ago
  Thanks for offering. In the graph labeled "Prod customer throughput: (higher is better)" eyeballing it within a week you are seeing ~2k qps peak increase over the previous week.
  Operationally, how do you handle landing that large of a perf improvement? If my data store changed that much in a week it could break something.
  [-]
  - nikita 20 hours ago
    Generally the more throughput the system supports the better. In this case we were hitting limits (btw each operation is many queries of different sizes) and the customer observed higher latencies which is typical if the system can't sustain the throughput required.
    After this change latencies are back to normal and throughput increased.
    [-]
    - Veelox 19 hours ago
      Ahh, so it was a customer pain point of higher latency so they were happy to see latency go down and throughput go up. Good to hear.
      Great write up, cheers to the people involved.
- mixolydianagain 14 hours ago
  Hi Nikita. Can you share any of Neon's techniques for minimizing noisy neighbor issues in the multi tenant storage services? Thanks!
  [-]
  - nikita 10 minutes ago
    * Rate limiting on proxy in front of compute fleet
    * Large tenants are broken up into shards, reducing hotspots
    * Each shard is throttled to a fixed req/s rate
    * We do not run pageservers at their redline in terms of CPU load, so there is some slack to take up bursts
    * Capacity quotas which selectively throttle write traffic to the largest databases if they are competing with others for disk space, until the larger database is migrated away.
mystraline 1 day ago
Im not a proper DBA, but oversee some basic postgres installs (read: logging, monitoring, upgrades).
This appears to only have any effect with datalake style installs, where storage is separate from compute.
Not going to have any effect on those small postgres installs for that generic one off app.
[-]
- tempest_ 23 hours ago
  Everyone thinks they need a data lake when most people just need a data pond or data puddle. This is made worse by the industry disappearance of the DBA role and compounded by the fact that PG is not especially easy to tune.
  All of this to say that a ton of people are on some sort of managed cloud postgres where the compute is almost always separated from the storage even for the small instances.
  Neon et al. will tell you they scale, and I am sure they can but the number of enterprises that actually exceed when can be put on a few large servers in pretty low. You gotta lock them in early so their orgs never develop the expertise to move off on the off chance they get big.
- nikita 23 hours ago
  We provide you fully managed Postgres. Lots of our customers use it for lots of small instances of Postgres since using Lakebase is so lightweight.
  Small and large instances benefit from this performance optimization.
  [-]
  - mystraline 4 hours ago
    Again, when the users of this app are just 3 admins, over basic metrics from ArcGIS Monitor, I dont need to deviate from their recommended, tune, or much of anything.
    This isnt some high throughput app where every IOPS matters.
    And in my experiences, most apps that need a database arent serving 1000's of people per hour. Most are fine with a few users per hour.
    The whole "scale to scale cause youre 1 minute away from HN hug of death" is just wishful thinking a budget bloat from the hyperscalers selling 'butwhatifs'. Ive called AWS to task on their 5 pillars crap, which basically says "pay double for redundancy". Gee, I wonder why AWS would recommend customers to pay double (eyeroll).
suyavuz 22 hours ago
[dead]