We use Vega pretty heavily, its a broader ecosystem. Where it really shines is in combination with Altair and Vegafusion to do number-crunching on the backend and return a chart spec that can just be rendered on the front-end.
That makes it particularly useful when building interactive visualizations with a lot of data.
Year 1 - Getting the first version out that worked for me and my colleagues.
Year 2 - Cleaning it up enough to promote wider use. Documentation.
Years 3-5 - Minor bug fixes only as I thought qStudio solved the problem.
Years 6 - I realised restricting qStudio to only 1-2 database technologies was foolish. Major change to support many more databases. Improved generic SQL highlighting. Added a partial dark mode.
Years 7-8 - Minor bug fixes.
Year 9 - Added a proper Dark Mode and support for many themes by using FlatLaf. Now looking properly modern.
Year 10 - Realise that I'm not fully solving the problem. That actually for most data analysts I should support creating the analysis (pivot table) and improve exporting (real excel export, not just nicely escaped CSV).
There were more learnings, like I should definitely have went fully open source at the start. It's harder to do later.
Typically data warehouses are OLAP databases that have much better performance than OLTP databases for large queries.
There might also be several applications in a company, each with their own database, and a need to produce reports based on combinations of data from multiple applications.
I think that in many cases your question is based on an idea that is completely right. engineers are too eager to split out applications into multiple databases and tacking on separate data warehouses. The costs of maintaining separate databases is often higher than initially thought. Especially when some of the data in the warehouse needs to go back into the application database, for example for customer facing analytics. I think many companies would be better served by considering traditional data warehousing needs directly in their main application databases and abstain from splitting out databases. Having one single ACID source of truth and paying a bit more for a single beefy database server makes a lot more sense than is commonly thought. Especially now when many customer facing products, like recommendation systems, are “data driven”. At least that’s my impression after working in the space for a while.
If the postgres database is recording business transactions, you don't want to cause your business to stop being able to take credit cards because you generated a report.
Futhermore, Postgres is an OLTP (transactional) database, designed to efficiently perform updates and deletes on individual rows. OLAP (analytical) databases/query engines like Clickhouse, Presto, Druid, etc. are designed for efficient processing across many rows of mostly unchanging data.
Analytical queries (like "find the average sales price across all orders over the past year, grouped by store and region") can be 100-1000x faster in an OLAP database compared to Postgres.
That’s the use case for cdc, to make it equally easy to use a DW. As always the complexity is just air you move in the balloon. The oltp db can spit out the events and forget them, how you load them efficiently is now a data engineer’s problem to solve ( if it was easy to write event grain on an olap you would not need an oltp). Kafka usually enters the room at this stage and the simplification promise is becoming tenuous.
Additionally, unless your data model is designed as append-only (which is unusual and requires logic downstream), you won't be able to track updates and deletions, which are valuable for reporting
Also reporting/analytics work loads tend to be ad hoc queries and hard to optimize so you generally favor fast storage over indexes. Frequently for analytics and reporting it's more efficient to use a columnar DB than a relational db
When you need to do large/medium-scale analytical queries. Postgres is fairly slow for aggregate/group queries needed for analytics. Think if you're building Google Analytics type functionality.
Data warehouses are structured to handle large volumes of data and complex queries more efficiently than a typical transactional database like PostgreSQL.
This is great, especially to introduce new devs to models. I use (and love) TablePlus (https://tableplus.com/) which has a diagram generator plugin that does the same!
I don't have a formal compsci background but one of the most enlightening things I've undertaken recently was a course involving data modeling. this seems like an important exercise many overlook.
We are building a modern alternative to Jupyter, something like Cursor meets Jupyter.