Five years ago I was absolutely frustrated with the state of Graph databases and libraries and tried putting several non-Graph DBMSs behind a NetworkX-like Python interface <https://github.com/unum-cloud/NetworkXum>.
When benchmarked, Neo4J crashed on every graph I’ve tried <https://www.unum.cloud/blog/2020-11-12-graphs>, making SQLite and Postgres much more viable options even for network-processing workloads. So I wouldn’t be surprised to learn that people actually use pgRouting and Supabase in that setting.
With the rise of Postgres-compatible I’m wondering if it’s worth refreshing the project. Similarly, there are now more Graph DBs like MemGraph compatible with CYPHER, which should probably work much better than Neo4J.
Checkout https://github.com/Pometry/Raphtory, it's written in Rust, embedded (the binaries are about 20mb) and you can use the Python APIs as a drop-in replacement for NetworkX. Disclaimer, I am one of the people behind it.
Just starting to review it but my front of mind questions:
1) How do I handle persistence? Looks like some code is missing.
2) Do you support multi-tenancy (b2b saas graph backend for handling relations scoped to a tenant)
1) You can persist a graph to disk. By default, this uses protobuf (`save_to_file`), however we’re migrating to Parquet in next release for better performance because we noticed loading a 100m edge graph from scratch (CSV, Pandas, or raw Parquet) is actually faster (~1M rows/sec) than from persisted proto, which isn’t ideal. There’s also a private version that uses custom memory buffers for on-disk storage, handling updates and compaction automatically.
2) You can run a Raphtory instance either as a GraphQL server or an embedded library. For the server, multiple users can query the persisted graphs, which are stored in a simple folder structure with namespaces (for different graphs). For now, access control needs to be managed externally, however it's on our roadmap!
A good 10 years ago or so I was running a solution that used RDF Quad Stores - and the best one at the time (after trialling 4Store, Marklogic and some others I can't remember) was OpenLink Virtuoso - how they managed to fit a performant distributed Quad store into what started life as an SQL engine was impressive.
I've left that world now, but if you're in the market for a graph store again, it might be something to look at.
Your graph DB frustrations mirror what many experienced with Neo4j. If you refresh your project, consider including FalkorDB (formerly RedisGraph) - it uses sparse adjacency matrices and GraphBLAS for much better performance while supporting Cypher.
Would be interesting to see updated benchmarks comparing these newer options against PostgreSQL extensions.
I had almost exactly the opposite experience, although my dataset was pretty small.
We wanted to store a graph in postgres and ended up writing some recursive queries to pull subgraphs then had NetworkX layered over it to do some more complex graph operations. We ended up doing that for a short while but then switched to Neo4j because of how comparatively easy it was to write queries (although the Python support for Neo4j was severely lacking). Never really stressed it out on dataset size though.
I did manage to crash Redis' graph plugin pretty quickly when I was testing that.
Not sure what you consider "quite small" and I don't know how NetworkX works, but postgresql recursive queries have worked well for me for small graphs.
Could you share what the data structure and scale was?
We basically had a single table that we wanted to be able to nest on itself arbitrarily. Think categories and subcategories, maybe 100k nodes/rows
Postgres worked fine but cypher is so much more expressive and handles stuff like loop detection for you, neo4j was much easier to work with. Performance wasn't ever really an issue with either.
We have something similar in Postgres but IMO disconnectedness also plays a really big part in this whole calculation. We actually ended up just changing the transitive closure for fast operations (and simpler code).
Neo4J is mature not dated which is why it's so popular.
And couldn't disagree more that graph is a feature. You really want something optimised for it (query language / storage approach) as the data structure is so different in every way from a relational or document store.
Graph processing can create substantial amount of intermediate data if it is done in typical join implementation fashion (nested loops or hash join). So it may appear that graph processing needs a tailored approach.
But what can help graph algorithms can help SQL query execution as well and vice versa, see the link above.
For example, TPC-DS contains queries that (indirectly) joins same tables multiple times (query 4, for example). This is, basically, a kind of centrality metric computation for a graph represented by the tables.
How is it different? Isn't a graph basically two sets of tuples: edges and nodes? I played with Cayley (Google) for a little while, & that was my impression.
I think it's less a matter of "can you represent graphs in a relational DB" (of course you can), and more about what kind of queries the DB is optimised for. Graph databases are intended for complex recursive queries on relatively unstructured data. You could certainly do that in SQL if you wanted to, but you'll pay for it performance-wise.
Graph query languages also make those kinds of queries much easier to express in the first place.
So the underlying storage is conventional, it's still tuples of some kind, and it's only a matter of how indexes are laid out? Otherwise, I'm struggling to see how it could "optimise" for certain access patterns. How would a typical graph database index be different from a btree access method in Postgres?
I don't know much about the internal details of postgres. But there is a ton of detail underlying "it's just tuples of some kind" and there are lots of ways to implement indices, no? Is it so difficult to imagine that different implementations have different performance properties?
There's also the query planner layer to think about too.
No need for the snark. If you want specific details of how postgres differs from graph databases I have nothing for you. I just find your position that btrees are optimised for every query structure... obviously false on general grounds? Like a thing to do to make recursive queries faster is to store relations as direct pointers of some kind, rather than doing index scans for every level of join.
Perhaps we're talking past each other about the word "optimised".
My original goal in this article was to figure out if pgrouting would be a good tool to build a memory-layer (for AI/agents) but the article got a bit long. Early results are promising - I’ll follow up with another article soon
there are some other interesting extensions in this space - onesparse[0] is early in development but pretty exciting as it builds on SuiteSparse which is very mature
Thanks Paul! OneSparse author here, we're still in early dev stages (OneSparse requires some new features in postgres that won't be available until pg18) but my plan is to do a benchmark shootout with various graph tool for postgres. Initial results look good, on my 4-core 11th gen intel laptop we're getting some really good numbers!
Orkut was as big as I could go due to limited RAM. One of my constrains is limited access to big enough hardware to do the kinds of Graphs Of Unusual Size (billions of edges, trillions of triangles) where we can really flex the scale that CUDA support gives us. Stay tuned!
Maybe a stupid question. When just looking at the data model (and not e.g. the query language) ... but is there a real difference between a "graph" database and a "normal SQL" database when the SQL database is able to directly point to rows (ROWID?) without a separate index?
Supabase consistently puts out such fantastic bite-sized gems - and for me some of my favorites have been related to PostGIS - whether it’s about serving tiles directly, or this (ab)use of functionality typically used in a PG geospatial context. Nothing revolutionary or massive and complex - not like reading DDIA of course, but just fun and mentally activating, making me want to jump into something new. I really applaud them for frequently posting actually engaging content that just gets you excited to work with databases… it sounds silly to say it like that, but it does feel like I get regularly struck with the feeling of sadness when I realize how vanilla all of my daily development related interactions with dbs are so vanilla.
I’ve always wondered why there isn’t a “SQLite for graphs,” so to speak. Is there something about how they have to be stored that precludes an in-process solution with disk-based storage?
https://www.hillelwayne.com/post/graph-types/ gives an interesting take on why we don't see a graph type as a primitive on more programming languages. Essentially boils down to graphs being very vague and depending on the topology of your graphs you are going to want different implementations for reasonable efficiency.
Interesting! Thanks for the link. I suppose the graph databases just take an opinionated approach. NetworkX is great; I always wished it had a simple backend.
Anybody have any experience creating isocrhones using PgRouting? I have a use case that involves generating isochrone maps for walking, biking, etc. but I'd like to just use Postgres if possible and avoid another piece of infrastructure like Valhalla, OpenTripPlanner, OpenRouteService, etc.
Interesting in hearing some thoughts about using roaring bitmaps stored in a bytea postgres column to represent adjacency matrixes.
I was thinking that given RDS has support for plrust and PostgreSQL's SPI I could use the fact they support croaring-rs there as a crate and build upon that.
I figure I can use that to represent many graph's with say 100s to ~100m nodes and many relations between these things. But each graph would be tenanted to a tenant (company/b2b saas use case).
I was thinking that by using plrust storing the roaring bitmap on the DB server in a bytea and using SPI, I can benefit from the minimal network overhead to mutate and query against the bitmap with croaring. Using SPI locally in the DB server I eliminate network overhead shipping that back to my application code.
PostgreSQL also gives me transaction safety to updates etc. And a bunch of support for other column base data such as my tenant ID column, some JSONB for relationship metadata to query on etc.
Basically something like https://jazco.dev/2024/04/20/roaring-bitmaps/ but on postgres. Given I need to support many tenanted graphs & we're already using citus this seems like something that is feasible at a larger scale too.
I was wondering though if I am going to need to create some operator classes to allow me to index relations a bit better (probably seems likely I think).
I am aware of https://github.com/ChenHuajun/pg_roaringbitmap but would prefer to use int64s and maybe start out on RDS instead of having to add another workload to our self hosted citus cluster/s.
Happy to be told I am fool and any insights would be nice. I am potentially going to try this out on some of our data sets we have because our product team is basically laying out a vision where they want us to have a graph powering a bunch of things.
I don't like the idea of neo4j when we're already deep into PostgreSQL for a bunch of workloads (~20+ TB table workloads etc so we have some reasonable inhouse PG experience).
Also huge thanks to the author of the blog post. I had been looking at pgRouting and wondering with a tilted head.. hmm seems like we can just use this as a graph DB. So that is also on my list to test out.
I think Apache AGE is much more generic, as it can parse Cypher queries and comes with a bunch of utility functions.
OP article is more like a hack, and a good one! It seems like you can achieve a lot of what you might expect from graph database with pgRouting functions and good old SQL.
Great article!
Although it seems that the last section (about YouTube recommendations) is incomplete, as there's no query to actually calculate the recommendations.
You can always trust Postgres to have another extension that opens up great new data modelling opportunities. This is great. Wonder how this stacks up to the CedarDB (Postgres-compatible) graph capabilities.
There is an unfortunate overloaded of the terms relational and relationships in relational databases.
Relationships is association between relations/tables, parent-child, node/edge etc, depending on model, extensions etc..
There are three basic models of databases:
model name. | basic data structure
-----------------------------------
relational | tables
hierarchical | trees
network | graph
A "relational" in RDBMS and Codd's rules is just a table data structure with some additional rules.
Part of those rules are a named table, with named and typed attributes (columns) with data in the form of rows of tuples.
PgVector is nearest neighbor search for tuple values, often from a single table/relation while PgRouting is graph traversal for relational data.
There is a bit more to that and in the relational model the data is independent of the schema, and no RDBMS is pure.
It is possibly helpful to realize that pgvector is about finding neighbors among tuples, in that relation/table, and that it is very different from graph traversal.
For sub-billion graph queries, we have been doing in-memory via GFQL that runs graph queries & analytics on top of pandas (columnar CPU dataframes) and cudf (columnar GPU dataframes). Sub-billion is typically small enough that single node is fine. We began dev after years of being annoyed at having no good simple OSS in-process solution for small graphs like these. Avoiding jumping architectural hoops around dealing with multiple systems of record infra is nice, as can often just do in the compute tier.
Once big graphs get involved, scalable systems, and especially those that seperate storage from compute & price accordingly, get much more interesting. We work with our partners like databricks, Google spanner, AWS Neptune, etc, who have different sweet spots that really depend on workload and context, they're all pretty different. OLTP vs OLAP, etc.
Five years ago I was absolutely frustrated with the state of Graph databases and libraries and tried putting several non-Graph DBMSs behind a NetworkX-like Python interface <https://github.com/unum-cloud/NetworkXum>.
When benchmarked, Neo4J crashed on every graph I’ve tried <https://www.unum.cloud/blog/2020-11-12-graphs>, making SQLite and Postgres much more viable options even for network-processing workloads. So I wouldn’t be surprised to learn that people actually use pgRouting and Supabase in that setting.
With the rise of Postgres-compatible I’m wondering if it’s worth refreshing the project. Similarly, there are now more Graph DBs like MemGraph compatible with CYPHER, which should probably work much better than Neo4J.
You ran Neo4J with 512MB even thought it has always recommended 2GB at a minimum.
And MemGraph is nice but it's memory only where as Neo4J is designed for super large graphs that live on the filesystem. Not really that comparable.
Checkout https://github.com/Pometry/Raphtory, it's written in Rust, embedded (the binaries are about 20mb) and you can use the Python APIs as a drop-in replacement for NetworkX. Disclaimer, I am one of the people behind it.
This looks super interesting.
Just starting to review it but my front of mind questions: 1) How do I handle persistence? Looks like some code is missing. 2) Do you support multi-tenancy (b2b saas graph backend for handling relations scoped to a tenant)
Thanks
Good questions.
1) You can persist a graph to disk. By default, this uses protobuf (`save_to_file`), however we’re migrating to Parquet in next release for better performance because we noticed loading a 100m edge graph from scratch (CSV, Pandas, or raw Parquet) is actually faster (~1M rows/sec) than from persisted proto, which isn’t ideal. There’s also a private version that uses custom memory buffers for on-disk storage, handling updates and compaction automatically.
2) You can run a Raphtory instance either as a GraphQL server or an embedded library. For the server, multiple users can query the persisted graphs, which are stored in a simple folder structure with namespaces (for different graphs). For now, access control needs to be managed externally, however it's on our roadmap!
A good 10 years ago or so I was running a solution that used RDF Quad Stores - and the best one at the time (after trialling 4Store, Marklogic and some others I can't remember) was OpenLink Virtuoso - how they managed to fit a performant distributed Quad store into what started life as an SQL engine was impressive.
I've left that world now, but if you're in the market for a graph store again, it might be something to look at.
Your graph DB frustrations mirror what many experienced with Neo4j. If you refresh your project, consider including FalkorDB (formerly RedisGraph) - it uses sparse adjacency matrices and GraphBLAS for much better performance while supporting Cypher.
Would be interesting to see updated benchmarks comparing these newer options against PostgreSQL extensions.
I had almost exactly the opposite experience, although my dataset was pretty small.
We wanted to store a graph in postgres and ended up writing some recursive queries to pull subgraphs then had NetworkX layered over it to do some more complex graph operations. We ended up doing that for a short while but then switched to Neo4j because of how comparatively easy it was to write queries (although the Python support for Neo4j was severely lacking). Never really stressed it out on dataset size though.
I did manage to crash Redis' graph plugin pretty quickly when I was testing that.
Not sure what you consider "quite small" and I don't know how NetworkX works, but postgresql recursive queries have worked well for me for small graphs.
Could you share what the data structure and scale was?
We basically had a single table that we wanted to be able to nest on itself arbitrarily. Think categories and subcategories, maybe 100k nodes/rows
Postgres worked fine but cypher is so much more expressive and handles stuff like loop detection for you, neo4j was much easier to work with. Performance wasn't ever really an issue with either.
Note that more recent versions of Postgres have added support for the CYCLE keyword, for easier loop detection.
We have something similar in Postgres but IMO disconnectedness also plays a really big part in this whole calculation. We actually ended up just changing the transitive closure for fast operations (and simpler code).
Have you tried NetworkDisk[1] to manipulate NetworkX graphs on disk?
[1] https://networkdisk.inria.fr/
Neo4j is pretty bad and very dated. No idea what they’re doing. MemGraph is a much better tool.
Really graph is a feature and not a product.
Neo4J is mature not dated which is why it's so popular.
And couldn't disagree more that graph is a feature. You really want something optimised for it (query language / storage approach) as the data structure is so different in every way from a relational or document store.
> ...as the data structure is so different in every way from a relational or document store.
No, it is not.
[1] https://en.wikipedia.org/wiki/Worst-case_optimal_join_algori...
Graph processing can create substantial amount of intermediate data if it is done in typical join implementation fashion (nested loops or hash join). So it may appear that graph processing needs a tailored approach.
But what can help graph algorithms can help SQL query execution as well and vice versa, see the link above.
For example, TPC-DS contains queries that (indirectly) joins same tables multiple times (query 4, for example). This is, basically, a kind of centrality metric computation for a graph represented by the tables.
How is it different? Isn't a graph basically two sets of tuples: edges and nodes? I played with Cayley (Google) for a little while, & that was my impression.
I think it's less a matter of "can you represent graphs in a relational DB" (of course you can), and more about what kind of queries the DB is optimised for. Graph databases are intended for complex recursive queries on relatively unstructured data. You could certainly do that in SQL if you wanted to, but you'll pay for it performance-wise.
Graph query languages also make those kinds of queries much easier to express in the first place.
So the underlying storage is conventional, it's still tuples of some kind, and it's only a matter of how indexes are laid out? Otherwise, I'm struggling to see how it could "optimise" for certain access patterns. How would a typical graph database index be different from a btree access method in Postgres?
I don't know much about the internal details of postgres. But there is a ton of detail underlying "it's just tuples of some kind" and there are lots of ways to implement indices, no? Is it so difficult to imagine that different implementations have different performance properties?
There's also the query planner layer to think about too.
Thank you, I'm well aware that a class of software called "databases" exists
No need for the snark. If you want specific details of how postgres differs from graph databases I have nothing for you. I just find your position that btrees are optimised for every query structure... obviously false on general grounds? Like a thing to do to make recursive queries faster is to store relations as direct pointers of some kind, rather than doing index scans for every level of join.
Perhaps we're talking past each other about the word "optimised".
My original goal in this article was to figure out if pgrouting would be a good tool to build a memory-layer (for AI/agents) but the article got a bit long. Early results are promising - I’ll follow up with another article soon
there are some other interesting extensions in this space - onesparse[0] is early in development but pretty exciting as it builds on SuiteSparse which is very mature
[0] https://onesparse.com/docs.html
Thanks Paul! OneSparse author here, we're still in early dev stages (OneSparse requires some new features in postgres that won't be available until pg18) but my plan is to do a benchmark shootout with various graph tool for postgres. Initial results look good, on my 4-core 11th gen intel laptop we're getting some really good numbers!
Orkut was as big as I could go due to limited RAM. One of my constrains is limited access to big enough hardware to do the kinds of Graphs Of Unusual Size (billions of edges, trillions of triangles) where we can really flex the scale that CUDA support gives us. Stay tuned!Maybe a stupid question. When just looking at the data model (and not e.g. the query language) ... but is there a real difference between a "graph" database and a "normal SQL" database when the SQL database is able to directly point to rows (ROWID?) without a separate index?
Supabase consistently puts out such fantastic bite-sized gems - and for me some of my favorites have been related to PostGIS - whether it’s about serving tiles directly, or this (ab)use of functionality typically used in a PG geospatial context. Nothing revolutionary or massive and complex - not like reading DDIA of course, but just fun and mentally activating, making me want to jump into something new. I really applaud them for frequently posting actually engaging content that just gets you excited to work with databases… it sounds silly to say it like that, but it does feel like I get regularly struck with the feeling of sadness when I realize how vanilla all of my daily development related interactions with dbs are so vanilla.
I’ve always wondered why there isn’t a “SQLite for graphs,” so to speak. Is there something about how they have to be stored that precludes an in-process solution with disk-based storage?
https://www.hillelwayne.com/post/graph-types/ gives an interesting take on why we don't see a graph type as a primitive on more programming languages. Essentially boils down to graphs being very vague and depending on the topology of your graphs you are going to want different implementations for reasonable efficiency.
That said, there are graph databases.
Interesting! Thanks for the link. I suppose the graph databases just take an opinionated approach. NetworkX is great; I always wished it had a simple backend.
There is now, it's Kuzudb, embedded as well.
Anybody have any experience creating isocrhones using PgRouting? I have a use case that involves generating isochrone maps for walking, biking, etc. but I'd like to just use Postgres if possible and avoid another piece of infrastructure like Valhalla, OpenTripPlanner, OpenRouteService, etc.
Interesting in hearing some thoughts about using roaring bitmaps stored in a bytea postgres column to represent adjacency matrixes.
I was thinking that given RDS has support for plrust and PostgreSQL's SPI I could use the fact they support croaring-rs there as a crate and build upon that.
I figure I can use that to represent many graph's with say 100s to ~100m nodes and many relations between these things. But each graph would be tenanted to a tenant (company/b2b saas use case).
I was thinking that by using plrust storing the roaring bitmap on the DB server in a bytea and using SPI, I can benefit from the minimal network overhead to mutate and query against the bitmap with croaring. Using SPI locally in the DB server I eliminate network overhead shipping that back to my application code.
PostgreSQL also gives me transaction safety to updates etc. And a bunch of support for other column base data such as my tenant ID column, some JSONB for relationship metadata to query on etc.
Basically something like https://jazco.dev/2024/04/20/roaring-bitmaps/ but on postgres. Given I need to support many tenanted graphs & we're already using citus this seems like something that is feasible at a larger scale too.
I was wondering though if I am going to need to create some operator classes to allow me to index relations a bit better (probably seems likely I think).
I am aware of https://github.com/ChenHuajun/pg_roaringbitmap but would prefer to use int64s and maybe start out on RDS instead of having to add another workload to our self hosted citus cluster/s.
Happy to be told I am fool and any insights would be nice. I am potentially going to try this out on some of our data sets we have because our product team is basically laying out a vision where they want us to have a graph powering a bunch of things.
I don't like the idea of neo4j when we're already deep into PostgreSQL for a bunch of workloads (~20+ TB table workloads etc so we have some reasonable inhouse PG experience).
Also huge thanks to the author of the blog post. I had been looking at pgRouting and wondering with a tilted head.. hmm seems like we can just use this as a graph DB. So that is also on my list to test out.
I'm working on a little Poatgres graph db project. The querying and table structure is much simpler for the same task:
https://memelang.net/03/ https://github.com/memelang-net/memesql3
Any comments on "Apache AGE"?
Apache AGE™ is a PostgreSQL that provides graph database functionality.
https://age.apache.org
I think Apache AGE is much more generic, as it can parse Cypher queries and comes with a bunch of utility functions.
OP article is more like a hack, and a good one! It seems like you can achieve a lot of what you might expect from graph database with pgRouting functions and good old SQL.
Great article! Although it seems that the last section (about YouTube recommendations) is incomplete, as there's no query to actually calculate the recommendations.
You can always trust Postgres to have another extension that opens up great new data modelling opportunities. This is great. Wonder how this stacks up to the CedarDB (Postgres-compatible) graph capabilities.
Resourceful, but is there a reason to use this approach over pgvector?
There is an unfortunate overloaded of the terms relational and relationships in relational databases.
Relationships is association between relations/tables, parent-child, node/edge etc, depending on model, extensions etc..
There are three basic models of databases:
A "relational" in RDBMS and Codd's rules is just a table data structure with some additional rules.Part of those rules are a named table, with named and typed attributes (columns) with data in the form of rows of tuples.
PgVector is nearest neighbor search for tuple values, often from a single table/relation while PgRouting is graph traversal for relational data.
There is a bit more to that and in the relational model the data is independent of the schema, and no RDBMS is pure.
It is possibly helpful to realize that pgvector is about finding neighbors among tuples, in that relation/table, and that it is very different from graph traversal.
Has anyone done something like this with a very large dataset? Hundreds of millions of rows.
For sub-billion graph queries, we have been doing in-memory via GFQL that runs graph queries & analytics on top of pandas (columnar CPU dataframes) and cudf (columnar GPU dataframes). Sub-billion is typically small enough that single node is fine. We began dev after years of being annoyed at having no good simple OSS in-process solution for small graphs like these. Avoiding jumping architectural hoops around dealing with multiple systems of record infra is nice, as can often just do in the compute tier.
Once big graphs get involved, scalable systems, and especially those that seperate storage from compute & price accordingly, get much more interesting. We work with our partners like databricks, Google spanner, AWS Neptune, etc, who have different sweet spots that really depend on workload and context, they're all pretty different. OLTP vs OLAP, etc.
[dead]
Supabase such a gem