cozodb/cozoPublic

NotificationsYou must be signed in to change notification settings
Fork118
Star3.7k

A transactional, relational-graph-vector database that uses Datalog for query. The hippocampus for AI!

License

MPL-2.0 license

3.7k stars 118 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,813 Commits
.github/workflows		.github/workflows
cozo-bin		cozo-bin
cozo-core-examples		cozo-core-examples
cozo-core		cozo-core
cozo-lib-c		cozo-lib-c
cozo-lib-java		cozo-lib-java
cozo-lib-nodejs		cozo-lib-nodejs
cozo-lib-python		cozo-lib-python
cozo-lib-swift		cozo-lib-swift
cozo-lib-wasm		cozo-lib-wasm
cozorocks		cozorocks
scripts		scripts
signatures/version1		signatures/version1
static		static
.gitignore		.gitignore
.gitmodules		.gitmodules
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE.txt		LICENSE.txt
README-zh.md		README-zh.md
README.md		README.md
VERSION		VERSION

Repository files navigation

`CozoDB`

🎉🎉🎉 New versions 🎉🎉🎉

Version v0.7: after HNSW vector search from 0.6, in 0.7 we bring to you MinHash-LSH for near-duplicate search, full-textsearch, Json value support and more! Seehere for more details.

Version v0.6 released! This version brings vector search with HNSW indices inside Datalog, which can be integratedseamlessly with powerful features like ad-hoc joins, recursive Datalog and classical whole-graph algorithms. Thissignificantly expanded the horizon of possibilities of CozoDB.

Highlights:

You can now create HNSW (hierarchical navigable small world) indices on relations containing vectors.
You can create multiple HNSW indices for the same relation by specifying filters dictating which rows should beindexed, or which vector(s) should be indexed for each row if the row contains multiple vectors.
The vector search functionality is integrated within Datalog, meaning that you can use vectors (either explicitlygiven or coming from another relation) as pivots to perform unification into the indexed relations (roughly equivalentto table joins in SQL).
Unification with vector search is semantically no different from regular unification, meaning that you can even usevector search in recursive Datalog, enabling extremely complex query logic.
The HNSW index is no more than a hierarchy of proximity graphs. As an open, competent graph database, CozoDB exposesthese graphs to the end user to be used as regular graphs in your query, so that all the usual techniques for dealingwith them can now be applied, especially: community detection and other classical whole-graph algorithms.
As with all mutations in CozoDB, the index is protected from corruption in the face of concurrent writes by usingMulti-Version Concurrency Control (MVCC), and you can use multi-statement transactions for complex workflows.
The index resides on disk as a regular relation (unless you use the purely in-memory storage option, of course).During querying, close to the absolute minimum amount of memory is used, and memory is freed as soon as the processingis done (thanks to Rust's RAII), so it can run on memory-constrained systems.
The HNSW functionality is available for CozoDB on all platforms: in the server as a standalone service, in yourPython, NodeJS, or Clojure programs om embedded or client mode, on your phone in embedded mode, even in the browserwith the WASM backend.
HNSW vector search in CozoDB is performant: we have optimized the index to the point where basic vector operationsthemselves have become a limiting factor (along with memcpy), and we are constantly finding ways to improve our newimplementation of the HNSW algorithm further.

Seehere for more details.

Introduction

CozoDB is a general-purpose, transactional, relational databasethat usesDatalog for query, isembeddable but can also handle huge amounts of data and concurrency,and focuses ongraph data and algorithms.It supportstime travel and it isperformant!

What doesembeddable mean here?

A database is almost surely embeddedif you can use it on a phone whichnever connects to any network(this situation is not as unusual as you might think). SQLite is embedded. MySQL/Postgres/Oracle are client-server.

A database isembedded if it runs in the same process as your main program.This is in contradistinction toclient-server databases, where your program connects toa database server (maybe running on a separate machine) via a client library. Embedded databasesgenerally require no setup and can be used in a much wider range of environments.
We say CozoDB isembeddable instead ofembedded since you can also use it in client-servermode, which can make better use of server resources and allow much more concurrency thanin embedded mode.

Whygraphs?

Because data are inherently interconnected. Most insights about data can only be obtained ifyou take this interconnectedness into account.

Most existinggraph databases start by requiring you to shoehorn your data into the labelled-property graph model.We don't go this route because we think the traditional relational model is much easier to work with forstoring data, much more versatile, and can deal with graph data just fine. Even more importantly,the most piercing insights about data usually come from graph structuresimplicit several levels deepin your data. The relational model, being analgebra, can deal with it just fine. The property graph model,not so much, since that model is not very composable.

What is so cool aboutDatalog?

Datalog can express allrelational queries.Recursion in Datalog is much easier to express,much more powerful, and usually runs faster than in SQL. Datalog is also extremely composable:you can build your queries piece by piece.

Recursion is especially important for graph queries. CozoDB's dialect of Datalogsupercharges it even further by allowing recursion through a safe subset of aggregations,and by providing extremely efficient canned algorithms (such as PageRank) for the kinds of recursionsfrequently required in graph analysis.
As you learn Datalog, you will discover that therules of Datalog are like functionsin a programming language. Rules are composable, and decomposing a query into rulescan make it clearer and more maintainable, with no loss in efficiency.This is unlike the monolithic approach taken by the SQLselect-from-where in nested forms,which can sometimes read likegolfing.

Time travel?

Time travel in the database setting meanstracking changes to data over timeand allowing queries to be logically executed at a point in timeto get a historical view of the data.

In a sense, this makes your databaseimmutable,since nothing is really deleted from the database ever.
In Cozo, instead of having all data automatically supporttime travel, we let you decide if you want the capabilityfor each of your relation. Every extra functionality comeswith its cost, and you don't want to pay the price if you don't use it.
For the reason why you might want time travel for your data,we have written ashort story.

How performant?

On a 2020 Mac Mini with the RocksDB persistent storage engine (CozoDB supports many storage engines):

Running OLTP queries for a relation with 1.6M rows, you can expect around 100K QPS (queries per second) for mixedread/write/update transactional queries, and more than 250K QPS for read-only queries, with database peak memory usagearound 50MB.
Speed for backup is around 1M rows per second, for restore is around 400K rows per second, and is insensitive torelation (table) size.
For OLAP queries, it takes around 1 second (within a factor of 2, depending on the exact operations) to scan a tablewith 1.6M rows. The time a query takes scales roughly with the number of rows the query touches, with memory usagedetermined mainly by the size of the return set.
Two-hop graph traversal completes in less than 1ms for a graph with 1.6M vertices and 31M edges.
The Pagerank algorithm completes in around 50ms for a graph with 10K vertices and 120K edges, around 1 second for agraph with 100K vertices and 1.7M edges, and around 30 seconds for a graph with 1.6M vertices and 32M edges.

For more numbers and further details, we have a writeupabout performancehere.

Getting started

Usually, to learn a database, you need to install it first.This is unnecessary for CozoDB as a testimony to its extreme embeddability, since you can runa complete CozoDB instance in your browser, at near-native speed for most operations!

So open up theCozoDB in WASM page, and then:

Follow thetutorial.

Or you can skip ahead for the information about installing CozoDB into your favourite environment first.

Teasers

If you are in a hurry and just want a taste of what querying with CozoDB is like, here it is.In the following*route is a relation with two columnsfr andto,representing a route between those airports,andFRA is the code for Frankfurt Airport.

How many airports are directly connected toFRA?

?[count_unique(to)] := *route{fr: 'FRA', to}

count_unique(to)
310

How many airports are reachable fromFRA by one stop?

?[count_unique(to)] := *route{fr: 'FRA', to: stop},                       *route{fr: stop, to}

count_unique(to)
2222

How many airports are reachable fromFRA by any number of stops?

reachable[to] := *route{fr: 'FRA', to}reachable[to] := reachable[stop], *route{fr: stop, to}?[count_unique(to)] := reachable[to]

count_unique(to)
3462

What are the two most difficult-to-reach airportsby the minimum number of hops required,starting fromFRA?

shortest_paths[to, shortest(path)] := *route{fr: 'FRA', to},                                      path = ['FRA', to]shortest_paths[to, shortest(path)] := shortest_paths[stop, prev_path],                                      *route{fr: stop, to},                                      path = append(prev_path, to)?[to, path, p_len] := shortest_paths[to, path], p_len = length(path):order -p_len:limit 2

to	path	p_len
YPO	`["FRA","YYZ","YTS","YMO","YFA","ZKE","YAT","YPO"]`	8
BVI	`["FRA","AUH","BNE","ISA","BQL","BEU","BVI"]`	7

What is the shortest path betweenFRA andYPO, by actual distance travelled?

start[] <- [['FRA']]end[] <- [['YPO]]?[src, dst, distance, path] <~ ShortestPathDijkstra(*route[], start[], end[])

src	dst	distance	path
FRA	YPO	4544.0	`["FRA","YUL","YVO","YKQ","YMO","YFA","ZKE","YAT","YPO"]`

CozoDB attempts to provide nice error messages when you make mistakes:

?[x, Y] := x = 1, y = x + 1

eval::unbound_symb_in_head× Symbol 'Y' in rule head is unbound   ╭────1 │ ?[x, Y] := x = 1, y = x + 1   ·     ─   ╰────  help:Note that symbols occurring only in negated positions are not considered bound

Install

We suggest that youtry out CozoDB before you install it in your environment.

How you install CozoDB depends on which environment you want to use it in.Follow the links in the table below:

Language/Environment	Official platform support	Storage
Python	Linux (x86_64), Mac (ARM64, x86_64), Windows (x86_64)	MQR
NodeJS	Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64)	MQR
Web browser	Modern browsers supportingweb assembly	M
Java (JVM)	Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64)	MQR
Clojure (JVM)	Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64)	MQR
Android	Android (ARM64, ARMv7, x86_64, x86)	MQ
iOS/MacOS (Swift)	iOS (ARM64, simulators), Mac (ARM64, x86_64)	MQ
Rust	Source only, usable on anyplatform with`std` support	MQRST
Golang	Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64)	MQR
C/C++/language with C FFI	Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64)	MQR
Standalone HTTP server	Linux (x86_64, ARM64), Mac (ARM64, x86_64), Windows (x86_64)	MQRST
Lisp	Linux (x86_64 so far)	MR
Smalltalk	Win10 & Linux (Ubuntu 23.04) x86_64 tested, MacOS should probably work	MQR

For the storage column:

M: in-memory, non-persistent backend
Q:SQLite storage backend
R:RocksDB storage backend
S:Sled storage backend
T:TiKV distributed storage backend

TheRust doc has some tips on choosing storage,which is helpful even if you are not using Rust.Even if a storage/platform is not officially supported,you can still try to compile your version to use, maybe with some tweaks in the code.

Tuning the RocksDB backend for CozoDB

RocksDB has a lot of options, and by tuning them you can achieve better performancefor your workload. This is probably unnecessary for 95% of users, but if you are theremaining 5%, CozoDB gives you the options to tune RocksDB directly if you are using theRocksDB storage engine.

When you create the CozoDB instance with the RocksDB backend option, you are asked toprovide a path to a directory to store the data (will be created if it does not exist).If you put a file namedoptions inside this directory, the engine will expect thisto be aRocksDB options fileand use it. If you are using the standalonecozo executable, you will get a log message ifthis feature is activated.

Note that improperly set options can make your database misbehave!In general, you should run your database once, copy the options file fromdata/OPTIONS-XXXXXXfrom within your database directory, and use that as a base for your customization.If you are not an expert on RocksDB, we suggest you limit your changes to adjusting those numericaloptions that you at least have a vague understanding.

Architecture

CozoDB consists of three layers stuck on top of each other,with each layer only calling into the layer below:

(User code)

Language/environment wrapper

Query engine

Storage engine

(Operating system)

Storage engine

The storage engine defines a storagetrait for the storage backend, which is an interfacewith required operations, mainly the provision of a key-value store for binary datawith range scan capabilities. There are various implementations:

In-memory, non-persistent backend
SQLite storage backend
RocksDB storage backend
Sled storage backend
TiKV distributed storage backend

Depending on the build configuration, not all backends may be availablein a binary release.The SQLite backend is special in that it is also used as the backup file format,which allows the exchange of data between databases with different backends.If you are using the database embedded in Rust, you can even provide your owncustom backend.

The storage engine also defines arow-oriented binary data format, which the storageengine implementation does not need to know anything about.This format contains an implementation of thememcomparable formatused for the keys, which enables the storage of rows of data as binary blobsthat, when sorted lexicographically, give the correct order.This also means that data files for the SQLite backend cannot be queried with SQLin the usual way, and access must be through the decoding process in CozoDB.

Query engine

The query engine part provides various functionalities:

function/aggregation/algorithm definitions
database schema
transaction
query compilation
query execution

This part is where most ofthe code of CozoDB is concerned. The CozoScript manualhas a chapterabout the execution process.

Users interact with the query engine with theRust API.

Language/environment wrapper

For all languages/environments except Rust, this part just translates the Rust APIinto something that can be easily consumed by the targets. For Rust, there is no wrapper.For example, in the case of the standalone server, the Rust API is translatedinto HTTP endpoints, whereas in the case of NodeJS, the (synchronous) Rust APIis translated into a series of asynchronous calls from the JavaScript runtime.

If you want to make CozoDB usable in other languages, this part is where your focusshould be. Any existing generic interop libraries between Rust and your target languagewould make the job much easier. Otherwise, you can consider wrapping the C API,as this is supported by most languages. For the languages officially supported,only Golang wraps the C API directly.