NotificationsYou must be signed in to change notification settings
Fork352
Star6.6k

Commitbea11fe

Montana Low

committed

tighten up verbage and add a logo

1 parentf917003 commitbea11feCopy full SHA for bea11fe

File tree

2 files changed

+49

-47

lines changed

2 files changed

+49

-47

lines changed

`‎README.md‎`

Lines changed: 49 additions & 47 deletions

Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,45 @@`
`1`	`1`	`#PostgresML`
`2`	`2`
`3`		`-PostgresML aims to be the easiest way to gain value from machine learning. Anyone with a basic understanding of SQL should be able to build and deploy models to production, while receiving the benefits of a high performance machine learning platform. PostgresML leverages state of the art algorithms with built in best practices, without having to setup additional infrastructure or learn additional programming languages.`
	`3`	`+![PostgresML](./logo.png)`
	`4`	`+`
	`5`	`+PostgresML is a Proof of Concept to create the simplest end-to-end machine learning system. We're building on the shoulders of giants, namely Postgres which is arguably the most robust storage and compute engine that exists, and we're coupling that with Python machine learning libraries (and their c implementations) to prototype different machine learning workflows.`
	`6`	`+`
	`7`	+Common architectures driven by standard organizational hierarchies make it hard to employ machine learning successfully, i.e[Conway's Law](https://en.wikipedia.org/wiki/Conway%27s_law). A single model at a unicorn scale startup may require work from Data Scientists, Data Engineers, Machine Learning Engineers, Infrastructure Engineers, Reliability Engineers, Front & Backend Product Engineers, multiple Engineering Managers, a Product Manager and finally, the Business Partner(s) this "solution" is supposed to eventually address. It can take multiple quarters of effort to shepherd a first effort. The typical level of complexity adds risk, makes maintenance a hot potato and iteration politically difficult. Worse, burnout and morale damage to expensive headcount have left teams and leadership warry of implementing ML solutions throughout the industry, even though FAANGs have proven the immense value when successful.
	`8`	`+`
	`9`	`+Our goal is that anyone with a basic understanding of SQL should be able to build and deploy machine learning models to production, while receiving the benefits of a high performance machine learning platform. Ultimately, PostgresML aims to be the easiest, safest and fastest way to gain value from machine learning.`
	`10`	`+`
	`11`	`+`
	`12`	`+###FAQ`
	`13`	`+`
	`14`	`+How far can this scale?`
	`15`	`+`
	`16`	`+Petabyte sized Postgres deployements are[documented](https://www.computerworld.com/article/2535825/size-matters--yahoo-claims-2-petabyte-database-is-world-s-biggest--busiest.html) in production since at least 2008, and[recent patches](https://www.2ndquadrant.com/en/blog/postgresql-maximum-table-size/) have enabled working beyond exabyte up to the yotabyte scale. Machine learning models can be horizontally scaled using industry proven Postgres replication techniques.`
	`17`	`+`
	`18`	`+How reliable can this be?`
	`19`	`+`
	`20`	`+Postgres is widely considered mission critical, and some of the most[reliable](https://www.postgresql.org/docs/current/wal-reliability.html) technology in any modern stack. PostgresML allows an infrastructure organization to leverage pre-existing best practices to deploy machine learning into production with less risk and effort than other systems. For example, model backup and recovery happens automatically alongside normal data backup procedures.`
	`21`	`+`
	`22`	`+How good are the models?`
	`23`	`+`
	`24`	+Model quality is often a tradeoff between compute resources and incremental quality improvements. Sometimes a few thousands training examples and an off the shelf algorithm can deliver significant business value after a few seconds of training a model. PostgresML allows stakeholders to choose several different algorithms to get the most bang for the buck, or invest in more computationally intensive techniques as necessary. In addition, PostgresML automatically applies best practices for data cleaning like imputing missing values by default and normalizing data to prevent common problems in production.
	`25`	`+`
	`26`	`+PostgresML doesn't help with reformulating a business problem into a machine learning problem. Like most things in life, the ultimate in quality will be a concerted effort of experts working over time. PostgresML is intended to establish successful patterns for those experts to collaborate around while leveraging the expertise of open source and research communities.`
	`27`	`+`
	`28`	`+Is PostgresML fast?`
	`29`	`+`
	`30`	+Colocating the compute with the data inside the database removes one of the most common latency bottlenecks in the ML stack, which is the (de)serialization of data between stores and services across the wire. Modern versions of Postgres also support automatic query parrellization across multiple workers to further minimize latency in large batch workloads. Finally, PostgresML will utilize GPU compute if both the algorithm and hardware support it, although it is currently rare in practice for production databases to have GPUs. We're working on[benchmarks](sql/benchmarks.sql).
	`31`	`+`
	`32`	`+###Current features`
	`33`	`+- Train models directly in Postgres with data from a table or view`
	`34`	`+- Make predictions in Postgres using SELECT statements`
	`35`	`+- Manage new versions and algorithms over time as your solution evolves`
	`36`	`+`
	`37`	`+###Planned features`
	`38`	`+- Model management dashboard`
	`39`	`+- Data explorer`
	`40`	`+- Scheduled training`
	`41`	`+- More algorithms and libraries including custom algorithm support`
	`42`	`+`
`4`	`43`
`5`	`44`	`##Installation`
`6`	`45`
`@@ -99,15 +138,16 @@ $ psql -c 'SELECT pgml.version()'`
`99`	`138`
`100`	`139`	`The two most important functions the framework provides are:`
`101`	`140`
`102`		-1.`pgml.train(project_name TEXT, objective TEXT, relation_name TEXT, y_column_name TEXT)`,
	`141`	+1.`pgml.train(project_name TEXT, objective TEXT, relation_name TEXT, y_column_name TEXT, algorithm TEXT)`,
`103`	`142`	2.`pgml.predict(project_name TEXT, VARIADIC features DOUBLE PRECISION[])`.
`104`	`143`
`105`		-The first function trains a model, given a human-friendly project name, a`regression` or`classification` objective, a table or view name which contains the training and testing datasets,
`106`		-and the name of the`y` column containing the target values. The second function predicts novel datapoints, given the project name for an exiting model trained with`pgml.train`,
`107`		`-and a list of features used to train that model.`
	`144`	+The first function trains a model, given a human-friendly project name, a`regression` or`classification` objective, a table or view name which contains the training and testing datasets, and the name of the`y` column containing the target values. The second function predicts novel datapoints, given the project name for an exiting model trained with`pgml.train`, and a list of features used to train that model.
`108`	`145`
`109`		-We'll be using the[Red Wine Quality](https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009) dataset from Kaggle for this example. You can find it in the`data` folder in this repository.
`110`		`-You can import it into PostgresML running in Docker with this:`
	`146`	`+You can also browse complete[code examples in the repository](examples/).`
	`147`	`+`
	`148`	`+###Walkthrough`
	`149`	`+`
	`150`	+We'll be using the[Red Wine Quality](https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009) dataset from Kaggle for this example. You can find it in the`data` folder in this repository. You can import it into PostgresML running in Docker with this:
`111`	`151`
`112`	`152`	```bash
`113`	`153`	`$ psql -f data/winequality-red.sql -p 5433 -U root -h 127.0.0.1`
`@@ -159,43 +199,7 @@ LIMIT 1;`
`159`	`199`	`(1 row)`
`160`	`200`	```
`161`	`201`
`162`		`-Checkout the[documentation](https://TODO) to view the full capabilities, including:`
`163`		`--[Creating Training Sets](https://TODO)`
`164`		`--[Classification](https://TODO)`
`165`		`--[Regression](https://TODO)`
`166`		`--[Supported Algorithms](https://TODO)`
`167`		`--[Scikit Learn](https://TODO)`
`168`		`--[XGBoost](https://TODO)`
`169`		`--[Tensorflow](https://TODO)`
`170`		`--[PyTorch](https://TODO)`
`171`		`-`
`172`		`-###Planned features`
`173`		`-- Model management dashboard`
`174`		`-- Data explorer`
`175`		`-- More algorithms and libraries incluiding custom algorithm support`
`176`		`-`
`177`		`-`
`178`		`-###FAQ`
`179`		`-`
`180`		`-How well does this scale?`
`181`		`-`
`182`		-Petabyte sized Postgres deployements are[documented](https://www.computerworld.com/article/2535825/size-matters--yahoo-claims-2-petabyte-database-is-world-s-biggest--busiest.html) in production since at least 2008, and[recent patches](https://www.2ndquadrant.com/en/blog/postgresql-maximum-table-size/) have enabled working beyond exabyte up to the yotabyte scale. Machine learning models can be horizontally scaled using well tested Postgres replication techniques on top of a mature storage and compute platform.
`183`		`-`
`184`		`-How reliable is this system?`
`185`		`-`
`186`		`-Postgres is widely considered mission critical, and some of the most[reliable](https://www.postgresql.org/docs/current/wal-reliability.html) technology in any modern stack. PostgresML allows an infrastructure organization to leverage pre-existing best practices to deploy machine learning into production with less risk and effort than other systems. For example, model backup and recovery happens automatically alongside normal data backup procedures.`
`187`		`-`
`188`		`-How good are the models?`
`189`		`-`
`190`		-Model quality is often a tradeoff between compute resources and incremental quality improvements. PostgresML allows stakeholders to choose algorithms from several libraries that will provide the most bang for the buck. In addition, PostgresML automatically applies best practices for data cleaning like imputing missing values by default and normalizing data to prevent common problems in production. After quickly enabling 0 to 1 value creation, PostgresML enables further expert iteration with custom data preperation and algorithm implementations. Like most things in life, the ultimate in quality will be a concerted effort of experts working over time, but that shouldn't get in the way of a quick start.
`191`		`-`
`192`		`-Is PostgresML fast?`
`193`		`-`
`194`		-Colocating the compute with the data inside the database removes one of the most common latency bottlenecks in the ML stack, which is the (de)serialization of data between stores and services across the wire. Modern versions of Postgres also support automatic query parrellization across multiple workers to further minimize latency in large batch workloads. Finally, PostgresML will utilize GPU compute if both the algorithm and hardware support it, although it is currently rare in practice for production databases to have GPUs. Checkout our[benchmarks](https://todo).
`195`		`-`
`196`		`-`
`197`		`-`
`198`		`-###Development`
	`202`	`+###Contributing`
`199`	`203`
`200`	`204`	`Follow the installation instructions to create a local working Postgres environment, then install your PgML package from the git repository:`
`201`	`205`
`@@ -205,10 +209,8 @@ sudo python3 setup.py install`
`205`	`209`	`cd ../`
`206`	`210`	```
`207`	`211`
`208`		`-Run thetest:`
	`212`	`+Run thetests from the root of the repo:`
`209`	`213`
`210`	`214`	```
`211`	`215`	`psql -f sql/test.sql`
`212`	`216`	```
`213`		`-`
`214`		`-Make sure to run it exactly like this, from the root directory of the repo.`

`‎logo.png‎`

63.4 KB

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitbea11fe

File tree

2 files changed

2 files changed

`‎README.md‎`

`‎logo.png‎`

0 commit comments