- Notifications
You must be signed in to change notification settings - Fork328
MVP goals#1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Merged
Uh oh!
There was an error while loading.Please reload this page.
Merged
Changes fromall commits
Commits
Show all changes
17 commits Select commitHold shift + click to select a range
92b80f2
MVP goals
f18f276
Use unittest as the test running harness
3c66272
remove validate because validation has a different meaning in ML, and…
958cfba
keep model in memory to avoid going to disk
14b1f61
use bytea directly for pl/python rather than hex/text conversion
829b62e
add a draft schema to support snapshots and multiple training runs fo…
9907aaa
sketch out the regression model training cycle
b50f000
break it down into model classes
89b467d
add categoricals
d9d6727
Update pgml/tests/test_train.py
montanalowdfb57c6
fix categorical test
56e033d
Merge branch 'montana/readme' of github.com:postgresml/postgresml int…
a1ef909
docs
c2de3d8
make test that "works"
ffedbc5
Update pgml/pgml/model.py
montanalowaa44f94
remove parens around ifs
4ca1a5f
Merge branch 'montana/readme' of github.com:postgresml/postgresml int…
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Uh oh!
There was an error while loading.Please reload this page.
Jump to
Jump to file
Failed to load files.
Loading
Uh oh!
There was an error while loading.Please reload this page.
Diff view
Diff view
There are no files selected for viewing
84 changes: 79 additions & 5 deletionsREADME.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,6 +1,82 @@ | ||
## PostgresML | ||
PostgresML aims to be the easiest way to gain value from machine learning. Anyone with a basic understanding of SQL should be able to build and deploy models to production, while receiving the benefits of a high performance machine learning platform. PostgresML leverages state of the art algorithms with built in best practices, without having to setup additional infrastructure or learn additional programming languages. | ||
Getting started is as easy as creating a `table` or `view` that holds the training data, and then registering that with PostgresML. | ||
```sql | ||
SELECT pgml.model_regression('Red Wine Quality', training_data_table_or_view_name, label_column_name); | ||
``` | ||
And predict novel datapoints: | ||
```sql | ||
SELECT pgml.predict('Red Wine Quality', red_wines.*) | ||
FROM pgml.red_wines | ||
LIMIT 3; | ||
quality | ||
--------- | ||
0.896432 | ||
0.834822 | ||
0.954502 | ||
(3 rows) | ||
``` | ||
PostgresML similarly supports classification to predict discrete classes rather than numeric scores for novel data. | ||
```sql | ||
SELECT pgml.create_classification('Handwritten Digit Classifier', pgml.mnist_training_data, label_column_name); | ||
montanalow marked this conversation as resolved. Show resolvedHide resolvedUh oh!There was an error while loading.Please reload this page. | ||
``` | ||
And predict novel datapoints: | ||
```sql | ||
SELECT pgml.predict('Handwritten Digit Classifier', pgml.mnist_test_data.*) | ||
FROM pgml.mnist | ||
LIMIT 1; | ||
digit | likelihood | ||
-------+---- | ||
5 | 0.956432 | ||
(1 row) | ||
``` | ||
Checkout the [documentation](https://TODO) to view the full capabilities, including: | ||
- [Creating Training Sets](https://TODO) | ||
- [Classification](https://TODO) | ||
- [Regression](https://TODO) | ||
- [Supported Algorithms](https://TODO) | ||
- [Scikit Learn](https://TODO) | ||
- [XGBoost](https://TODO) | ||
- [Tensorflow](https://TODO) | ||
montanalow marked this conversation as resolved. Show resolvedHide resolvedUh oh!There was an error while loading.Please reload this page. | ||
- [PyTorch](https://TODO) | ||
### Planned features | ||
- Model management dashboard | ||
- Data explorer | ||
- More algorithms and libraries incluiding custom algorithm support | ||
### FAQ | ||
*How well does this scale?* | ||
Petabyte sized Postgres deployements are [documented](https://www.computerworld.com/article/2535825/size-matters--yahoo-claims-2-petabyte-database-is-world-s-biggest--busiest.html) in production since at least 2008, and [recent patches](https://www.2ndquadrant.com/en/blog/postgresql-maximum-table-size/) have enabled working beyond exabyte up to the yotabyte scale. Machine learning models can be horizontally scaled using well tested Postgres replication techniques on top of a mature storage and compute platform. | ||
*How reliable is this system?* | ||
Postgres is widely considered mission critical, and some of the most [reliable](https://www.postgresql.org/docs/current/wal-reliability.html) technology in any modern stack. PostgresML allows an infrastructure organization to leverage pre-existing best practices to deploy machine learning into production with less risk and effort than other systems. For example, model backup and recovery happens automatically alongside normal data backup procedures. | ||
*How good are the models?* | ||
Model quality is often a tradeoff between compute resources and incremental quality improvements. PostgresML allows stakeholders to choose algorithms from several libraries that will provide the most bang for the buck. In addition, PostgresML automatically applies best practices for data cleaning like imputing missing values by default and normalizing data to prevent common problems in production. After quickly enabling 0 to 1 value creation, PostgresML enables further expert iteration with custom data preperation and algorithm implementations. Like most things in life, the ultimate in quality will be a concerted effort of experts working over time, but that shouldn't get in the way of a quick start. | ||
*Is PostgresML fast?* | ||
Colocating the compute with the data inside the database removes one of the most common latency bottlenecks in the ML stack, which is the (de)serialization of data between stores and services across the wire. Modern versions of Postgres also support automatic query parrellization across multiple workers to further minimize latency in large batch workloads. Finally, PostgresML will utilize GPU compute if both the algorithm and hardware support it, although it is currently rare in practice for production databases to have GPUs. Checkout our [benchmarks](https://todo). | ||
### Installation in WSL or Ubuntu | ||
@@ -29,11 +105,9 @@ Install Scikit globally (I didn't bother setup Postgres with a virtualenv, but i | ||
sudo pip3 install sklearn | ||
``` | ||
### Run theexample | ||
```bash | ||
psql -f scikit_train_and_predict.sql | ||
``` | ||
23 changes: 23 additions & 0 deletionsbenchmarks.sql
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
-- | ||
-- CREATE EXTENSION | ||
-- | ||
CREATE EXTENSION IF NOT EXISTS plpython3u; | ||
CREATE OR REPLACE FUNCTION pg_call() | ||
RETURNS INT | ||
AS $$ | ||
BEGIN | ||
RETURN 1; | ||
END; | ||
$$ LANGUAGE plpgsql; | ||
CREATE OR REPLACE FUNCTION py_call() | ||
RETURNS INT | ||
AS $$ | ||
return 1; | ||
$$ LANGUAGE plpython3u; | ||
\timing on | ||
SELECT generate_series(1, 50000), pg_call(); -- Time: 20.679 ms | ||
montanalow marked this conversation as resolved. Show resolvedHide resolvedUh oh!There was an error while loading.Please reload this page. | ||
SELECT generate_series(1, 50000), py_call(); -- Time: 67.355 ms | ||
Oops, something went wrong.
Uh oh!
There was an error while loading.Please reload this page.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.