Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitb0afb0c

Browse files
authored
Merge pull request#15 from postgresml/montana/readme
tighten up verbiage and add a logo
2 parentsda81ecf +6c47d7d commitb0afb0c

File tree

4 files changed

+168
-45
lines changed

4 files changed

+168
-45
lines changed

‎MIT-LICENSE.txt

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
Copyright (c) 2022 Lev Kokotov and Montana Low
2+
3+
Permission is hereby granted, free of charge, to any person obtaining
4+
a copy of this software and associated documentation files (the
5+
"Software"), to deal in the Software without restriction, including
6+
without limitation the rights to use, copy, modify, merge, publish,
7+
distribute, sublicense, and/or sell copies of the Software, and to
8+
permit persons to whom the Software is furnished to do so, subject to
9+
the following conditions:
10+
11+
The above copyright notice and this permission notice shall be
12+
included in all copies or substantial portions of the Software.
13+
14+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
15+
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
16+
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
17+
NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE
18+
LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
19+
OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
20+
WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

‎README.md

Lines changed: 148 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,144 @@
11
#PostgresML
22

3-
PostgresML aims to be the easiest way to gain value from machine learning. Anyone with a basic understanding of SQL should be able to build and deploy models to production, while receiving the benefits of a high performance machine learning platform. PostgresML leverages state of the art algorithms with built in best practices, without having to setup additional infrastructure or learn additional programming languages.
3+
![PostgresML](./logo-small.png)
4+
5+
PostgresML is an end-to-end machine learning system. Using only SQL, it allows to train models and run online predictions, alongside normal queries, directly using the data in your databases.
6+
7+
##Why
8+
9+
Deploying machine learning models into existing applications is not straight forward. Unless you're already using Python in your day to day work, you need to learn a new language and toolchain, figure out how to EL(T) your data from your database(s) into a warehouse or object storage, learn how to train models (Scikit-Learn, Pytorch, Tensorflow, etc.), and finally serve preditions to your apps, forcing your organization into microservices and all the complexity that comes with it.
10+
11+
PostgresML makes ML simple: your data doesn't really go anywhere, you train using simple SQL commands, and you get the predictions to your apps using a mechanism you've been using already: a Postgres connection and a query.
12+
13+
Our goal is that anyone with a basic understanding of SQL should be able to build and deploy machine learning models to production, while receiving the benefits of a high performance machine learning platform. Ultimately, PostgresML aims to be the easiest, safest and fastest way to gain value from machine learning.
14+
15+
##Quick start
16+
17+
Using Docker, boot up PostresML locally:
18+
19+
```bash
20+
$ docker-compose up
21+
```
22+
23+
The system is available on port 5433 by default, just in case you happen to be running Postgres already:
24+
25+
```bash
26+
$ psql -U root -h 127.0.0.1 -p 5433
27+
```
28+
29+
We've included a couple examples in the`examples/` folder. You can run them directly with:
30+
31+
```bash
32+
$ psql -U root -h 127.0.0.1 -p 5433 -f<filename>
33+
```
34+
35+
See[installation instructions](#Installation) for installing PostgresML in different supported environments, and for more information.
36+
37+
##Features
38+
39+
###Training models
40+
41+
Given a Postgres table or a view, PostgresML can train a model using some commonly used algorithms. We currently support the following Scikit-Learn regression and classification models:
42+
43+
-`LinearRegression`,
44+
-`LogisticRegression`,
45+
-`SVR`,
46+
-`SVC`,
47+
-`RandomForestRegressor`,
48+
-`RandomForestClassifier`,
49+
-`GradientBoostingRegressor`,
50+
-`GradientBoostingClassifier`.
51+
52+
Training a model is then as simple as:
53+
54+
```sql
55+
SELECT*FROMpgml.train(
56+
'Human-friendly project name',
57+
'regression',
58+
'<name of the table or view containing the data>',
59+
'<name of the column containing the y target values>'
60+
);
61+
```
62+
63+
PostgresML will snapshot the data from the table, train multiple models from the above list given the objective (`regression` or`classification`), and automatically choose and deploy the model with the best predictions.
64+
65+
###Making predictions
66+
67+
Once the model is trained, making predictions is as simple as:
68+
69+
```sql
70+
SELECTpgml.predict('Human-friendly project name', ARRAY[...])AS prediction_score;
71+
```
72+
73+
where`ARRAY[...]` is a list of features for which we want to run a prediction. This list has to be in the same order as the columns in the data table. This score then can be used in normal queries, for example:
74+
75+
```sql
76+
SELECT*,
77+
pgml.predict(
78+
'Probability of buying our products',
79+
ARRAY[user.location, NOW()-user.created_at,user.total_purchases_in_dollars]
80+
)AS likely_to_buy_score
81+
FROM users
82+
WHERE comapany_id=5
83+
ORDER BY likely_to_buy_score
84+
LIMIT25;
85+
```
86+
87+
Take a look[below](#Working-with-PostgresML) for an example with real data.
88+
89+
###Model and data versioning
90+
91+
As data in your database changes, it is possible to retrain the model again to get better predictions. With PostgresML, it's as simple as running the`pgml.train` command again. If the model scores better, it will be automatically used in predictions; if not, the existing model will be kept and continue to score in your queries. We also snapshot the training data, so models can be retrained deterministically to validate and fix any issues.
92+
93+
##Roadmap
94+
95+
This project is currently a proof of concept. Some important features, which we are currently thinking about or working on, are listed below.
96+
97+
###Production deployment
98+
99+
Most companies that use PostgreSQL in production do so using managed services like AWS RDS, Digital Ocean, Azure, etc. Those services do not allow running custom extensions, so we have to run PostgresML
100+
directly on VMs, e.g. EC2, droplets, etc. The idea here is to replicate production data directly from Postgres and make it available in real-time to PostgresML. We're considering solutions like logical replication for small to mid-size databases, and Debezium for multi-TB deployments.
101+
102+
###Model management dashboard
103+
104+
A good looking and useful UI goes a long way. A dashboard similar to existing solutions like MLFlow or AWS SageMaker will make the experience of working with PostgresML as pleasant as possible.
105+
106+
107+
###Data explorer
108+
109+
A data explorer allows anyone to browse the dataset in production and to find useful tables and features to build effective machine learning models.
110+
111+
112+
###More algorithms
113+
114+
Scikit-Learn is a good start, but we're also thinking about including Tensorflow, Pytorch, and many more useful models.
115+
116+
117+
###Scheduled training
118+
119+
In applications where data changes often, it's useful to retrain the models automatically on a schedule, e.g. every day, every week, etc.
120+
121+
122+
###FAQ
123+
124+
*How far can this scale?*
125+
126+
Petabyte sized Postgres deployements are[documented](https://www.computerworld.com/article/2535825/size-matters--yahoo-claims-2-petabyte-database-is-world-s-biggest--busiest.html) in production since at least 2008, and[recent patches](https://www.2ndquadrant.com/en/blog/postgresql-maximum-table-size/) have enabled working beyond exabyte and up to the yotabyte scale. Machine learning models can be horizontally scaled using standard Postgres replicas.
127+
128+
*How reliable can this be?*
129+
130+
Postgres is widely considered mission critical, and some of the most[reliable](https://www.postgresql.org/docs/current/wal-reliability.html) technology in any modern stack. PostgresML allows an infrastructure organization to leverage pre-existing best practices to deploy machine learning into production with less risk and effort than other systems. For example, model backup and recovery happens automatically alongside normal Postgres data backup.
131+
132+
*How good are the models?*
133+
134+
Model quality is often a tradeoff between compute resources and incremental quality improvements. Sometimes a few thousands training examples and an off the shelf algorithm can deliver significant business value after a few seconds of training. PostgresML allows stakeholders to choose several different algorithms to get the most bang for the buck, or invest in more computationally intensive techniques as necessary. In addition, PostgresML automatically applies best practices for data cleaning like imputing missing values by default and normalizing data to prevent common problems in production.
135+
136+
PostgresML doesn't help with reformulating a business problem into a machine learning problem. Like most things in life, the ultimate in quality will be a concerted effort of experts working over time. PostgresML is intended to establish successful patterns for those experts to collaborate around while leveraging the expertise of open source and research communities.
137+
138+
*Is PostgresML fast?*
139+
140+
Colocating the compute with the data inside the database removes one of the most common latency bottlenecks in the ML stack, which is the (de)serialization of data between stores and services across the wire. Modern versions of Postgres also support automatic query parrellization across multiple workers to further minimize latency in large batch workloads. Finally, PostgresML will utilize GPU compute if both the algorithm and hardware support it, although it is currently rare in practice for production databases to have GPUs. We're working on[benchmarks](sql/benchmarks.sql).
141+
4142

5143
##Installation
6144

@@ -99,15 +237,16 @@ $ psql -c 'SELECT pgml.version()'
99237

100238
The two most important functions the framework provides are:
101239

102-
1.`pgml.train(project_name TEXT, objective TEXT, relation_name TEXT, y_column_name TEXT)`,
240+
1.`pgml.train(project_name TEXT, objective TEXT, relation_name TEXT, y_column_name TEXT, algorithm TEXT DEFAULT NULL)`,
103241
2.`pgml.predict(project_name TEXT, VARIADIC features DOUBLE PRECISION[])`.
104242

105-
The first function trains a model, given a human-friendly project name, a`regression` or`classification` objective, a table or view name which contains the training and testing datasets,
106-
and the name of the`y` column containing the target values. The second function predicts novel datapoints, given the project name for an exiting model trained with`pgml.train`,
107-
and a list of features used to train that model.
243+
The first function trains a model, given a human-friendly project name, a`regression` or`classification` objective, a table or view name which contains the training and testing datasets, and the name of the`y` column containing the target values. The second function predicts novel datapoints, given the project name for an exiting model trained with`pgml.train`, and a list of features used to train that model.
108244

109-
We'll be using the[Red Wine Quality](https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009) dataset from Kaggle for this example. You can find it in the`data` folder in this repository.
110-
You can import it into PostgresML running in Docker with this:
245+
You can also browse complete[code examples in the repository](examples/).
246+
247+
###Walkthrough
248+
249+
We'll be using the[Red Wine Quality](https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009) dataset from Kaggle for this example. You can find it in the`data` folder in this repository. You can import it into PostgresML running in Docker with this:
111250

112251
```bash
113252
$ psql -f data/winequality-red.sql -p 5433 -U root -h 127.0.0.1
@@ -159,43 +298,7 @@ LIMIT 1;
159298
(1 row)
160299
```
161300

162-
Checkout the[documentation](https://TODO) to view the full capabilities, including:
163-
-[Creating Training Sets](https://TODO)
164-
-[Classification](https://TODO)
165-
-[Regression](https://TODO)
166-
-[Supported Algorithms](https://TODO)
167-
-[Scikit Learn](https://TODO)
168-
-[XGBoost](https://TODO)
169-
-[Tensorflow](https://TODO)
170-
-[PyTorch](https://TODO)
171-
172-
###Planned features
173-
- Model management dashboard
174-
- Data explorer
175-
- More algorithms and libraries incluiding custom algorithm support
176-
177-
178-
###FAQ
179-
180-
*How well does this scale?*
181-
182-
Petabyte sized Postgres deployements are[documented](https://www.computerworld.com/article/2535825/size-matters--yahoo-claims-2-petabyte-database-is-world-s-biggest--busiest.html) in production since at least 2008, and[recent patches](https://www.2ndquadrant.com/en/blog/postgresql-maximum-table-size/) have enabled working beyond exabyte up to the yotabyte scale. Machine learning models can be horizontally scaled using well tested Postgres replication techniques on top of a mature storage and compute platform.
183-
184-
*How reliable is this system?*
185-
186-
Postgres is widely considered mission critical, and some of the most[reliable](https://www.postgresql.org/docs/current/wal-reliability.html) technology in any modern stack. PostgresML allows an infrastructure organization to leverage pre-existing best practices to deploy machine learning into production with less risk and effort than other systems. For example, model backup and recovery happens automatically alongside normal data backup procedures.
187-
188-
*How good are the models?*
189-
190-
Model quality is often a tradeoff between compute resources and incremental quality improvements. PostgresML allows stakeholders to choose algorithms from several libraries that will provide the most bang for the buck. In addition, PostgresML automatically applies best practices for data cleaning like imputing missing values by default and normalizing data to prevent common problems in production. After quickly enabling 0 to 1 value creation, PostgresML enables further expert iteration with custom data preperation and algorithm implementations. Like most things in life, the ultimate in quality will be a concerted effort of experts working over time, but that shouldn't get in the way of a quick start.
191-
192-
*Is PostgresML fast?*
193-
194-
Colocating the compute with the data inside the database removes one of the most common latency bottlenecks in the ML stack, which is the (de)serialization of data between stores and services across the wire. Modern versions of Postgres also support automatic query parrellization across multiple workers to further minimize latency in large batch workloads. Finally, PostgresML will utilize GPU compute if both the algorithm and hardware support it, although it is currently rare in practice for production databases to have GPUs. Checkout our[benchmarks](https://todo).
195-
196-
197-
198-
###Development
301+
###Contributing
199302

200303
Follow the installation instructions to create a local working Postgres environment, then install your PgML package from the git repository:
201304

@@ -205,7 +308,7 @@ sudo python3 setup.py install
205308
cd ../
206309
```
207310

208-
Run thetest:
311+
Run thetests from the root of the repo:
209312

210313
```
211314
psql -f sql/test.sql

‎logo-small.png

20.2 KB
Loading

‎logo.png

63.4 KB
Loading

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp