Apr 18, 2022 · Apr 18, 2022 · Apr 18, 2022 · Apr 18, 2022 · Apr 18, 2022 · Apr 18, 2022
diff --git a/README.md b/README.md

 ![PostgresML](./logo-small.png)

 PostgresML isa Proof of Concept to create the simplestend-to-end machine learning system.We're building on the shoulders of giants, namely Postgres which is arguably the most robust storageandcompute engine that exists, and we're coupling that with Python machine learning libraries (and their c implementations) to prototype different machine learning workflows.
 PostgresML isanend-to-end machine learning system.Using only SQL, it allows to train modelsandrun online predictions, alongside normal queries, directly using the data in your databases.

 Common architectures driven by standard organizational hierarchies make it hard to employ machine learning successfully, i.e [Conway's Law](https://en.wikipedia.org/wiki/Conway%27s_law). A single model at a unicorn scale startup may require work from Data Scientists, Data Engineers, Machine Learning Engineers, Infrastructure Engineers, Reliability Engineers, Front & Backend Product Engineers, multiple Engineering Managers, a Product Manager and finally, the Business Partner(s) this "solution" is supposed to eventually address. It can take multiple quarters of effort to shepherd a first effort. The typical level of complexity adds risk, makes maintenance a hot potato and iteration politically difficult. Worse, burnout and morale damage to expensive headcount have left teams and leadership warry of implementing ML solutions throughout the industry, even though FAANGs have proven the immense value when successful.
 ## Why

 Deploying machine learning models into existing applications is not straight forward. Unless you're already using Python in your day to day work, you need to learn a new language and toolchain, figure out how to EL(T) your data from your database(s) into a warehouse or object storage, learn how to train models (Scikit-Learn, Pytorch, Tensorflow, etc.), and finally serve preditions to your apps, forcing your organization into microservices and all the complexity that comes with it.

 PostgresML makes ML simple: your data doesn't really go anywhere, you train using simple SQL commands, and you get the predictions to your apps using a mechanism you've been using already: a Postgres connection and a query.

 Our goal is that anyone with a basic understanding of SQL should be able to build and deploy machine learning models to production, while receiving the benefits of a high performance machine learning platform. Ultimately, PostgresML aims to be the easiest, safest and fastest way to gain value from machine learning.

 ## Quick start

 Using Docker, boot up PostresML locally:

 ```bash
 $ docker-compose up
 ```

 The system is available on port 5433 by default, just in case you happen to be running Postgres already:

 ```bash
 $ psql -U root -h 127.0.0.1 -p 5433
 ```

 We've included a couple examples in the `examples/` folder. You can run them directly with:

 ```bash
 $ psql -U root -h 127.0.0.1 -p 5433 -f <filename>
 ```

 See [installation instructions](#Installation) for installing PostgresML in different supported environments, and for more information.

 ## Features

 ### Training models

 Given a Postgres table or a view, PostgresML can train a model using some commonly used algorithms. We currently support the following Scikit-Learn regression and classification models:

 - `LinearRegression`,
 - `LogisticRegression`,
 - `SVR`,
 - `SVC`,
 - `RandomForestRegressor`,
 - `RandomForestClassifier`,
 - `GradientBoostingRegressor`,
 - `GradientBoostingClassifier`.

 Training a model is then as simple as:

 ```sql
 SELECT * FROM pgml.train(
     'Human-friendly project name',
     'regression',
     '<name of the table or view containing the data>',
     '<name of the column containing the y target values>'
 );
 ```

 PostgresML will snapshot the data from the table, train multiple models from the above list given the objective (`regression` or `classification`), and automatically choose and deploy the model with the best predictions.

 ### Making predictions

 Once the model is trained, making predictions is as simple as:

 ```sql
 SELECT pgml.predict('Human-friendly project name', ARRAY[...]) AS prediction_score;
 ```

 where `ARRAY[...]` is a list of features for which we want to run a prediction. This list has to be in the same order as the columns in the data table. This score then can be used in normal queries, for example:

 ```sql
 SELECT *,
       pgml.predict(
          'Probability of buying our products',
          ARRAY[user.location, NOW() - user.created_at, user.total_purchases_in_dollars]
       ) AS likely_to_buy_score
 FROM users
 WHERE comapany_id = 5
 ORDER BY likely_to_buy_score
 LIMIT 25;
 ```

 Take a look [below](#Working-with-PostgresML) for an example with real data.

 ### Model and data versioning

 As data in your database changes, it is possible to retrain the model again to get better predictions. With PostgresML, it's as simple as running the `pgml.train` command again. If the model scores better, it will be automatically used in predictions; if not, the existing model will be kept and continue to score in your queries. We also snapshot the training data, so models can be retrained deterministically to validate and fix any issues.

 ## Roadmap

 This project is currently a proof of concept. Some important features, which we are currently thinking about or working on, are listed below.

 ### Production deployment

 Most companies that use PostgreSQL in production do so using managed services like AWS RDS, Digital Ocean, Azure, etc. Those services do not allow running custom extensions, so we have to run PostgresML
 directly on VMs, e.g. EC2, droplets, etc. The idea here is to replicate production data directly from Postgres and make it available in real-time to PostgresML. We're considering solutions like logical replication for small to mid-size databases, and Debezium for multi-TB deployments.

 ### Model management dashboard

 A good looking and useful UI goes a long way. A dashboard similar to existing solutions like MLFlow or AWS SageMaker will make the experience of working with PostgresML as pleasant as possible.


 ### Data explorer

 A data explorer allows anyone to browse the dataset in production and to find useful tables and features to build effective machine learning models.


 ### More algorithms

 Scikit-Learn is a good start, but we're also thinking about including Tensorflow, Pytorch, and many more useful models.


 ### Scheduled training

 In applications where data changes often, it's useful to retrain the models automatically on a schedule, e.g. every day, every week, etc.


 ### FAQ

 *How far can this scale?*

 Petabyte sized Postgres deployements are [documented](https://www.computerworld.com/article/2535825/size-matters--yahoo-claims-2-petabyte-database-is-world-s-biggest--busiest.html) in production since at least 2008, and [recent patches](https://www.2ndquadrant.com/en/blog/postgresql-maximum-table-size/) have enabled working beyond exabyte up to the yotabyte scale. Machine learning models can be horizontally scaled usingindustry provenPostgresreplication techniques.
 Petabyte sized Postgres deployements are [documented](https://www.computerworld.com/article/2535825/size-matters--yahoo-claims-2-petabyte-database-is-world-s-biggest--busiest.html) in production since at least 2008, and [recent patches](https://www.2ndquadrant.com/en/blog/postgresql-maximum-table-size/) have enabled working beyond exabyteandup to the yotabyte scale. Machine learning models can be horizontally scaled usingstandardPostgresreplicas.

 *How reliable can this be?*

 Postgres is widely considered mission critical, and some of the most [reliable](https://www.postgresql.org/docs/current/wal-reliability.html) technology in any modern stack. PostgresML allows an infrastructure organization to leverage pre-existing best practices to deploy machine learning into production with less risk and effort than other systems. For example, model backup and recovery happens automatically alongside normal data backup procedures.
 Postgres is widely considered mission critical, and some of the most [reliable](https://www.postgresql.org/docs/current/wal-reliability.html) technology in any modern stack. PostgresML allows an infrastructure organization to leverage pre-existing best practices to deploy machine learning into production with less risk and effort than other systems. For example, model backup and recovery happens automatically alongside normalPostgresdata backup.

 *How good are the models?*

 Model quality is often a tradeoff between compute resources and incremental quality improvements. Sometimes a few thousands training examples and an off the shelf algorithm can deliver significant business value after a few seconds of training a model. PostgresML allows stakeholders to choose several different algorithms to get the most bang for the buck, or invest in more computationally intensive techniques as necessary. In addition, PostgresML automatically applies best practices for data cleaning like imputing missing values by default and normalizing data to prevent common problems in production.
 Model quality is often a tradeoff between compute resources and incremental quality improvements. Sometimes a few thousands training examples and an off the shelf algorithm can deliver significant business value after a few seconds of training. PostgresML allows stakeholders to choose several different algorithms to get the most bang for the buck, or invest in more computationally intensive techniques as necessary. In addition, PostgresML automatically applies best practices for data cleaning like imputing missing values by default and normalizing data to prevent common problems in production.

 PostgresML doesn't help with reformulating a business problem into a machine learning problem. Like most things in life, the ultimate in quality will be a concerted effort of experts working over time. PostgresML is intended to establish successful patterns for those experts to collaborate around while leveraging the expertise of open source and research communities.

 *Is PostgresML fast?*

 Colocating the compute with the data inside the database removes one of the most common latency bottlenecks in the ML stack, which is the (de)serialization of data between stores and services across the wire. Modern versions of Postgres also support automatic query parrellization across multiple workers to further minimize latency in large batch workloads. Finally, PostgresML will utilize GPU compute if both the algorithm and hardware support it, although it is currently rare in practice for production databases to have GPUs. We're working on [benchmarks](sql/benchmarks.sql).

 ### Current features
 - Train models directly in Postgres with data from a table or view
 - Make predictions in Postgres using SELECT statements
 - Manage new versions and algorithms over time as your solution evolves

 ### Planned features
 - Model management dashboard
 - Data explorer
 - Scheduled training
 - More algorithms and libraries including custom algorithm support


 ## Installation


 The two most important functions the framework provides are:

 1. `pgml.train(project_name TEXT, objective TEXT, relation_name TEXT, y_column_name TEXT, algorithm TEXT)`,
 1. `pgml.train(project_name TEXT, objective TEXT, relation_name TEXT, y_column_name TEXT, algorithm TEXT DEFAULT NULL)`,
 2. `pgml.predict(project_name TEXT, VARIADIC features DOUBLE PRECISION[])`.

 The first function trains a model, given a human-friendly project name, a `regression` or `classification` objective, a table or view name which contains the training and testing datasets, and the name of the `y` column containing the target values. The second function predicts novel datapoints, given the project name for an exiting model trained with `pgml.train`, and a list of features used to train that model.
diff --git a/logo-small.png b/logo-small.png
Original file line number	Diff line number	Diff line change
Expand Up		@@ -2,44 +2,143 @@

		![PostgresML](./logo-small.png)

		PostgresML isa Proof of Concept to create the simplestend-to-end machine learning system.We're building on the shoulders of giants, namely Postgres which is arguably the most robust storageandcompute engine that exists, and we're coupling that with Python machine learning libraries (and their c implementations) to prototype different machine learning workflows.
		PostgresML isanend-to-end machine learning system.Using only SQL, it allows to train modelsandrun online predictions, alongside normal queries, directly using the data in your databases.
Copy link Contributor montanalowApr 18, 2022 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. I like proof of concept. It seems to have played well in your pgcat debut. I think name dropping Postgres/Python helps buy some credibility that we're not completely insanely trying to build this from scratch. Copy link ContributorAuthor levkkApr 18, 2022 Choose a reason for hiding this comment The reason will be displayed to describe this comment to others.Learn more. Yeah that's true, fair enough

		Common architectures driven by standard organizational hierarchies make it hard to employ machine learning successfully, i.e [Conway's Law](https://en.wikipedia.org/wiki/Conway%27s_law). A single model at a unicorn scale startup may require work from Data Scientists, Data Engineers, Machine Learning Engineers, Infrastructure Engineers, Reliability Engineers, Front & Backend Product Engineers, multiple Engineering Managers, a Product Manager and finally, the Business Partner(s) this "solution" is supposed to eventually address. It can take multiple quarters of effort to shepherd a first effort. The typical level of complexity adds risk, makes maintenance a hot potato and iteration politically difficult. Worse, burnout and morale damage to expensive headcount have left teams and leadership warry of implementing ML solutions throughout the industry, even though FAANGs have proven the immense value when successful.
		## Why

		Deploying machine learning models into existing applications is not straight forward. Unless you're already using Python in your day to day work, you need to learn a new language and toolchain, figure out how to EL(T) your data from your database(s) into a warehouse or object storage, learn how to train models (Scikit-Learn, Pytorch, Tensorflow, etc.), and finally serve preditions to your apps, forcing your organization into microservices and all the complexity that comes with it.

		PostgresML makes ML simple: your data doesn't really go anywhere, you train using simple SQL commands, and you get the predictions to your apps using a mechanism you've been using already: a Postgres connection and a query.

		Our goal is that anyone with a basic understanding of SQL should be able to build and deploy machine learning models to production, while receiving the benefits of a high performance machine learning platform. Ultimately, PostgresML aims to be the easiest, safest and fastest way to gain value from machine learning.

		## Quick start

		Using Docker, boot up PostresML locally:

		```bash
		$ docker-compose up
		```

		The system is available on port 5433 by default, just in case you happen to be running Postgres already:

		```bash
		$ psql -U root -h 127.0.0.1 -p 5433
		```

		We've included a couple examples in the `examples/` folder. You can run them directly with:

		```bash
		$ psql -U root -h 127.0.0.1 -p 5433 -f <filename>
		```

		See [installation instructions](#Installation) for installing PostgresML in different supported environments, and for more information.

		## Features

		### Training models

		Given a Postgres table or a view, PostgresML can train a model using some commonly used algorithms. We currently support the following Scikit-Learn regression and classification models:

		- `LinearRegression`,
		- `LogisticRegression`,
		- `SVR`,
		- `SVC`,
		- `RandomForestRegressor`,
		- `RandomForestClassifier`,
		- `GradientBoostingRegressor`,
		- `GradientBoostingClassifier`.

		Training a model is then as simple as:

		```sql
		SELECT * FROM pgml.train(
		'Human-friendly project name',
		'regression',
		'<name of the table or view containing the data>',
		'<name of the column containing the y target values>'
		);
		```

		PostgresML will snapshot the data from the table, train multiple models from the above list given the objective (`regression` or `classification`), and automatically choose and deploy the model with the best predictions.

		### Making predictions

		Once the model is trained, making predictions is as simple as:

		```sql
		SELECT pgml.predict('Human-friendly project name', ARRAY[...]) AS prediction_score;
		```

		where `ARRAY[...]` is a list of features for which we want to run a prediction. This list has to be in the same order as the columns in the data table. This score then can be used in normal queries, for example:

		```sql
		SELECT *,
		pgml.predict(
		'Probability of buying our products',
		ARRAY[user.location, NOW() - user.created_at, user.total_purchases_in_dollars]
		) AS likely_to_buy_score
		FROM users
		WHERE comapany_id = 5
		ORDER BY likely_to_buy_score
		LIMIT 25;
		```

		Take a look [below](#Working-with-PostgresML) for an example with real data.

		### Model and data versioning

		As data in your database changes, it is possible to retrain the model again to get better predictions. With PostgresML, it's as simple as running the `pgml.train` command again. If the model scores better, it will be automatically used in predictions; if not, the existing model will be kept and continue to score in your queries. We also snapshot the training data, so models can be retrained deterministically to validate and fix any issues.

		## Roadmap

		This project is currently a proof of concept. Some important features, which we are currently thinking about or working on, are listed below.

		### Production deployment

		Most companies that use PostgreSQL in production do so using managed services like AWS RDS, Digital Ocean, Azure, etc. Those services do not allow running custom extensions, so we have to run PostgresML
		directly on VMs, e.g. EC2, droplets, etc. The idea here is to replicate production data directly from Postgres and make it available in real-time to PostgresML. We're considering solutions like logical replication for small to mid-size databases, and Debezium for multi-TB deployments.

		### Model management dashboard

		A good looking and useful UI goes a long way. A dashboard similar to existing solutions like MLFlow or AWS SageMaker will make the experience of working with PostgresML as pleasant as possible.


		### Data explorer

		A data explorer allows anyone to browse the dataset in production and to find useful tables and features to build effective machine learning models.


		### More algorithms

		Scikit-Learn is a good start, but we're also thinking about including Tensorflow, Pytorch, and many more useful models.


		### Scheduled training

		In applications where data changes often, it's useful to retrain the models automatically on a schedule, e.g. every day, every week, etc.


		### FAQ

		How far can this scale?

		Petabyte sized Postgres deployements are [documented](https://www.computerworld.com/article/2535825/size-matters--yahoo-claims-2-petabyte-database-is-world-s-biggest--busiest.html) in production since at least 2008, and [recent patches](https://www.2ndquadrant.com/en/blog/postgresql-maximum-table-size/) have enabled working beyond exabyte up to the yotabyte scale. Machine learning models can be horizontally scaled usingindustry provenPostgresreplication techniques.
		Petabyte sized Postgres deployements are [documented](https://www.computerworld.com/article/2535825/size-matters--yahoo-claims-2-petabyte-database-is-world-s-biggest--busiest.html) in production since at least 2008, and [recent patches](https://www.2ndquadrant.com/en/blog/postgresql-maximum-table-size/) have enabled working beyond exabyteandup to the yotabyte scale. Machine learning models can be horizontally scaled usingstandardPostgresreplicas.

		How reliable can this be?

		Postgres is widely considered mission critical, and some of the most [reliable](https://www.postgresql.org/docs/current/wal-reliability.html) technology in any modern stack. PostgresML allows an infrastructure organization to leverage pre-existing best practices to deploy machine learning into production with less risk and effort than other systems. For example, model backup and recovery happens automatically alongside normal data backup procedures.
		Postgres is widely considered mission critical, and some of the most [reliable](https://www.postgresql.org/docs/current/wal-reliability.html) technology in any modern stack. PostgresML allows an infrastructure organization to leverage pre-existing best practices to deploy machine learning into production with less risk and effort than other systems. For example, model backup and recovery happens automatically alongside normalPostgresdata backup.

		How good are the models?

		Model quality is often a tradeoff between compute resources and incremental quality improvements. Sometimes a few thousands training examples and an off the shelf algorithm can deliver significant business value after a few seconds of training a model. PostgresML allows stakeholders to choose several different algorithms to get the most bang for the buck, or invest in more computationally intensive techniques as necessary. In addition, PostgresML automatically applies best practices for data cleaning like imputing missing values by default and normalizing data to prevent common problems in production.
		Model quality is often a tradeoff between compute resources and incremental quality improvements. Sometimes a few thousands training examples and an off the shelf algorithm can deliver significant business value after a few seconds of training. PostgresML allows stakeholders to choose several different algorithms to get the most bang for the buck, or invest in more computationally intensive techniques as necessary. In addition, PostgresML automatically applies best practices for data cleaning like imputing missing values by default and normalizing data to prevent common problems in production.

		PostgresML doesn't help with reformulating a business problem into a machine learning problem. Like most things in life, the ultimate in quality will be a concerted effort of experts working over time. PostgresML is intended to establish successful patterns for those experts to collaborate around while leveraging the expertise of open source and research communities.

		Is PostgresML fast?

		Colocating the compute with the data inside the database removes one of the most common latency bottlenecks in the ML stack, which is the (de)serialization of data between stores and services across the wire. Modern versions of Postgres also support automatic query parrellization across multiple workers to further minimize latency in large batch workloads. Finally, PostgresML will utilize GPU compute if both the algorithm and hardware support it, although it is currently rare in practice for production databases to have GPUs. We're working on [benchmarks](sql/benchmarks.sql).

		### Current features
		- Train models directly in Postgres with data from a table or view
		- Make predictions in Postgres using SELECT statements
		- Manage new versions and algorithms over time as your solution evolves

		### Planned features
		- Model management dashboard
		- Data explorer
		- Scheduled training
		- More algorithms and libraries including custom algorithm support


		## Installation

Expand DownExpand Up		@@ -138,7 +237,7 @@ $ psql -c 'SELECT pgml.version()'

		The two most important functions the framework provides are:

		1. `pgml.train(project_name TEXT, objective TEXT, relation_name TEXT, y_column_name TEXT, algorithm TEXT)`,
		1. `pgml.train(project_name TEXT, objective TEXT, relation_name TEXT, y_column_name TEXT, algorithm TEXT DEFAULT NULL)`,
		2. `pgml.predict(project_name TEXT, VARIADIC features DOUBLE PRECISION[])`.

		The first function trains a model, given a human-friendly project name, a `regression` or `classification` objective, a table or view name which contains the training and testing datasets, and the name of the `y` column containing the target values. The second function predicts novel datapoints, given the project name for an exiting model trained with `pgml.train`, and a list of features used to train that model.
Expand Down