- Notifications
You must be signed in to change notification settings - Fork328
Initial_audit_changes#1436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Changes from2 commits
05f79b4
afd202f
4bf3c81
63e50c7
c3d1f8f
File filter
Filter by extension
Conversations
Uh oh!
There was an error while loading.Please reload this page.
Jump to
Uh oh!
There was an error while loading.Please reload this page.
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -4,12 +4,14 @@ description: The key concepts that make up PostgresML. | ||
# Overview | ||
PostgresML is a complete[MLOps platform](## "A Machine Learning Operations platform is a set of practices that streamlines bringing machine learning models to production") built on PostgreSQL. Our operating principle is: | ||
> _Move models to the database, rather than constantly moving data to the models._ | ||
Data for ML & AI systems is inherently larger and more dynamic than the models. It's more efficient, manageable and reliable to move models to the database, rather than continuously moving data to the models. | ||
We offer both [managed-cloud](/docs/product/cloud-database/) and [local](/docs/resources/developer-docs/installation) installations to provide solutions for wherever you keep your data. | ||
## AI engine | ||
PostgresML allows you to take advantage of the fundamental relationship between data and models, by extending the database with the following capabilities: | ||
@@ -48,8 +50,8 @@ Some of the use cases include: | ||
## Our mission | ||
PostgresML strives to provide access to open source AI for everyone. We are continuouslydeveloping PostgresML to keep up with the rapidly evolving use cases for ML & AI, but we remain committed to never breaking user-facing APIs. We welcome contributions to our [open source code and documentation](https://github.com/postgresml target="_blank") from the community. | ||
## Managed cloud | ||
While our extension and pooler are open source, we also offer a managed cloud database service for production deployments of PostgresML. You can [sign up](https://postgresml.org/signup target="_blank") for an account and get a free Serverless database in seconds. | ||
craigmoore1 marked this conversation as resolved. Show resolvedHide resolvedUh oh!There was an error while loading.Please reload this page. |
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
@@ -37,7 +37,7 @@ pgml.train( | ||||||
| `task` | `'regression'` | The objective of the experiment: `regression`, `classification` or `cluster` | | ||||||
| `relation_name` | `'public.search_logs'` | The Postgres table or view where the training data is stored or defined. | | ||||||
| `y_column_name` | `'clicked'` | The name of the label (aka "target" or "unknown") column in the training table. | | ||||||
| `algorithm` | `'xgboost'` | <p>The algorithm to train on the dataset.</p> | | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. We don't need the <p> if we're not nesting multiple links, but why remove the links? Suggested change
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. The links 404'ed. Do the pages still exist? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Yeah, these are supposed to link to | ||||||
| `hyperparams` | `{ "n_estimators": 25 }` | The hyperparameters to pass to the algorithm for training, JSON formatted. | | ||||||
| `search` | `grid` | If set, PostgresML will perform a hyperparameter search to find the best hyperparameters for the algorithm. See [hyperparameter-search.md](hyperparameter-search.md "mention") for details. | | ||||||
| `search_params` | `{ "n_estimators": [5, 10, 25, 100] }` | Search parameters used in the hyperparameter search, using the scikit-learn notation, JSON formatted. | | ||||||
@@ -63,7 +63,7 @@ This will create a "My Classification Project", copy the `pgml.digits` table int | ||||||
When used for the first time in a project, `pgml.train()` function requires the `task` parameter, which can be either `regression` or `classification`. The task determines the relevant metrics and analysis performed on the data. All models trained within the project will refer to those metrics and analysis for benchmarking and deployment. | ||||||
The first time it is called, the function will also require a `relation_name` and `y_column_name`. The two arguments will be used to create the first snapshot of training and test data. By default, 25% of the data (specified by the `test_size` parameter) will be randomly sampled to measure the performance of the model after the `algorithm` has been trained on the 75% of the data. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. Why do we prefer not using contractions? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others.Learn more. I suggested it as a way to make it simpler for non-native English speakers to read. Another reason is for translation, but I figured you probably have no plans for that at this point. | ||||||
!!! tip | ||||||
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -27,7 +27,7 @@ _Result_ | ||
### Model from hub | ||
To use a specific model fromthe HuggingFace model hub, pass the model name along with task name in task. | ||
craigmoore1 marked this conversation as resolved. Show resolvedHide resolvedUh oh!There was an error while loading.Please reload this page. | ||
```sql | ||
SELECT pgml.transform( | ||
@@ -109,7 +109,7 @@ _Result_ | ||
### Beam Search | ||
Text generation typically utilizes a greedy search algorithm that selects the word with the highest probability as the next word in the sequence. However, an alternative method called beam search can be used, which aims to minimize the possibility of overlooking hidden high probability word combinations. Beam search achieves this by retaining the`num\_beams` most likely hypotheses at each step and ultimately selecting the hypothesis with the highest overall probability. We set `num_beams > 1` and `early_stopping=True` so that generation is finished when all beam hypotheses reached the EOS token. | ||
```sql | ||
SELECT pgml.transform( | ||
@@ -135,14 +135,16 @@ _Result_ | ||
]] | ||
``` | ||
Sampling methods involve selecting the next word or sequence of words at random from the set of possible candidates, weighted by their probabilities according to the language model. This can result in more diverse and creative text, as well as avoiding repetitive patterns. In its most basic form, sampling means randomly picking the next word`$w\_t$` according to its conditional probability distribution:`$$w_t \approx P(w_t|w_{1:t-1})$$`. | ||
However, the randomness of the sampling method can also result in less coherent or inconsistent text, depending on the quality of the model and the chosen sampling parameters such as`temperature`, `top-k`, or`top-p`. Therefore, choosing an appropriate sampling method and parameters is crucial for achieving the desired balance between creativity and coherence in generated text. | ||
You can pass `do_sample = True` in the arguments to use sampling methods. It is recommended to alter `temperature` or `top_p` but not both. | ||
### _Temperature_ | ||
The `temperature` parameter can fine tune the level of confidence, diversity, and randommness of a model. It uses a range from 0 (very conservative output) to infinity (very diverse output) to define how the model should select a certain output based on the output's certainty. Higher temperatures should be used when the certainty of the output is low, and lower temperatures should be used when the certainty is very high. A `temperature` of 1 is considered a medium setting. | ||
```sql | ||
SELECT pgml.transform( | ||
task => '{ | ||
@@ -167,6 +169,8 @@ _Result_ | ||
### _Top p_ | ||
Top_p is a technique used to improve the performance of generative models. It selects the tokens that are in the top percentage of probability distribution, allowing for more diverse responses. If you are experiencing repetitive responses, modifying this setting can improve the quality of the output. The value of `top_p` is a number between 0 and 1 that sets the probability distribution, so a setting of `.8` sets the probability distribution at 80 percent. This means that it selects tokens that have an 80% probability of being accurate or higher. | ||
```sql | ||
SELECT pgml.transform( | ||
task => '{ | ||
Uh oh!
There was an error while loading.Please reload this page.