Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit3dd619c

Browse files
authored
edits (#978)
1 parent5ae5eb9 commit3dd619c

File tree

1 file changed

+13
-13
lines changed

1 file changed

+13
-13
lines changed

‎pgml-dashboard/content/blog/how-to-improve-search-results-with-machine-learning.md

Lines changed: 13 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -97,7 +97,7 @@ Postgres provides the complete reference [documentation](https://www.postgresql.
9797

9898
###Indexing
9999

100-
Postgres treats everything in the standard SQL`WHERE` clause as a filter, by default. It makes this keyword search work by scanning the entire table, converting each document body to a`tsvector`, and then comparing the`tsquery` to the`tsvector`. This is called a "sequential scan". It's fine for small tables, but for production use cases at scale, we'll need a more efficient solution.
100+
Postgres treats everything in the standard SQL`WHERE` clause as a filter. By default, it makes this keyword search work by scanning the entire table, converting each document body to a`tsvector`, and then comparing the`tsquery` to the`tsvector`. This is called a "sequential scan". It's fine for small tables, but for production use cases at scale, we'll need a more efficient solution.
101101

102102
The first step is to store the`tsvector` in the table, so we don't have to generate it during each search. We can do this by adding a new`GENERATED` column to our table, that will automatically stay up to date. We also want to search both the**title** and**body**, so we'll concatenate`||` the fields we want to include in our search, separated by a simple space`' '`.
103103

@@ -115,7 +115,7 @@ GENERATED ALWAYS AS (to_tsvector('english', title || ' ' || body )) STORED;
115115

116116
!!!
117117

118-
One nicefunction of generated columns is that they will backfill the data for existingcolumns. They can also be indexed, just like any other column. We can add a Generalized Inverted Index (GIN) on this new column that will pre-compute the lists of all documents that contain each keyword. This will allow us to skip the sequential scan, and instead use the index to find the exact list of documents that satisfy any given`tsquery`.
118+
One niceaspect of generated columns is that they will backfill the data for existingrows. They can also be indexed, just like any other column. We can add a Generalized Inverted Index (GIN) on this new column that will pre-compute the lists of all documents that contain each keyword. This will allow us to skip the sequential scan, and instead use the index to find the exact list of documents that satisfy any given`tsquery`.
119119

120120
!!! generic
121121

@@ -155,11 +155,11 @@ WHERE title_and_body_text @@ to_tsquery('english', 'another & second');
155155

156156
!!!
157157

158-
We can see our new`tsvector` column in the results now as well, since we used`SELECT *`. You'll notice that the`tsvector` contains the stemmed words from both the**title** and**body**, along with their position. The position information allows Postgres to support**phrase** matches as well as single keywords. You'll also notice that_stopwords_, like "the", "is", and "of" have been removed. This is a common optimization for keyword search, since these words are so common, they don't add much value to the search results.
158+
We can see our new`tsvector` column in the results now as well, since we used`SELECT *`. You'll notice that the`tsvector` contains the stemmed words from both the**title** and**body**, along with their position. The position information allows Postgres to support_phrase_ matches as well as single keywords. You'll also notice that_stopwords_, like "the", "is", and "of" have been removed. This is a common optimization for keyword search, since these words are so common, they don't add much value to the search results.
159159

160160
###Ranking
161161

162-
Ranking is a critical component of search, and it's also where Machine Learning becomes critical for great results. Our users will expect us to sort our results with the most relevant at the top. A simplearithmatic relevance score is provided`ts_rank`. It computes the Term Frequency (TF) of each keyword in the query that matches the document. For example, if the document has 2 keyword matches out of 5 words total, it's`ts_rank` will be`2 / 5 = 0.4`. The more matches and the fewer total words, the higher the score and the more relevant the document.
162+
Ranking is a critical component of search, and it's also where Machine Learning becomes critical for great results. Our users will expect us to sort our results with the most relevant at the top. A simplearithmetic relevance score is provided`ts_rank`. It computes the Term Frequency (TF) of each keyword in the query that matches the document. For example, if the document has 2 keyword matches out of 5 words total, it's`ts_rank` will be`2 / 5 = 0.4`. The more matches and the fewer total words, the higher the score and the more relevant the document.
163163

164164
With multiple query terms OR`|` together, the`ts_rank` will add the numerators and denominators to account for both. For example, if the document has 2 keyword matches out of 5 words total for the first query term, and 1 keyword match out of 5 words total for the second query term, it's ts_rank will be`(2 + 1) / (5 + 5) = 0.3`. The full`ts_rank` function has many additional options and configurations that you can read about in the[documentation](https://www.postgresql.org/docs/current/textsearch-controls.html#TEXTSEARCH-RANKING), but this should give you the basic idea.
165165

@@ -189,7 +189,7 @@ Our document that matches 2 of the keywords has twice the score of the documents
189189

190190
###Boosting
191191

192-
A quick improvement we could make to our search query would be to differentiate relevance of the title and body. It's intuitive that a keyword match in the title is more relevant than a keyword match in the body. We can implement a simple boosting function by multiplying the title rank 2x, and adding it to the body rank. This will**boost** title matches up the rankings in our final results list. This can be done by creating a simple arithmetic formula in the`ORDER BY` clause.
192+
A quick improvement we could make to our search query would be to differentiate relevance of the title and body. It's intuitive that a keyword match in the title is more relevant than a keyword match in the body. We can implement a simple boosting function by multiplying the title rank 2x, and adding it to the body rank. This will_boost_ title matches up the rankings in our final results list. This can be done by creating a simple arithmetic formula in the`ORDER BY` clause.
193193

194194
!!! generic
195195

@@ -212,11 +212,11 @@ Wait a second... is a title match 2x or 10x, or maybe log(π / tsrank<sup>2</sup
212212

213213
So far we've only considered simple statistical measures of relevance like`ts_rank`s TF/IDF, but people have a much more sophisticated idea of relevance. Luckily, they'll tell you exactly what they think is relevant by clicking on it. We can use this feedback to train a model that learns the optimal weights of**title_rank** vs**body_rank** for our boosting function. We'll redefine relevance as the probability that a user will click on a search result, given our inputs like**title_rank** and**body_rank**.
214214

215-
This is considered a Supervised Learning problem, because we have a labeled dataset of user clicks that we can use to train our model. The inputs to our function are called**features** of the data for the machine learning model, and the output is often referred to as the**label**.
215+
This is considered a Supervised Learning problem, because we have a labeled dataset of user clicks that we can use to train our model. The inputs to our function are called_features_ of the data for the machine learning model, and the output is often referred to as the_label_.
216216

217217
###Training Data
218218

219-
First things first, we need to record some user clicks on our search results. We'll create a new table to store our training data, which are the observed inputs and output of our new relevance function. In a real system, we'd probably have separate tables to record**sessions**,**searches**,**results**,**clicks** and other events, but for simplicity in this example, we'll just record the exact information we need to train our model in a single table. Everytime we perform a search, we'll record the`ts_rank` for the both the title and body, and whether the user clicked on the result.
219+
First things first, we need to record some user clicks on our search results. We'll create a new table to store our training data, which are the observed inputs and output of our new relevance function. In a real system, we'd probably have separate tables to record**sessions**,**searches**,**results**,**clicks** and other events, but for simplicity in this example, we'll just record the exact information we need to train our model in a single table. Everytime we perform a search, we'll record the`ts_rank` for the both the**title** and**body**, and whether the user**clicked** on the result.
220220

221221
!!! generic
222222

@@ -232,9 +232,9 @@ CREATE TABLE search_result_clicks (
232232

233233
!!!
234234

235-
One of the hardest parts of machine learning is gathering the data from disparate sources and turning it into features like this. There are often teams of data engineers involvedmaintain endless pipelines from one feature store or data warehouse and then back again. We don't need that complexity in PostgresML and can just insert the ML features directly into the database.
235+
One of the hardest parts of machine learning is gathering the data from disparate sources and turning it into features like this. There are often teams of data engineers involvedin maintaining endless pipelines from one feature store or data warehouse and then back again. We don't need that complexity in PostgresML and can just insert the ML features directly into the database.
236236

237-
I've made up 4 example searches, across our 3 documents, and recorded the`ts_rank` for the title and body, and whether the user clicked on the result. I've cherry-picked some intuitive results, where the user always clicked on the top ranked document, that has the highest combined title and body ranks. We'll insert this data into our new table.
237+
I've made up 4 example searches, across our 3 documents, and recorded the`ts_rank` for the**title** and**body**, and whether the user**clicked** on the result. I've cherry-picked some intuitive results, where the user always clicked on the top ranked document, that has the highest combined title and body ranks. We'll insert this data into our new table.
238238

239239
!!! generic
240240

@@ -267,11 +267,11 @@ VALUES
267267

268268
!!!
269269

270-
In a real application, we'd record the results of millions of searches results with the ts_ranks and clicks from our users, but even this small amount of data is enough to train a model with PostgresML. Bootstrapping or back-filling data is also possible with several techniques. You could build the app, and have your admins or employeesjustuse it to generate training data before a public release.
270+
In a real application, we'd record the results of millions of searches results with the ts_ranks and clicks from our users, but even this small amount of data is enough to train a model with PostgresML. Bootstrapping or back-filling data is also possible with several techniques. You could build the app, and have your admins or employees use it to generate training data before a public release.
271271

272272
###Training a Model to rank search results
273273

274-
We'll train a model for our "Search Ranking" project using the`pgml.train` function, which takes several arguments. The`project_name` isjusta handle we can use to refer to the model later when we're ranking results, and the`task` is the type of model we want to train. In this case, we want to train a model to predict the probability of a user clicking on a search result, given the`title_rank` and`body_rank` of the result. This is a regression problem, because we're predicting a continuous value between 0 and 1. We could also train a classification model to make a boolean prediction whether a user will click on a result, but we'll save that for another example.
274+
We'll train a model for our "Search Ranking" project using the`pgml.train` function, which takes several arguments. The`project_name` is a handle we can use to refer to the model later when we're ranking results, and the`task` is the type of model we want to train. In this case, we want to train a model to predict the probability of a user clicking on a search result, given the`title_rank` and`body_rank` of the result. This is a regression problem, because we're predicting a continuous value between 0 and 1. We could also train a classification model to make a boolean prediction whether a user will click on a result, but we'll save that for another example.
275275

276276
Here goes some machine learning:
277277

@@ -364,13 +364,13 @@ If you're watching your database logs, you'll notice the first time a model is u
364364
!!!
365365

366366

367-
The model is predicting values close to 1where there was a click, and values closer to 0where there wasn't a click. This is a good sign that the model is learning something useful. We can also use the`pgml.predict` function to make predictions on new data, and this is where things actually get interesting in online search results with PostgresML.
367+
The model is predicting values close to 1when there was a click, and values closer to 0when there wasn't a click. This is a good sign that the model is learning something useful. We can also use the`pgml.predict` function to make predictions on new data, and this is where things actually get interesting in online search results with PostgresML.
368368

369369
###Ranking Search Results with Machine Learning
370370

371371
Search results are often computed in multiple steps of recall and (re)ranking. Each step can apply more sophisticated (and expensive) models on more and more features, before pruning less relevant results for the next step. We're going to expand our original keyword search query to include a machine learning model that will re-rank the results. We'll use the`pgml.predict` function to make predictions on the title and body rank of each result, and then we'll use the predictions to re-rank the results.
372372

373-
It's nice to organize the query into logical steps, and we can use**Common Table Expressions** (CTEs) to do this. CTEs are like temporary tables that only exist for the duration of the query. We can use CTEs to organize our query into logical steps. We'll start by defining a CTE that will rank all the documents in our table by the ts_rank for title and body text. We define a CTEwith the`WITH` keyword, and then we can use the CTE as if it were a table in the rest of the query. We'll name our CTE**first_pass_ranked_documents**. Having the full power of SQL gives us a lot of power to flex in this step.
373+
It's nice to organize the query into logical steps, and we can use**Common Table Expressions** (CTEs) to do this. CTEs are like temporary tables that only exist for the duration of the query. We'll start by defining a CTE that will rank all the documents in our table by the ts_rank for title and body text. We define a CTEusing the`WITH` keyword, and then we can use the CTE as if it were a table in the rest of the query. We'll name our CTE**first_pass_ranked_documents**. Having the full power of SQL gives us a lot of power to flex in this step.
374374

375375
1) We can efficiently recall matching documents using the keyword index`WHERE title_and_body_text @@ to_tsquery('english', 'second | title'))`
376376
2) We can generate multiple ts_rank scores for each row the documents using the`ts_rank` function as if they were columns in the table

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp