Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
This repository was archived by the owner on Sep 18, 2020. It is now read-only.
/acceleratorPublic archive

Code for my data science accelerator project

NotificationsYou must be signed in to change notification settings

MatMoore/accelerator

Repository files navigation

This repository contains code and a blog for my data science accelerator project (https://matmoore.github.io/accelerator/)

Overview

This was a 3 month project to mine search logs for evaluating/improving the search function onGOV.UK. I worked on this project 1 day a week from April to June 2018.

More background:

Data

For every search session I store these things:

VariableFormatPurpose
searchTermStringWhat the user typed into the search bar
finalItemClickedUUID or URLID of the last thing clicked
finalRankIntegerRank of the last thing clicked
clickedResultsArray of UUIDs or URLsIDs of everything clicked in the session
allResultsArray of UUIDs or URLsIDs of everything displayed on a search result page

I defined a search session to be a user viewing a distinct search query within a single visit to GOV.UK. So if they return to the same search multiple times, it's still considered the same session, no matter what pages theyvisited in the middle.

Running the code in this repository

Dependencies

To run this code you need access to the GOV.UK Google Analytics BigQuery export, and a relational database to write data to.

You need to configure the following environment variables:

VariableFormatPurposeDefault
DATABASE_URLStringwhich local database to usepostgres://localhost/accelerator
BIGQUERY_PRIVATE_KEY_IDStringKey id from bigquery credentials
BIGQUERY_PRIVATE_KEYSSH keySSH key from bigquery credentials
BIGQUERY_CLIENT_EMAILEmail addressClient email from bigquery credentials
BIGQUERY_CLIENT_IDStringClient ID from bigquery credentials
DEBUGStringIf set to anything, debug the code using part of the dataset

These can be set in a.env file for local development when using pipenv.

Running the ETL pipeline

The following scripts form a pipeline to extract, transform and load the data into a database:

  • pipenv run python bigquery.py exports session data from google query
  • pipenv run clean_data_from_bigquery.py [PATH_TO_RAW_DATA] [OUTPUT_PATH] cleans up the output ofbigquery.py and produces a single dataset where each row is a unique combination of (session, query, document)
  • pipenv run load_sessions.py [INPUT_FILE] groups the data by session and imports it into a local database

Some of these scripts use hardcoded dates and filenames, so check the code before running them.

After running these you will have the following tables, arranged as aSTAR schema:

  • searches - observations, where each row is a search session
  • queries - each row is a unique search query
  • datasets - each row records metadata about a single run of theload_sessions.py script. This is for debugging purposes only.

Once the data is loaded, I manually ran a SQL query to mark queries as high volume or low volume.This is a separate step because I ended up loading the data in batches, and I could consider morequeries high volume if I collected more data.

with fooas (select query_idfrom queriesjoin searches using (query_id)group by query_idhavingcount(*)>1000)update queriesset high_volume=truefrom foowherequeries.query_id=foo.query_id;

Training a click model

To train the click model, first runpipenv run split_data.py to create training/test datasets. You need to have run all the previous steps first. This will output CSV files with the test and training datasets.

Then runpipenv run python estimate_with_pyclick.py. This uses a Simplified Dynamic Bayesian Network model, which should be very fast (a few minutes on my Macbook pro). In contrast, the full Dynamic Bayesian network model takes hours rather than minutes. If you want to speed it up you can try using PyPy as recommended by PyClick, but I didn't get this working.

Evaluating the click model's inferred optimal ranking

The trained click model can be used to rerank a set of search results so that the most "relevant" resultsare at the top. I compared to this the ranking the user originally saw, by looking at whether theirchosen result moved up or down.

The script I used to do this isevaluate_model.py.

Unfortunately this metric is biased towards results that were originally ranked higher up, but I didn'tcome up with a better one in the time I had.


[8]ページ先頭

©2009-2025 Movatter.jp