This repository was archived by the owner on Sep 18, 2020. It is now read-only.

MatMoore/acceleratorPublic archive

NotificationsYou must be signed in to change notification settings
Fork1
Star3

Code for my data science accelerator project

matmoore.github.io/accelerator/post/what-is-this/

3 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
docs		docs
journal		journal
public_data		public_data
queries		queries
.gitignore		.gitignore
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
bigquery.py		bigquery.py
clean_data_from_bigquery.py		clean_data_from_bigquery.py
database.py		database.py
debug.py		debug.py
estimate_with_pyclick.py		estimate_with_pyclick.py
evaluate_model.py		evaluate_model.py
load_sessions.py		load_sessions.py
pyclick_sdbn.py		pyclick_sdbn.py
split_data.py		split_data.py

Repository files navigation

Modelling user behaviour to benchmark GOV.UK's search engine

This repository contains code and a blog for my data science accelerator project (https://matmoore.github.io/accelerator/)

Overview

This was a 3 month project to mine search logs for evaluating/improving the search function onGOV.UK. I worked on this project 1 day a week from April to June 2018.

More background:

Data

For every search session I store these things:

Variable	Format	Purpose
searchTerm	String	What the user typed into the search bar
finalItemClicked	UUID or URL	ID of the last thing clicked
finalRank	Integer	Rank of the last thing clicked
clickedResults	Array of UUIDs or URLs	IDs of everything clicked in the session
allResults	Array of UUIDs or URLs	IDs of everything displayed on a search result page

I defined a search session to be a user viewing a distinct search query within a single visit to GOV.UK. So if they return to the same search multiple times, it's still considered the same session, no matter what pages theyvisited in the middle.

Running the code in this repository

Dependencies

To run this code you need access to the GOV.UK Google Analytics BigQuery export, and a relational database to write data to.

You need to configure the following environment variables:

Variable	Format	Purpose	Default
DATABASE_URL	String	which local database to use	postgres://localhost/accelerator
BIGQUERY_PRIVATE_KEY_ID	String	Key id from bigquery credentials
BIGQUERY_PRIVATE_KEY	SSH key	SSH key from bigquery credentials
BIGQUERY_CLIENT_EMAIL	Email address	Client email from bigquery credentials
BIGQUERY_CLIENT_ID	String	Client ID from bigquery credentials
DEBUG	String	If set to anything, debug the code using part of the dataset

These can be set in a.env file for local development when using pipenv.

Running the ETL pipeline

The following scripts form a pipeline to extract, transform and load the data into a database:

pipenv run python bigquery.py exports session data from google query
pipenv run clean_data_from_bigquery.py [PATH_TO_RAW_DATA] [OUTPUT_PATH] cleans up the output ofbigquery.py and produces a single dataset where each row is a unique combination of (session, query, document)
pipenv run load_sessions.py [INPUT_FILE] groups the data by session and imports it into a local database

Some of these scripts use hardcoded dates and filenames, so check the code before running them.

After running these you will have the following tables, arranged as aSTAR schema:

searches - observations, where each row is a search session
queries - each row is a unique search query
datasets - each row records metadata about a single run of theload_sessions.py script. This is for debugging purposes only.

Once the data is loaded, I manually ran a SQL query to mark queries as high volume or low volume.This is a separate step because I ended up loading the data in batches, and I could consider morequeries high volume if I collected more data.

with fooas (select query_idfrom queriesjoin searches using (query_id)group by query_idhavingcount(*)>1000)update queriesset high_volume=truefrom foowherequeries.query_id=foo.query_id;

Training a click model

To train the click model, first runpipenv run split_data.py to create training/test datasets. You need to have run all the previous steps first. This will output CSV files with the test and training datasets.

Then runpipenv run python estimate_with_pyclick.py. This uses a Simplified Dynamic Bayesian Network model, which should be very fast (a few minutes on my Macbook pro). In contrast, the full Dynamic Bayesian network model takes hours rather than minutes. If you want to speed it up you can try using PyPy as recommended by PyClick, but I didn't get this working.

Evaluating the click model's inferred optimal ranking

The trained click model can be used to rerank a set of search results so that the most "relevant" resultsare at the top. I compared to this the ranking the user originally saw, by looking at whether theirchosen result moved up or down.

The script I used to do this isevaluate_model.py.

Unfortunately this metric is biased towards results that were originally ranked higher up, but I didn'tcome up with a better one in the time I had.

About

Code for my data science accelerator project

matmoore.github.io/accelerator/post/what-is-this/

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Modelling user behaviour to benchmark GOV.UK's search engine

Overview

Data

Running the code in this repository

Dependencies

Running the ETL pipeline

Training a click model

Evaluating the click model's inferred optimal ranking

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

MatMoore/accelerator

Folders and files

Latest commit

History

Repository files navigation

Modelling user behaviour to benchmark GOV.UK's search engine

Overview

Data

Running the code in this repository

Dependencies

Running the ETL pipeline

Training a click model

Evaluating the click model's inferred optimal ranking

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages