Movatterモバイル変換


[0]ホーム

URL:


Reviewing your history of public GitHub repositories using ClickHouse

There's a story going around at the moment that people have found code from their private GitHub repositories in the AI training data known as The Stack, using this search tool:https://huggingface.co/spaces/bigcode/in-the-stack

I'm very doubtful that private data has been included in that training set. I think it's far more likely that the repositories in question were public at some point in the time, and were gathered up by thehttps://www.softwareheritage.org/ project when they archived code from GitHub.

But how can we tell if a private repository was public at some point in the past?

GitHub havea security audit log for logged in users, but sadly it appears to only cover the past six months.

For a longer history, we can look things up in theGitHub Archive project, which has been recording public events from the GitHub API since 2011.

TLDR: I built a tool for this here:https://observablehq.com/@simonw/github-public-repo-history

TheClickHouse team provide a public tool for querying that data using SQL as a demo of their software. We can use that to try and find out if a repository was public at some point in the past.

Access the tool here - no login required:https://play.clickhouse.com/play

Now execute the following SQL, replacing my username with yours in both places where it occurs:

with public_eventsas (select    created_atastimestamp,'Private repo made public'as action,    repo_namefrom github_eventswherelower(actor_login)='simonw'and event_typein ('PublicEvent')),most_recent_public_pushas (selectmax(created_at)astimestamp,'Most recent public push'as action,    repo_namefrom github_eventswhere event_type='PushEvent'andlower(actor_login)='simonw'group by repo_name),combinedas (select*from public_eventsunion allselect*from most_recent_public_push)select*from combinedorder bytimestamp

The result is a combined timeline showing two things:

Here's an extract from the data I get back when I run the query for myself:

2017-09-10: Most recent public push, simonw/github-large-file-test - 2017-09-12: Most recent public push, simonw/Houston-Shelters - 2017-09-26: Private repo made public, simonw/squirrelspotter - 2017-10-01: Private repo made public, simonw/simonwillisonblog - 2017-10-12: Most recent public push, simonw/ratelimitcache - 2017-10-15: Most recent public push, simonw/irma-scraped-data - 2017-10-15: Most recent public push, simonw/fema-history - 2017-11-06: Most recent public push, simonw/factory_worker_python - 2017-11-13: Private repo made public, simonw/datasette

A UI for that query using Observable

I put together an Observable Notebook that provides a UI for executing this query:https://observablehq.com/@simonw/github-public-repo-history

It uses just three cells of JavaScript. The first provides a username input, with a submit button to avoid firing off SQL queries while the user is still typing their name:

viewofusername=Inputs.text({placeholder:"Your GitHub username",submit:true})

The second executes the query using the ClickHouse JSON API,described previously:

results=username.trim()&&(awaitfetch("https://play.clickhouse.com/?user=play",{method:"POST",body:`with public_events as (  select    created_at as timestamp,    'Private repo made public' as action,    repo_name  from github_events  where lower(actor_login) = '${username.trim().toLowerCase()}'  and event_type in ('PublicEvent')),most_recent_public_push as (  select    max(created_at) as timestamp,    'Most recent public push' as action,    repo_name  from github_events  where event_type = 'PushEvent'  and lower(actor_login) = '${username.trim()}.toLowerCase()'  group by repo_name),combined as (  select * from public_events  union all select * from most_recent_public_push)select * from combined order by timestamp FORMAT JSON`})).json()

The third conditionally shows a table of results if the data has been fetched:

table={if(results&&results.data){returnInputs.table(results.data);}else{    returnnull;}}

Here's what it looks likerunning on Observable:

A notebook executing the code shown here, displaying a table of results for username simonw.

Related

Created 2024-03-20T13:49:53-07:00, updated 2024-03-20T14:32:11-07:00 ·History ·Edit


[8]ページ先頭

©2009-2025 Movatter.jp