There's a story going around at the moment that people have found code from their private GitHub repositories in the AI training data known as The Stack, using this search tool:https://huggingface.co/spaces/bigcode/in-the-stack
I'm very doubtful that private data has been included in that training set. I think it's far more likely that the repositories in question were public at some point in the time, and were gathered up by thehttps://www.softwareheritage.org/ project when they archived code from GitHub.
But how can we tell if a private repository was public at some point in the past?
GitHub havea security audit log for logged in users, but sadly it appears to only cover the past six months.
For a longer history, we can look things up in theGitHub Archive project, which has been recording public events from the GitHub API since 2011.
TLDR: I built a tool for this here:https://observablehq.com/@simonw/github-public-repo-history
TheClickHouse team provide a public tool for querying that data using SQL as a demo of their software. We can use that to try and find out if a repository was public at some point in the past.
Access the tool here - no login required:https://play.clickhouse.com/play
Now execute the following SQL, replacing my username with yours in both places where it occurs:
with public_eventsas (select created_atastimestamp,'Private repo made public'as action, repo_namefrom github_eventswherelower(actor_login)='simonw'and event_typein ('PublicEvent')),most_recent_public_pushas (selectmax(created_at)astimestamp,'Most recent public push'as action, repo_namefrom github_eventswhere event_type='PushEvent'andlower(actor_login)='simonw'group by repo_name),combinedas (select*from public_eventsunion allselect*from most_recent_public_push)select*from combinedorder bytimestamp
The result is a combined timeline showing two things:
PublicEvent
events - whichGitHub describes as "When a private repository is made public. Without a doubt: the best GitHub event."PushEvent
for each repository. Repositories which started life public won't show up in thePublicEvent
list, so this aims to capture them.Here's an extract from the data I get back when I run the query for myself:
I put together an Observable Notebook that provides a UI for executing this query:https://observablehq.com/@simonw/github-public-repo-history
It uses just three cells of JavaScript. The first provides a username input, with a submit button to avoid firing off SQL queries while the user is still typing their name:
viewofusername=Inputs.text({placeholder:"Your GitHub username",submit:true})
The second executes the query using the ClickHouse JSON API,described previously:
results=username.trim()&&(awaitfetch("https://play.clickhouse.com/?user=play",{method:"POST",body:`with public_events as ( select created_at as timestamp, 'Private repo made public' as action, repo_name from github_events where lower(actor_login) = '${username.trim().toLowerCase()}' and event_type in ('PublicEvent')),most_recent_public_push as ( select max(created_at) as timestamp, 'Most recent public push' as action, repo_name from github_events where event_type = 'PushEvent' and lower(actor_login) = '${username.trim()}.toLowerCase()' group by repo_name),combined as ( select * from public_events union all select * from most_recent_public_push)select * from combined order by timestamp FORMAT JSON`})).json()
The third conditionally shows a table of results if the data has been fetched:
table={if(results&&results.data){returnInputs.table(results.data);}else{ returnnull;}}
Here's what it looks likerunning on Observable:
Created 2024-03-20T13:49:53-07:00, updated 2024-03-20T14:32:11-07:00 ·History ·Edit