- Notifications
You must be signed in to change notification settings - Fork0
Scrape and analyze FBREF data with kickR.
License
Unknown, MIT licenses found
Licenses found
jeffreyohene/kickR
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
title |
---|
kickR 1.0.0 |
kickR is a comprehensive R package designed for web scraping of football metrics fromFBRef. Whether you're an analyst, data scientist, or a football enthusiast, kickR provides you with the tools to access player and team statistics from various football leagues around the world. This package makes it easy to gather football metrics for your analysis and/or data vizzes.
kickR is written in the R programming language and to get started, if you have never used R before, you will have to download and install R and RStudio here for your computer by selecting the OS you use:
If you are more comfortable in other programming languages like Python, you can scrape the data you need here in R and there are functions kickR provides that help you save the scraped data and export it to continue your analysis or use the data for the visualizations you want to work with.
For some examples as to what you can do with kickR data, kindly check thisrepo out.
Scraping Data: Easily retrieve football statistics, player data, and team metrics from websites like FBref.
League Coverage:
- Premier League
- Championship
- Serie A
- Ligue 1
- La Liga
- Segunda División
- Serie B
- Bundesliga
- Eredivisie
- Campeonato Brasileiro Série A
- Liga MX
- Major League Soccer
- Primeira Liga
- Bundesliga 2
- Belgian Pro League
- Ligue 2
Data Analysis:FBRef has metrics grouped under 9 categories which are listed with the top metrics for each category below:
- Standard: General team metrics, xG, npxG, xG performance
- Goalkeeping: Clean sheets, save percentage, goals conceded
- Advanced Goalkeeping: Free kick and corner kick goals conceded, post shot expected goals(PSxG), PSxG performance, average passing length, goal kicks breakdown, crosses faced and crosses stopped an sweeping metrics.
- Shooting: Goals, shots and shots on target, average shot distance, xG and npxG performance.
- Passing: Total passes, pass blength breakdown, total passing distance, total progressive passing distance, pass length breakdown(attempted, completed and completion rates of shor, medium and long passes), final third passes, expected assisted goals, expected assists, key passes, final third passes and crosses and passes into the penalty area.
- Pass Types: Switches, throw ins, through balls, crosses, live in-play and dead passes, in and out-swinging corner kicks and passes offside.
- Goal & Shot Creation: Goal & Shot creating actions from live and dead passes, take-ons, shots, fouls drawn and defensive actions.
- Defensive Actions: tackles, challenges, blocks, interceptions, clearances and errors leading to an opponent shot.
- Possession: Touches, take-ons, carries and passes received.
- Playing Time: Matches played, minutes played, minutes per matches played, starts, points per match, goals scored and goals conceded when playing, xg performance when playing
- Miscellaneous: disciplinary record, fouls made and drawn, ball recoveries and offsides
Data Export: After scraping data, you can export it as a .rds file if you use R or as a .csv, .xlsx or .json file.
Additional Feature: kickR also has a function for calculating player/team similarity using the find_similar_players and find_similar_teams functions. This can help you scout players and teams who play in a certain way. Note that, similarity does not directly equate to style. Two players when compared for their touches in possession, touches in the middle third and carries made can be considered similar but their playing style may differ.
Open Source: kickR is open-source software distributed under the MIT License.
To install, run the code below:
# Install latest development version of kickR.if (!requireNamespace('devtools',quietly=T)) { install.packages('devtools')}devtools::install_github('jeffreyohene/kickR')
Here is the function syntax for scraping team data from a league:
fbref_team_stats<-function(league=NULL,season=NULL,type=NULL)
Below are the leagues that kickR support and these values are to be passed to the leaue argument in the function. Do note that when a league os not supplied, the English Premier League will be automatically selected.
- premier_league
- championship
- serie_a
- la_liga
- ligue_1
- segunda_division
- serie_b
- bundesliga
- mls
- eredivisie
- br_serie_a
- liga_mx
- primera_liga
- bundesliga_2
- belgian_pro_league
- ligue_2
Below are the metrics available to be scraped and are to be passed to the type parameter. When the type argument is null, it will be defaulted to the standard metric.
- standard
- goalkeeping
- advanced_goalkeeping
- shooting
- passing
- pass_types
- goal_creation
- defensive_actions
- possession
- playing_time
- miscellaneous
The season parameter is the last argument to pass to the function. When left blank, it defaults to the current season. It should be supplied in the formatYYYY/YYYY
so if you want data for the 2022 to 2023 season, you can supply2022/2023
to the year argument. FBREF started collecting metrics for most leagues in 2017, 2018 so should your function return nothing for the league you selected, visit the website to check if data is actually available for that season.
To scrape team statistics from the available football leagues usingkickR
, follow these steps:
- Load the
kickR
package in your R environment.
library(kickR)
# To scrape bundesliga league goalkeeping data for 2020/2021bundesliga_goalkeeping<- fbref_team_stats(league="bundesliga",season="2023/2024",type="goalkeeping")# Expected output as at 12/10/2023# A tibble: 18 × 21clubleaguematches_playedsquadtotal_minutes_playedmins_per_90goals_againstgoals_against_per90<chr><chr><chr><chr><chr><chr><chr><chr>1Augsburg234343,06034.0601.762BayernMunich334343,06034.0451.323Bochum234343,06034.0742.184Darmstadt98234343,06034.0862.535Dortmund234343,06034.0431.266EintFrankfurt234343,06034.0501.477Freiburg134343,06034.0581.718Gladbach234343,06034.0671.979Heidenheim134343,06034.0551.6210Hoffenheim134343,06034.0661.9411Köln134343,06034.0601.7612Leverkusen234343,06034.0240.7113Mainz05234343,06034.0511.5014RBLeipzig234343,06034.0391.1515Stuttgart234343,06034.0391.1516UnionBerlin234343,06034.0581.7117WerderBremen234343,06034.0541.5918Wolfsburg234343,06034.0561.65# ℹ 13 more variables: shots_on_target_against <chr>, saves <chr>, save_percentage <chr>, wins <chr>, draws <chr>,# losses <chr>, clean_sheets <chr>, clean_sheet_percentage <chr>, penalties_attempted <chr>,# penalty_kicks_allowed <chr>, penalty_kicks_saved <chr>, penalty_kicks_missed <chr>,# penalty_kicks_save_percentage <chr>
# Passing data for La Liga# If you want latest statistics for a league you can always leave the season parameter out like thisla_liga_passing<- fbref_team_stats(league="la_liga",season="2023/2024",type="goalkeeping")la_liga_passing# A tibble: 20 × 26clubnumber_of_players_usedmins_per_90total_passes_completedtotal_passes_attemptedpass_completion_perc…¹<chr><chr><chr><chr><chr><chr>1Alavés3038.0100911411671.52Almería3538.0127741654077.23AthleticClub2738.0145091872477.54AtléticoMadrid2738.0170642070982.45Barcelona2938.0215062476186.96Betis3538.0153101897980.77Cádiz3438.0107631499071.88CeltaVigo3138.0138901777678.19Getafe3338.0107131519370.510Girona2538.0187932191485.811Granada4038.0121271597675.912LasPalmas2938.0191052288283.513Mallorca2538.0114761564573.414Osasuna2938.0128741733574.315RayoVallecano2638.0132821745976.116RealMadrid2738.0217942469188.317RealSociedad3138.0152631924979.318Sevilla3538.0146281859978.619Valencia2938.0122331625575.320Villarreal3238.0153491863782.4# ℹ abbreviated name: ¹pass_completion_percentage# ℹ 20 more variables: total_passing_distance <chr>, total_progressive_distance <chr>, short_passes_completed <chr>,# short_passes_attempted <chr>, short_pass_completion_percentage <chr>, medium_passes_completed <chr>,# medium_passes_attempted <chr>, medium_pass_completion_percentage <chr>, long_passes_completed <chr>,# long_passes_attempted <chr>, long_pass_completion_percentage <chr>, assists <chr>, xAG <chr>, xA <chr>,# xag_performance <chr>, key_passes <chr>, passes_into_final_third <chr>, passes_into_penalty_box <chr>,# crosses_into_penalty_box <chr>, progressive_passes <chr>kickRalsosupportsleagueoutsideofEuropeliketheMexicanLigaMx.Itfollowsthesamepatternlikescrapingforotherleagues.Ifyouleavetheseasonargumentblank,kickRscrapesdataforthecurrentseason,soifwewantedtoseethelatestshotandgoalcreationstatsacrossclubsintheMexicanleagues,wecandoitlikethis```R# Scrape latest liga mx shot and goal creation statsliga_mx_sca_gca<- fbref_team_stats(league="liga_mx",type="goal_creation")liga_mx_sca_gca# A tibble: 18 × 19clubnumber_of_players_usedmins_per_90shot_creating_actionsshot_creating_action…¹sca_live_passessca_dead_passes<chr><chr><chr><chr><chr><chr><chr>1América244.09323.256882Atlas214.07919.7550133Atléti…204.08020.005864CruzA…204.012030.0089165FCJuá…204.06817.005146Guadal…194.09523.757357León204.08721.756088Mazatl…204.07318.255399Monter…204.08421.0074210Necaxa214.08521.25641011Pachuca214.08521.25541212Puebla194.011428.50811413Querét…224.06115.2543814Santos224.04611.5026715Tijuana214.08721.7568616Toluca194.07719.2563717UANL184.010927.25801018UNAM224.011228.007116# ℹ abbreviated name: ¹shot_creating_actions_per90# ℹ 12 more variables: sca_take_ons <chr>, sca_shots <chr>, sca_fouls <chr>, sca_defensive_actions <chr>,# goal_creating_actions <chr>, goal_creating_actions_per90 <chr>, gca_live_passes <chr>, gca_dead_passes <chr>,# gca_take_ons <chr>, gca_shots <chr>, gca_fouls <chr>, gca_defensive_actions <chr>
With this version you can access player data of every available league on FBREF. Do note that the player data scraping is a little different from the team data scraping and since the player data tables on the site are dynamically rendered, we will use Javascript to scrape the data. To use this function, you will need to haveMozilla Firefox installed on your computer. Note that if you encounter any problems during scraping, use Ctrl + Shift + F10 to restart your R session then use the function again
To scrape the EFL championship passing data for players for the 2022/2023 season for example, you can use this
># Scrape passing stats for all EFL players in the 2022/2023 season>efl_passing_players<- fbref_player_stats(season="2022/2023",+league="championship",+type="passing")>># Expected output>efl_passing_players# A tibble: 750 × 30playernationpositionclubagebirth_yearmins_per_90total_passes_completedtotal_passes_attempted<chr><chr><chr><chr><chr><chr><chr><chr><chr>1MaxAaronsengENGDFNorwichC…22200042.8200825362TheloAasgaardnoNORFW,MFWiganAth…20200217.63995073NelsonAbbeyengENGDFReading1820030.2694KelvinAbrefaengENGMF,DFReading1820031.420365FinlayAdairengENGFW,MFPreston1720050.7386ElijahAdebayoengENGFWLutonTown24199835.73886257TobyAdeyemoengENGMF,FWWatford1720051.19158AlbertAdomahghGHAFW,MFQPR34198714.12504239MichaelAdu-PokuengENGFWWatford1620050.10110BenikAfobecdCODFWMillwall29199310.3127192# ℹ 740 more rows# ℹ 21 more variables: pass_completion_percentage <chr>, total_passing_distance <chr>, total_progressive_distance <chr>,# short_passes_completed <chr>, short_passes_attempted <chr>, short_pass_completion_percentage <chr>,# medium_passes_completed <chr>, medium_passes_attempted <chr>, medium_pass_completion_percentage <chr>,# long_passes_completed <chr>, long_passes_attempted <chr>, long_pass_completion_percentage <chr>, assists <chr>,# xAG <chr>, xA <chr>, xag_performance <chr>, key_passes <chr>, passes_into_final_third <chr>,# passes_into_penalty_box <chr>, crosses_into_penalty_box <chr>, progressive_passes <chr># ℹ Use `print(n = ...)` to see more rows
To use this function, you will need to have Firefox installed on your computer.
# extract player passing datadf<- fbref_player_stats(season="2023/2024",league="premier_league",type="passing")# find players similar to Martin Ødegaard in tge English Premier Leaguem_odegaard_sim_pl<- find_similar_players(df=df,player="Martin Ødegaard",metrics= c("key_passes","passes_into_final_third"),formula="euclidean",top_n=15)m_odegaard_sim_plplayerdistance580Martin Ødegaard0.00000184BrunoFernandes29.54657415ColePalmer38.41875552JamesWard-Prowse44.41846317JamesMaddison46.09772220BrunoGuimarães46.64762201MorganGibbs-White53.85165313DouglasLuiz54.12947316AlexisMacAllister55.03635196ConorGallagher55.9017018TrentAlexander-Arnold56.63921434PedroPorro56.79789457AndrewRobertson58.00000114LewisCook60.16644418LucasPaquetá61.40033
It is usually better to have a larger dataframe. You can use the fbref_player_stats() function to scrape player stats from as many leagues as you can and use therbind()
function to combine them into a larger dataframe to have a very deep pool of players so you can really unearth hidden players who are really good but play in a less known league. For an example, we luckily have FBREF having all players in the top 5 league in a single table which you can scrape with kickR using the fbref_big5_player_stats() function. If we wanted to really see which players perform similarly to Martin Ødegaard in terms of key passes and passes into the final third, we can go about it like this:
# Scrape passing data for all players in top 5 leagues: Premier League, La Liga, Serie A, Bundesliga, Ligue 1df<- fbref_big5_player_stats(season="2023/2024",type="passing")# find players similar to Martin Ødegaard in Europe's Top 5 Leagues using cosine similaritym_odegaard_sim_big5_cos<- find_similar_players(df=df,player="Martin Ødegaard",metrics= c("key_passes","passes_into_final_third"),formula="cosine",top_n=15)m_odegaard_sim_big5_cosplayersimilarity2849Martin Ødegaard1.0000000108FelipeAnderson0.99999972429BernardoSilva0.99999842583JanThielmann0.99999842660KacperUrbanski0.99999841762TakumiMinamino0.9999960735RitsuDoan0.99999402068AdriàPedrosa0.99999282336AlexisSánchez0.999992881MiguelAlmirón0.9999911547JordanClark0.99999111409GrejohnKyei0.99999111036VincenzoGrifo0.9999904191RidleBaku0.99998732093AyozePérez0.9999873# find players similar to Martin Ødegaard in Europe's Top 5 Leagues uing euclidean distancem_odegaard_sim_big5_eucl<- find_similar_players(df=df,player="Martin Ødegaard",metrics= c("key_passes","passes_into_final_third"),formula="euclidean",top_n=15)m_odegaard_sim_big5_euclplayerdistance2849Martin Ødegaard0.000002366TéjiSavanier28.231191071 İlkayGündoğan28.44293862BrunoFernandes29.546571195Isco36.619672027ColePalmer38.4187563LuisAlberto39.84972361BenjaminBourigeaud40.804411339JoshuaKimmich41.036572512KevinStöger43.011632743JamesWard-Prowse44.418461553JamesMaddison46.097721063BrunoGuimarães46.647621784LukaModrić47.042532211TijjaniReijnders52.61179
As you can see, cosine and euclidean distance measure similarity in two different approaches. An article will be added to this repo's description to talk more about it and if you have any suggestions on how to adjust it, do reach out to me.
A tip I would like to include is this. In our example, Martin Ødegaard is a midfielder. It would make sense to filter the scraped data to include only midfielders or defenders/midfielders or forwards/midfielders. This will improve the formula's ability to find similar players as the context is clearer. If you wanted to filter your dataframe for only midfielders before calling the find_similar_players() function, you could use this in base R:
# filter dataframe for only midfieldersdf<-df[df$position=="MF", ]# call similarity function againm_odegaard_sim_big5_eucl<- find_similar_players(df=df,player="Martin Ødegaard",metrics= c("key_passes","passes_into_final_third"),formula="euclidean",top_n=15)m_odegaard_sim_big5_euclplayerdistance568Martin Ødegaard0.00000480TéjiSavanier28.23119207 İlkayGündoğan28.44293231Isco36.6196713LuisAlberto39.84972515KevinStöger43.01163553JamesWard-Prowse44.41846311JamesMaddison46.09772205BrunoGuimarães46.64762362LukaModrić47.04253437TijjaniReijnders52.61179265TeunKoopmeiners54.03702304DouglasLuiz54.12947184AngelGomes54.1479532MaximilianArnold54.74486
These are the available positions for players on FBREF:
unique(df$position) [1]"DF""MF,FW""MF""FW""FW,MF""DF,FW""GK""DF,MF""MF,DF""FW,DF"
So if you want to extend your midfielder search, you would have to filter for players who are primarily registered as midfielders so MF, MF/FW, MF/DF. To filter for multiple values in base R, you can use this snippet:
df<-df[df$position%in% c("MF","MF,FW","MF,DF"), ]
This package was built onrvest
,jsonlite
andopenxlsx
. Since the first release is a purely scraping package release, you would have to load dplyr into your R environment for helpful data manipulation functions like renaming columns and also changing column data types from character to numeric for example.
It is also worth noting that the scraping package cleans the column names into more descriptive names for easier analysis. You can always rename the columns in your analysis workflow to what suits you best.
Thesave_table
function is designed to save a given data frame in various formats such as JSON, CSV, XLSX, or RDS. It offers flexibility for choosing the desired format.
Thesave_table
function saves a specified dataframe to your working directory provided you want to store it locally or work with it later. If we want to save our La Liga passing table from early on to make a viz that proves why Real Madrid is the most potent passing team in La Liga for example we can do that below with the following code.
# Initialize variablesdf<-la_liga_passing_latestfilename<-'la_liga_passing_latest'format<-'csv'save_table(df,filename,format)# Expected output:Tablesavedas' la_liga_passing_latest.csv'
kickR relies on the following R packages:
- jsonlite
- rvest
- RSelenium
- openxlsx
- Author: Jeffrey Ohene
- Maintainer:jeffreyohene)
This package is released under the MIT License. See the LICENSE file fordetails.
If you would like to contribute to this project, please check the contributions file for this package.
I regularly monitor the packages' functions' performance and functionality andrelease updates as needed to ensure its reliability and from time to time, small updates will be released to fix bugs or comply with FBREF's scraping policy. If you encounter anyissues or have suggestions for improvements, please don't hesitate to open anissue on therepo and provide as much detail as possible to help me understand and address the issue.
Project icon from icon8.com
About
Scrape and analyze FBREF data with kickR.