Movatterモバイル変換


[0]ホーム

URL:


How to Extract Script and CSS Files from Web Pages in Python

Building a tool to extract all Javascript and CSS files from a web page in Python using requests and BeautifulSoup.
  · 4 min read · Updated may 2022 ·Web Scraping

Step up your coding game withAI-powered Code Explainer. Get insights like never before!

Say you're tasked to analyze some website to check for its performance and you need to extract total files required to download for the web page to properly load, in this tutorial, I will help you accomplish that by building a Python tool to extract all script and CSS file links that are linked to a specific website.

We will be usingrequests andBeautifulSoup as an HTML parser, if you don't have them installed on your Python, please do:

pip3 install requests bs4

Let's start off by initializing the HTTP session and setting the User agent as a regular browser and not a Python bot:

import requestsfrom bs4 import BeautifulSoup as bsfrom urllib.parse import urljoin# URL of the web page you want to extracturl = "http://books.toscrape.com"# initialize a sessionsession = requests.Session()# set the User-agent as a regular browsersession.headers["User-Agent"] = "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"

Now to download all the HTML content of that web page, all we need to do is callsession.get() method, which returns a response object, we are interested just in the HTML code, not the entire response:

# get the HTML contenthtml = session.get(url).content# parse HTML using beautiful soupsoup = bs(html, "html.parser")

Now we have our soup, let's extract all script and CSS files, we usesoup.find_all() method that returns all the HTML soup objects filtered with the tag and attributes passed:

# get the JavaScript filesscript_files = []for script in soup.find_all("script"):    if script.attrs.get("src"):        # if the tag has the attribute 'src'        script_url = urljoin(url, script.attrs.get("src"))        script_files.append(script_url)

So, basically we are searching forscript tags that have thesrc attribute, this usually links to Javascript files required for this website.

Similarly, we can use it for extract CSS files:

# get the CSS filescss_files = []for css in soup.find_all("link"):    if css.attrs.get("href"):        # if the link tag has the 'href' attribute        css_url = urljoin(url, css.attrs.get("href"))        css_files.append(css_url)

As you may know, CSS files are withinhref attributes inlink tags. We are usingurljoin() function to make sure the link is an absolute one (i.e with full path, not a relative path such as/js/script.js).

Finally, let's print the total script and CSS files and write the links into seperate files:

print("Total script files in the page:", len(script_files))print("Total CSS files in the page:", len(css_files))# write file links into fileswith open("javascript_files.txt", "w") as f:    for js_file in script_files:        print(js_file, file=f)with open("css_files.txt", "w") as f:    for css_file in css_files:        print(css_file, file=f)

Once you execute it, 2 files will appear, one for Javascript links and the other for CSS files:

css_files.txt

http://books.toscrape.com/static/oscar/favicon.icohttp://books.toscrape.com/static/oscar/css/styles.csshttp://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.csshttp://books.toscrape.com/static/oscar/css/datetimepicker.css

javascript_files.txt

http://ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.jshttp://books.toscrape.com/static/oscar/js/bootstrap3/bootstrap.min.jshttp://books.toscrape.com/static/oscar/js/oscar/ui.jshttp://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/bootstrap-datetimepicker.jshttp://books.toscrape.com/static/oscar/js/bootstrap-datetimepicker/locales/bootstrap-datetimepicker.all.js

Alright, in the end, I encourage you to further extend this code to build a sophisticated audit tool that is able to identify different files, their sizes and maybe can make suggestions to optimize the website !

As a challenge, try to download all these files and store them in your local disk (this tutorial can help).

I have another tutorial to show you how you canextract all website links, check it outhere.

Furthermore, if the website you're analyzing accidentally bans your IP address, you need touse a proxy server in that case.

Related:How to Automate Login using Selenium in Python.

Happy Scraping ♥

Let ourCode Converter simplify your multi-language projects. It's like having a coding translator at your fingertips. Don't miss out!

View Full Code Explain The Code for Me
Sharing is caring!



Read Also


How to Convert HTML Tables into CSV Files in Python
How to Make an Email Extractor in Python
How to Extract All Website Links in Python

Comment panel

    Got a coding query or need some guidance before you comment? Check out thisPython Code Assistant for expert advice and handy tips. It's like having a coding tutor right in your fingertips!





    Ethical Hacking with Python EBook - Topic - Top


    Join 50,000+ Python Programmers & Enthusiasts like you!



    Tags

    Ethical Hacking with Python EBook - Topic - Middle


    New Tutorials

    Popular Tutorials


    Ethical Hacking with Python EBook - Topic - Bottom






    Claim your Free Chapter!

    Download a Completely Free Ethical hacking with Python from Scratch Chapter.

    See how the book can help you build awesome hacking tools with Python!



    [8]ページ先頭

    ©2009-2025 Movatter.jp