Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commitec0f854

Browse files
committed
added download images from javascript-driven websites script
1 parent2aabf0f commitec0f854

File tree

3 files changed

+101
-0
lines changed

3 files changed

+101
-0
lines changed

‎web-scraping/download-images/README.md‎

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,3 +24,4 @@ To run this:
2424
python download_images https://www.thepythoncode.com/topic/web-scraping
2525
```
2626
A new folder `www.thepythoncode.com` will be created automatically that contains all the images of that web page.
27+
- If you want to download images from javascript-driven websites, consider using `download_images_js.py` script instead (it accepts the same parameters)
Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
fromrequests_htmlimportHTMLSession
2+
importrequests
3+
fromtqdmimporttqdm
4+
frombs4importBeautifulSoupasbs
5+
fromurllib.parseimporturljoin,urlparse
6+
7+
importos
8+
9+
10+
defis_valid(url):
11+
"""
12+
Checks whether `url` is a valid URL.
13+
"""
14+
parsed=urlparse(url)
15+
returnbool(parsed.netloc)andbool(parsed.scheme)
16+
17+
18+
defget_all_images(url):
19+
"""
20+
Returns all image URLs on a single `url`
21+
"""
22+
# initialize the session
23+
session=HTMLSession()
24+
# make the HTTP request and retrieve response
25+
response=session.get(url)
26+
# execute Javascript
27+
response.html.render()
28+
# construct the soup parser
29+
soup=bs(response.html.html,"html.parser")
30+
urls= []
31+
forimgintqdm(soup.find_all("img"),"Extracting images"):
32+
img_url=img.attrs.get("src")orimg.attrs.get("data-src")
33+
ifnotimg_url:
34+
# if img does not contain src attribute, just skip
35+
continue
36+
# make the URL absolute by joining domain with the URL that is just extracted
37+
img_url=urljoin(url,img_url)
38+
# remove URLs like '/hsts-pixel.gif?c=3.2.5'
39+
try:
40+
pos=img_url.index("?")
41+
img_url=img_url[:pos]
42+
exceptValueError:
43+
pass
44+
# finally, if the url is valid
45+
ifis_valid(img_url):
46+
urls.append(img_url)
47+
returnurls
48+
49+
50+
defdownload(url,pathname):
51+
"""
52+
Downloads a file given an URL and puts it in the folder `pathname`
53+
"""
54+
# if path doesn't exist, make that path dir
55+
ifnotos.path.isdir(pathname):
56+
os.makedirs(pathname)
57+
# download the body of response by chunk, not immediately
58+
response=requests.get(url,stream=True)
59+
60+
# get the total file size
61+
file_size=int(response.headers.get("Content-Length",0))
62+
63+
# get the file name
64+
filename=os.path.join(pathname,url.split("/")[-1])
65+
66+
# progress bar, changing the unit to bytes instead of iteration (default by tqdm)
67+
progress=tqdm(response.iter_content(1024),f"Downloading{filename}",total=file_size,unit="B",unit_scale=True,unit_divisor=1024)
68+
withopen(filename,"wb")asf:
69+
fordatainprogress:
70+
# write data read to the file
71+
f.write(data)
72+
# update the progress bar manually
73+
progress.update(len(data))
74+
75+
76+
defmain(url,path):
77+
# get all images
78+
imgs=get_all_images(url)
79+
forimginimgs:
80+
# for each img, download it
81+
download(img,path)
82+
83+
84+
85+
if__name__=="__main__":
86+
importargparse
87+
parser=argparse.ArgumentParser(description="This script downloads all images from a web page")
88+
parser.add_argument("url",help="The URL of the web page you want to download images")
89+
parser.add_argument("-p","--path",help="The Directory you want to store your images, default is the domain of URL passed")
90+
91+
args=parser.parse_args()
92+
url=args.url
93+
path=args.path
94+
95+
ifnotpath:
96+
# if path isn't specified, use the domain name of that url as the folder name
97+
path=urlparse(url).netloc
98+
99+
main(url,path)
Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
11
requests
2+
requests_html
23
bs4
34
tqdm

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp