Python script for Web scraping

Question 1

I wrote a Python script for Web scraping of a Website. Please review my code and suggest me any changes or make me aware of my blunders/mistakes?.I wrote the almost same script for other websites also so please can you suggest me a way to combine all other scripts into one script so that I can get one single [merged] file.

import requestsimport jsonimport pandas as pddef geturl():  urls = [               # List of URLs           ]  main = []   id = 0   for url in urls:    r = requests.get(url)    data = json.loads(r.content)    print(r.status_code)    items = data['items']      baseurl = #URL       for item in items:      data = {}      data['id']= id      data['Title'] = item['name']      data["Price"] = item['price']      data['Detai Page'] =baseurl+item['slug']      data['Image'] = item['thumb_image']      id += 1      main.append(data)  sr= pd.Series(main)  sr.to_json('data.json', orient='records')  geturl()

Question 2

Are you able to share any real URLs, or at least markup from the real pages?

Question 3

This has nothing to do with functional programming, web scraping or BeautifulSoup.

Question 4

When usingRequests to fetch JSON content, it's usually more convenient to use theresonse.json() method instead of manually passingresponse.content tojson.loads.
As far as I can tell, you're usingPandas to turn a dictionary into a JSON string and write it to a file - as far as I can tell, the structure doesn't change. That seems to me like a strange reason to usePandas. It seems to me like you could get the same results in a simpler way by doing something like
```
with open("data.json", "w") as output_file:    json.dump(main, output_file)
```
That said, I somewhat dislike not having the ability to just fetch the data without also writing it to a file. Personally I'd have a separate function to fetch the data, which could in turn be called by the largergeturl function.
Hard-coding URLs and output paths makes your function less reusable than it could be. Consider having it take such things as parameters instead.
Right now, the last line of your script callsgeturl not just whenever the script itself is run, but also whenever this file is imported as part of a larger program. That's awkward, since I'm sure this function might be useful elsewhere too. You can avoid having the code run when imported by putting it in anif __name__ == '__main__' block.
Creatingdata as an emptydict and adding elements one by one is a fine approach. However, fordicts like this one, which are small, and have consistent structures and simple contents, I often find it neater to just make it all at once with a singledict literal. But that's a matter of taste, and your current approach works just fine.
Re-using the name of a built-in function (such asid) for another variable will work, but is generally not considered great practice - I'd suggest renaming theid variable for that reason.

Put that all together, you might end up with something like:

import requestsimport jsondef get_item_data(urls, detail_base_url):    main = []    item_id = 0    for url in urls:        r = requests.get(url)        data = r.json()        for item in data['items']:            item_data = {                'id': item_id                'Title': item['name']                'Price': item['price']                'Detai Page': detail_base_url + item['slug']                'Image': item['thumb_image']            }            id += 1            main.append(item_data)    return maindef save_item_data(urls, detail_base_url, output_filename):    item_data = get_item_data(urls, detail_base_url)    with open(output_filename, 'w') as output_file:        json.dump(item_data, output_file)if __name__ == '__main__':    # TODO: We might want to get these from the command line. The "argparse" module would be useful for that    urls = [ ... ]    detail_base_url = ...    output_file_name = 'data.json'    save_item_data(urls, detail_base_url, output_file_name)

Question 5

When you work with requests and json transformations you should allways wrap that code on a try except block

r = requests.get(url)data = json.loads(r.content)print(r.status_code)

Should be rewrite to

try:    response = requests.get(url)    if response.ok:        data = response.json()except (requests.exceptions.RequestException, json.decoder.JSONDecodeError) as exc:    print(exc)

Those are the minimal changes that you need to do when you work with APIs and use json transformations and requests library.

Hope it helps

Question 6

This is... not right. Universally catching and printing exceptions is the opposite of good practice.

Sara J 4,22112 silver badges37 bronze badges · Answer 1 · 2022-05-17 10:12:33Z

When usingRequests to fetch JSON content, it's usually more convenient to use theresonse.json() method instead of manually passingresponse.content tojson.loads.
As far as I can tell, you're usingPandas to turn a dictionary into a JSON string and write it to a file - as far as I can tell, the structure doesn't change. That seems to me like a strange reason to usePandas. It seems to me like you could get the same results in a simpler way by doing something like
```
with open("data.json", "w") as output_file:    json.dump(main, output_file)
```
That said, I somewhat dislike not having the ability to just fetch the data without also writing it to a file. Personally I'd have a separate function to fetch the data, which could in turn be called by the largergeturl function.
Hard-coding URLs and output paths makes your function less reusable than it could be. Consider having it take such things as parameters instead.
Right now, the last line of your script callsgeturl not just whenever the script itself is run, but also whenever this file is imported as part of a larger program. That's awkward, since I'm sure this function might be useful elsewhere too. You can avoid having the code run when imported by putting it in anif __name__ == '__main__' block.
Creatingdata as an emptydict and adding elements one by one is a fine approach. However, fordicts like this one, which are small, and have consistent structures and simple contents, I often find it neater to just make it all at once with a singledict literal. But that's a matter of taste, and your current approach works just fine.
Re-using the name of a built-in function (such asid) for another variable will work, but is generally not considered great practice - I'd suggest renaming theid variable for that reason.

Put that all together, you might end up with something like:

import requestsimport jsondef get_item_data(urls, detail_base_url):    main = []    item_id = 0    for url in urls:        r = requests.get(url)        data = r.json()        for item in data['items']:            item_data = {                'id': item_id                'Title': item['name']                'Price': item['price']                'Detai Page': detail_base_url + item['slug']                'Image': item['thumb_image']            }            id += 1            main.append(item_data)    return maindef save_item_data(urls, detail_base_url, output_filename):    item_data = get_item_data(urls, detail_base_url)    with open(output_filename, 'w') as output_file:        json.dump(item_data, output_file)if __name__ == '__main__':    # TODO: We might want to get these from the command line. The "argparse" module would be useful for that    urls = [ ... ]    detail_base_url = ...    output_file_name = 'data.json'    save_item_data(urls, detail_base_url, output_file_name)

camp0 8655 silver badges7 bronze badges · Answer 2 · 2022-03-22 10:02:54Z

When you work with requests and json transformations you should allways wrap that code on a try except block

r = requests.get(url)data = json.loads(r.content)print(r.status_code)

Should be rewrite to

try:    response = requests.get(url)    if response.ok:        data = response.json()except (requests.exceptions.RequestException, json.decoder.JSONDecodeError) as exc:    print(exc)

Those are the minimal changes that you need to do when you work with APIs and use json transformations and requests library.

Hope it helps

This is... not right. Universally catching and printing exceptions is the opposite of good practice.

Movatterモバイル変換

Stack Exchange Network

Python script for Web scraping

2 Answers2

You mustlog in to answer this question.

Related

Hot Network Questions

Subscribe to RSS