1
\$\begingroup\$

I wrote a Python script for Web scraping of a Website. Please review my code and suggest me any changes or make me aware of my blunders/mistakes?.I wrote the almost same script for other websites also so please can you suggest me a way to combine all other scripts into one script so that I can get one single [merged] file.

import requestsimport jsonimport pandas as pddef geturl():  urls = [               # List of URLs           ]  main = []   id = 0   for url in urls:    r = requests.get(url)    data = json.loads(r.content)    print(r.status_code)    items = data['items']      baseurl = #URL       for item in items:      data = {}      data['id']= id      data['Title'] = item['name']      data["Price"] = item['price']      data['Detai Page'] =baseurl+item['slug']      data['Image'] = item['thumb_image']      id += 1      main.append(data)  sr= pd.Series(main)  sr.to_json('data.json', orient='records')  geturl()
200_success's user avatar
200_success
146k22 gold badges191 silver badges481 bronze badges
askedMar 21, 2022 at 3:24
Muhammad Fahim's user avatar
\$\endgroup\$
2
  • 1
    \$\begingroup\$Are you able to share any real URLs, or at least markup from the real pages?\$\endgroup\$CommentedMar 21, 2022 at 18:15
  • \$\begingroup\$This has nothing to do with functional programming, web scraping or BeautifulSoup.\$\endgroup\$CommentedMar 21, 2022 at 21:58

2 Answers2

3
\$\begingroup\$
  • When usingRequests to fetch JSON content, it's usually more convenient to use theresonse.json() method instead of manually passingresponse.content tojson.loads.

  • As far as I can tell, you're usingPandas to turn a dictionary into a JSON string and write it to a file - as far as I can tell, the structure doesn't change. That seems to me like a strange reason to usePandas. It seems to me like you could get the same results in a simpler way by doing something like

    with open("data.json", "w") as output_file:    json.dump(main, output_file)
  • That said, I somewhat dislike not having the ability to just fetch the data without also writing it to a file. Personally I'd have a separate function to fetch the data, which could in turn be called by the largergeturl function.

  • Hard-coding URLs and output paths makes your function less reusable than it could be. Consider having it take such things as parameters instead.

  • Right now, the last line of your script callsgeturl not just whenever the script itself is run, but also whenever this file is imported as part of a larger program. That's awkward, since I'm sure this function might be useful elsewhere too. You can avoid having the code run when imported by putting it in anif __name__ == '__main__' block.

  • Creatingdata as an emptydict and adding elements one by one is a fine approach. However, fordicts like this one, which are small, and have consistent structures and simple contents, I often find it neater to just make it all at once with a singledict literal. But that's a matter of taste, and your current approach works just fine.

  • Re-using the name of a built-in function (such asid) for another variable will work, but is generally not considered great practice - I'd suggest renaming theid variable for that reason.

Put that all together, you might end up with something like:

import requestsimport jsondef get_item_data(urls, detail_base_url):    main = []    item_id = 0    for url in urls:        r = requests.get(url)        data = r.json()        for item in data['items']:            item_data = {                'id': item_id                'Title': item['name']                'Price': item['price']                'Detai Page': detail_base_url + item['slug']                'Image': item['thumb_image']            }            id += 1            main.append(item_data)    return maindef save_item_data(urls, detail_base_url, output_filename):    item_data = get_item_data(urls, detail_base_url)    with open(output_filename, 'w') as output_file:        json.dump(item_data, output_file)if __name__ == '__main__':    # TODO: We might want to get these from the command line. The "argparse" module would be useful for that    urls = [ ... ]    detail_base_url = ...    output_file_name = 'data.json'    save_item_data(urls, detail_base_url, output_file_name)
Toby Speight's user avatar
Toby Speight
88.7k14 gold badges104 silver badges327 bronze badges
answeredMay 17, 2022 at 2:13
Sara J's user avatar
\$\endgroup\$
-3
\$\begingroup\$

When you work with requests and json transformations you should allways wrap that code on a try except block

r = requests.get(url)data = json.loads(r.content)print(r.status_code)

Should be rewrite to

try:    response = requests.get(url)    if response.ok:        data = response.json()except (requests.exceptions.RequestException, json.decoder.JSONDecodeError) as exc:    print(exc)

Those are the minimal changes that you need to do when you work with APIs and use json transformations and requests library.

Hope it helps

answeredMar 22, 2022 at 10:02
camp0's user avatar
\$\endgroup\$
1
  • 4
    \$\begingroup\$This is... not right. Universally catching and printing exceptions is the opposite of good practice.\$\endgroup\$CommentedMar 22, 2022 at 13:05

You mustlog in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.