I wrote a Python script for Web scraping of a Website. Please review my code and suggest me any changes or make me aware of my blunders/mistakes?.I wrote the almost same script for other websites also so please can you suggest me a way to combine all other scripts into one script so that I can get one single [merged] file.
import requestsimport jsonimport pandas as pddef geturl(): urls = [ # List of URLs ] main = [] id = 0 for url in urls: r = requests.get(url) data = json.loads(r.content) print(r.status_code) items = data['items'] baseurl = #URL for item in items: data = {} data['id']= id data['Title'] = item['name'] data["Price"] = item['price'] data['Detai Page'] =baseurl+item['slug'] data['Image'] = item['thumb_image'] id += 1 main.append(data) sr= pd.Series(main) sr.to_json('data.json', orient='records') geturl()- 1\$\begingroup\$Are you able to share any real URLs, or at least markup from the real pages?\$\endgroup\$Reinderien– Reinderien2022-03-21 18:15:29 +00:00CommentedMar 21, 2022 at 18:15
- \$\begingroup\$This has nothing to do with functional programming, web scraping or BeautifulSoup.\$\endgroup\$Reinderien– Reinderien2022-03-21 21:58:08 +00:00CommentedMar 21, 2022 at 21:58
2 Answers2
When usingRequests to fetch JSON content, it's usually more convenient to use the
resonse.json()method instead of manually passingresponse.contenttojson.loads.As far as I can tell, you're usingPandas to turn a dictionary into a JSON string and write it to a file - as far as I can tell, the structure doesn't change. That seems to me like a strange reason to usePandas. It seems to me like you could get the same results in a simpler way by doing something like
with open("data.json", "w") as output_file: json.dump(main, output_file)That said, I somewhat dislike not having the ability to just fetch the data without also writing it to a file. Personally I'd have a separate function to fetch the data, which could in turn be called by the larger
geturlfunction.Hard-coding URLs and output paths makes your function less reusable than it could be. Consider having it take such things as parameters instead.
Right now, the last line of your script calls
geturlnot just whenever the script itself is run, but also whenever this file is imported as part of a larger program. That's awkward, since I'm sure this function might be useful elsewhere too. You can avoid having the code run when imported by putting it in anif __name__ == '__main__'block.Creating
dataas an emptydictand adding elements one by one is a fine approach. However, fordicts like this one, which are small, and have consistent structures and simple contents, I often find it neater to just make it all at once with a singledictliteral. But that's a matter of taste, and your current approach works just fine.Re-using the name of a built-in function (such as
id) for another variable will work, but is generally not considered great practice - I'd suggest renaming theidvariable for that reason.
Put that all together, you might end up with something like:
import requestsimport jsondef get_item_data(urls, detail_base_url): main = [] item_id = 0 for url in urls: r = requests.get(url) data = r.json() for item in data['items']: item_data = { 'id': item_id 'Title': item['name'] 'Price': item['price'] 'Detai Page': detail_base_url + item['slug'] 'Image': item['thumb_image'] } id += 1 main.append(item_data) return maindef save_item_data(urls, detail_base_url, output_filename): item_data = get_item_data(urls, detail_base_url) with open(output_filename, 'w') as output_file: json.dump(item_data, output_file)if __name__ == '__main__': # TODO: We might want to get these from the command line. The "argparse" module would be useful for that urls = [ ... ] detail_base_url = ... output_file_name = 'data.json' save_item_data(urls, detail_base_url, output_file_name)When you work with requests and json transformations you should allways wrap that code on a try except block
r = requests.get(url)data = json.loads(r.content)print(r.status_code)Should be rewrite to
try: response = requests.get(url) if response.ok: data = response.json()except (requests.exceptions.RequestException, json.decoder.JSONDecodeError) as exc: print(exc)Those are the minimal changes that you need to do when you work with APIs and use json transformations and requests library.
Hope it helps
- 4\$\begingroup\$This is... not right. Universally catching and printing exceptions is the opposite of good practice.\$\endgroup\$Reinderien– Reinderien2022-03-22 13:05:53 +00:00CommentedMar 22, 2022 at 13:05
You mustlog in to answer this question.
Explore related questions
See similar questions with these tags.



