Scenario:
I work in a company and my paylips are downloadable in an aspx portal. One by one, not in block.
I needed them all, for burocaracy reasons and in order to archive them.
How: py, selenium. I tried with beutifulsoup but it didn't work.
Explenation and Code
Web Driver
I used webdriver Chrome with some options in order to save the pdf-files when browser opens it. Look pref in code below.
Creating a class
classPaylipsScaper:# Initdef__init__(self,username,password):self.username=usernameself.password=password# Optionschrome_options=webdriver.ChromeOptions()prefs={"plugins.always_open_pdf_externally":True,"download.default_directory":"C:\\tmp",# folder save files"download.prompt_for_download":False,"download.directory_upgrade":True,"safebrowsing.enabled":True}chrome_options.add_experimental_option("prefs",prefs)chrome_options.headless=True# If True hide browserself.driver=webdriver.Chrome(executable_path='chromedriver.exe',options=chrome_options)
Login
The login phase is quite easy, just select by id the area and insert the value. I just kept attention to iframe, because in that case you have to useswitch_to.frame before.
# Manage login pagedeflogin(self,url):driver=self.driverdriver.get(url)driver.switch_to.frame("FunArea")username=driver.find_element_by_id("login")password=driver.find_element_by_id("pwd")username.send_keys(self.username)time.sleep(1)password.send_keys(self.password)driver.find_element_by_id("CmdInvia").click()
Loop table using XPATH
I created a class that wrap the selenium driver in order to keep all cleans.
I just reproduced the clicks done by myself.
At the beginning I tried with CSS selector but for the structure of the pages was a better solution to use XPATH
(to get the XPATH with Chromesee here)
By the way I don't like the time.sleep but it was useful to avoid navigations problems during the process.
# Inside PaylipsScaper classdefget_num_rows(self,num_rows=1):driver=self.driverself.click_to_payslips_area()num_rows=len(driver.find_elements_by_xpath("//table[@id='ContTab']/tbody/tr/td/div/table/tbody/tr"))returnnum_rows[...otherstuff...]try:bot=PaylipsScaper(username,password)bot.login(url_website)wait=WebDriverWait(bot.driver,10)num_rows=bot.get_num_rows()forrowinrange(1,num_rows+1):paylip_year=bot.get_val_in_cedolino_row(row,4)paylip_month=bot.get_val_in_cedolino_row(row,5)paylip_type=bot.get_val_in_cedolino_row(row,7)bot.driver.execute_script("arguments[0].click();",WebDriverWait(bot.driver,20).until(EC.element_to_be_clickable((By.XPATH,"/html/body/form/div/table/tbody/tr/td/div/table/tbody/tr["+str(row)+"]/td[10]/img"))))time.sleep(2)filepdf=dirpath+"\\*.pdf"list_of_files=glob.glob(filepdf)file_name=max(list_of_files,key=os.path.getctime)current_paylip=Paylip(paylip_year,paylip_month,paylip_type,file_name)bot.rename_and_move(current_paylip)print("Downloaded:")print(current_paylip)
Save pdf file in folder and rename it
It's quite a brute solution anyway I got the last pdf saved in a folder and renamed it with the informations from the website.
Then I moved the files in sub-folders by year.
defrename_and_move(self,urrent_paylip):ifcurrent_paylip.paylip_month=="":new_file_name='Cud_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_type).replace("","_").replace("Completo","").replace("NORMALE","")+'.pdf'elif"TREDICESIMA"incurrent_paylip.paylip_type:new_file_name='Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'_Tredicesima.pdf'else:new_file_name='Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'.pdf'print(new_file_name)new_file_name=os.path.join(dirpath,new_file_name)# Rename file and move it in the year-directoryos.rename(current_paylip.file_name,new_file_name)current_paylip.file_name=new_file_name# Check if path with year directory exist otherwise create itdirin=os.path.split(new_file_name)newdir=dirin[0]+'\\'+current_paylip.paylip_yearifos.path.exists(newdir)==False:# Create directoryos.mkdir(newdir)# Move file in the year-directoryifos.path.exists(newdir+"\\"+dirin[1]):# If file already exist, delete itos.remove(newdir+"\\"+dirin[1])shutil.move(current_paylip.file_name,newdir+"\\"+dirin[1])return
Final situation
Got it. I have a folder with subfolders by year and in each one all the paylips with a standard name format.
What I learned:
- Use of Selenium in py.
- Simple automation can save a lot of time and avoid manual boring tasks.
- How to write my first article here.(it's a personal task so not so useful for you but better than nothing after all)
Future improvements:
- input parameters
- (re)try to use css selector instead of xpath selector
- (re)try to use BeautifulSoup
- save last paylips saved in order, next run, to save only the not already saved paylips
- read pdf and report data in file(eg google sheets)
Of course the code is useful just for me and my colleagues. Anyway I hope that the idea and process can be a good idea to someone else.
Top comments(0)
For further actions, you may consider blocking this person and/orreporting abuse