Posted onFeb 4, 2021

Scraper payslips with Python | Selenium

#scraper #python #selenium #automate

Scenario:

I work in a company and my paylips are downloadable in an aspx portal. One by one, not in block.
I needed them all, for burocaracy reasons and in order to archive them.

How: py, selenium. I tried with beutifulsoup but it didn't work.

Explenation and Code

Web Driver

I used webdriver Chrome with some options in order to save the pdf-files when browser opens it. Look pref in code below.

Creating a class

classPaylipsScaper:# Initdef__init__(self,username,password):self.username=usernameself.password=password# Optionschrome_options=webdriver.ChromeOptions()prefs={"plugins.always_open_pdf_externally":True,"download.default_directory":"C:\\tmp",# folder save files"download.prompt_for_download":False,"download.directory_upgrade":True,"safebrowsing.enabled":True}chrome_options.add_experimental_option("prefs",prefs)chrome_options.headless=True# If True hide browserself.driver=webdriver.Chrome(executable_path='chromedriver.exe',options=chrome_options)

Login

The login phase is quite easy, just select by id the area and insert the value. I just kept attention to iframe, because in that case you have to useswitch_to.frame before.

# Manage login pagedeflogin(self,url):driver=self.driverdriver.get(url)driver.switch_to.frame("FunArea")username=driver.find_element_by_id("login")password=driver.find_element_by_id("pwd")username.send_keys(self.username)time.sleep(1)password.send_keys(self.password)driver.find_element_by_id("CmdInvia").click()

Loop table using XPATH

I created a class that wrap the selenium driver in order to keep all cleans.
I just reproduced the clicks done by myself.
At the beginning I tried with CSS selector but for the structure of the pages was a better solution to use XPATH
(to get the XPATH with Chromesee here)
By the way I don't like the time.sleep but it was useful to avoid navigations problems during the process.

# Inside PaylipsScaper classdefget_num_rows(self,num_rows=1):driver=self.driverself.click_to_payslips_area()num_rows=len(driver.find_elements_by_xpath("//table[@id='ContTab']/tbody/tr/td/div/table/tbody/tr"))returnnum_rows[...otherstuff...]try:bot=PaylipsScaper(username,password)bot.login(url_website)wait=WebDriverWait(bot.driver,10)num_rows=bot.get_num_rows()forrowinrange(1,num_rows+1):paylip_year=bot.get_val_in_cedolino_row(row,4)paylip_month=bot.get_val_in_cedolino_row(row,5)paylip_type=bot.get_val_in_cedolino_row(row,7)bot.driver.execute_script("arguments[0].click();",WebDriverWait(bot.driver,20).until(EC.element_to_be_clickable((By.XPATH,"/html/body/form/div/table/tbody/tr/td/div/table/tbody/tr["+str(row)+"]/td[10]/img"))))time.sleep(2)filepdf=dirpath+"\\*.pdf"list_of_files=glob.glob(filepdf)file_name=max(list_of_files,key=os.path.getctime)current_paylip=Paylip(paylip_year,paylip_month,paylip_type,file_name)bot.rename_and_move(current_paylip)print("Downloaded:")print(current_paylip)

Save pdf file in folder and rename it

It's quite a brute solution anyway I got the last pdf saved in a folder and renamed it with the informations from the website.
Then I moved the files in sub-folders by year.

defrename_and_move(self,urrent_paylip):ifcurrent_paylip.paylip_month=="":new_file_name='Cud_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_type).replace("","_").replace("Completo","").replace("NORMALE","")+'.pdf'elif"TREDICESIMA"incurrent_paylip.paylip_type:new_file_name='Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'_Tredicesima.pdf'else:new_file_name='Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'.pdf'print(new_file_name)new_file_name=os.path.join(dirpath,new_file_name)# Rename file and move it in the year-directoryos.rename(current_paylip.file_name,new_file_name)current_paylip.file_name=new_file_name# Check if path with year directory exist otherwise create itdirin=os.path.split(new_file_name)newdir=dirin[0]+'\\'+current_paylip.paylip_yearifos.path.exists(newdir)==False:# Create directoryos.mkdir(newdir)# Move file in the year-directoryifos.path.exists(newdir+"\\"+dirin[1]):# If file already exist, delete itos.remove(newdir+"\\"+dirin[1])shutil.move(current_paylip.file_name,newdir+"\\"+dirin[1])return

Final situation

Got it. I have a folder with subfolders by year and in each one all the paylips with a standard name format.

What I learned:

Use of Selenium in py.
Simple automation can save a lot of time and avoid manual boring tasks.
How to write my first article here.(it's a personal task so not so useful for you but better than nothing after all)

Future improvements:

input parameters
(re)try to use css selector instead of xpath selector
(re)try to use BeautifulSoup
save last paylips saved in order, next run, to save only the not already saved paylips
read pdf and report data in file(eg google sheets)

Of course the code is useful just for me and my colleagues. Anyway I hope that the idea and process can be a good idea to someone else.