Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Alan Stocco
Alan Stocco

Posted on

     

Scraper payslips with Python | Selenium

Scenario:

I work in a company and my paylips are downloadable in an aspx portal. One by one, not in block.
I needed them all, for burocaracy reasons and in order to archive them.

How: py, selenium. I tried with beutifulsoup but it didn't work.

Explenation and Code

Web Driver

I used webdriver Chrome with some options in order to save the pdf-files when browser opens it. Look pref in code below.

Creating a class

classPaylipsScaper:# Initdef__init__(self,username,password):self.username=usernameself.password=password# Optionschrome_options=webdriver.ChromeOptions()prefs={"plugins.always_open_pdf_externally":True,"download.default_directory":"C:\\tmp",# folder save files"download.prompt_for_download":False,"download.directory_upgrade":True,"safebrowsing.enabled":True}chrome_options.add_experimental_option("prefs",prefs)chrome_options.headless=True# If True hide browserself.driver=webdriver.Chrome(executable_path='chromedriver.exe',options=chrome_options)
Enter fullscreen modeExit fullscreen mode

Login

The login phase is quite easy, just select by id the area and insert the value. I just kept attention to iframe, because in that case you have to useswitch_to.frame before.

# Manage login pagedeflogin(self,url):driver=self.driverdriver.get(url)driver.switch_to.frame("FunArea")username=driver.find_element_by_id("login")password=driver.find_element_by_id("pwd")username.send_keys(self.username)time.sleep(1)password.send_keys(self.password)driver.find_element_by_id("CmdInvia").click()
Enter fullscreen modeExit fullscreen mode

Loop table using XPATH

I created a class that wrap the selenium driver in order to keep all cleans.
I just reproduced the clicks done by myself.
At the beginning I tried with CSS selector but for the structure of the pages was a better solution to use XPATH
(to get the XPATH with Chromesee here)
By the way I don't like the time.sleep but it was useful to avoid navigations problems during the process.

# Inside PaylipsScaper classdefget_num_rows(self,num_rows=1):driver=self.driverself.click_to_payslips_area()num_rows=len(driver.find_elements_by_xpath("//table[@id='ContTab']/tbody/tr/td/div/table/tbody/tr"))returnnum_rows[...otherstuff...]try:bot=PaylipsScaper(username,password)bot.login(url_website)wait=WebDriverWait(bot.driver,10)num_rows=bot.get_num_rows()forrowinrange(1,num_rows+1):paylip_year=bot.get_val_in_cedolino_row(row,4)paylip_month=bot.get_val_in_cedolino_row(row,5)paylip_type=bot.get_val_in_cedolino_row(row,7)bot.driver.execute_script("arguments[0].click();",WebDriverWait(bot.driver,20).until(EC.element_to_be_clickable((By.XPATH,"/html/body/form/div/table/tbody/tr/td/div/table/tbody/tr["+str(row)+"]/td[10]/img"))))time.sleep(2)filepdf=dirpath+"\\*.pdf"list_of_files=glob.glob(filepdf)file_name=max(list_of_files,key=os.path.getctime)current_paylip=Paylip(paylip_year,paylip_month,paylip_type,file_name)bot.rename_and_move(current_paylip)print("Downloaded:")print(current_paylip)
Enter fullscreen modeExit fullscreen mode

Save pdf file in folder and rename it

It's quite a brute solution anyway I got the last pdf saved in a folder and renamed it with the informations from the website.
Then I moved the files in sub-folders by year.

defrename_and_move(self,urrent_paylip):ifcurrent_paylip.paylip_month=="":new_file_name='Cud_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_type).replace("","_").replace("Completo","").replace("NORMALE","")+'.pdf'elif"TREDICESIMA"incurrent_paylip.paylip_type:new_file_name='Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'_Tredicesima.pdf'else:new_file_name='Cedolino_'+str(current_paylip.paylip_year)+'_'+str(current_paylip.paylip_month)+'.pdf'print(new_file_name)new_file_name=os.path.join(dirpath,new_file_name)# Rename file and move it in the year-directoryos.rename(current_paylip.file_name,new_file_name)current_paylip.file_name=new_file_name# Check if path with year directory exist otherwise create itdirin=os.path.split(new_file_name)newdir=dirin[0]+'\\'+current_paylip.paylip_yearifos.path.exists(newdir)==False:# Create directoryos.mkdir(newdir)# Move file in the year-directoryifos.path.exists(newdir+"\\"+dirin[1]):# If file already exist, delete itos.remove(newdir+"\\"+dirin[1])shutil.move(current_paylip.file_name,newdir+"\\"+dirin[1])return
Enter fullscreen modeExit fullscreen mode

Final situation

Got it. I have a folder with subfolders by year and in each one all the paylips with a standard name format.

What I learned:

  • Use of Selenium in py.
  • Simple automation can save a lot of time and avoid manual boring tasks.
  • How to write my first article here.(it's a personal task so not so useful for you but better than nothing after all)

Future improvements:

  • input parameters
  • (re)try to use css selector instead of xpath selector
  • (re)try to use BeautifulSoup
  • save last paylips saved in order, next run, to save only the not already saved paylips
  • read pdf and report data in file(eg google sheets)

Of course the code is useful just for me and my colleagues. Anyway I hope that the idea and process can be a good idea to someone else.

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Software engineer.Always interested in learning new technologies and in applying them.
  • Location
    Italy
  • Work
    Software Engineering at Verona
  • Joined

Trending onDEV CommunityHot

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp