Posted onApr 23, 2024

Sometimes things simply don't work

#puppeteer #crawling #scraping #bug

As I have previously mentioned I am rather fond ofpuppeteer. It's a useful library for all kinds of web automation...but like any open source project it needs some TLC.

I am not in any way associated with the developers at puppeteer, but if you are looking for a way to contribute, they areopen source

The frustration

I was looking at a somewhat long page(think vertically) and tried to create a screenshot of it. The optimist in me was thinking that it will simply work so I went on as usual and planned my approach on the assumption that it will function as intended.

I checked the screenshot and found that it was a tiled image of a fixed size crop from the top of the file. First reaction was frustration...but I think it was more at myself that I had not allowed any margin for error in the experiment.

The insight

There is no reason to point fingers when something is not working, especially in OSS, if you have the chops fix it for yourself, share it, if it is good enough it might get adopted upstream. In other words perfect is the enemy of good.

The bug

Before focusing on hacking my way out of the jam I scoured the web, as usually problems are not as unique as one might think. I am ashamed to admit it, but I'm not fond of documentation and hacking my way out of a problem by digging into the different related projects' docs is the last step in my debugging journey.

I found that this was related to anold, still open bug in the puppeteer repo.

Discussion ongoing to quite recently...but still open.

The consensus I could gather is either useplaywright or use a workaround to solve it in the puppeteer layer. The root cause of the bug isa websocket size limitation on the CDP protocol for chromium.

I had an intention of using playwright but in some of my tests it was failing to load some pages so I decided to revisit the puppeteer idea and solve the issue where I can.

Hacking my way through it

Started by doing a height based chunking method. A more generic approach was to create achunker that returns a function so that the chunk height is configurable via the parameter.

// return a chunker function with the height for each chunk// number will be the full height of the element you want to // grab a screenshot ofconstchunkBy=(n)=>number=>{letchunks=newArray(Math.floor(number/n)).fill(n);chunks=chunks.map((c,i)=>{return{height:c,start:i*c}});constremainder=number-chunks[chunks.length-1].start-chunks[chunks.length-1].height;if(remainder>0){chunks.push({height:remainder,start:chunks[chunks.length-1].start+chunks[chunks.length-1].height});}console.log('CHUNKS =',chunks);returnchunks;};

Afterwards I wrote the method for grabbing the screenshot that works regardless of the height of it so that it works around the CDP limitation.

asyncfunctiongrabSelectorScreenshot(){constbrowser=awaitpuppeteer.launch();constpage=awaitbrowser.newPage();awaitpage.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3');// urls is a list of string urlsfor(consturlofurls){consthashed=crypto.createHash('sha256').update(url).digest('hex');awaitpage.goto(url,{waitUntil:'networkidle0'});// this is where the element is selectedconstelement=awaitpage.$("div#document1 div.eli-container");// get height and width for later iteratingconst{width,height}=awaitelement.boundingBox();constdesignatedPathPng=`./screenshots/${hashed}-merged-ss.png`;// chunk by 4000 heightconstheights=chunkBy4k(height);// keep track of starting point and height// to have continuous mapping of the imageconstchunks=heights.map((h,i)=>{returnelement.screenshot({clip:{x:0,y:h.start,height:h.height,width,},path:`./screenshots/${hashed}-${i}-ss.png`})});// wait for all the part files to be writtenconstfilesResolved=awaitPromise.all(chunks)// merge all the parts in a vertical layoutconstmergedImage=awaitmergeImg(filesResolved,{direction:true});// this is interesting, the merged image is a promise,// but the write only worked via a function callbackmergedImage.write(designatedPathPng,async()=>{browser.close();constdataPng=awaitreadFile(designatedPathPng);constb64imgPng=Buffer.from(dataPng).toString('base64');// clean up the temporary files createdawaitdeleteFilesMatchingPattern('./screenshots',newRegExp(`^${hashed}-\\d+-ss\\.png$`));returnb64imgPng;});}}

Cleaning up temporary files

You probably want to clean up the files. One way to do that:

asyncfunctiondeleteFilesMatchingPattern(dirPath,regex){try{constfiles=awaitreaddir(dirPath);// Read all files in the directoryfor(letfileoffiles){if(regex.test(file)){// Check if the file matches the patternconstfilePath=path.join(dirPath,file);awaitfs.unlink(filePath);// Delete the fileconsole.log(`Deleted:${filePath}`);}}}catch(error){console.error('Error:',error);}}

In hindsight, probably a better way to do this is by using actualtmp files and decouple the cleanup, but this was good enough for a barebones script.

Conclusion

OSS needs some TLC
problems are rarely unique
it's better to hack at it and unblock yourself, switching library is more of a PITA as there are no guarantees