With some help I've put together a macro to scrape webpage elements by tag ID from URLs (indicated in column A of an Excel sheet) and insert the text into columns B, C and D. The code loops from 1st row until 1000th. So far so good.
The problem is that it seems to be taking SOOOO much time. Is there any way to speed up the process? I feel like my use of IE pages isn't optimal.
(Please note that I'm aware thatOn Error the code is copying previous data as new but I have a manual easy work around to clean the data, which works fine until I get around to fixing that issue.)
Sub getCountry()'' GetCountry Macro'' Keyboard Shortcut: Ctrl+k' Dim IE As Object: Set IE = CreateObject("InternetExplorer.Application") Dim ws As Worksheet: Set ws = ThisWorkbook.Sheets("Sheet1") Dim Country As String Dim Category As String Dim Network As String Dim n As Integer n = 729 For n = 729 To 1000 With IE Visible = False .navigate ws.Range("A" & n).Value Do DoEvents Loop Until .readyState = 4 End With On Error Resume Next Country = Trim$(IE.document.getElementByID("youtube-user-page-country").innerText) On Error Resume Next Category = Trim$(IE.document.getElementByID("youtube-user-page-channeltype").innerText) On Error Resume Next Network = Trim$(IE.document.getElementByID("youtube-user-page-network").innerText) ws.Range("B" & n).Value2 = Country ws.Range("C" & n).Value2 = Category ws.Range("D" & n).Value2 = Network Next n IE.Quit End Sub- \$\begingroup\$Is there a reason you did this in VBA, instead of scraping it with a more specialized tool and export to CSV (which Excel can import perfectly fine)?\$\endgroup\$2017-09-30 20:03:54 +00:00CommentedSep 30, 2017 at 20:03
- \$\begingroup\$Probably because of a lack of knowledge and a strong like for excel. No knowledge of any tool specialized in scraping a long list of varying URLs but I'm open to suggestions. For the edit, I don't get the edit of the title, I thought phrasing it in a question was recommended. Thanks for the feedback and time !\$\endgroup\$Michael McCarthy– Michael McCarthy2017-09-30 20:06:36 +00:00CommentedSep 30, 2017 at 20:06
2 Answers2
The indentation is terribly misleading.
Compare yours:
For n = 729 To 1000 With IE Visible = False .navigate ws.Range("A" & n).Value Do DoEvents Loop Until .readyState = 4 End With
To howRubberduck'sSmart Indenter default settings do it:
For n = 729 To 1000 With IE Visible = False .navigate ws.Range("A" & n).Value Do DoEvents Loop Until .readyState = 4 End WithIf I glance at your code, I've no idea where theWith block ends, and at second glance it appears to end in the middle of aDo loop body (which would be illegal) - it takes a third read to pick up the matchingLoop keyword hidden in the code.
Before modifying anything, fix the indentation.
Running Rubberduck inspections reveal that variableVisible is never declared, which meansOption Explicit isn't specified at the top of the module, which means VBA will happily compile typos, and that can easily lead to very hard-to-debug problems.ALWAYS specifyOption Explicit, and declareall variables.
The macro isimplicitlyPublic,n should probably be declaredAs Long, and the procedure name should bePascalCase, i.e.GetCountry, although the name suggests it'sgetting some value, which usually implies aFunction thatreturns that value -Sub proceduresdo something, they don'tget stuff.
On Error Resume Next will make your macro keep running in error state, which makes the assignments fail and, as you noted, the previous iteration's value gets repeated. Why not "reset" the 3 values at the end of each iteration? That way a failedgetElementByID would write an empty string instead of duplicating data. That said...
On Error Resume Next Country = Trim$(IE.document.getElementByID("youtube-user-page-country").innerText) On Error Resume Next Category = Trim$(IE.document.getElementByID("youtube-user-page-channeltype").innerText) On Error Resume Next Network = Trim$(IE.document.getElementByID("youtube-user-page-network").innerText)
Error handling doesn't work that way; only the first OERN changes anything, the other two are perfectly redundant. If you want torestore error handling you need to doOn Error GoTo 0:
On Error Resume Next 'suppresses run-time errors Country = Trim$(IE.document.getElementByID("youtube-user-page-country").innerText) Category = Trim$(IE.document.getElementByID("youtube-user-page-channeltype").innerText) Network = Trim$(IE.document.getElementByID("youtube-user-page-network").innerText) Err.Clear 'clears error state On Error GoTo 0 'restores run-time errorsAs for speeding things up, assuming lightning-speed scraping this would be the bottleneck:
ws.Range("B" & n).Value2 = Country ws.Range("C" & n).Value2 = Category ws.Range("D" & n).Value2 = Network
You know how many iterations you're making, therefore you know from the start how many rows you're going to be writing to the worksheet; the fastest way to write to a worksheet isn't one cell at a time.
Declare a 2D array, populate it in your loop, and dump the array onto the worksheet to perform allRange writes in one single, instantaneous operation. Otherwise you're having Excel raiseWorksheet.Change events, repainting itself, and possibly re-calculating thingsevery time you write to a cell.
You could "cheat" and doApplication.ScreenUpdating = False at the top and set it back toTrue at the end of the procedure, but that wouldn't be any more efficient code. Go with the 2D array.
The main problem with scraping in browser is this: along with the data (the only thing you need = HTML), the browser displays everything (HTML, CSS, JS) and renders it (even more time wasted).
The best solution for non interactive scraping is to use HTTPRequest, get only HTML as response. This way you are working with HTML DOM without rendering overhead.
Storing and writing the data in excel should be done in Mat's way, store all on 2D array, write once into sheet. Writing one cell at a time is brutally inefficient.
You mustlog in to answer this question.
Explore related questions
See similar questions with these tags.

