4
\$\begingroup\$

With some help I've put together a macro to scrape webpage elements by tag ID from URLs (indicated in column A of an Excel sheet) and insert the text into columns B, C and D. The code loops from 1st row until 1000th. So far so good.

The problem is that it seems to be taking SOOOO much time. Is there any way to speed up the process? I feel like my use of IE pages isn't optimal.

(Please note that I'm aware thatOn Error the code is copying previous data as new but I have a manual easy work around to clean the data, which works fine until I get around to fixing that issue.)

Sub getCountry()'' GetCountry Macro'' Keyboard Shortcut: Ctrl+k'  Dim IE      As Object: Set IE = CreateObject("InternetExplorer.Application")  Dim ws      As Worksheet: Set ws = ThisWorkbook.Sheets("Sheet1")  Dim Country As String  Dim Category As String  Dim Network As String  Dim n As Integer  n = 729  For n = 729 To 1000    With IE    Visible = False    .navigate ws.Range("A" & n).Value    Do      DoEvents        Loop Until .readyState = 4      End With      On Error Resume Next      Country = Trim$(IE.document.getElementByID("youtube-user-page-country").innerText)      On Error Resume Next      Category = Trim$(IE.document.getElementByID("youtube-user-page-channeltype").innerText)      On Error Resume Next      Network = Trim$(IE.document.getElementByID("youtube-user-page-network").innerText)      ws.Range("B" & n).Value2 = Country      ws.Range("C" & n).Value2 = Category      ws.Range("D" & n).Value2 = Network    Next n  IE.Quit  End Sub
Jamal's user avatar
Jamal
35.2k13 gold badges134 silver badges238 bronze badges
askedSep 30, 2017 at 19:49
Michael McCarthy's user avatar
\$\endgroup\$
3
  • \$\begingroup\$Is there a reason you did this in VBA, instead of scraping it with a more specialized tool and export to CSV (which Excel can import perfectly fine)?\$\endgroup\$CommentedSep 30, 2017 at 20:03
  • \$\begingroup\$Probably because of a lack of knowledge and a strong like for excel. No knowledge of any tool specialized in scraping a long list of varying URLs but I'm open to suggestions. For the edit, I don't get the edit of the title, I thought phrasing it in a question was recommended. Thanks for the feedback and time !\$\endgroup\$CommentedSep 30, 2017 at 20:06
  • \$\begingroup\$As for the edit, no, the current title is more appropriate. Seethis meta for more information.\$\endgroup\$CommentedSep 30, 2017 at 20:13

2 Answers2

4
\$\begingroup\$

The indentation is terribly misleading.

Compare yours:

  For n = 729 To 1000    With IE    Visible = False    .navigate ws.Range("A" & n).Value    Do      DoEvents        Loop Until .readyState = 4      End With

To howRubberduck'sSmart Indenter default settings do it:

For n = 729 To 1000    With IE        Visible = False        .navigate ws.Range("A" & n).Value        Do            DoEvents        Loop Until .readyState = 4    End With

If I glance at your code, I've no idea where theWith block ends, and at second glance it appears to end in the middle of aDo loop body (which would be illegal) - it takes a third read to pick up the matchingLoop keyword hidden in the code.

Before modifying anything, fix the indentation.

Running Rubberduck inspections reveal that variableVisible is never declared, which meansOption Explicit isn't specified at the top of the module, which means VBA will happily compile typos, and that can easily lead to very hard-to-debug problems.ALWAYS specifyOption Explicit, and declareall variables.

The macro isimplicitlyPublic,n should probably be declaredAs Long, and the procedure name should bePascalCase, i.e.GetCountry, although the name suggests it'sgetting some value, which usually implies aFunction thatreturns that value -Sub proceduresdo something, they don'tget stuff.

On Error Resume Next will make your macro keep running in error state, which makes the assignments fail and, as you noted, the previous iteration's value gets repeated. Why not "reset" the 3 values at the end of each iteration? That way a failedgetElementByID would write an empty string instead of duplicating data. That said...

    On Error Resume Next    Country = Trim$(IE.document.getElementByID("youtube-user-page-country").innerText)    On Error Resume Next    Category = Trim$(IE.document.getElementByID("youtube-user-page-channeltype").innerText)    On Error Resume Next    Network = Trim$(IE.document.getElementByID("youtube-user-page-network").innerText)

Error handling doesn't work that way; only the first OERN changes anything, the other two are perfectly redundant. If you want torestore error handling you need to doOn Error GoTo 0:

    On Error Resume Next 'suppresses run-time errors    Country = Trim$(IE.document.getElementByID("youtube-user-page-country").innerText)    Category = Trim$(IE.document.getElementByID("youtube-user-page-channeltype").innerText)    Network = Trim$(IE.document.getElementByID("youtube-user-page-network").innerText)    Err.Clear 'clears error state    On Error GoTo 0 'restores run-time errors

As for speeding things up, assuming lightning-speed scraping this would be the bottleneck:

    ws.Range("B" & n).Value2 = Country    ws.Range("C" & n).Value2 = Category    ws.Range("D" & n).Value2 = Network

You know how many iterations you're making, therefore you know from the start how many rows you're going to be writing to the worksheet; the fastest way to write to a worksheet isn't one cell at a time.

Declare a 2D array, populate it in your loop, and dump the array onto the worksheet to perform allRange writes in one single, instantaneous operation. Otherwise you're having Excel raiseWorksheet.Change events, repainting itself, and possibly re-calculating thingsevery time you write to a cell.

You could "cheat" and doApplication.ScreenUpdating = False at the top and set it back toTrue at the end of the procedure, but that wouldn't be any more efficient code. Go with the 2D array.

answeredSep 30, 2017 at 20:29
Mathieu Guindon's user avatar
\$\endgroup\$
3
\$\begingroup\$

The main problem with scraping in browser is this: along with the data (the only thing you need = HTML), the browser displays everything (HTML, CSS, JS) and renders it (even more time wasted).

The best solution for non interactive scraping is to use HTTPRequest, get only HTML as response. This way you are working with HTML DOM without rendering overhead.

Storing and writing the data in excel should be done in Mat's way, store all on 2D array, write once into sheet. Writing one cell at a time is brutally inefficient.

answeredOct 1, 2017 at 6:38
kolcinx's user avatar
\$\endgroup\$

You mustlog in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.