Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Leonardo Gasparini Romão
Leonardo Gasparini Romão

Posted on • Edited on

     

Speed up web scraping using C#

This article is a part of web scrapping series using c#:

How to web scrapping using C#
Speed up web scrapping using C#

Now we use parallels to up speed our web scrapping code. It's common to want multiple pages when we getting data from the web, and in the last article I use one page to test web scrapping but, if we need to get a large set of information, we need a better solution.

Use a single process looping all the pages will take so much time to get all data, so another option is use parallels. This is an example to use multiple processes to take data from the web:

varlinks=newstring[]{"https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_fellowship_of_the_ring","https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_two_towers","https://www.rottentomatoes.com/m/the_lord_of_the_rings_the_return_of_the_king","https://www.rottentomatoes.com/m/the_hobbit_an_unexpected_journey","https://www.rottentomatoes.com/m/the_hobbit_the_desolation_of_smaug","https://www.rottentomatoes.com/m/the_hobbit_the_battle_of_the_five_armies"};Console.WriteLine("Gettting page from movie...");Parallel.ForEach(links,newParallelOptions{MaxDegreeOfParallelism=4},link=>{usingvarClient=newWebClient();//Download Html from a Url:varHtmlRequestResult=Client.DownloadString(link);//Load HtmlString to AgilityPack DocumentvarDocument=newHtmlDocument();Document.LoadHtml(HtmlRequestResult);//Get movie title, critic score and user scorevarMovieTitle=Document.DocumentNode.Descendants("h1").FirstOrDefault()?.InnerText.Trim();varCriticScore=Document.GetElementbyId("tomato_meter_link")?.InnerText.Trim();varUserScore=Document.DocumentNode.Descendants("a").FirstOrDefault(x=>x.GetAttributeValue("href","")=="#audience_reviews")?.InnerText.Trim();Console.WriteLine(string.Format(" Title:{0} \r\n Critic Score:{1} \r\n User Score:{2}",MovieTitle,CriticScore,UserScore));});Console.WriteLine("Press any key to close the program...");Console.ReadKey();
Enter fullscreen modeExit fullscreen mode

The console now prints all the movies, unordered because we are using multiprocess and with this, we have an different behavior for every link that we get.

Alt Text

Well, like any other multiprocess application, we now need to care about how to manage and control our parallelism level, because we can make the code use so much memory or CPU and slow down all our infrastructure, so be careful, besides, we need to have attention to the website that we visit, using multiple processes can be confused as a DDOS attack and block our code, so, don't push links so harder.

Useful sources:

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Dev .NET e Professor
  • Location
    São Paulo, Brazil
  • Work
    Professor
  • Joined

More fromLeonardo Gasparini Romão

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp