Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework

License

NotificationsYou must be signed in to change notification settings

dotnetcore/DotnetSpider

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

免责申明:本框架是为了帮助开发人员简化开发流程、提高开发效率,请勿使用此框架做任何违法国家法律的事情,使用者所做任何事情也与本框架的作者无关。

Build StatusNuGetMember project of .NET Core CommunityGitHub license

DotnetSpider, a .NET Standard web crawling library. It is a lightweight, efficient, and fast high-level web crawling & scraping framework.

If you want to get the latest beta packages, you should add the myget feed:

<addkey="myget.org"value="https://www.myget.org/F/zlzforever/api/v3/index.json"protocolVersion="3"/>

DESIGN

DESIGN IMAGE

DEVELOP ENVIROMENT

  1. Visual Studio 2017 (15.3 or later) or Jetbrains Rider

  2. .NET Core 2.2 or later

  3. Docker

  4. MySql

     docker run --name mysql -d -p 3306:3306 --restart always -e MYSQL_ROOT_PASSWORD=1qazZAQ! mysql:5.7
  5. Redis (option)

     docker run --name redis -d -p 6379:6379 --restart always redis
  6. SqlServer

     docker run --name sqlserver -d -p 1433:1433 --restart always  -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=1qazZAQ!' mcr.microsoft.com/mssql/server:2017-latest
  7. PostgreSQL (option)

     docker run --name postgres -d  -p 5432:5432 --restart always -e POSTGRES_PASSWORD=1qazZAQ! postgres
  8. MongoDb (option)

     docker run --name mongo -d -p 27017:27017 --restart always mongo
  9. RabbitMQ

    docker run -d --restart always --name rabbimq -p 4369:4369 -p 5671-5672:5671-5672 -p 25672:25672 -p 15671-15672:15671-15672 \       -e RABBITMQ_DEFAULT_USER=user -e RABBITMQ_DEFAULT_PASS=password \       rabbitmq:3-management
  10. Docker remote api for mac

    docker run -d  --restart always --name socat -v /var/run/docker.sock:/var/run/docker.sock -p 2376:2375 bobrik/socat TCP4-LISTEN:2375,fork,reuseaddr UNIX-CONNECT:/var/run/docker.sock
  11. HBase

    docker run -d --restart always --name hbase -p 20550:8080 -p 8085:8085 -p 9090:9090 -p 9095:9095 -p 16010:16010 dajobe/hbase

MORE DOCUMENTS

https://github.com/dotnetcore/DotnetSpider/wiki

SAMPLES

Please see the Project DotnetSpider.Sample in the solution.

BASE USAGE

Base usage Codes

ADDITIONAL USAGE: Configurable Entity Spider

View complete Codes

[DisplayName("博客园爬虫")]publicclassEntitySpider(IOptions<SpiderOptions>options,DependenceServicesservices,ILogger<Spider>logger):Spider(options,services,logger){publicstaticasyncTaskRunAsync(){varbuilder=Builder.CreateDefaultBuilder<EntitySpider>(options=>{options.Speed=1;});builder.UseSerilog();builder.IgnoreServerCertificateError();awaitbuilder.Build().RunAsync();}protectedoverrideasyncTaskInitializeAsync(CancellationTokenstoppingToken=default){AddDataFlow<DataParser<CnblogsEntry>>();AddDataFlow(GetDefaultStorage);awaitAddRequestsAsync(newRequest("https://news.cnblogs.com/n/page/1",newDictionary<string,object>{{"网站","博客园"}}));}[Schema("cnblogs","news")][EntitySelector(Expression=".//div[@class='news_block']",Type=SelectorType.XPath)][GlobalValueSelector(Expression=".//a[@class='current']",Name="类别",Type=SelectorType.XPath)][GlobalValueSelector(Expression="//title",Name="Title",Type=SelectorType.XPath)][FollowRequestSelector(Expressions=["//div[@class='pager']"])]publicclassCnblogsEntry:EntityBase<CnblogsEntry>{protectedoverridevoidConfigure(){HasIndex(x=>x.Title);HasIndex(x=>new{x.WebSite,x.Guid},true);}publicintId{get;set;}[Required][StringLength(200)][ValueSelector(Expression="类别",Type=SelectorType.Environment)]publicstringCategory{get;set;}[Required][StringLength(200)][ValueSelector(Expression="网站",Type=SelectorType.Environment)]publicstringWebSite{get;set;}[StringLength(200)][ValueSelector(Expression="Title",Type=SelectorType.Environment)][ReplaceFormatter(NewValue="",OldValue=" - 博客园")]publicstringTitle{get;set;}[StringLength(40)][ValueSelector(Expression="GUID",Type=SelectorType.Environment)]publicstringGuid{get;set;}[ValueSelector(Expression=".//h2[@class='news_entry']/a")]publicstringNews{get;set;}[ValueSelector(Expression=".//h2[@class='news_entry']/a/@href")]publicstringUrl{get;set;}[ValueSelector(Expression=".//div[@class='entry_summary']")][TrimFormatter]publicstringPlainText{get;set;}[ValueSelector(Expression="DATETIME",Type=SelectorType.Environment)]publicDateTimeCreationTime{get;set;}}}

Distributed spider

Read this document

Puppeteer downloader

Coming soon

NOTICE

when you use redis scheduler, please update your redis config:

timeout 0tcp-keepalive 60

Dependencies

PackageLicense
Bert.RateLimitersApache 2.0
MessagePackMIT
Newtonsoft.JsonMIT
DapperApache 2.0
HtmlAgilityPackMIT
ZCJ.HashedWheelTimerMIT
murmurhashApache 2.0
Serilog.AspNetCoreApache 2.0
Serilog.Sinks.ConsoleApache 2.0
Serilog.Sinks.RollingFileApache 2.0
Serilog.Sinks.PeriodicBatchingApache 2.0
MongoDB.DriverApache 2.0
MySqlConnectorMIT
AutoMapper.Extensions.Microsoft.DependencyInjectionMIT
Docker.DotNetMIT
BuildBundlerMinifierApache 2.0
Pomelo.EntityFrameworkCore.MySqlMIT
Quartz.AspNetCoreApache 2.0
Quartz.AspNetCore.MySqlConnectorApache 2.0
NpgsqlPostgreSQL License
RabbitMQ.ClientApache 2.0
PollyBSD 3-C

AREAS FOR IMPROVEMENTS

QQ Group: 477731655Email:zlzforever@163.com

About

DotnetSpider, a .NET standard web crawling library. It is lightweight, efficient and fast high-level web crawling & scraping framework

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors24


[8]ページ先頭

©2009-2025 Movatter.jp