Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings
NotificationsYou must be signed in to change notification settings

ParadeTo/imooc_scrapy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mysql下载

https://dev.mysql.com/downloads/installer/

python3 path

最好用3.5/Library/Frameworks/Python.framework/Versions/3.5/bin/python3.5

python豆瓣源

pip install *** -ihttps://pypi.douban.com/simple/

virtualenv

以python3创建一个目录

virtualenv -p E:\soft\Python36\python.exe testProject

vcruntime140.dll缺失

把python3中的该文件拷贝到virtualenv创建的目录下的Scripts下面

virtualenvwrapper

安装

windows: pip install virtualenvwrapper-win

mac: --ignore-installed six

设置工作目录

WORKON_HOME

deactivate

workon **

mkvirtualenv --python=path\python.exe project

如果还是报vcruntime140.dll缺失,可用virtualenv -p E:\soft\Python36\python.exe testProject创建

windows下包安装出错

www.lfd.uci.edu/~gohlke/pythonlibs/

去重

md5编码url

bitmap

比如8个员工记录考勤,可以用一个byte来表示,每个bit表示来或者没来

编码

utf8一般作为保存,不定字节 8bit unicode transforming format

unicode一般作为编程,因其用固定的2个字节

python3 全部转化为unicode

u"abc" 表示unicode

window下:

s = '我用python' # window下是gb2312 linux下是utf8s.encode("utf8") # 会出错,s必须是unicodes.decode("gb2312").encode("utf8") # linux utf8

scrapy windows下

** 安装pywin32http://www.lfd.uci.edu/~gohlke/pythonlibs/#pywin32** 拷贝Lib\site-packages\pywin32_system32 下的文件到系统的system32下面

item

github scrapy-djangoitem

jobbole article

知乎

DROP TABLE IF EXISTS `zhihu_question`;CREATE TABLE `zhihu_question` (  `zhihu_id` bigint(20) NOT NULL,  `topics` varchar(255) DEFAULT NULL,  `url` varchar(300) NOT NULL,  `title` varchar(200) NOT NULL,  `content` longtext NOT NULL,  `create_time` datetime DEFAULT NULL,  `update_time` datetime DEFAULT NULL,  `answer_num` int(11) NOT NULL DEFAULT '0',  `comments_num` int(11) NOT NULL DEFAULT '0',  `watch_user_num` int(11) NOT NULL DEFAULT '0',  `click_num` int(11) NOT NULL DEFAULT '0',  `crawl_time` datetime NOT NULL,  `crawl_update_time` datetime DEFAULT NULL,  PRIMARY KEY (`zhihu_id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;
DROP TABLE IF EXISTS `zhihu_answer`;CREATE TABLE `zhihu_answer` (  `zhihu_id` bigint(20) NOT NULL,  `url` varchar(300) NOT NULL,  `question_id` bigint(20) NOT NULL,  `author_id` varchar(100) DEFAULT NULL,  `content` longtext NOT NULL,  `praise_num` int(11) NOT NULL DEFAULT '0',  `comments_num` int(11) NOT NULL DEFAULT '0',  `create_time` datetime NOT NULL DEFAULT '0000-00-00 00:00:00',  `update_time` datetime NOT NULL,  `crawl_time` datetime NOT NULL,  `crawl_update_time` datetime DEFAULT NULL,  PRIMARY KEY (`zhihu_id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

jobbole

CREATE TABLE `jobbole` (  `title` varchar(200) NOT NULL DEFAULT '',  `create_date` date DEFAULT NULL,  `url` varchar(300) NOT NULL DEFAULT '',  `url_object_id` varchar(50) NOT NULL DEFAULT '',  `front_image_url` varchar(300) DEFAULT '',  `front_image_path` varchar(200) DEFAULT NULL,  `comment_nums` int(11) DEFAULT '0',  `fav_nums` int(11) DEFAULT '0',  `praise_nums` int(11) DEFAULT '0',  `tags` varchar(200) DEFAULT NULL,  `content` longtext,  PRIMARY KEY (`url_object_id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

lagou

DROP TABLE IF EXISTS `lagou_job`;CREATE TABLE `lagou_job` (  `url` varchar(300) NOT NULL,  `url_object_id` varchar(50) NOT NULL,  `title` varchar(100) NOT NULL,  `salary` varchar(20) DEFAULT NULL,  `job_city` varchar(10) DEFAULT NULL,  `work_years` varchar(100) DEFAULT NULL,  `degree_need` varchar(30) DEFAULT NULL,  `job_type` varchar(20) DEFAULT NULL,  `publish_time` varchar(20) NOT NULL,  `tags` varchar(100) DEFAULT NULL,  `job_advantage` varchar(1000) DEFAULT NULL,  `job_desc` longtext NOT NULL,  `job_addr` varchar(50) DEFAULT NULL,  `company_url` varchar(300) DEFAULT NULL,  `company_name` varchar(100) DEFAULT NULL,  `crawl_time` datetime NOT NULL,  `crawl_update_time` datetime DEFAULT NULL,  PRIMARY KEY (`url_object_id`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

代理

DROP TABLE IF EXISTS `proxy_ip`;CREATE TABLE `proxy_ip` (  `ip` varchar(20) NOT NULL,  `port` varchar(255) NOT NULL,  `speed` float DEFAULT NULL,  `proxy_type` varchar(5) DEFAULT NULL,  PRIMARY KEY (`ip`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;

关于SEO友好

非正常页面不要返回200,否则,百度爬虫等会认为这是一个正常页面,从而进行收录。而一旦重复的页面过多,会降低网站的权重,因为百度等会认为这是恶意竞争。

关于Request cookie

scrapy/downloadermiddlewares/cookies

ip代理

西刺

torhttp://pkmishra.github.io/blog/2013/03/18/how-to-run-scrapy-with-TOR-and-multiple-browser-agents-part-1-mac/

验证码识别

  • tesseract-ocr识别率很低
  • 在线打码
  • 人工打码

云打码

http://www.yundama.com

  • 用户账号youxingzhi 81051766

  • 开发者账号

动态网页抓取selenium

phantomjs,无界面的浏览器

多进程情况下其性能下降很严重

其他动态网页获取技术

  • pyvirtualdisplay
  • selenium grid
  • splinter

暂停,重启

scrapycrawllagou-sJOBDIR=job_info/001

如果此时ctrl+c停止,爬虫会先处理掉已经发出去的request然后再停止,然后保存爬虫状态

scrapy 去重原理

scrapy->dupefilters

telnet

telnet localhost:6023

状态收集器

信号

拓展

利用信号来实现一些切面编程

core/scheduler

核心函数:

enqueue_request

pqclass = load_object(settings['SCHEDULER_PRIORITY_QUEUE'])dqclass = load_object(settings['SCHEDULER_DISK_QUEUE'])mqclass = load_object(settings['SCHEDULER_MEMORY_QUEUE'])

在setting/default_settings.py

SCHEDULER_DISK_QUEUE = 'scrapy.squeues.PickleLifoDiskQueue'SCHEDULER_MEMORY_QUEUE = 'scrapy.squeues.LifoMemoryQueue'SCHEDULER_PRIORITY_QUEUE = 'queuelib.PriorityQueue'
# 状态保存到磁盘jbdirdqok = self._dqpush(request)...self.dqs.push(reqd, -request.priority)

bloomfilter的原理

http://blog.csdn.net/lmh12506/article/details/7575651

redis setbit

set mykey 7 1 # 从左往右第8位设置为1

set mykey 32 1 # \x00\x00\x00\x00\x80

elasticsearch

elk 日志分析系统

安装及使用

1.安装jdk (>=8)

2.安装elasticsearch-rtf

github

3.head插件(类似navicat)和kibana插件(操作elasticsearch)

github

  • head连接不了的解决办法启动elasticsearch时候进行配置
http.cors.enabled:truehttp.cors.allow-origin:"*"http.cors.allow-methods:OPTIONS,HEAD,GET,POST,PUT,DELETEhttp.cors.allow-headers:"X-Requested-With, Content-Type, Content-Length, X-User"
  • kibana版本与elasticsearch-rtf要一致

概念

elasticsearchmysql
index(索引)数据库
type(类型)
documents(文档)
fiels
  • 索引还可以做动词

倒排索引

TF-IDF

  • 大小写转换
  • 词干抽取 looking look应该为一个词
  • 分词
  • 倒排索引文件过大-压缩编码

kibana操作

PUT lagou{  "settings": {    "index": {      "number_of_shards": 5,      "number_of_replicas": 1    }  }}GET lagou/_settingsGET _all/_settingsGET .kibana,lagou/_settingsPUT lagou/_settings{  "number_of_replicas": 1}GET _all# index/type/document(不指定会自动不指定会自动gen)PUT lagou/job/1{  "title": "python分布式爬虫开发",  "salary_min": 15000,  "city": "北京",  "company": {    "name": "百度",    "companyy_addr": "北京市软件园"  },  "publish_date": "2017-04-16",  "comments": 15}GET lagou/job/1GET lagou/job/1?_source=titlePOST lagou/job/1/_update{  "doc": {    "comments": 20  }}DELETE lagou/job/1DELETE lagou/job# mgetPUT testdb/job1/1{  "title": "job1_1"}PUT testdb/job1/2{  "title": "job1_2"}PUT testdb/job2/1{  "title": "job2_1"}PUT testdb/job2/2{  "title": "job2_2"}GET _mget{  "docs": [    {      "_index": "testdb",      "_type": "job1",      "_id": "1"    },    {      "_index": "testdb",      "_type": "job2",      "_id": "2"    }  ]}GET testdb/_mget{  "docs": [    {      "_type": "job1",      "_id": "1"    },    {      "_type": "job2",      "_id": "2"    }  ]}GET testdb/job1/_mget{  "ids": [1, 2]}# bulk must in one line# all opr will be passed to one node, the node deliver the opr to other replicas acoording to meta info, and receive the response from them, then return. POST _bulk{"index": { "_index": "lagou","_type": "job","_id": "2" }}{  "title": "python分布式爬虫开发2",  "salary_min": 16000,  "city": "北京",  "company": {"name": "百度", "companyy_addr":"北京市软件园2"  },"publish_date": "2017-04-17", "comments": 15}{  "index": { "_index": "lagou","_type": "job", "_id": "3"}}{  "title": "python分布式爬虫开发3",  "salary_min": 17000,  "city": "北京",  "company": {    "name": "百度",    "companyy_addr": "北京市软件园3"  },  "publish_date": "2017-04-18",  "comments": 15}POST _bulk{"update":{"_id":"1","_type":"type1","_index":"index1"}}{"doc":{"field2":"value2"}}

映射(mapping)

elasticsearch会根据我们的数据自动映射属性的类型

内置类型

  • text
  • keyword(不会分析,不会建立倒排索引,查询时必须全部匹配到)
  • 数字: long, integer, short, byte, double, float
  • date
  • boolean
  • binary: will not retrieve
  • object
  • nested: like array in javascript
  • geo
  • ip, competion

内置属性

  • store: yes表示存储
  • index:yes表示分析,默认true,适合string类型
  • null_value: 字段默认值
  • analyzer 分词器,默认为standard,可以用whitespace,simple,english
  • include_in_all 如果某个字段不想被搜索到,可以设置为false
  • format 时间格式字符串的模式,适合date

example

一旦数据类型确定了,不能更改

PUT lagou{  "mappings": {    "job": {      "properties":{        "title":{          "type":"text"        },        "salary_min":{          "type":"integer"        },        "city":{          "type":"keyword"        },        "company":{          "properties": {            "name": {              "type":"text"            },            "company_addr":{              "type":"text"            },            "employee_count":{              "type":"integer"            }          }        },        "publish_date":{          "type":"date",          "format":"yyyy-MM-dd"        },        "comments":{          "type":"integer"        }      }    }  }}GET lagou/_mappingPUT lagou/job/1{  "title": "python分布式爬虫开发",  "salary_min": 15000,  "city": "北京",  "company": {    "name": "百度",    "companyy_addr": "北京市软件园",    "employee_count": 44  },  "publish_date": "2017-04-16",  "comments": 15}

查询

  • 基本查询
  • 组合查询
  • 过滤
# match# 会对查询的内容进行分词,只要分词中有一个匹配即可GET lagou/job/_search{"query": {"match": {"title": "python"}}}# term# 不会对查询词做处理GET lagou/job/_search{"query": {"term": {"title": "python"}}}# terms# 数组中有匹配就返回GET lagou/job/_search{"query": {"terms": {"title": ["python","爬虫"]}}}# 控制返回数量,可用作分页GET lagou/job/_search{"query": {"match": {"title": "python"}},"from":1, # 开始的index 从0开始"size":2 # 数量}# match_allGET lagou/job/_search{"query": {"match_all": {}}}# match_phrase 短语查询GET lagou/_search{"query": {"match_phrase": {"title": {"query": "python系统", # 分词,且必须满足所有的词"slop": 6 # 分词之间的最小距离python与系统之间的距离必须小于3}}}}# multi_match GET lagou/job/_search{"query": {"multi_match": {"query": "python","fields":["title^3", "desc"] # title desc两者中满足其一 ^3表示权重,选择回来的结果其分数会比较高}}}# 返回特定字段GET lagou/job/_search{"stored_fields":["title","company_name"], # 返回的字段"query": {"match": {"title": "python"}}}# sortGET lagou/job/_search{"query": {"match_all": {}},"sort": [{"comments":{"order":"desc"}}]}# rangeGET lagou/job/_search{"query": {"range": {"comments": {"gte": 10,"lte": 20,"boost": 2.0 # 权重}}}}GET lagou/job/_search{"query": {"range": {"add_time": {"gte": "2017-04-01","lte": "now",}}}}# wildcard 模糊查询GET lagou/job/_search{"query": {"wildcard": {"title": {"value": "pyth*n","boost": 2.0,}}}}

bool查询

bool: {"filter":[], # 过滤,不参与打分"must":[], # 都必须满足"should":[], # 满足至少一个"must_not":[] # 一个都不满足}

example

# select * from testjob where salary=20GET lagou/testjob/_search{"query": {"bool": {"must":{"match_all":{}},"filter":{"term":{"salary":[10, 20]}}}}}# select * from testjob where title="Python"GET lagou/testjob/_search{"query": {"bool": {"must":{"match_all":{}},"filter":{"term":{"title": "Python" # Python为text类型,入库的时候已经转为小写了,可以将term改为match}}}}}# select * from testjob where (salary=20 or title=Python) AND (price != 30)GET lagou/testjob/_search{"query": {"bool": {"should": [{"term":{"salary":20}},{"term":{"title":"python"}}],"must_not": {"term":{"price":30}}}}}# 嵌套查询# select * from testjob where title="python" or (title="elasticsearch" AND salary=30)GET lagou/testjob/_search{"query": {"bool": {"should": [{"term":{"title":"python"}},{"bool":{"must":[{"term":{"title":"elasticsearch"}},{"term":{"salary":30}},]}}]}}}# 过滤空和非空# select tags from testjob where tags is not nullGET lagou/testjob/_search{"query": {"bool": {"filter": {"exists": {"field":"tags"}}}}}# select tags from testjob where tags is null 或者该字段不存在也可以GET lagou/testjob/_search{"query": {"bool": {"must_not": {"exists": {"field":"tags"}}}}}# 查看分析器解析的结果GET _analyze{"analyzer": "ik_max_word","text": "Python网络开发工程师"}# completion suggestor 自动补全,详见官网文档# https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html# fuzzy模糊搜索GET jobbole/article/_search{  "query": {    "fuzzy": {      "title":{        "value": "linux",        "fuzziness": 2,        "prefix_length": 0      }    }  }}# suggestGET lagou/testjob/_search{"suggest": {"my-suggest": {"text":"linux" # linu linxx 等都可以搜到linux"completion": {"field": "suggest",  # 自己生成的suggest字段"fuzzy":{"fuzziness":2}}}}}# search关键字highlightGET jobbole/_search{  "query": {    "multi_match": {        "query": "王小波的水平",        "fields": ["tags", "title", "content"]    }                },                "from": 0,                "size": 10,                "highlight": {                    "pre_tags": ["<span class='keyWord'>"],                    "post_tags": ["</span>"],                    "fields": {                        "title": {},                        "content": {}                    }                }}

scrapy数据写到es中

elasticsearch-dsl-py

https://elasticsearch-dsl.readthedocs.io/en/latest/persistence.html#doctype

pip install elasticsearch-dsl

scrapyd 部署

scrapyd

http://scrapyd.readthedocs.io/en/stable/

workon article_spiderpip install scrapyd

scrapyd-client

pip install scrapyd-client

/Library/Python/2.7/site-packages/scrapyd-client/scrapyd-deploy

/usr/local/bin/scrapyd-deploy

  • windows解决办法新建scrapyd-deploy.bat
@echo off"e:\**\python.exe" "e:\**\scrapyd-deploy" %1 %2 %3 %4 %5 %6 %7 %8 %9

scrapy.cfg

[deploy:bobby]url = http://localhost:6800/project = ArticleSpider

scrapyd-deploy bobby -p ArticleSpider

将本地项目打包成egg,传递给scrapyd server中去

运行爬虫

通过api的方式

http://scrapyd.readthedocs.io/en/stable/api.html#schedule-json

错误

TypeError: __init__() got an unexpected keyword argument '_job'

spiders.jobbole

    def __init__(self):        self.fail_urls = []        dispatcher.connect(self.handle_spider_closed, signals.spider_closed)

改为

    def __init__(self, **kwargs):        self.fail_urls = []        dispatcher.connect(self.handle_spider_closed, signals.spider_closed)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp