百度 AI 搜索引擎

学习python123课程:https://python123.io/index/tutorials/web_crawler_intro

爬虫协议

与其它爬虫不同,全站爬虫意图爬取网站所有页面,由于爬虫对网页的爬取速度比人工浏览快几百倍,对网站服务器来说压力山大,很容易造成网站崩溃。 为了避免双输的场面,大家约定,如果网站建设者不愿意爬虫访问某些页面,他就按照约定的格式,把这些页面添加到 robots.txt 文件中,爬虫应该主动避免访问这些页面。除此之外,作为爬虫编写者也应该主动控制爬虫访问速度。

访问 robots 协议的方式是:网站域名+’/robots.txt’。

1
2
3
#  https://ai.baidu.com/robots.txt
User-agent: *
Disallow: /product/

处理爬虫协议

Python 中的内置库 Urllib 能够帮助你解析 robots 协议,判断某个具体的网页是否可以被爬取。

1
2
3
4
5
6
7
8
import urllib.robotparser

url = 'https://ai.baidu.com'
rp = urllib.robotparser.RobotFileParser()
rp.set_url(url + '/robots.txt')
rp.read()
info = rp.can_fetch("*", 'https://ai.baidu.com/product/minos')
print(info)

使用代码判断下面两个链接是否属于可爬取页面?

https://ai.baidu.com/product/pingo 不允许爬取

https://ai.baidu.com/tech/s peech/asr 允许爬取

全站爬虫的基本架构

几乎所有的全站爬虫,都可以抽象为这样的逻辑:

爬虫从一个 URL 开始访问,通常是网站的域名,并将获得网页中的链接提取出来,去重后放入待访问列表。重复此操作,知道访问完网站内全部网页。

需要注意的是,全站爬虫通常只爬取网站的内部链接

网页链接提取

在上图步骤三中,需要把页面中的所有链接提取出来,因为我们使用了 requests_html 库,这件事情变得非常容易,只需要 r.html.links。

1
2
3
4
5
from requests_html import HTMLSession
session = HTMLSession()
origin = 'https://ai.baidu.com'
r = session.get(origin)
print(r.html.links)
1
{'/tech/bicc/rts', 'http://di.baidu.com/product/shangqing', '/solution/censoring', '/tech/ocr_cards/vehicle_license', '/tech/imageprocess/stretch_restore', '/customer/baobao', 'https://app.baidu.com', '/tech/imagesearch', '/tech/nlp_basic/dependency_parsing', 'http://ar.baidu.com/dumixar', '/tech/imagerecognition/currency', '/tech/imageprocess/image_quality_enhance', 'https://aim.baidu.com/', '/tech/ocr_cards/passport', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/antiporn/overview/index', '/customer/samsung', 'http://www.baidu.com/duty', '/customer', '/solution/faceprint', '/solution/roboticvision', '/broad/subordinate?dataset=sceneparsing', '/tech/ocr_cars/driving_license', '/tech/video/vcs', 'http://di.baidu.com/product/mtj', '/tech/imagerecognition', '/easydl/retail', '/broad/subordinate?dataset=gon', '/customer/liaoningshiyan', '/broad/subordinate?dataset=canine', '/customer/nfdw', '/tech/ocr_cars/vehicle_license', '/partner/course', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/body/overview/index', '/partner/cella', '/tech/speech/lsr', 'http://zhongbao.baidu.com', 'http://di.baidu.com/product/stj', '/tech/speech', '/solution/aifenzhen', '/tech/face', '/broad/subordinate?dataset=saoke', '/customer/panda', '/partner/apply', '/docs#/FAQ/30a9eac7', 'http://di.baidu.com/product/perswc', '/broad/subordinate?dataset=sked', '/tech/vehicle/detect', 'https://nlp.baidu.com/homepage/nlptools', '/tech/hardware/offlinesdk', '/tech/kg/bgraph', '/solution/kgaas', '/solution/cabinet', '/tech/intelligentwriting#couplet', '/edgecloud/web/capture/int#/devicelist', 'https://ai.baidu.com/forum/topic/show/943331', '/broad/subordinate?dataset=traffic', '/broad/subordinate?dataset=video', '/tech/imagerecognition/fine_grained', '/tech/bicc/atw', '/customer/wps', '/partner/hardware', 'http://di.baidu.com/product/habo', '/tech/vehicle/damage', '/tech/unit', '/tech/ocr_cars', '/broad/subordinate?dataset=amd', '/tech/ocr/business', '/tech/video/vca', '/tech/nlp_basic/simnet', '/solution/facesignIn', 'http://di.baidu.com/product/insurancerisk', '/easydl', '/industry/retail', '/solution/cashier', '/support/news', 'https://console.bce.baidu.com/billing/?fromai=1#/account/index', '/tech/video/vcr', '/tech/face/compare', '/accelerator', '/customer/18183', '/tech/imagerecognition/general', 'https://console.bce.baidu.com/?fromai=1#/aip/overview', '/tech/nlp_apply/news_summary', '/support/news?action=detail&id=996', 'https://cloud.baidu.com/calculator.html#/ocr/price', '/tech/body/pose', '/tech/imagesearch/similar', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#5', '/customer/dida', 'http://di.baidu.com/product/zhizhou', '/broad/subordinate?dataset=dureader', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#4', '/customer/fang', '/partner/case', '/sdk', '/paddlepaddle', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/nlp/overview/index', '/tech/ocr/plate', '/tech/imageprocess/dehaze', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#2', '/customer/hualife', '/industry/enterpriseservice', '/tech/ocr_cards/driving_license', '/customer/iqiyi', '/customer/zhiban', '/tech/ocr_cards', '/tech/nlp_basic/lexical', '/tech/nlp_apply/emotion_detection', '/tech/imageprocess/contrast_enhance', '/tech/body/driver', '/tech/nlp_apply/topictagger', '/easydl/sound', '/tech/speech/asr', '/tech/intelligentwriting#article', 'https://developer.baidu.com', '/tech/body', '/solution/faceidentify', '/customer/biguiyuan', '/', '/unit/home', 'http://di.baidu.com/product/zhizhu', '/tech/vr', '/tech/nlp_apply', '/tech/intelligentwriting#poem', '/tech/vehicle/flow', '/broad/subordinate?dataset=pose', '/tech/imagerecognition/ingredient', '/tech/kg/wenda', '/solution/luban', '/partner/seeyon', '/tech/ocr/general', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#1', '/support/news?action=detail&id=1060', 'https://cloud.baidu.com/product/abc-robot.html', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/imagerecognition/overview/index', '/customer/bbsp', 'http://di.baidu.com/product/xuanke', '/tech/nlp_basic/word_embedding', '/tech/cognitive/hanyu', 'http://di.baidu.com/product/keqing', '/partner/data', '/industry/government', '/broad', 'https://ai.baidu.com/easydl/retail?hmsr=aibanner&hmpl=easydl-retail', 'https://bit.baidu.com', '/partner/solution', '/docs#/Begin/top', '/tech/ocr/table', 'http://fanyi-api.baidu.com/api/trans/product/index', '/tech/face/developmentkit', '/tech/imagerecognition/redwine', '/customer/zhongtong', '/tech/ocr_receipts/receipt', '/partner', '/tech/cognitive/entity_annotation', '/partner/roobo', '/solution/class', '/solution/factory', '/tech/bicc', '/tech/ocr_receipts', '/docs', '/tech/ocr_cards/idcard', 'http://di.baidu.com/ecology/dianshi', '/tech/imagecensoring', '/tech/speech/fsr', '/tech/vehicle', '/docs#/Begin', '/solution/private', '/tech/imageprocess/colourize', '/tech/nlp_basic/dnnlm_cn', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/overview/index', 'http://ar.baidu.com/content#/', 'https://ai.baidu.com/tech/vehicle/damage?hmsr=aibanner&hmpl=damage', '/facekit/home', '/tech/ocr/webimage', 'http://di.baidu.com/product/huike', 'http://di.baidu.com/product/calc', '/tech/vehicle/attr', '/customer/liuzhouyuanchuang', '/partner/kangxing', '/tech/speech/asrpro', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/ar/overview/index', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/kg/overview/index', '/customer/benchi', '/tech/ocr_cards/business_card', 'http://ai.baidu.com/tech/vr', '/forum', '/tech/face/faceliveness', '/customer/hzhuihe', '/customer/lagou', '/docs#/FAQ/61e783cc', '/tech/textcensoring', '/industry/factory', '/partner/yunjingwang', 'https://aim.baidu.com/product/b226a947-4660-4e27-83b4-877bf63b8627?hmsr=aibanner&hmpl=speechkit', '/support/news?action=detail&id=435', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/ocr/overview/index', '/easyedge/home', '/tech/hardware/certifiedpd', '/easydl/text', 'https://ai.baidu.com/support/news?action=detail&id=1021&hmsr=aibanner&hmpl=KDD', '/industry/estate', '/tech/ocr_receipts/train_ticket', '/tech/ocr_others', '/tech/nlp_apply/comment_tag', '/customer/hangzhouyintai', '/tech/cognitive', '/tech/imageprocess', '/tech/body/num', '/tech/imagerecognition/dish', '/tech/ocr_receipts/taxi_receipt', '/tech/body/gesture', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/bicc/overview/index', '/customer/yy', '/tech/kg/schema', '/broad/subordinate?dataset=kg', 'https://ai.baidu.com/support/news?action=detail&id=1031&hmsr=aibanner&hmpl=customer201904', '/tech/nlp_basic/word_emb_sim', '/tech/intelligentwriting#hotspot', '/tech/video/vcc', '/easydl/image', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#3', 'http://www.paddlepaddle.org', 'http://di.baidu.com/#products', '/docs#/FAQ', 'https://github.com/Baidu-AIP', '/docs#/FAQ/54966861', 'https://console.bce.baidu.com/ai/?locale=zh-cn#/ai/easydl/overview/index', '/tech/smartasr', 'http://di.baidu.com/product/dayu', '/tech/face/detect', '/solution/ivs', '/tech/hardware/chip', 'http://di.baidu.com/product/tongji', '/tech/nlp_apply/text_corrector', 'http://di.baidu.com/product/palo', 'http://aistudio.baidu.com', '/tech/imagesearch/same', '/solution/facegate', '/tech/hardware/certification', '/solution/itma', '/support/video', '/tech/ocr', 'https://dueros.baidu.com/', '/tech/ocr/iocr', '/tech/face/search', 'http://di.baidu.com/product/jarvis', '/tech/face/merge', '/tech/body/seg', 'http://di.baidu.com/product/entwc', 'http://ticket.bce.baidu.com/#/ticket/create', '/industry/hardware', 'http://di.baidu.com/product/yuqing', '/tech/speech/tts', '/tech/ocr_receipts/vat_invoice', '/tech/vehicle/car', '/tech/imagesearch/product', 'http://di.baidu.com/product/opinion', '/customer/yongyou', '/tech/edgecloud/capture', 'https://www.huodongxing.com/event/9489422863311', '/tech/nlp_apply/doctagger', '/industry/information', '/tech/hardware/deepkit', 'http://lbsyun.baidu.com', 'http://di.baidu.com/product/insurancefraud', '/docs#/FAQ/d1a6a706', 'http://di.baidu.com/product/mike', 'https://cloud.baidu.com/?t=cp:online-media%7Cci:%7Ccn:ai', 'http://di.baidu.com/product/pingo', '/solution/faceattendance', '/docs#/AI-service-agreement/top', '/tech/speech/wake', '/tech/hardware/speechkit', '/customer/xiecheng', 'http://apollo.auto', '/solution/bmvs', 'https://console.bce.baidu.com/iam/?fromai=1#/iam/baseinfo', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/face/overview/index', '/tech/intelligentwriting#relation', 'http://di.baidu.com/product/sinan', '/industry/education', 'http://ar.baidu.com/home', 'https://ai.baidu.com/support/news?action=detail&id=1060&hmsr=aibanner&hmpl=201904pandian', 'http://di.baidu.com/product/tuijian', '/partner/Baichebao', '/customer/xspatium', '/tech/body/attr', '/tech/face/collect', '/customer/mitu', '/tech/intelligentwriting', '/tech/nlp_basic', '/tech/nlp_apply/sentiment_classify', 'http://di.baidu.com/product/minos', '/tech/kg/zuowen', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/imagesearch/overview/index', 'https://ai.baidu.com/support/news?action=detail&id=1055&hmsr=aibanner&hmpl=paddlecamp', '/tech/face/offline-sdk', '/customer/huzhong', '/tech/ocr_cards/bankcard', 'http://di.baidu.com/product/bes', '/solution/mtp'}

链接过滤

提取出链接后,下一步便是过滤不需要爬取链接。

使用 urllib 库过滤所有非内部链接

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from urllib.parse import urlparse
from requests_html import HTMLSession
session = HTMLSession()
origin = 'https://ai.baidu.com'
r = session.get(origin)
print(r.html.links)

domain = 'ai.baidu.com'

def is_inner_link(link):
netloc = urlparse(link).netloc
return (not netloc) or (netloc == domain)

for link in r.html.links:
print(is_inner_link(link), link)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
{'/tech/bicc/rts', 'http://di.baidu.com/product/shangqing', '/solution/censoring', '/tech/ocr_cards/vehicle_license', '/tech/imageprocess/stretch_restore', '/customer/baobao', 'https://app.baidu.com', '/tech/imagesearch', '/tech/nlp_basic/dependency_parsing', 'http://ar.baidu.com/dumixar', '/tech/imagerecognition/currency', '/tech/imageprocess/image_quality_enhance', 'https://aim.baidu.com/', '/tech/ocr_cards/passport', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/antiporn/overview/index', '/customer/samsung', 'http://www.baidu.com/duty', '/customer', '/solution/faceprint', '/solution/roboticvision', '/broad/subordinate?dataset=sceneparsing', '/tech/ocr_cars/driving_license', '/tech/video/vcs', 'http://di.baidu.com/product/mtj', '/tech/imagerecognition', '/easydl/retail', '/broad/subordinate?dataset=gon', '/customer/liaoningshiyan', '/broad/subordinate?dataset=canine', '/customer/nfdw', '/tech/ocr_cars/vehicle_license', '/partner/course', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/body/overview/index', '/partner/cella', '/tech/speech/lsr', 'http://zhongbao.baidu.com', 'http://di.baidu.com/product/stj', '/tech/speech', '/solution/aifenzhen', '/tech/face', '/broad/subordinate?dataset=saoke', '/customer/panda', '/partner/apply', '/docs#/FAQ/30a9eac7', 'http://di.baidu.com/product/perswc', '/broad/subordinate?dataset=sked', '/tech/vehicle/detect', 'https://nlp.baidu.com/homepage/nlptools', '/tech/hardware/offlinesdk', '/tech/kg/bgraph', '/solution/kgaas', '/solution/cabinet', '/tech/intelligentwriting#couplet', '/edgecloud/web/capture/int#/devicelist', 'https://ai.baidu.com/forum/topic/show/943331', '/broad/subordinate?dataset=traffic', '/broad/subordinate?dataset=video', '/tech/imagerecognition/fine_grained', '/tech/bicc/atw', '/customer/wps', '/partner/hardware', 'http://di.baidu.com/product/habo', '/tech/vehicle/damage', '/tech/unit', '/tech/ocr_cars', '/broad/subordinate?dataset=amd', '/tech/ocr/business', '/tech/video/vca', '/tech/nlp_basic/simnet', '/solution/facesignIn', 'http://di.baidu.com/product/insurancerisk', '/easydl', '/industry/retail', '/solution/cashier', '/support/news', 'https://console.bce.baidu.com/billing/?fromai=1#/account/index', '/tech/video/vcr', '/tech/face/compare', '/accelerator', '/customer/18183', '/tech/imagerecognition/general', 'https://console.bce.baidu.com/?fromai=1#/aip/overview', '/tech/nlp_apply/news_summary', '/support/news?action=detail&id=996', 'https://cloud.baidu.com/calculator.html#/ocr/price', '/tech/body/pose', '/tech/imagesearch/similar', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#5', '/customer/dida', 'http://di.baidu.com/product/zhizhou', '/broad/subordinate?dataset=dureader', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#4', '/customer/fang', '/partner/case', '/sdk', '/paddlepaddle', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/nlp/overview/index', '/tech/ocr/plate', '/tech/imageprocess/dehaze', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#2', '/customer/hualife', '/industry/enterpriseservice', '/tech/ocr_cards/driving_license', '/customer/iqiyi', '/customer/zhiban', '/tech/ocr_cards', '/tech/nlp_basic/lexical', '/tech/nlp_apply/emotion_detection', '/tech/imageprocess/contrast_enhance', '/tech/body/driver', '/tech/nlp_apply/topictagger', '/easydl/sound', '/tech/speech/asr', '/tech/intelligentwriting#article', 'https://developer.baidu.com', '/tech/body', '/solution/faceidentify', '/customer/biguiyuan', '/', '/unit/home', 'http://di.baidu.com/product/zhizhu', '/tech/vr', '/tech/nlp_apply', '/tech/intelligentwriting#poem', '/tech/vehicle/flow', '/broad/subordinate?dataset=pose', '/tech/imagerecognition/ingredient', '/tech/kg/wenda', '/solution/luban', '/partner/seeyon', '/tech/ocr/general', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#1', '/support/news?action=detail&id=1060', 'https://cloud.baidu.com/product/abc-robot.html', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/imagerecognition/overview/index', '/customer/bbsp', 'http://di.baidu.com/product/xuanke', '/tech/nlp_basic/word_embedding', '/tech/cognitive/hanyu', 'http://di.baidu.com/product/keqing', '/partner/data', '/industry/government', '/broad', 'https://ai.baidu.com/easydl/retail?hmsr=aibanner&hmpl=easydl-retail', 'https://bit.baidu.com', '/partner/solution', '/docs#/Begin/top', '/tech/ocr/table', 'http://fanyi-api.baidu.com/api/trans/product/index', '/tech/face/developmentkit', '/tech/imagerecognition/redwine', '/customer/zhongtong', '/tech/ocr_receipts/receipt', '/partner', '/tech/cognitive/entity_annotation', '/partner/roobo', '/solution/class', '/solution/factory', '/tech/bicc', '/tech/ocr_receipts', '/docs', '/tech/ocr_cards/idcard', 'http://di.baidu.com/ecology/dianshi', '/tech/imagecensoring', '/tech/speech/fsr', '/tech/vehicle', '/docs#/Begin', '/solution/private', '/tech/imageprocess/colourize', '/tech/nlp_basic/dnnlm_cn', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/overview/index', 'http://ar.baidu.com/content#/', 'https://ai.baidu.com/tech/vehicle/damage?hmsr=aibanner&hmpl=damage', '/facekit/home', '/tech/ocr/webimage', 'http://di.baidu.com/product/huike', 'http://di.baidu.com/product/calc', '/tech/vehicle/attr', '/customer/liuzhouyuanchuang', '/partner/kangxing', '/tech/speech/asrpro', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/ar/overview/index', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/kg/overview/index', '/customer/benchi', '/tech/ocr_cards/business_card', 'http://ai.baidu.com/tech/vr', '/forum', '/tech/face/faceliveness', '/customer/hzhuihe', '/customer/lagou', '/docs#/FAQ/61e783cc', '/tech/textcensoring', '/industry/factory', '/partner/yunjingwang', 'https://aim.baidu.com/product/b226a947-4660-4e27-83b4-877bf63b8627?hmsr=aibanner&hmpl=speechkit', '/support/news?action=detail&id=435', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/ocr/overview/index', '/easyedge/home', '/tech/hardware/certifiedpd', '/easydl/text', 'https://ai.baidu.com/support/news?action=detail&id=1021&hmsr=aibanner&hmpl=KDD', '/industry/estate', '/tech/ocr_receipts/train_ticket', '/tech/ocr_others', '/tech/nlp_apply/comment_tag', '/customer/hangzhouyintai', '/tech/cognitive', '/tech/imageprocess', '/tech/body/num', '/tech/imagerecognition/dish', '/tech/ocr_receipts/taxi_receipt', '/tech/body/gesture', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/bicc/overview/index', '/customer/yy', '/tech/kg/schema', '/broad/subordinate?dataset=kg', 'https://ai.baidu.com/support/news?action=detail&id=1031&hmsr=aibanner&hmpl=customer201904', '/tech/nlp_basic/word_emb_sim', '/tech/intelligentwriting#hotspot', '/tech/video/vcc', '/easydl/image', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#3', 'http://www.paddlepaddle.org', 'http://di.baidu.com/#products', '/docs#/FAQ', 'https://github.com/Baidu-AIP', '/docs#/FAQ/54966861', 'https://console.bce.baidu.com/ai/?locale=zh-cn#/ai/easydl/overview/index', '/tech/smartasr', 'http://di.baidu.com/product/dayu', '/tech/face/detect', '/solution/ivs', '/tech/hardware/chip', 'http://di.baidu.com/product/tongji', '/tech/nlp_apply/text_corrector', 'http://di.baidu.com/product/palo', 'http://aistudio.baidu.com', '/tech/imagesearch/same', '/solution/facegate', '/tech/hardware/certification', '/solution/itma', '/support/video', '/tech/ocr', 'https://dueros.baidu.com/', '/tech/ocr/iocr', '/tech/face/search', 'http://di.baidu.com/product/jarvis', '/tech/face/merge', '/tech/body/seg', 'http://di.baidu.com/product/entwc', 'http://ticket.bce.baidu.com/#/ticket/create', '/industry/hardware', 'http://di.baidu.com/product/yuqing', '/tech/speech/tts', '/tech/ocr_receipts/vat_invoice', '/tech/vehicle/car', '/tech/imagesearch/product', 'http://di.baidu.com/product/opinion', '/customer/yongyou', '/tech/edgecloud/capture', 'https://www.huodongxing.com/event/9489422863311', '/tech/nlp_apply/doctagger', '/industry/information', '/tech/hardware/deepkit', 'http://lbsyun.baidu.com', 'http://di.baidu.com/product/insurancefraud', '/docs#/FAQ/d1a6a706', 'http://di.baidu.com/product/mike', 'https://cloud.baidu.com/?t=cp:online-media%7Cci:%7Ccn:ai', 'http://di.baidu.com/product/pingo', '/solution/faceattendance', '/docs#/AI-service-agreement/top', '/tech/speech/wake', '/tech/hardware/speechkit', '/customer/xiecheng', 'http://apollo.auto', '/solution/bmvs', 'https://console.bce.baidu.com/iam/?fromai=1#/iam/baseinfo', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/face/overview/index', '/tech/intelligentwriting#relation', 'http://di.baidu.com/product/sinan', '/industry/education', 'http://ar.baidu.com/home', 'https://ai.baidu.com/support/news?action=detail&id=1060&hmsr=aibanner&hmpl=201904pandian', 'http://di.baidu.com/product/tuijian', '/partner/Baichebao', '/customer/xspatium', '/tech/body/attr', '/tech/face/collect', '/customer/mitu', '/tech/intelligentwriting', '/tech/nlp_basic', '/tech/nlp_apply/sentiment_classify', 'http://di.baidu.com/product/minos', '/tech/kg/zuowen', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/imagesearch/overview/index', 'https://ai.baidu.com/support/news?action=detail&id=1055&hmsr=aibanner&hmpl=paddlecamp', '/tech/face/offline-sdk', '/customer/huzhong', '/tech/ocr_cards/bankcard', 'http://di.baidu.com/product/bes', '/solution/mtp'}
True /tech/bicc/rts
False http://di.baidu.com/product/shangqing
True /solution/censoring
True /tech/ocr_cards/vehicle_license
True /tech/imageprocess/stretch_restore
True /customer/baobao
False https://app.baidu.com
True /tech/imagesearch
True /tech/nlp_basic/dependency_parsing
False http://ar.baidu.com/dumixar
True /tech/imagerecognition/currency
True /tech/imageprocess/image_quality_enhance
False https://aim.baidu.com/
True /tech/ocr_cards/passport
False https://console.bce.baidu.com/ai/?fromai=1#/ai/antiporn/overview/index
True /customer/samsung
False http://www.baidu.com/duty
True /customer
True /solution/faceprint
True /solution/roboticvision
True /broad/subordinate?dataset=sceneparsing
True /tech/ocr_cars/driving_license
True /tech/video/vcs
False http://di.baidu.com/product/mtj
True /tech/imagerecognition
True /easydl/retail
True /broad/subordinate?dataset=gon
True /customer/liaoningshiyan
True /broad/subordinate?dataset=canine
True /customer/nfdw
True /tech/ocr_cars/vehicle_license
True /partner/course
False https://console.bce.baidu.com/ai/?fromai=1#/ai/body/overview/index
True /partner/cella
True /tech/speech/lsr
False http://zhongbao.baidu.com
False http://di.baidu.com/product/stj
True /tech/speech
True /solution/aifenzhen
True /tech/face
True /broad/subordinate?dataset=saoke
True /customer/panda
True /partner/apply
True /docs#/FAQ/30a9eac7
False http://di.baidu.com/product/perswc
True /broad/subordinate?dataset=sked
True /tech/vehicle/detect
False https://nlp.baidu.com/homepage/nlptools
True /tech/hardware/offlinesdk
True /tech/kg/bgraph
True /solution/kgaas
True /solution/cabinet
True /tech/intelligentwriting#couplet
True /edgecloud/web/capture/int#/devicelist
True https://ai.baidu.com/forum/topic/show/943331
True /broad/subordinate?dataset=traffic
True /broad/subordinate?dataset=video
True /tech/imagerecognition/fine_grained
True /tech/bicc/atw
True /customer/wps
True /partner/hardware
False http://di.baidu.com/product/habo
True /tech/vehicle/damage
True /tech/unit
True /tech/ocr_cars
True /broad/subordinate?dataset=amd
True /tech/ocr/business
True /tech/video/vca
True /tech/nlp_basic/simnet
True /solution/facesignIn
False http://di.baidu.com/product/insurancerisk
True /easydl
True /industry/retail
True /solution/cashier
True /support/news
False https://console.bce.baidu.com/billing/?fromai=1#/account/index
True /tech/video/vcr
True /tech/face/compare
True /accelerator
True /customer/18183
True /tech/imagerecognition/general
False https://console.bce.baidu.com/?fromai=1#/aip/overview
True /tech/nlp_apply/news_summary
True /support/news?action=detail&id=996
False https://cloud.baidu.com/calculator.html#/ocr/price
True /tech/body/pose
True /tech/imagesearch/similar
False http://fanyi-api.baidu.com/api/trans/product/prodinfo#5
True /customer/dida
False http://di.baidu.com/product/zhizhou
True /broad/subordinate?dataset=dureader
False http://fanyi-api.baidu.com/api/trans/product/prodinfo#4
True /customer/fang
True /partner/case
True /sdk
True /paddlepaddle
False https://console.bce.baidu.com/ai/?fromai=1#/ai/nlp/overview/index
True /tech/ocr/plate
True /tech/imageprocess/dehaze
False http://fanyi-api.baidu.com/api/trans/product/prodinfo#2
True /customer/hualife
True /industry/enterpriseservice
True /tech/ocr_cards/driving_license
True /customer/iqiyi
True /customer/zhiban
True /tech/ocr_cards
True /tech/nlp_basic/lexical
True /tech/nlp_apply/emotion_detection
True /tech/imageprocess/contrast_enhance
True /tech/body/driver
True /tech/nlp_apply/topictagger
True /easydl/sound
True /tech/speech/asr
True /tech/intelligentwriting#article
False https://developer.baidu.com
True /tech/body
True /solution/faceidentify
True /customer/biguiyuan
True /
True /unit/home
False http://di.baidu.com/product/zhizhu
True /tech/vr
True /tech/nlp_apply
True /tech/intelligentwriting#poem
True /tech/vehicle/flow
True /broad/subordinate?dataset=pose
True /tech/imagerecognition/ingredient
True /tech/kg/wenda
True /solution/luban
True /partner/seeyon
True /tech/ocr/general
False http://fanyi-api.baidu.com/api/trans/product/prodinfo#1
True /support/news?action=detail&id=1060
False https://cloud.baidu.com/product/abc-robot.html
False https://console.bce.baidu.com/ai/?fromai=1#/ai/imagerecognition/overview/index
True /customer/bbsp
False http://di.baidu.com/product/xuanke
True /tech/nlp_basic/word_embedding
True /tech/cognitive/hanyu
False http://di.baidu.com/product/keqing
True /partner/data
True /industry/government
True /broad
True https://ai.baidu.com/easydl/retail?hmsr=aibanner&hmpl=easydl-retail
False https://bit.baidu.com
True /partner/solution
True /docs#/Begin/top
True /tech/ocr/table
False http://fanyi-api.baidu.com/api/trans/product/index
True /tech/face/developmentkit
True /tech/imagerecognition/redwine
True /customer/zhongtong
True /tech/ocr_receipts/receipt
True /partner
True /tech/cognitive/entity_annotation
True /partner/roobo
True /solution/class
True /solution/factory
True /tech/bicc
True /tech/ocr_receipts
True /docs
True /tech/ocr_cards/idcard
False http://di.baidu.com/ecology/dianshi
True /tech/imagecensoring
True /tech/speech/fsr
True /tech/vehicle
True /docs#/Begin
True /solution/private
True /tech/imageprocess/colourize
True /tech/nlp_basic/dnnlm_cn
False https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/overview/index
False http://ar.baidu.com/content#/
True https://ai.baidu.com/tech/vehicle/damage?hmsr=aibanner&hmpl=damage
True /facekit/home
True /tech/ocr/webimage
False http://di.baidu.com/product/huike
False http://di.baidu.com/product/calc
True /tech/vehicle/attr
True /customer/liuzhouyuanchuang
True /partner/kangxing
True /tech/speech/asrpro
False http://fanyi-api.baidu.com/api/trans/product/prodinfo
False https://console.bce.baidu.com/ai/?fromai=1#/ai/ar/overview/index
False https://console.bce.baidu.com/ai/?fromai=1#/ai/kg/overview/index
True /customer/benchi
True /tech/ocr_cards/business_card
True http://ai.baidu.com/tech/vr
True /forum
True /tech/face/faceliveness
True /customer/hzhuihe
True /customer/lagou
True /docs#/FAQ/61e783cc
True /tech/textcensoring
True /industry/factory
True /partner/yunjingwang
False https://aim.baidu.com/product/b226a947-4660-4e27-83b4-877bf63b8627?hmsr=aibanner&hmpl=speechkit
True /support/news?action=detail&id=435
False https://console.bce.baidu.com/ai/?fromai=1#/ai/ocr/overview/index
True /easyedge/home
True /tech/hardware/certifiedpd
True /easydl/text
True https://ai.baidu.com/support/news?action=detail&id=1021&hmsr=aibanner&hmpl=KDD
True /industry/estate
True /tech/ocr_receipts/train_ticket
True /tech/ocr_others
True /tech/nlp_apply/comment_tag
True /customer/hangzhouyintai
True /tech/cognitive
True /tech/imageprocess
True /tech/body/num
True /tech/imagerecognition/dish
True /tech/ocr_receipts/taxi_receipt
True /tech/body/gesture
False https://console.bce.baidu.com/ai/?fromai=1#/ai/bicc/overview/index
True /customer/yy
True /tech/kg/schema
True /broad/subordinate?dataset=kg
True https://ai.baidu.com/support/news?action=detail&id=1031&hmsr=aibanner&hmpl=customer201904
True /tech/nlp_basic/word_emb_sim
True /tech/intelligentwriting#hotspot
True /tech/video/vcc
True /easydl/image
False http://fanyi-api.baidu.com/api/trans/product/prodinfo#3
False http://www.paddlepaddle.org
False http://di.baidu.com/#products
True /docs#/FAQ
False https://github.com/Baidu-AIP
True /docs#/FAQ/54966861
False https://console.bce.baidu.com/ai/?locale=zh-cn#/ai/easydl/overview/index
True /tech/smartasr
False http://di.baidu.com/product/dayu
True /tech/face/detect
True /solution/ivs
True /tech/hardware/chip
False http://di.baidu.com/product/tongji
True /tech/nlp_apply/text_corrector
False http://di.baidu.com/product/palo
False http://aistudio.baidu.com
True /tech/imagesearch/same
True /solution/facegate
True /tech/hardware/certification
True /solution/itma
True /support/video
True /tech/ocr
False https://dueros.baidu.com/
True /tech/ocr/iocr
True /tech/face/search
False http://di.baidu.com/product/jarvis
True /tech/face/merge
True /tech/body/seg
False http://di.baidu.com/product/entwc
False http://ticket.bce.baidu.com/#/ticket/create
True /industry/hardware
False http://di.baidu.com/product/yuqing
True /tech/speech/tts
True /tech/ocr_receipts/vat_invoice
True /tech/vehicle/car
True /tech/imagesearch/product
False http://di.baidu.com/product/opinion
True /customer/yongyou
True /tech/edgecloud/capture
False https://www.huodongxing.com/event/9489422863311
True /tech/nlp_apply/doctagger
True /industry/information
True /tech/hardware/deepkit
False http://lbsyun.baidu.com
False http://di.baidu.com/product/insurancefraud
True /docs#/FAQ/d1a6a706
False http://di.baidu.com/product/mike
False https://cloud.baidu.com/?t=cp:online-media%7Cci:%7Ccn:ai
False http://di.baidu.com/product/pingo
True /solution/faceattendance
True /docs#/AI-service-agreement/top
True /tech/speech/wake
True /tech/hardware/speechkit
True /customer/xiecheng
False http://apollo.auto
True /solution/bmvs
False https://console.bce.baidu.com/iam/?fromai=1#/iam/baseinfo
False https://console.bce.baidu.com/ai/?fromai=1#/ai/face/overview/index
True /tech/intelligentwriting#relation
False http://di.baidu.com/product/sinan
True /industry/education
False http://ar.baidu.com/home
True https://ai.baidu.com/support/news?action=detail&id=1060&hmsr=aibanner&hmpl=201904pandian
False http://di.baidu.com/product/tuijian
True /partner/Baichebao
True /customer/xspatium
True /tech/body/attr
True /tech/face/collect
True /customer/mitu
True /tech/intelligentwriting
True /tech/nlp_basic
True /tech/nlp_apply/sentiment_classify
False http://di.baidu.com/product/minos
True /tech/kg/zuowen
False https://console.bce.baidu.com/ai/?fromai=1#/ai/imagesearch/overview/index
True https://ai.baidu.com/support/news?action=detail&id=1055&hmsr=aibanner&hmpl=paddlecamp
True /tech/face/offline-sdk
True /customer/huzhong
True /tech/ocr_cards/bankcard
False http://di.baidu.com/product/bes
True /solution/mtp

除了过滤非内部链接外,还需要把已经访问过的链接、爬虫协议不允许的链接 和 你不想访问的链接都过滤掉。

为了便于学习和更快的获取结果,假定我们对 ‘/docs’, ‘/support’, ‘/forum’, ‘/broad’, ‘/paddlepaddle’, ‘/market’, ‘/download’, ‘/facekit’, ‘/sdk’, ‘/customer’, ‘/easydl’ 下的网页都不感兴趣(不索引它们),只关注网站中主要的产品信息。在下面的代码中你将看到具体实现。

百度 AI 爬虫实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
from requests_html import HTMLSession
import urllib.robotparser
from urllib.parse import urlparse

session = HTMLSession()
origin = 'https://ai.baidu.com'
domain = urlparse(origin).netloc

def is_inner_link(link):
netloc = urlparse(link).netloc
return (not netloc) or (netloc == domain)

visited = [] # 已访问链接列表
unvisited = [origin] # 待访问链接列表

# 解析爬虫协议
rp = urllib.robotparser.RobotFileParser()
rp.set_url(origin + '/robots.txt')
rp.read()

def add_unvisited(link):
# 过滤1:判断爬虫协议是否允许
allow = rp.can_fetch('*', link)
if not allow:
return

# 过滤2:判断是否为内链
if not is_inner_link(link):
return

# 过滤3:去掉非法链接
path = urlparse(link).path
if not path.startswith('/'):
return

# 过滤4:自定义过滤
if urlparse(link).path.startswith(('/file', '/docs', '/support', '/forum', '/broad', '/paddlepaddle', '/market', '/download', '/facekit', '/sdk', '/customer', '/easydl', '//')):
return

# 将 /tech/123 转换为 https://ai.baidu.com/tech/123 的形式
if link.startswith('/'):
link = origin + link

# 过滤5:判断是否访问过,或已经添加到待访问列表
if (link in visited) or (link in unvisited):
return

unvisited.append(link)

while len(unvisited):
link = unvisited.pop()
r = session.get(link)
visited.append(link)
if r.html and r.html.links and len(r.html.links):
for url in r.html.links:
add_unvisited(url)

if r.html.find('head title')[0]:
print(r.html.find('head title')[0].text, link)

print('共爬取 {} 个链接'.format(len(visited)))
1
2
3
4
5
6
7
……
百度AI开放平台-全球领先的人工智能服务平台-百度AI开放平台 https://ai.baidu.com/tech/imageprocess
图像搜索_以图搜图技术-百度AI开放平台 https://ai.baidu.com/tech/imagesearch
长语音识别-语音识别-百度AI-百度AI开放平台 https://ai.baidu.com/tech/speech/lsr
内容审核解决方案,图+文+音视频审核0压力-百度AI开放平台 https://ai.baidu.com/solution/censoring
……
共爬取 145 个链接

在前面的章节中你已经学习了如何将结果写入 csv 文件 存储格式化数据。现在,将爬虫的结果按 网页标题,内部地址 的格式存储在 data.csv 文件中。如下图的形式。

1
2
3
4
5
6
7
8
9
10
文件名 data.csv
-------------------------------
语音搜索解决方案-百度AI开放平台,https://ai.baidu.com/solution/bmvs
长语音识别-语音识别-百度AI-百度AI开放平台,https://ai.baidu.com/tech/speech/lsr
相似图片搜索_图像搜索技术-百度AI开放平台,https://ai.baidu.com/tech/imagesearch/similar
"人像分割技术,精准识别人体轮廓,一键抠像-百度AI开放平台",https://ai.baidu.com/tech/body/seg
AI+智能货柜解决方案-百度AI开放平台,https://ai.baidu.com/solution/cabinet
百度AI开放平台-全球领先的人工智能服务平台-百度AI开放平台,https://ai.baidu.com/tech/kg/wenda
短文本相似度-NLP-百度AI-百度AI开放平台,https://ai.baidu.com/tech/nlp/simnet
视频比对检索-百度AI开放平台,https://ai.baidu.com/tech/video/vcc

你可以使用下面代码对 data.csv 文件进行读取,并添加查找功能。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import csv

page_url = []
page_title = []
file = open('data.csv', 'r')
infos = csv.reader(file)
for info in infos:
page_title.append(info[0])
page_url.append(info[1])

while True:
keyword = input('请输入查询关键字,输入 quit 结束:')
if keyword == 'quit':
break
for i in range(len(page_title)):
if str(page_title[i]).find(keyword) >= 0:
print(page_url[i], page_title[i])

file.close()
1
2
3
4
5
6
7
8
9
请输入查询关键字,输入 quit 结束:图像
https://ai.baidu.com/tech/imagesearch 图像搜索_以图搜图技术-百度AI开放平台
https://ai.baidu.com/tech/imagecensoring 图像审核-百度AI开放平台
https://ai.baidu.com/tech/imagesearch/same 相同图片搜索_图像搜索技术-百度AI开放平台
https://ai.baidu.com/tech/imagerecognition 图像识别-百度AI开放平台
https://ai.baidu.com/customize/common/welcome EasyDL定制化图像识别 - 百度AI开放平台
https://ai.baidu.com/tech/imagesearch/product 商品图片搜索_图像搜索技术-百度AI开放平台
https://ai.baidu.com/tech/hardware/offlinesdk 离线SDK下载_人脸,文字,人体,图像,语音识别-百度AI开放平台
https://ai.baidu.com/tech/imagesearch/similar 相似图片搜索_图像搜索技术-百度AI开放平台
Donate? comment?