学习python123课程:https://python123.io/index/tutorials/web_crawler_intro
爬虫协议
与其它爬虫不同,全站爬虫意图爬取网站所有页面,由于爬虫对网页的爬取速度比人工浏览快几百倍,对网站服务器来说压力山大,很容易造成网站崩溃。 为了避免双输的场面,大家约定,如果网站建设者不愿意爬虫访问某些页面,他就按照约定的格式,把这些页面添加到 robots.txt 文件中,爬虫应该主动避免访问这些页面。除此之外,作为爬虫编写者也应该主动控制爬虫访问速度。
访问 robots 协议的方式是:网站域名+’/robots.txt’。1
2
3# https://ai.baidu.com/robots.txt
User-agent: *
Disallow: /product/
处理爬虫协议
Python 中的内置库 Urllib 能够帮助你解析 robots 协议,判断某个具体的网页是否可以被爬取。
1 | import urllib.robotparser |
使用代码判断下面两个链接是否属于可爬取页面?
https://ai.baidu.com/product/pingo 不允许爬取
https://ai.baidu.com/tech/s peech/asr 允许爬取
全站爬虫的基本架构
几乎所有的全站爬虫,都可以抽象为这样的逻辑:
爬虫从一个 URL 开始访问,通常是网站的域名,并将获得网页中的链接提取出来,去重后放入待访问列表。重复此操作,知道访问完网站内全部网页。
需要注意的是,全站爬虫通常只爬取网站的内部链接
网页链接提取
在上图步骤三中,需要把页面中的所有链接提取出来,因为我们使用了 requests_html 库,这件事情变得非常容易,只需要 r.html.links。
1 | from requests_html import HTMLSession |
1 | {'/tech/bicc/rts', 'http://di.baidu.com/product/shangqing', '/solution/censoring', '/tech/ocr_cards/vehicle_license', '/tech/imageprocess/stretch_restore', '/customer/baobao', 'https://app.baidu.com', '/tech/imagesearch', '/tech/nlp_basic/dependency_parsing', 'http://ar.baidu.com/dumixar', '/tech/imagerecognition/currency', '/tech/imageprocess/image_quality_enhance', 'https://aim.baidu.com/', '/tech/ocr_cards/passport', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/antiporn/overview/index', '/customer/samsung', 'http://www.baidu.com/duty', '/customer', '/solution/faceprint', '/solution/roboticvision', '/broad/subordinate?dataset=sceneparsing', '/tech/ocr_cars/driving_license', '/tech/video/vcs', 'http://di.baidu.com/product/mtj', '/tech/imagerecognition', '/easydl/retail', '/broad/subordinate?dataset=gon', '/customer/liaoningshiyan', '/broad/subordinate?dataset=canine', '/customer/nfdw', '/tech/ocr_cars/vehicle_license', '/partner/course', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/body/overview/index', '/partner/cella', '/tech/speech/lsr', 'http://zhongbao.baidu.com', 'http://di.baidu.com/product/stj', '/tech/speech', '/solution/aifenzhen', '/tech/face', '/broad/subordinate?dataset=saoke', '/customer/panda', '/partner/apply', '/docs#/FAQ/30a9eac7', 'http://di.baidu.com/product/perswc', '/broad/subordinate?dataset=sked', '/tech/vehicle/detect', 'https://nlp.baidu.com/homepage/nlptools', '/tech/hardware/offlinesdk', '/tech/kg/bgraph', '/solution/kgaas', '/solution/cabinet', '/tech/intelligentwriting#couplet', '/edgecloud/web/capture/int#/devicelist', 'https://ai.baidu.com/forum/topic/show/943331', '/broad/subordinate?dataset=traffic', '/broad/subordinate?dataset=video', '/tech/imagerecognition/fine_grained', '/tech/bicc/atw', '/customer/wps', '/partner/hardware', 'http://di.baidu.com/product/habo', '/tech/vehicle/damage', '/tech/unit', '/tech/ocr_cars', '/broad/subordinate?dataset=amd', '/tech/ocr/business', '/tech/video/vca', '/tech/nlp_basic/simnet', '/solution/facesignIn', 'http://di.baidu.com/product/insurancerisk', '/easydl', '/industry/retail', '/solution/cashier', '/support/news', 'https://console.bce.baidu.com/billing/?fromai=1#/account/index', '/tech/video/vcr', '/tech/face/compare', '/accelerator', '/customer/18183', '/tech/imagerecognition/general', 'https://console.bce.baidu.com/?fromai=1#/aip/overview', '/tech/nlp_apply/news_summary', '/support/news?action=detail&id=996', 'https://cloud.baidu.com/calculator.html#/ocr/price', '/tech/body/pose', '/tech/imagesearch/similar', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#5', '/customer/dida', 'http://di.baidu.com/product/zhizhou', '/broad/subordinate?dataset=dureader', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#4', '/customer/fang', '/partner/case', '/sdk', '/paddlepaddle', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/nlp/overview/index', '/tech/ocr/plate', '/tech/imageprocess/dehaze', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#2', '/customer/hualife', '/industry/enterpriseservice', '/tech/ocr_cards/driving_license', '/customer/iqiyi', '/customer/zhiban', '/tech/ocr_cards', '/tech/nlp_basic/lexical', '/tech/nlp_apply/emotion_detection', '/tech/imageprocess/contrast_enhance', '/tech/body/driver', '/tech/nlp_apply/topictagger', '/easydl/sound', '/tech/speech/asr', '/tech/intelligentwriting#article', 'https://developer.baidu.com', '/tech/body', '/solution/faceidentify', '/customer/biguiyuan', '/', '/unit/home', 'http://di.baidu.com/product/zhizhu', '/tech/vr', '/tech/nlp_apply', '/tech/intelligentwriting#poem', '/tech/vehicle/flow', '/broad/subordinate?dataset=pose', '/tech/imagerecognition/ingredient', '/tech/kg/wenda', '/solution/luban', '/partner/seeyon', '/tech/ocr/general', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#1', '/support/news?action=detail&id=1060', 'https://cloud.baidu.com/product/abc-robot.html', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/imagerecognition/overview/index', '/customer/bbsp', 'http://di.baidu.com/product/xuanke', '/tech/nlp_basic/word_embedding', '/tech/cognitive/hanyu', 'http://di.baidu.com/product/keqing', '/partner/data', '/industry/government', '/broad', 'https://ai.baidu.com/easydl/retail?hmsr=aibanner&hmpl=easydl-retail', 'https://bit.baidu.com', '/partner/solution', '/docs#/Begin/top', '/tech/ocr/table', 'http://fanyi-api.baidu.com/api/trans/product/index', '/tech/face/developmentkit', '/tech/imagerecognition/redwine', '/customer/zhongtong', '/tech/ocr_receipts/receipt', '/partner', '/tech/cognitive/entity_annotation', '/partner/roobo', '/solution/class', '/solution/factory', '/tech/bicc', '/tech/ocr_receipts', '/docs', '/tech/ocr_cards/idcard', 'http://di.baidu.com/ecology/dianshi', '/tech/imagecensoring', '/tech/speech/fsr', '/tech/vehicle', '/docs#/Begin', '/solution/private', '/tech/imageprocess/colourize', '/tech/nlp_basic/dnnlm_cn', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/overview/index', 'http://ar.baidu.com/content#/', 'https://ai.baidu.com/tech/vehicle/damage?hmsr=aibanner&hmpl=damage', '/facekit/home', '/tech/ocr/webimage', 'http://di.baidu.com/product/huike', 'http://di.baidu.com/product/calc', '/tech/vehicle/attr', '/customer/liuzhouyuanchuang', '/partner/kangxing', '/tech/speech/asrpro', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/ar/overview/index', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/kg/overview/index', '/customer/benchi', '/tech/ocr_cards/business_card', 'http://ai.baidu.com/tech/vr', '/forum', '/tech/face/faceliveness', '/customer/hzhuihe', '/customer/lagou', '/docs#/FAQ/61e783cc', '/tech/textcensoring', '/industry/factory', '/partner/yunjingwang', 'https://aim.baidu.com/product/b226a947-4660-4e27-83b4-877bf63b8627?hmsr=aibanner&hmpl=speechkit', '/support/news?action=detail&id=435', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/ocr/overview/index', '/easyedge/home', '/tech/hardware/certifiedpd', '/easydl/text', 'https://ai.baidu.com/support/news?action=detail&id=1021&hmsr=aibanner&hmpl=KDD', '/industry/estate', '/tech/ocr_receipts/train_ticket', '/tech/ocr_others', '/tech/nlp_apply/comment_tag', '/customer/hangzhouyintai', '/tech/cognitive', '/tech/imageprocess', '/tech/body/num', '/tech/imagerecognition/dish', '/tech/ocr_receipts/taxi_receipt', '/tech/body/gesture', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/bicc/overview/index', '/customer/yy', '/tech/kg/schema', '/broad/subordinate?dataset=kg', 'https://ai.baidu.com/support/news?action=detail&id=1031&hmsr=aibanner&hmpl=customer201904', '/tech/nlp_basic/word_emb_sim', '/tech/intelligentwriting#hotspot', '/tech/video/vcc', '/easydl/image', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#3', 'http://www.paddlepaddle.org', 'http://di.baidu.com/#products', '/docs#/FAQ', 'https://github.com/Baidu-AIP', '/docs#/FAQ/54966861', 'https://console.bce.baidu.com/ai/?locale=zh-cn#/ai/easydl/overview/index', '/tech/smartasr', 'http://di.baidu.com/product/dayu', '/tech/face/detect', '/solution/ivs', '/tech/hardware/chip', 'http://di.baidu.com/product/tongji', '/tech/nlp_apply/text_corrector', 'http://di.baidu.com/product/palo', 'http://aistudio.baidu.com', '/tech/imagesearch/same', '/solution/facegate', '/tech/hardware/certification', '/solution/itma', '/support/video', '/tech/ocr', 'https://dueros.baidu.com/', '/tech/ocr/iocr', '/tech/face/search', 'http://di.baidu.com/product/jarvis', '/tech/face/merge', '/tech/body/seg', 'http://di.baidu.com/product/entwc', 'http://ticket.bce.baidu.com/#/ticket/create', '/industry/hardware', 'http://di.baidu.com/product/yuqing', '/tech/speech/tts', '/tech/ocr_receipts/vat_invoice', '/tech/vehicle/car', '/tech/imagesearch/product', 'http://di.baidu.com/product/opinion', '/customer/yongyou', '/tech/edgecloud/capture', 'https://www.huodongxing.com/event/9489422863311', '/tech/nlp_apply/doctagger', '/industry/information', '/tech/hardware/deepkit', 'http://lbsyun.baidu.com', 'http://di.baidu.com/product/insurancefraud', '/docs#/FAQ/d1a6a706', 'http://di.baidu.com/product/mike', 'https://cloud.baidu.com/?t=cp:online-media%7Cci:%7Ccn:ai', 'http://di.baidu.com/product/pingo', '/solution/faceattendance', '/docs#/AI-service-agreement/top', '/tech/speech/wake', '/tech/hardware/speechkit', '/customer/xiecheng', 'http://apollo.auto', '/solution/bmvs', 'https://console.bce.baidu.com/iam/?fromai=1#/iam/baseinfo', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/face/overview/index', '/tech/intelligentwriting#relation', 'http://di.baidu.com/product/sinan', '/industry/education', 'http://ar.baidu.com/home', 'https://ai.baidu.com/support/news?action=detail&id=1060&hmsr=aibanner&hmpl=201904pandian', 'http://di.baidu.com/product/tuijian', '/partner/Baichebao', '/customer/xspatium', '/tech/body/attr', '/tech/face/collect', '/customer/mitu', '/tech/intelligentwriting', '/tech/nlp_basic', '/tech/nlp_apply/sentiment_classify', 'http://di.baidu.com/product/minos', '/tech/kg/zuowen', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/imagesearch/overview/index', 'https://ai.baidu.com/support/news?action=detail&id=1055&hmsr=aibanner&hmpl=paddlecamp', '/tech/face/offline-sdk', '/customer/huzhong', '/tech/ocr_cards/bankcard', 'http://di.baidu.com/product/bes', '/solution/mtp'} |
链接过滤
提取出链接后,下一步便是过滤不需要爬取链接。
使用 urllib 库过滤所有非内部链接
1 | from urllib.parse import urlparse |
1 | {'/tech/bicc/rts', 'http://di.baidu.com/product/shangqing', '/solution/censoring', '/tech/ocr_cards/vehicle_license', '/tech/imageprocess/stretch_restore', '/customer/baobao', 'https://app.baidu.com', '/tech/imagesearch', '/tech/nlp_basic/dependency_parsing', 'http://ar.baidu.com/dumixar', '/tech/imagerecognition/currency', '/tech/imageprocess/image_quality_enhance', 'https://aim.baidu.com/', '/tech/ocr_cards/passport', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/antiporn/overview/index', '/customer/samsung', 'http://www.baidu.com/duty', '/customer', '/solution/faceprint', '/solution/roboticvision', '/broad/subordinate?dataset=sceneparsing', '/tech/ocr_cars/driving_license', '/tech/video/vcs', 'http://di.baidu.com/product/mtj', '/tech/imagerecognition', '/easydl/retail', '/broad/subordinate?dataset=gon', '/customer/liaoningshiyan', '/broad/subordinate?dataset=canine', '/customer/nfdw', '/tech/ocr_cars/vehicle_license', '/partner/course', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/body/overview/index', '/partner/cella', '/tech/speech/lsr', 'http://zhongbao.baidu.com', 'http://di.baidu.com/product/stj', '/tech/speech', '/solution/aifenzhen', '/tech/face', '/broad/subordinate?dataset=saoke', '/customer/panda', '/partner/apply', '/docs#/FAQ/30a9eac7', 'http://di.baidu.com/product/perswc', '/broad/subordinate?dataset=sked', '/tech/vehicle/detect', 'https://nlp.baidu.com/homepage/nlptools', '/tech/hardware/offlinesdk', '/tech/kg/bgraph', '/solution/kgaas', '/solution/cabinet', '/tech/intelligentwriting#couplet', '/edgecloud/web/capture/int#/devicelist', 'https://ai.baidu.com/forum/topic/show/943331', '/broad/subordinate?dataset=traffic', '/broad/subordinate?dataset=video', '/tech/imagerecognition/fine_grained', '/tech/bicc/atw', '/customer/wps', '/partner/hardware', 'http://di.baidu.com/product/habo', '/tech/vehicle/damage', '/tech/unit', '/tech/ocr_cars', '/broad/subordinate?dataset=amd', '/tech/ocr/business', '/tech/video/vca', '/tech/nlp_basic/simnet', '/solution/facesignIn', 'http://di.baidu.com/product/insurancerisk', '/easydl', '/industry/retail', '/solution/cashier', '/support/news', 'https://console.bce.baidu.com/billing/?fromai=1#/account/index', '/tech/video/vcr', '/tech/face/compare', '/accelerator', '/customer/18183', '/tech/imagerecognition/general', 'https://console.bce.baidu.com/?fromai=1#/aip/overview', '/tech/nlp_apply/news_summary', '/support/news?action=detail&id=996', 'https://cloud.baidu.com/calculator.html#/ocr/price', '/tech/body/pose', '/tech/imagesearch/similar', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#5', '/customer/dida', 'http://di.baidu.com/product/zhizhou', '/broad/subordinate?dataset=dureader', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#4', '/customer/fang', '/partner/case', '/sdk', '/paddlepaddle', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/nlp/overview/index', '/tech/ocr/plate', '/tech/imageprocess/dehaze', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#2', '/customer/hualife', '/industry/enterpriseservice', '/tech/ocr_cards/driving_license', '/customer/iqiyi', '/customer/zhiban', '/tech/ocr_cards', '/tech/nlp_basic/lexical', '/tech/nlp_apply/emotion_detection', '/tech/imageprocess/contrast_enhance', '/tech/body/driver', '/tech/nlp_apply/topictagger', '/easydl/sound', '/tech/speech/asr', '/tech/intelligentwriting#article', 'https://developer.baidu.com', '/tech/body', '/solution/faceidentify', '/customer/biguiyuan', '/', '/unit/home', 'http://di.baidu.com/product/zhizhu', '/tech/vr', '/tech/nlp_apply', '/tech/intelligentwriting#poem', '/tech/vehicle/flow', '/broad/subordinate?dataset=pose', '/tech/imagerecognition/ingredient', '/tech/kg/wenda', '/solution/luban', '/partner/seeyon', '/tech/ocr/general', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#1', '/support/news?action=detail&id=1060', 'https://cloud.baidu.com/product/abc-robot.html', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/imagerecognition/overview/index', '/customer/bbsp', 'http://di.baidu.com/product/xuanke', '/tech/nlp_basic/word_embedding', '/tech/cognitive/hanyu', 'http://di.baidu.com/product/keqing', '/partner/data', '/industry/government', '/broad', 'https://ai.baidu.com/easydl/retail?hmsr=aibanner&hmpl=easydl-retail', 'https://bit.baidu.com', '/partner/solution', '/docs#/Begin/top', '/tech/ocr/table', 'http://fanyi-api.baidu.com/api/trans/product/index', '/tech/face/developmentkit', '/tech/imagerecognition/redwine', '/customer/zhongtong', '/tech/ocr_receipts/receipt', '/partner', '/tech/cognitive/entity_annotation', '/partner/roobo', '/solution/class', '/solution/factory', '/tech/bicc', '/tech/ocr_receipts', '/docs', '/tech/ocr_cards/idcard', 'http://di.baidu.com/ecology/dianshi', '/tech/imagecensoring', '/tech/speech/fsr', '/tech/vehicle', '/docs#/Begin', '/solution/private', '/tech/imageprocess/colourize', '/tech/nlp_basic/dnnlm_cn', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/speech/overview/index', 'http://ar.baidu.com/content#/', 'https://ai.baidu.com/tech/vehicle/damage?hmsr=aibanner&hmpl=damage', '/facekit/home', '/tech/ocr/webimage', 'http://di.baidu.com/product/huike', 'http://di.baidu.com/product/calc', '/tech/vehicle/attr', '/customer/liuzhouyuanchuang', '/partner/kangxing', '/tech/speech/asrpro', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/ar/overview/index', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/kg/overview/index', '/customer/benchi', '/tech/ocr_cards/business_card', 'http://ai.baidu.com/tech/vr', '/forum', '/tech/face/faceliveness', '/customer/hzhuihe', '/customer/lagou', '/docs#/FAQ/61e783cc', '/tech/textcensoring', '/industry/factory', '/partner/yunjingwang', 'https://aim.baidu.com/product/b226a947-4660-4e27-83b4-877bf63b8627?hmsr=aibanner&hmpl=speechkit', '/support/news?action=detail&id=435', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/ocr/overview/index', '/easyedge/home', '/tech/hardware/certifiedpd', '/easydl/text', 'https://ai.baidu.com/support/news?action=detail&id=1021&hmsr=aibanner&hmpl=KDD', '/industry/estate', '/tech/ocr_receipts/train_ticket', '/tech/ocr_others', '/tech/nlp_apply/comment_tag', '/customer/hangzhouyintai', '/tech/cognitive', '/tech/imageprocess', '/tech/body/num', '/tech/imagerecognition/dish', '/tech/ocr_receipts/taxi_receipt', '/tech/body/gesture', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/bicc/overview/index', '/customer/yy', '/tech/kg/schema', '/broad/subordinate?dataset=kg', 'https://ai.baidu.com/support/news?action=detail&id=1031&hmsr=aibanner&hmpl=customer201904', '/tech/nlp_basic/word_emb_sim', '/tech/intelligentwriting#hotspot', '/tech/video/vcc', '/easydl/image', 'http://fanyi-api.baidu.com/api/trans/product/prodinfo#3', 'http://www.paddlepaddle.org', 'http://di.baidu.com/#products', '/docs#/FAQ', 'https://github.com/Baidu-AIP', '/docs#/FAQ/54966861', 'https://console.bce.baidu.com/ai/?locale=zh-cn#/ai/easydl/overview/index', '/tech/smartasr', 'http://di.baidu.com/product/dayu', '/tech/face/detect', '/solution/ivs', '/tech/hardware/chip', 'http://di.baidu.com/product/tongji', '/tech/nlp_apply/text_corrector', 'http://di.baidu.com/product/palo', 'http://aistudio.baidu.com', '/tech/imagesearch/same', '/solution/facegate', '/tech/hardware/certification', '/solution/itma', '/support/video', '/tech/ocr', 'https://dueros.baidu.com/', '/tech/ocr/iocr', '/tech/face/search', 'http://di.baidu.com/product/jarvis', '/tech/face/merge', '/tech/body/seg', 'http://di.baidu.com/product/entwc', 'http://ticket.bce.baidu.com/#/ticket/create', '/industry/hardware', 'http://di.baidu.com/product/yuqing', '/tech/speech/tts', '/tech/ocr_receipts/vat_invoice', '/tech/vehicle/car', '/tech/imagesearch/product', 'http://di.baidu.com/product/opinion', '/customer/yongyou', '/tech/edgecloud/capture', 'https://www.huodongxing.com/event/9489422863311', '/tech/nlp_apply/doctagger', '/industry/information', '/tech/hardware/deepkit', 'http://lbsyun.baidu.com', 'http://di.baidu.com/product/insurancefraud', '/docs#/FAQ/d1a6a706', 'http://di.baidu.com/product/mike', 'https://cloud.baidu.com/?t=cp:online-media%7Cci:%7Ccn:ai', 'http://di.baidu.com/product/pingo', '/solution/faceattendance', '/docs#/AI-service-agreement/top', '/tech/speech/wake', '/tech/hardware/speechkit', '/customer/xiecheng', 'http://apollo.auto', '/solution/bmvs', 'https://console.bce.baidu.com/iam/?fromai=1#/iam/baseinfo', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/face/overview/index', '/tech/intelligentwriting#relation', 'http://di.baidu.com/product/sinan', '/industry/education', 'http://ar.baidu.com/home', 'https://ai.baidu.com/support/news?action=detail&id=1060&hmsr=aibanner&hmpl=201904pandian', 'http://di.baidu.com/product/tuijian', '/partner/Baichebao', '/customer/xspatium', '/tech/body/attr', '/tech/face/collect', '/customer/mitu', '/tech/intelligentwriting', '/tech/nlp_basic', '/tech/nlp_apply/sentiment_classify', 'http://di.baidu.com/product/minos', '/tech/kg/zuowen', 'https://console.bce.baidu.com/ai/?fromai=1#/ai/imagesearch/overview/index', 'https://ai.baidu.com/support/news?action=detail&id=1055&hmsr=aibanner&hmpl=paddlecamp', '/tech/face/offline-sdk', '/customer/huzhong', '/tech/ocr_cards/bankcard', 'http://di.baidu.com/product/bes', '/solution/mtp'} |
除了过滤非内部链接外,还需要把已经访问过的链接、爬虫协议不允许的链接 和 你不想访问的链接都过滤掉。
为了便于学习和更快的获取结果,假定我们对 ‘/docs’, ‘/support’, ‘/forum’, ‘/broad’, ‘/paddlepaddle’, ‘/market’, ‘/download’, ‘/facekit’, ‘/sdk’, ‘/customer’, ‘/easydl’ 下的网页都不感兴趣(不索引它们),只关注网站中主要的产品信息。在下面的代码中你将看到具体实现。
百度 AI 爬虫实现
1 | from requests_html import HTMLSession |
1 | …… |
在前面的章节中你已经学习了如何将结果写入 csv 文件 存储格式化数据。现在,将爬虫的结果按 网页标题,内部地址 的格式存储在 data.csv 文件中。如下图的形式。1
2
3
4
5
6
7
8
9
10文件名 data.csv
-------------------------------
语音搜索解决方案-百度AI开放平台,https://ai.baidu.com/solution/bmvs
长语音识别-语音识别-百度AI-百度AI开放平台,https://ai.baidu.com/tech/speech/lsr
相似图片搜索_图像搜索技术-百度AI开放平台,https://ai.baidu.com/tech/imagesearch/similar
"人像分割技术,精准识别人体轮廓,一键抠像-百度AI开放平台",https://ai.baidu.com/tech/body/seg
AI+智能货柜解决方案-百度AI开放平台,https://ai.baidu.com/solution/cabinet
百度AI开放平台-全球领先的人工智能服务平台-百度AI开放平台,https://ai.baidu.com/tech/kg/wenda
短文本相似度-NLP-百度AI-百度AI开放平台,https://ai.baidu.com/tech/nlp/simnet
视频比对检索-百度AI开放平台,https://ai.baidu.com/tech/video/vcc
你可以使用下面代码对 data.csv 文件进行读取,并添加查找功能。
1 | import csv |
1 | 请输入查询关键字,输入 quit 结束:图像 |