1. 爬取豆瓣电源 并把名字和年份记录在csv文件上:
这里用到的是比较新的requests_html的HTMLSession
官方文档: https://cncert.github.io/requests-html-doc-cn/#/?id=%E4%BD%BF%E7%94%A8%E6%96%B9%E6%B3%951
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19# -*- coding: utf-8 -*-
from requests_html import HTMLSession
import csv
session = HTMLSession()
file = open('movies.csv', 'w', newline='')
csvwriter = csv.writer(file)
csvwriter.writerow(['名称', '年份'])
links = ['https://movie.douban.com/subject/1292052/', 'https://movie.douban.com/subject/26752088/', 'https://movie.douban.com/subject/1962665/']
for link in links:
r = session.get(link)
title = r.html.find('#content > h1 > span:nth-child(1)', first=True)
year = r.html.find('#content > h1 > span.year', first=True)
csvwriter.writerow([title.text, year.text])
file.close()
- 爬取北京地区的爬虫工程师薪资数据
1 | # -*- coding: utf-8 -*- |
报错:
RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.
详见:
https://github.com/kennethreitz/requests-html/issues 未解决,连创始人都没回,好像跟异步有关,之后有时间啃了那个库的文档再说(别用这个库了(谷歌很难找,还是找对应github开源库的issue)