爬取豆瓣电源 并把名字和年份记录在csv文件上

1. 爬取豆瓣电源 并把名字和年份记录在csv文件上:

1

这里用到的是比较新的requests_html的HTMLSession

官方文档: https://cncert.github.io/requests-html-doc-cn/#/?id=%E4%BD%BF%E7%94%A8%E6%96%B9%E6%B3%95

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# -*- coding: utf-8 -*-
from requests_html import HTMLSession
import csv

session = HTMLSession()

file = open('movies.csv', 'w', newline='')
csvwriter = csv.writer(file)
csvwriter.writerow(['名称', '年份'])

links = ['https://movie.douban.com/subject/1292052/', 'https://movie.douban.com/subject/26752088/', 'https://movie.douban.com/subject/1962665/']

for link in links:
r = session.get(link)
title = r.html.find('#content > h1 > span:nth-child(1)', first=True)
year = r.html.find('#content > h1 > span.year', first=True)
csvwriter.writerow([title.text, year.text])

file.close()

  1. 爬取北京地区的爬虫工程师薪资数据
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# -*- coding: utf-8 -*-
from requests_html import HTMLSession
import re
from matplotlib import pyplot as plt

salary_element = '<p.*>(\d+)K-(\d+)K</p>'
salary = []
disabled_button_element = '<button.* disabled="disabled">下一页</button>'
disabled_button = None
p = 1

while not disabled_button:
print('正在爬取第' + str(p) + '页')
url = 'https://sou.zhaopin.com/?p=' + str(p) + '&jl=530&kw=爬虫工程师&kt=3'
session = HTMLSession()
page = session.get(url)
page.html.render(sleep=20)
# 提取出薪资,保存为形如 [[10,20], [15,20], [12, 15]] 的数组
salary += re.findall(salary_element, page.html.html)
# 判断页面中下一页按钮还能不能点击
disabled_button = re.findall(disabled_button_element, page.html.html)
p = p + 1
session.close()

# 求出每家公司的平均薪资,比如 [12, 15] 的平均值为 13
salary = [(int(s[0]) + int(s[1])) / 2 for s in salary]
# 划定薪资范围,便于展示,你也可以尝试其它展示方案
low_salary, middle_salary, high_salary = [0, 0, 0]
for s in salary:
if s <= 15:
low_salary += 1
elif s > 15 and s <= 30:
middle_salary += 1
else:
high_salary += 1
# 调节图形大小,宽,高
plt.figure(figsize=(6, 9))
# 定义饼状图的标签,标签是列表
labels = [u'<15K', u'15K-30K', u'>30K']
data = [low_salary, middle_salary, high_salary]
plt.pie(data, labels=labels)
# 设置x,y轴刻度一致,这样饼图才能是圆的
plt.axis('equal')
plt.legend()
plt.show()

1

报错:
RuntimeError: Cannot use HTMLSession within an existing event loop. Use AsyncHTMLSession instead.

详见:
https://github.com/kennethreitz/requests-html/issues 未解决,连创始人都没回,好像跟异步有关,之后有时间啃了那个库的文档再说(别用这个库了(谷歌很难找,还是找对应github开源库的issue)

Donate? comment?