聚合新闻头条

学习python123课程：https://python123.io/index/tutorials/web_crawler_intro

新闻来源

访问百度新闻和网易新闻首页查看热点新闻标题

提取头条新闻

使用右键-审查元素功能，任意提取三个题目的选择器并观察差异。

除了第一个外，其余几个选择器都是只有 ul:nth-child() 括号中的数字不同，所以除了第一个题目，其余标题可以使用 ul:nth-child(n) 提取出来：

from requests_html import HTMLSession

def get_news(): 
    ans_news_titles = [] 
    session = HTMLSession()
    r = session.get('https://news.baidu.com/')
    title1_baidu = r.html.find('#pane-news > div > ul > li.hdline0 > strong > a', first=True)
    ans_news_titles.append(title1_baidu)
    titles_baidu = r.html.find('#pane-news > ul:nth-child(n) > li.bold-item > a')
    ans_news_titles += titles_baidu
    for title in ans_news_titles:
        print(title.text)

if __name__=="__main__":
    get_news()

习近平主持中共中央政治局会议
新华时评：认清美方倒打一耙的把戏
“新闻联播”热搜第一，网友：就是要硬气！
特朗普宣布额外拨款16亿美元加速NASA登月计划
如何让中国故事打动日本民众看这只"熊猫"怎么做的
广东考试院回应"富源学校去年9人考上名校":已调查

与此类似，提取网易新闻的代码如下。

from requests_html import HTMLSession

def get_news(): 
    ans_news_titles = [] 
    session = HTMLSession()
    r2 = session.get('https://news.163.com/')
    title1_163 = r2.html.find('#js_top_news > h2:nth-child(1) > a', first=True)
    title2_163 = r2.html.find('#js_top_news > h2.top_news_title > a', first=True)
    titles_163 = r2.html.find('#js_top_news > ul:nth-child(n) > li:nth-child(n)')
    ans_news_titles.append(title1_163)
    ans_news_titles.append(title2_163)
    ans_news_titles += titles_163
    for title in ans_news_titles:
        print(title.text)

if __name__=="__main__":
    get_news()

定时

在控制台使用 pip install apscheduler 命令安装 Advanced Python Scheduler 库。

APScheduler 库可以帮我们实现定时执行功能。以下代码可以实现 5 秒中输出一次“Python123!”。

以下是 APScheduler 中一些常用的定时方法：

每 30 秒执行一次 my_print：
add_job(my_print, 'interval', seconds = 30)

每 2 分钟执行一次 my_print：
add_job(my_print, 'interval', minutes = 2)

在 2019-01-01 09:30:00 2019-02-01 11:00:00 的时间范围内，每 2 小时执行一次 my_print：
add_job(my_print, 'interval', hours=2, start_date='2019-01-01 09:30:00', end_date='2019-02-01 11:00:00')

在 2019-01-01 09:30:00 执行一次 my_print：
add_job(my_print, 'date', run_date='2019-01-01 09:30:00')

每个整点执行一次 my_print：
add_job(my_print, 'cron', hour='*')

每周一到周五 05:30 执行 my_print：
add_job(my_print, 'cron', day_of_week='mon-fri', hour=5, minute=30)

下面的代码实现了在 6、7、8、11、12 月份的第三个星期五的 00:00、01:00、02:00、03:00 获取并输出一次腾讯和网易热点新闻。

from apscheduler.schedulers.blocking import BlockingScheduler
from requests_html import HTMLSession

def get_news(): 
    …

sched = BlockingScheduler()
#在6、7、8、11、12月份的第三个星期五的 00:00、01:00、02:00、03:00 执行该任务
sched.add_job(get_news, 'cron', month = '6-8,11-12', day = '3rd fri', hour = '0-3')
sched.start()

可视化展示

热点新闻现在已经可以定时的输出展示，但我们还想可视化展示，试试设计更好地展示方式吧。

为了让新闻热点更有趣味性，本节采用词云的形式展示。

首先要将得到的文字进行分词，分词就是将一句话中的词语提取出来。jieba 库可以实现这个功能，先来试试。

在控制台使用 pip install jieba命令安装 jieba 库。

import jieba

seg_list = jieba.cut("Python123！Python123为你提供优秀的 Python 学习工具、教程、平台和更好的学习体验。", cut_all=True)
word_split = " ".join(seg_list)
print(word_split)

1	Python123 Python123 为你提供优秀的 Python 学习工具教程平台和更好的学习体验

在完成分词后，我们当然会希望出现次数多的词更清晰得显示，wordcloud 库帮我们完成形成词云的功能，再存储为图片文件即可。

在控制台使用 pip install wordcloud 命令安装 wordcloud 库。

from wordcloud import WordCloud
import jieba
import time

seg_list = jieba.cut("Python123！Python123为你提供优秀的 Python 学习工具、教程、平台和更好的学习体验。", cut_all=True)
word_split = " ".join(seg_list)
# 显示中文需要的字体，以下是 Windows 系统的设置
# MacOS 中 font_path 可以设置为："/System/Library/fonts/PingFang.ttc"
my_wordclud = WordCloud(background_color='white', font_path = 'C:\Windows\Fonts\simhei.ttf', max_words = 100, width = 1600, height = 800)
# 产生词云
my_wordclud = my_wordclud.generate(word_split)
# 以当前时间为名称存储词云图片
now = time.strftime('%Y-%m-%d-%H_%M_%S', time.localtime(time.time())) 
my_wordclud.to_file(now + '.png')

结合之前爬取新闻的代码，写出完整的代码

from apscheduler.schedulers.blocking import BlockingScheduler
from requests_html import HTMLSession
import jieba
from wordcloud import WordCloud
import time

def get_news():
    print('开始爬取热点新闻')
    ans_news_titles = []
    session = HTMLSession()
    # 获取百度新闻
    r = session.get('https://news.baidu.com/')
    title1_baidu = r.html.find('#pane-news > div > ul > li.hdline0 > strong > a', first=True)
    ans_news_titles.append(title1_baidu.text)
    titles_baidu = r.html.find('#pane-news > ul:nth-child(n) > li.bold-item > a')
    for title in titles_baidu:
        ans_news_titles.append(title.text)
    
    # 获取网易新闻
    r = session.get('https://news.163.com/')
    title1_163 = r.html.find('#js_top_news > h2:nth-child(1) > a', first=True)
    title2_163 = r.html.find('#js_top_news > h2.top_news_title > a', first=True)
    titles_163 = r.html.find('#js_top_news > ul:nth-child(n) > li:nth-child(n)')
    ans_news_titles.append(title1_163.text)
    ans_news_titles.append(title2_163.text)
    for title in titles_163:
        ans_news_titles.append(title.text)
    word_jieba = jieba.cut(' '.join(ans_news_titles), cut_all=True)
    word_split = " ".join(word_jieba)
    my_wordclud = WordCloud(background_color='white', font_path = 'C:\Windows\Fonts\simhei.ttf', max_words = 100, width = 1600, height = 800)
    # 生成词云
    my_wordclud = my_wordclud.generate(word_split)
    # 以当前时间为名称存储词云
    now = time.strftime('%Y-%m-%d-%H_%M_%S', time.localtime(time.time())) 
    my_wordclud.to_file(now + '.png')

sched = BlockingScheduler()

get_news()
# 之后每 30 秒执行一次
sched.add_job(get_news, 'interval', seconds = 30)
sched.start()

当然你还可以做一些更有意思的事情，比如做每个小时的热点词相对上个小时增加了什么、减少了什么、汇总24小时的热点……