用于文章抓取和策划的 Python 模块报纸?

pythonserver side programmingprogramming

我们可以从各种领域(例如数据挖掘、信息检索等)中提取网页内容。要从报纸和杂志的网站中提取信息,我们将使用报纸库。

该库的主要目的是从报纸和类似网站中提取和策划文章。

安装:

  • 要安装报纸库,请在终端中运行:

$ pip install newspaper3k
  • 对于 lxml 依赖项,请在终端中运行以下命令

$pip install lxml
  • 要安装 PIL,运行

$pip install Pillow
  • 将下载 NLP 语料库:

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python

python newpaper 库用于收集与文章相关的信息。其中包括作者姓名、文章中的主要图片、出版日期、文章中的视频、描述文章的关键词以及文章摘要。

#导入所需库
from newspaper import Article
# url 链接-您要提取的链接
url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117"
# 下载文章
>>> from newspaper import Article
>>> url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117"
>>> article = Article(url)
>>> article.download()
# 解析文章并获取作者姓名
>>> article.parse()
>>> print(article.authors)

输出:

['Kristina Peterson', 'Andrew Duehren', 'Natalie Andrews', 'Kristina.Peterson Wsj.Com', 'Andrew.Duehren Wsj.Com', 'Natalie.Andrews Wsj.Com']

# 提取出版日期
>>> print("文章出版日期:")
>>> print(article.publish_date)

# 提取主要图像的 URL

>>> print(article.top_image)

输出:

https://images.wsj.net/im-51122/social

# 使用 NLP 提取关键词

print("文章中的关键词", article.keywords)

# 提取文章摘要

print("文章摘要", article.summary)

以下是完整程序:

from newspaper import Article
url = "https://www.wsj.com/articles/lawmakers-to-resume-stalled-border-security-talks-11549901117"
article = Article(url)
article.download()
article.parse()
print(article.authors)
print("文章发布日期:")
print(article.publish_date)
print("文章中的主要图片:")
print(article.top_image)
article.nlp()
print("文章中的关键词")
print(article.keywords)
print("文章摘要")
print(article.summary)

输出:

['Kristina Peterson', 'Andrew Duehren', 'Natalie Andrews', 'Kristina.Peterson Wsj.Com', 'Andrew.Duehren Wsj.Com', 'Natalie.Andrews Wsj.Com']
文章发布日期:
None
文章中的主要图片:
https://images.wsj.net/im-51122/social
文章中的关键词
['state', 'spending', 'sweeping', 'southern', 'security', 'border', 'principle', 'lawmakers', 'avoid', 'shutdown', 'reach', 'weekendthe', 'fund', 'trump', 'union', 'agreement', 'wall']
文章摘要
President Trump made the case in his State of the Union address for the construction of a wall along the southern U.S. border, calling it a “moral issue."
Photo: GettyWASHINGTON—Senior lawmakers said Monday night they had reached an agreement in principle on a sweeping deal to end a monthslong fight over border security and avoid a partial government shutdown this weekend.
The top four lawmakers on the House and Senate Appropriations Committees emerged after three closed-door meetings Monday and announced that they had agreed to a framework for all seven spending bills whose funding expires at 12:01 a.m. Saturday.

相关文章