新浪新闻不能正常爬取,王大伟老师能帮助一下吗
0
import requests
from bs4 import BeautifulSoup
import re
res = requests.get('http://news.sina.com.cn/china/')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
newsary = []
for link in soup.select('.news-item'):
if len(link.select('h2 a')) > 0:
newsary.append(getArticle(link.select('h2 a')[0]['href']))
def getArticle(url):
res = requests.get(url)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
dic = {}
#dic['title'] = soup.select('body > div.main-content.w1240 > h1')[0].text
dic['content'] = ''.join(soup.select('#article')[0].text.split())
dic['sourse'] = soup.select('.date-source')[0].text
dic['keywords'] = soup.select('.keywords')[0].text
return dic
from bs4 import BeautifulSoup
import re
res = requests.get('http://news.sina.com.cn/china/')
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
newsary = []
for link in soup.select('.news-item'):
if len(link.select('h2 a')) > 0:
newsary.append(getArticle(link.select('h2 a')[0]['href']))
def getArticle(url):
res = requests.get(url)
res.encoding = 'utf-8'
soup = BeautifulSoup(res.text, 'html.parser')
dic = {}
#dic['title'] = soup.select('body > div.main-content.w1240 > h1')[0].text
dic['content'] = ''.join(soup.select('#article')[0].text.split())
dic['sourse'] = soup.select('.date-source')[0].text
dic['keywords'] = soup.select('.keywords')[0].text
return dic
没有找到相关结果
重要提示:提问者不能发表回复,可以通过评论与回答者沟通,沟通后可以通过编辑功能完善问题描述,以便后续其他人能够更容易理解问题.
3 个回复
ID王大伟 - 人生苦短,我选Python。 2018-04-09 回答
赞同来自:
晨枫 2018-04-11 回答
赞同来自:
1. 爬取http://news.sina.com.cn/china/ 下最新消息新闻内容,具体贴图;
2. 发出代码是仅仅可以爬一小部分,但下拉到最后的分页不能爬
晨枫 2018-04-11 回答
赞同来自:
感谢王老师回复,爬取新浪新闻http://news.sina.com.cn/china/ 最新消息的内容,下拉最后,
就是分页内容爬不了,抛出异常。