Python的scrapy之爬取6毛小说网的圣墟
2018-10-26 05:29:30来源:博客园 阅读 ()
闲来无事想看个小说,打算下载到电脑上看,找了半天,没找到可以下载的网站,于是就想自己爬取一下小说内容并保存到本地
圣墟 第一章 沙漠中的彼岸花 - 辰东 - 6毛小说网 http://www.6mao.com/html/40/40184/12601161.html
这是要爬取的网页
观察结构
下一章
然后开始创建scrapy项目:
其中sixmaospider.py:
# -*- coding: utf-8 -*- import scrapy from ..items import SixmaoItem class SixmaospiderSpider(scrapy.Spider): name = 'sixmaospider' #allowed_domains = ['http://www.6mao.com'] start_urls = ['http://www.6mao.com/html/40/40184/12601161.html'] #圣墟 def parse(self, response): novel_biaoti = response.xpath('//div[@id="content"]/h1/text()').extract() #print(novel_biaoti) novel_neirong=response.xpath('//div[@id="neirong"]/text()').extract() print(novel_neirong) #print(len(novel_neirong)) novelitem = SixmaoItem() novelitem['novel_biaoti'] = novel_biaoti[0] print(novelitem['novel_biaoti']) for i in range(0,len(novel_neirong),2): #print(novel_neirong[i]) novelitem['novel_neirong'] = novel_neirong[i] yield novelitem #下一章 nextPageURL = response.xpath('//div[@class="s_page"]/a/@href').extract() # 取下一页的地址 nexturl='http://www.6mao.com'+nextPageURL[2] print('下一章',nexturl) if nexturl: url = response.urljoin(nexturl) # 发送下一页请求并调用parse()函数继续解析 yield scrapy.Request(url, self.parse, dont_filter=False) pass else: print("退出") pass
pipelinesio.py 将内容保存到本地文件
import os print(os.getcwd()) class SixmaoPipeline(object): def process_item(self, item, spider): #print(item['novel']) with open('./data/圣墟.txt', 'a', encoding='utf-8') as fp: fp.write(item['novel_neirong']) fp.flush() fp.close() return item print('写入文件成功')
items.py
import scrapy class SixmaoItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() novel_biaoti=scrapy.Field() novel_neirong=scrapy.Field() pass
startsixmao.py,直接右键这个运行,项目就开始运行了
from scrapy.cmdline import execute execute(['scrapy', 'crawl', 'sixmaospider'])
settings.py
LOG_LEVEL='INFO' #这是加日志 LOG_FILE='novel.log' DOWNLOADER_MIDDLEWARES = { 'sixmao.middlewares.SixmaoDownloaderMiddleware': 543, 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None, 'sixmao.rotate_useragent.RotateUserAgentMiddleware' :400 #这行是使用代理 } ITEM_PIPELINES = { #'sixmao.pipelines.SixmaoPipeline': 300, 'sixmao.pipelinesio.SixmaoPipeline': 300, } #在pipelines输出管道加入这个 SPIDER_MIDDLEWARES = { 'sixmao.middlewares.SixmaoSpiderMiddleware': 543, } #打开中间件 其余地方应该不需要改变
rotate_useragent.py 给项目加代理,防止被服务器禁止
# 导入random模块 import random # 导入useragent用户代理模块中的UserAgentMiddleware类 from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware # RotateUserAgentMiddleware类,继承 UserAgentMiddleware 父类 # 作用:创建动态代理列表,随机选取列表中的用户代理头部信息,伪装请求。 # 绑定爬虫程序的每一次请求,一并发送到访问网址。 # 发爬虫技术:由于很多网站设置反爬虫技术,禁止爬虫程序直接访问网页, # 因此需要创建动态代理,将爬虫程序模拟伪装成浏览器进行网页访问。 class RotateUserAgentMiddleware(UserAgentMiddleware): def __init__(self, user_agent=''): self.user_agent = user_agent def process_request(self, request, spider): #这句话用于随机轮换user-agent ua = random.choice(self.user_agent_list) if ua: # 输出自动轮换的user-agent print(ua) request.headers.setdefault('User-Agent', ua) # the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape # for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php # 编写头部请求代理列表 user_agent_list = [\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\ "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\ "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\ "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\ "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\ "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\ "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\ "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\ "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\ "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24" ]
最终运行结果:
呐呐呐,这就是一个小的scrapy项目了
标签:
版权申明:本站文章部分自网络,如有侵权,请联系:west999com@outlook.com
特别注意:本站所有转载文章言论不代表本站观点,本站所提供的摄影照片,插画,设计作品,如需使用,请与原作者联系,版权归原作者所有
下一篇:切片
- python3基础之“术语表(2)” 2019-08-13
- python3 之 字符串编码小结(Unicode、utf-8、gbk、gb2312等 2019-08-13
- Python3安装impala 2019-08-13
- 小白如何入门 Python 爬虫? 2019-08-13
- python_字符串方法 2019-08-13
IDC资讯: 主机资讯 注册资讯 托管资讯 vps资讯 网站建设
网站运营: 建站经验 策划盈利 搜索优化 网站推广 免费资源
网络编程: Asp.Net编程 Asp编程 Php编程 Xml编程 Access Mssql Mysql 其它
服务器技术: Web服务器 Ftp服务器 Mail服务器 Dns服务器 安全防护
软件技巧: 其它软件 Word Excel Powerpoint Ghost Vista QQ空间 QQ FlashGet 迅雷
网页制作: FrontPages Dreamweaver Javascript css photoshop fireworks Flash