首页 > > 网络编程 > 其它 >

Python使用Scrapy框架爬取数据存入CSV文件(Pytho…

2018-06-18 01:30:43来源：未知阅读 ()

1. Scrapy框架

　　Scrapy是python下实现爬虫功能的框架，能够将数据解析、数据处理、数据存储合为一体功能的爬虫框架。

2. Scrapy安装

1. 安装依赖包

yum install gcc libffi-devel python-devel openssl-devel -y
yum install libxslt-devel -y

2. 安装scrapy

pip install scrapy
pip install twisted==13.1.0

注意事项：scrapy和twisted存在兼容性问题，如果安装twisted版本过高，运行scrapy startproject project_name的时候会提示报错，安装twisted==13.1.0即可。

3. 基于Scrapy爬取数据并存入到CSV

3.1. 爬虫目标，获取简书中热门专题的数据信息，站点为https://www.jianshu.com/recommendations/collections，点击"热门"是我们需要爬取的站点，该站点使用了AJAX异步加载技术，通过F12键——Network——XHR，并翻页获取到页面URL地址为https://www.jianshu.com/recommendations/collections?page=2&order_by=hot，通过修改page=后面的数值即可访问多页的数据，如下图：

3.2. 爬取内容

　　需要爬取专题的内容包括：专题内容、专题描述、收录文章数、关注人数，Scrapy使用xpath来清洗所需的数据，编写爬虫过程中可以手动通过lxml中的xpath获取数据，确认无误后再将其写入到scrapy代码中，区别点在于，scrapy需要使用extract()函数才能将数据提取出来。

3.3 创建爬虫项目

[root@HappyLau jianshu_hot_topic]# scrapy startproject jianshu_hot_topic

#项目目录结构如下:
[root@HappyLau python]# tree jianshu_hot_topic
jianshu_hot_topic
├── jianshu_hot_topic
│   ├── __init__.py
│   ├── __init__.pyc
│   ├── items.py
│   ├── items.pyc
│   ├── middlewares.py
│   ├── pipelines.py
│   ├── pipelines.pyc
│   ├── settings.py
│   ├── settings.pyc
│   └── spiders
│       ├── collection.py
│       ├── collection.pyc
│       ├── __init__.py
│       ├── __init__.pyc
│       ├── jianshu_hot_topic_spider.py    #手动创建文件，用于爬虫数据提取
│       └── jianshu_hot_topic_spider.pyc
└── scrapy.cfg

2 directories, 16 files
[root@HappyLau python]#

3.4 代码内容

1. items.py代码内容,定义需要爬取数据字段
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy
from scrapy import Item
from scrapy import Field

class JianshuHotTopicItem(scrapy.Item):
	'''
	@scrapy.item，继承父类scrapy.Item的属性和方法，该类用于定义需要爬取数据的子段
	'''
	collection_name = Field()
	collection_description = Field()
	collection_article_count = Field()
	collection_attention_count = Field()

2. piders/jianshu_hot_topic_spider.py代码内容，实现数据获取的代码逻辑，通过xpath实现
[root@HappyLau jianshu_hot_topic]# cat spiders/jianshu_hot_topic_spider.py
#_*_ coding:utf8 _*_

import random
from time import sleep
from scrapy.spiders import CrawlSpider
from scrapy.selector import Selector
from scrapy.http import Request
from jianshu_hot_topic.items import JianshuHotTopicItem 

class jianshu_hot_topic(CrawlSpider):
	'''
	简书专题数据爬取，获取url地址中特定的子段信息
	'''
	name = "jianshu_hot_topic"
	start_urls = ["https://www.jianshu.com/recommendations/collections?page=2&order_by=hot"]

	def parse(self,response):
		'''
		@params:response,提取response中特定字段信息
		'''
		item = JianshuHotTopicItem()
		selector = Selector(response)
		collections = selector.xpath('//div[@class="col-xs-8"]')	
		for collection in collections:
			collection_name = collection.xpath('div/a/h4/text()').extract()[0].strip()
                	collection_description = collection.xpath('div/a/p/text()').extract()[0].strip()
                	collection_article_count = collection.xpath('div/div/a/text()').extract()[0].strip().replace('篇文章','')
                	collection_attention_count = collection.xpath('div/div/text()').extract()[0].strip().replace("人关注",'').replace("· ",'')	
			item['collection_name'] = collection_name
			item['collection_description'] = collection_description
			item['collection_article_count'] = collection_article_count 
			item['collection_attention_count'] = collection_attention_count

			yield item
		
		
		urls = ['https://www.jianshu.com/recommendations/collections?page={}&order_by=hot'.format(str(i)) for i in range(3,11)]
		for url in urls:
			sleep(random.randint(2,7))
			yield Request(url,callback=self.parse)

3. pipelines文件内容，定义数据存储的方式，此处定义数据存储的逻辑，可以将数据存储载MySQL数据库，MongoDB数据库，文件，CSV，Excel等存储介质中，如下以存储载CSV为例：
[root@HappyLau jianshu_hot_topic]# cat pipelines.py
# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


import csv 

class JianshuHotTopicPipeline(object):
    def process_item(self, item, spider):
    	f = file('/root/zhuanti.csv','a+')
	writer = csv.writer(f)
	writer.writerow((item['collection_name'],item['collection_description'],item['collection_article_count'],item['collection_attention_count']))
        return item

4. 修改settings文件，
ITEM_PIPELINES = { 
    'jianshu_hot_topic.pipelines.JianshuHotTopicPipeline': 300,
}

3.5 运行scrapy爬虫

　　返回到项目scrapy项目创建所在目录，运行scrapy crawl spider_name即可，如下：

[root@HappyLau jianshu_hot_topic]# pwd
/root/python/jianshu_hot_topic
[root@HappyLau jianshu_hot_topic]# scrapy crawl jianshu_hot_topic
2018-02-24 19:12:23 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: jianshu_hot_topic)
2018-02-24 19:12:23 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 13.1.0, Python 2.7.5 (default, Aug  4 2017, 00:39:18) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-16)], pyOpenSSL 0.13.1 (OpenSSL 1.0.1e-fips 11 Feb 2013), cryptography 1.7.2, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core
2018-02-24 19:12:23 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jianshu_hot_topic.spiders', 'SPIDER_MODULES': ['jianshu_hot_topic.spiders'], 'ROBOTSTXT_OBEY': True, 'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0', 'BOT_NAME': 'jianshu_hot_topic'}

查看/root/zhuanti.csv中的数据，即可实现。

4. 遇到的问题总结

1. twisted版本不见容，安装过新的版本导致，安装Twisted (13.1.0)即可

2. 中文数据无法写入，提示'ascii'错误，通过设置python的encoding为utf即可，如下：

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> reload(sys)
<module 'sys' (built-in)>
>>> sys.setdefaultencoding('utf8')
>>> sys.getdefaultencoding()
'utf8'

3. 爬虫无法获取站点数据，由于headers导致，载settings.py文件中添加USER_AGENT变量，如：

USER_AGENT="Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"

Scrapy使用过程中可能会遇到结果执行失败或者结果执行不符合预期，其现实的logs非常详细，通过观察日志内容，并结合代码+网上搜索资料即可解决。

标签：

版权申明：本站文章部分自网络，如有侵权，请联系：west999com@outlook.com
特别注意：本站所有转载文章言论不代表本站观点，本站所提供的摄影照片，插画，设计作品，如需使用，请与原作者联系，版权归原作者所有

上一篇：小白眼中的git操作

下一篇：Python中的类（一）

python3基础之“术语表（2）” 2019-08-13
python3 之字符串编码小结（Unicode、utf-8、gbk、gb2312等 2019-08-13
Python3安装impala 2019-08-13
小白如何入门 Python 爬虫？ 2019-08-13
python_字符串方法 2019-08-13

IDC资讯：主机资讯注册资讯托管资讯 vps资讯网站建设

网站运营：建站经验策划盈利搜索优化网站推广免费资源

网站联盟：联盟新闻联盟介绍联盟点评网赚技巧

行业资讯：搜索引擎网络游戏电子商务广告传媒

网络编程： Asp.Net编程 Asp编程 Php编程 Xml编程 Access Mssql Mysql 其它

服务器技术： Web服务器 Ftp服务器 Mail服务器 Dns服务器安全防护

软件技巧：其它软件 Word Excel Powerpoint Ghost Vista QQ空间 QQ FlashGet 迅雷

网页制作： FrontPages Dreamweaver Javascript css photoshop fireworks Flash

程序设计： Java技术 C/C++ VB delphi

网络知识：网络协议网络安全网络管理组网方案 Cisco技术

操作系统： Win2000 WinXP Win2003 Mac OS Linux FreeBSD

热门词条

最新资讯

热门关注

热门标签