Discuz! Board

 找回密码
 立即注册
搜索
热搜: 活动 交友 discuz
查看: 15026|回复: 0
打印 上一主题 下一主题

Centos下安装Scrapy

[复制链接]

85

主题

89

帖子

600

积分

管理员

Rank: 9Rank: 9Rank: 9

积分
600
跳转到指定楼层
楼主
发表于 2016-8-2 17:57:21 | 只看该作者 回帖奖励 |倒序浏览 |阅读模式
Scrapy是一个开源的机遇twisted框架的python的单机爬虫,该爬虫实际上包含大多数网页抓取的工具包,用于爬虫下载端以及抽取端。
安装环境:

centos5.4python2.7.3

安装步骤:
按 Ctrl+C 复制代码
按 Ctrl+C 复制代码

 验证python2.7安装
[root@zxy-websgs Python-2.7.3]# python2.7Python 2.7.3 (default, Feb 28 2013, 03:08:43) [GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2Type "help", "copyright", "credits" or "license" for more information.>>> exit()
[root@zxy-websgs ~]# wget http://pypi.python.org/packages/ ... tools-0.6c11.tar.gz -P /opt/[root@zxy-websgs opt]# tar zxvf setuptools-0.6c11.tar.gz [root@zxy-websgs setuptools-0.6c11]# python2.7 setup.py  install
3.安装Twisted
[root@zxy-websgs setuptools-0.6c11]# easy_install Twisted......Installed /usr/local/lib/python2.7/site-packages/Twisted-12.3.0-py2.7-linux-x86_64.egg......Installed /usr/local/lib/python2.7/site-packages/zope.interface-4.0.4-py2.7-linux-x86_64.egg
Twisted要安装zope.interface,可以从下面地址下载
twisted:http://twistedmatrix.com/Releases/Twisted/12.1/Twisted-12.1.0.tar.bz2
5.安装w3lib
[url=][/url]
[root@zxy-websgs setuptools-0.6c11]# easy_install -U w3libSearching for w3libReading http://pypi.python.org/simple/w3lib/Reading http://github.com/scrapy/w3libBest match: w3lib 1.2Downloading http://pypi.python.org/packages/ ... 72f185a9eProcessing w3lib-1.2.tar.gzRunning w3lib-1.2/setup.py -q bdist_egg --dist-dir /tmp/easy_install-wm_1BB/w3lib-1.2/egg-dist-tmp-2DQHY_zip_safe flag not set; analyzing archive contents...Adding w3lib 1.2 to easy-install.pth fileInstalled /usr/local/lib/python2.7/site-packages/w3lib-1.2-py2.7.eggProcessing dependencies for w3libFinished processing dependencies for w3lib[url=][/url]

6.安装libxml2或者用easy_install安装lxml
[root@zxy-websgs lxml-3.1.0]# easy_install lxml
验证lxml安装
[root@zxy-websgs lxml-3.1.0]# python2.7Python 2.7.3 (default, Feb 28 2013, 03:08:43) [GCC 4.1.2 20080704 (Red Hat 4.1.2-50)] on linux2Type "help", "copyright", "credits" or "license" for more information.>>> import lxml>>> exit()
也可以安装libxml2,官网上推荐安装2.6.28或者以上的版本,但在官网上没找到,我先是安装的2.6.9的版本,运行scrapy时报以下错误
[url=][/url]
Traceback (most recent call last):  File "/usr/local/bin/scrapy", line 5, in <module>    pkg_resources.run_script('Scrapy==0.14.4', 'scrapy')  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 489, in run_script  File "build/bdist.linux-x86_64/egg/pkg_resources.py", line 1207, in run_script  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/EGG-INFO/scripts/scrapy", line 4, in <module>    execute()  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 112, in execute    cmds = _get_commands_dict(inproject)  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 37, in _get_commands_dict    cmds = _get_commands_from_module('scrapy.commands', inproject)  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 30, in _get_commands_from_module    for cmd in _iter_command_classes(module):  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/cmdline.py", line 21, in _iter_command_classes    for module in walk_modules(module_name):  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/utils/misc.py", line 65, in walk_modules    submod = __import__(fullpath, {}, {}, [''])  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/commands/shell.py", line 8, in <module>    from scrapy.shell import Shell  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/shell.py", line 14, in <module>    from scrapy.selector import XPathSelector, XmlXPathSelector, HtmlXPathSelector  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/__init__.py", line 30, in <module>    from scrapy.selector.libxml2sel import *  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/libxml2sel.py", line 12, in <module>    from .factories import xmlDoc_from_html, xmlDoc_from_xml  File "/usr/local/lib/python2.7/site-packages/Scrapy-0.14.4-py2.7.egg/scrapy/selector/factories.py", line 14, in <module>    libxml2.HTML_PARSE_NOERROR + \AttributeError: 'module' object has no attribute 'HTML_PARSE_RECOVER'[url=][/url]

升级到2.6.21版本以后解决了。
libxml2.6.1:ftp://xmlsoft.org/libxml2/python/libxml2-python-2.6.21.tar.gz
7.安装pyOpenSSL(这个是可选安装的,主要为了使scrapy能够支持https)
用easy_install pyOpenSSL安装的是pyOpenSSL-0.13版本,没安装成功,于是手动下载.011版本来进行安装。
[root@zxy-websgs opt]# wget http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz -P /opt[root@zxy-websgs opt]# tar zxvf pyOpenSSL-0.11.tar.gz [root@zxy-websgs pyOpenSSL-0.11]# python2.7 setup.py install
pyOpenSSL:http://launchpadlibrarian.net/58498441/pyOpenSSL-0.11.tar.gz
8.安装scrapy
[root@zxy-websgs pyOpenSSL-0.11]# easy_install -U Scrapy
验证安装
[url=][/url]
[root@zxy-websgs pyOpenSSL-0.11]# scrapyScrapy 0.16.4 - no active projectUsage:  scrapy <command> [options] [args]Available commands:  fetch         Fetch a URL using the Scrapy downloader  runspider     Run a self-contained spider (without creating a project)  settings      Get settings values  shell         Interactive scraping console  startproject  Create new project  version       Print Scrapy version  view          Open URL in browser, as seen by Scrapy  [ more ]      More commands available when run from project directoryUse "scrapy <command> -h" to see more info about a command[url=][/url]

scrapy:http://pypi.python.org/packages/source/S/Scrapy/Scrapy-0.14.4.tar.gz
总结:
pyOpenSSL单独安装的时候不成功,也可以先下载pyOpenSSL0.11进行安装,再使用easy_install -U Scrapy进行全程安装






回复

使用道具 举报

您需要登录后才可以回帖 登录 | 立即注册

本版积分规则

Archiver|手机版|小黑屋|Comsenz Inc.

GMT+8, 2024-12-16 01:30 , Processed in 0.034850 second(s), 20 queries .

Powered by Discuz! X3.2

© 2001-2013 Comsenz Inc.

快速回复 返回顶部 返回列表