Python爬虫学习==>第八章:Requests库详…

2018-06-18 02:14:39来源:未知 阅读 ()

新老客户大回馈,云服务器低至5折

学习目的:

   request库比urllib库使用更加简洁,且更方便。

正式步骤

Step1:什么是requests


 

  requests是用Python语言编写,基于urllib,采用Apache2 Licensed开源协议的HTTP库。它比urllib更加方便,可以节约大量工作时间,还完全满足HTTP测试需求,是一个简单易用的HTTP库。

 

Step2:实例 引入


 

   

# -*-  coding:utf-8 -*-

import requests

response = requests.get('http://www.baidu.com')
print(type(response))
print(response.content)
print(response.status_code)
print(response.text)
print(type(response.text))
print(response.cookies)

 

Step3:各种请求方式


 

  

# -*-  coding:utf-8 -*-
import requests

requests.post('http://httpbin.org/post')
requests.put('http://httpbin.org/put')
requests.delete('http://httpbin.org/delete')
requests.head('http://httpbin.org/get')
requests.options('http://httpbin.org/get')
  1. get请求
    ① 基本用法
    # -*-  coding:utf-8 -*-
    
    import requests
    
    response = requests.get('http://httpbin.org/get')
    print(response.text)

    运行结果:

    {
      "args": {}, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Connection": "close", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.18.4"
      }, 
      "origin": "222.94.50.178", 
      "url": "http://httpbin.org/get"
    }


    ②带参数的get请求

    import requests
    
    
    data = {
        'name':'python','age':17
    }
    
    response = requests.get('http://httpbin.org/get',params=data)
    print(response.text)

    运行结果:

    {
      "args": {
        "age": "17", 
        "name": "python"
      }, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Connection": "close", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.18.4"
      }, 
      "origin": "222.94.50.178", 
      "url": "http://httpbin.org/get?name=python&age=17"
    }

     
    ③解析Json

    import requests
    import json
    
    response = requests.get('http://httpbin.org/get')
    print(response.json())
    print(type(response.json()))

     




    ④获取二进制数据

    # -*-  coding:utf-8 -*-
    '''
    保存百度图标
    '''
    import requests
    
    response = requests.get('https://www.baidu.com/img/bd_logo1.png')
    with open('baidu.png','wb') as f:
        f.write(response.content)
        f.close()

     



    ⑤添加headers
    如果直接爬取知乎的网站,是会报错的,如:

    import requests
    
    response = requests.get('https://www.zhihu.com/explore')
    print(response.text)

     

    运行结果:

    <html><body><h1>500 Server Error</h1>
    An internal server error occured.
    </body></html>

     

    解决办法:

    import requests
    headers = {
        'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    }
    response = requests.get('https://www.zhihu.com/explore',headers = headers)
    print(response.text)

     

    就是添加一个headers,就可以正常抓取,而headers中的数据,我是通过chrome浏览器自带的开发者工具去找了然后copy过来的


  2. 基本POST请求
    import requests
    
    data = {
        'name':'python','age' : 18
    }
    headers = {
        'user-agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
    }
    
    response = requests.post('http://httpbin.org/post',data=data,headers=headers)
    print(response.json())

     

  3. 响应
    import requests
    '''
    response属性
    '''
    response = requests.get('http://www.baidu.com')
    print(response.status_code,type(response.status_code))
    print(response.history,type(response.history))
    print(response.cookies,type(response.cookies))
    print(response.url,type(response.url))
    print(response.headers,type(response.headers))

     运行结果:

    200 <class 'int'>
    [] <class 'list'>
    <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]> <class 'requests.cookies.RequestsCookieJar'>
    http://www.baidu.com/ <class 'str'>
    {'Server': 'bfe/1.0.8.18', 'Date': 'Thu, 05 Apr 2018 06:27:33 GMT', 'Content-Type': 'text/html', 'Last-Modified': 'Mon, 23 Jan 2017 13:28:24 GMT', 'Transfer-Encoding': 'chunked', 'Connection': 'Keep-Alive', 'Cache-Control': 'private, no-cache, no-store, proxy-revalidate, no-transform', 'Pragma': 'no-cache', 'Set-Cookie': 'BDORZ=27315; max-age=86400; domain=.baidu.com; path=/', 'Content-Encoding': 'gzip'} <class 'requests.structures.CaseInsensitiveDict'>

     



  4. 状态码判断
    状态码参考表 http://www.cnblogs.com/wuzhiming/p/8722422.html
    # -*-  coding:utf-8 -*-
    
    import requests
    
    response = requests.get('http://www.cnblogs.com/hello.html')
    exit() if not response.status_code == requests.codes.not_found else print('404 not found')
    
    response1 = requests.get('http://www.baidu.com')
    exit() if not response1.status_code == requests.codes.ok else print('Request Successly')

     

  5. 高级操作
    ①文件上传
    import requests
    
    file = {'file':open('baidu.png','rb')}
    response = requests.post('http://httpbin.org/post',files = file)
    print(response.text)

     运行结果不演示


    ②获取cookie
    import requests
    
    response = requests.get('http://www.baidu.com')
    cookies = response.cookies
    print(cookies)
    for key,value in cookies.items():
        print(key + '=' + value)

     

    ③会话维持
    import requests
    
    s = requests.Session()
    s.get('http://httpbin.org/cookies/get/number/123456789')
    response = s.get('http://httpbin.org/cookies')
    print(response.text)

     

    ④证书验证
    import requests
    
    #verify=False表示不进行证书验证
    response = requests.get('https://www.12306.cn',verify=False)
    print(response.status_code)

     手动指定证书

    response1 = requests.get('https://www.12306.cn',cert=('/path/server.crt','/path/key'))

     


    ⑤代理设置
    import requests
    #用法示例,代理可以自己百度免费的代理
    proxies = {
        'http':'http://127.0.0.1:端口号',
        'https':'https://ip:端口号',
        'http':'http://username:password@ip:端口号'
    }
    
    response = requests.get('http://www.baidu.com',proxies=proxies)
    print(response.status_code)

     

    ⑥超时设置
    import requests
    
    response = requests.get('http://httpbin.org/get',timeout = 1)
    print(response.status_code)

     

    ⑦认证设置
    import requests
    from requests.auth import HTTPBasicAuth
    
    response = requests.get('http://127.0.0.1:8888',auth=('user','password'))
    response1 = requests.get('http://127.0.0.1:8888',auth=HTTPBasicAuth('user','passwrd'))
    print(response.status_code)

     PS:127.0.0.1:8888只是举例


    ⑧异常处理
    import requests
    from requests.exceptions import ReadTimeout,HTTPError,RequestException
    
    try:
        response = requests.get('http://httpbin.org/get',timeout = 0.01)
        print(response.status_code)
    except ReadTimeout:
        print("TIME OUT")
    except HTTPError:
        print('HTTP ERROR')
    except RequestException:
        print("ERROR")

     

 

学习总结:


 

   通过爬虫的学习可以进一步的掌握python的基础应用,我的目的就是这个,后面继续学习

标签:

版权申明:本站文章部分自网络,如有侵权,请联系:west999com@outlook.com
特别注意:本站所有转载文章言论不代表本站观点,本站所提供的摄影照片,插画,设计作品,如需使用,请与原作者联系,版权归原作者所有

上一篇:写给小白的Python之008:数据类型之元组

下一篇:python+flask 分分钟完美解析阿里云日志