Python不仅能爬网页还能爬取APP呢!批量爬取抖音…

2018-08-17 09:47:19来源:博客园 阅读 ()

新老客户大回馈,云服务器低至5折

介绍

这次爬的是当下大火的APP--抖音,批量下载一个用户发布的所有视频。各位也应该知道,抖音只有移动端,官网打开除了给你个APP下载二维码啥也没有,所以相比爬PC网站,还是遇到了更多的问题,也花了更多的时间,不过好在基本实现了,除了一点咱在后面说。

进群:548377875   即可获取数十套PDF哦!

Python不仅能爬网页还能爬取APP呢!批量爬取抖音视频!最新代码

 

Python不仅能爬网页还能爬取APP呢!批量爬取抖音视频!最新代码

 

Python不仅能爬网页还能爬取APP呢!批量爬取抖音视频!最新代码

 

谷歌Chrome浏览器有一个模拟手机访问的功能,我们选在iPhone X模式来访问页面,果然看到发布的视频了:

Python不仅能爬网页还能爬取APP呢!批量爬取抖音视频!最新代码

 

我们接下来看下后台请求,不多,很快就找到我们需要的视频信息了,也能直接打开观看视频,到这感觉已经成功了一大半了:

Python不仅能爬网页还能爬取APP呢!批量爬取抖音视频!最新代码

 

Python不仅能爬网页还能爬取APP呢!批量爬取抖音视频!最新代码

 

两个地址除了max_cursor其他都一样,其实就是上一条返回的json数据中的max_cursor就是下个链接中的max_cursor,然后has_more等于1的时候表示还未全部加载,这样逻辑就清楚了,我们只要先判断has_more是否等于1,等于1的时候我们将max_cursor的值传入下一个链接继续访问获取视频地址,直到has_more等于0为止。

Python不仅能爬网页还能爬取APP呢!批量爬取抖音视频!最新代码

 

这样所有视频地址都有了,就开始下载吧!!

Python不仅能爬网页还能爬取APP呢!批量爬取抖音视频!最新代码

 

Python不仅能爬网页还能爬取APP呢!批量爬取抖音视频!最新代码

 

代码部分

from selenium import webdriver

from bs4 import BeautifulSoup

import json

import requests

import sys

import time

import os

import uuid

from contextlib import closing

from requests.packages.urllib3.exceptions import InsecureRequestWarning

requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

class douyin_spider(object):

"""docstring for douyin_spider"""

def __init__(self,user_id,_signature,dytk):

print '*******DouYin_spider******'

print 'Author : Awesome_Tang'

print 'Date : 2018-07-29'

print 'Version: Python2.7'

print '**************************'

print ''

self.userid = user_id

self.headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36'}

mobile_emulation = {'deviceName': 'iPhone X'}

# chrome浏览器模拟iPhone X进行页面访问

options = webdriver.ChromeOptions()

options.add_experimental_option("mobileEmulation", mobile_emulation)

self.browser = webdriver.Chrome(chrome_options=options)

self._signature= _signature

self.dytk= dytk

self.url = 'https://www.amemv.com/aweme/v1/aweme/post/?user_id=%s&count=32&max_cursor=0&aid=1128&_signature=%s&dytk=%s'%(self.userid,self._signature,self.dytk)

def handle_url(self):

url_list = [self.url,]

self.browser.get(self.url)

web_data = self.browser.page_source

soup = BeautifulSoup(web_data, 'lxml')

web_data = soup.pre.string

web_data = json.loads(str(web_data))

if web_data['status_code'] == 0:

while web_data['has_more'] == 1:

# 最大加载32条视频信息,has_more等于1表示还未全部加载完

max_cursor = web_data['max_cursor']

# 获取时间戳

url = 'https://www.amemv.com/aweme/v1/aweme/post/?user_id=%s&count=32&max_cursor=%s&aid=1128&_signature=%s&dytk=%s'%(self.userid,max_cursor,self._signature,self.dytk)

url_list.append(url)

self.browser.get(url)

web_data = self.browser.page_source

soup = BeautifulSoup(web_data, 'lxml')

web_data = soup.pre.string

web_data = json.loads(str(web_data))

else:

max_cursor = web_data['max_cursor']

# 获取时间戳

url = 'https://www.amemv.com/aweme/v1/aweme/post/?user_id=%s&count=32&max_cursor=%s&aid=1128&_signature=%s&dytk=%s'%(self.userid,max_cursor,self._signature,self.dytk)

url_list.append(url)

else:

url_list = []

return url_list

def get_download_url(self,url_list):

download_url = []

title_list = []

if len(url_list)> 0:

for url in url_list:

self.browser.get(url)

web_data = self.browser.page_source

soup = BeautifulSoup(web_data, 'lxml')

web_data = soup.pre.string

web_data = json.loads(str(web_data))

if web_data['status_code'] == 0:

for i in range(len(web_data['aweme_list'])):

download_url.append(web_data['aweme_list'][i]['video']['play_addr']['url_list'][0])

title_list.append(web_data['aweme_list'][i]['share_info']['share_desc'].encode('utf-8'))

return download_url,title_list

def videodownloader(self,url,title):

size = 0

path = title+'.mp4'

with closing(requests.get(url, headers = self.headers ,stream=True, verify=False)) as response:

chunk_size = 1024

content_size = int(response.headers['content-length'])

if response.status_code == 200:

print '%s is downloading...'%title

sys.stdout.write('[File Size]: %0.2f MB ' % (content_size/chunk_size/1024))

with open(path, 'wb') as f:

for data in response.iter_content(chunk_size=chunk_size):

f.write(data)

size += len(data)

f.flush()

sys.stdout.write('[Progress]: %0.2f%%' % float(size/content_size*100) + ' ')

sys.stdout.flush()

else:

print response.status_code

def run(self):

url = 'https://www.amemv.com/aweme/v1/aweme/post/?user_id=%s&count=32&max_cursor=0&aid=1128&_signature=%s&dytk=%s'%(self.userid,self._signature,self.dytk)

url_list = self.handle_url()

download_url,title_list = self.get_download_url(url_list)

for i in range(len(download_url)):

url = download_url[i]

title = title_list[i]

self.videodownloader(url,title)

if __name__ == '__main__':

# 创建对象

# 传入三个参数,user_id,_signature,dytk

douyin_spider = douyin_spider('95870186531','RFE1JhAbHxD7J8TA9HCumURRNT','539f2c59bb57577983b3818b7a7f32ef')

douyin_spider.run()

print '******DouYin_spider@Awesome_Tang、******'

Python不仅能爬网页还能爬取APP呢!批量爬取抖音视频!最新代码

标签:

版权申明:本站文章部分自网络,如有侵权,请联系:west999com@outlook.com
特别注意:本站所有转载文章言论不代表本站观点,本站所提供的摄影照片,插画,设计作品,如需使用,请与原作者联系,版权归原作者所有

上一篇:Python 爬虫 (四)

下一篇:Python基础