-------------已经到底啦! -------------
并不全面
request模块
request模块:python中原生的一款基于网络请求的模块,功能非常强大,简单便捷,效率极高
使用流程
以上也是爬虫的四步法则
环境安装:
编写
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 import requestsif __name__ == "__main__" : url = 'https://www.sogou.com/' response = requests.get(url=url) page_text = response.text print (page_text) with open ('./sogou.html' ,'w' ,encoding='utf-8' ) as fp: fp.write(page_text) print ('爬取结束!' )
-
tips UA检测:门户网站的服务器会检测对应请求载体的标识,若检测到请求的载体身份标识是某一浏览器,说明该请求是正常请求【不是基于浏览器的大都是爬虫】 UA伪装:让爬虫对应的请求载体身份标识伪装成某一款浏览器
编写
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 import requestsfrom requests.models import Responseif __name__ == "__main__" : headers = { 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0' } url = 'https://www.baidu.com/s' Dic = input ('plz enter your word:' ) param = { 'wd' :Dic } response = requests.get(url=url,params=param,headers=headers) page_text = response.text fileName = Dic+'.html' with open (fileName,'w' ,encoding='utf-8' ) as fp: fp.write(page_text) print (fileName,'保存成功!' )
动态加载数据
首页中对应的企业信息数据是通过ajax动态请求到的,详情页的详情数据也是动态加载出来的
观察后发现: 所有的post请求的url都是一样的,只有参数id的值不同 我们可以批量获取多家企业的id,可以将id和url形成一个完整的详情页对应详情数据的ajax请求的url
编写
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 import requestsimport jsonif __name__ == "__main__" : url = 'https://movie.douban.com/j/chart/top_list' param = { 'type' : '6' , 'interval_id' : '100:90' , 'action' :'' , 'start' : '0' , 'limit' : '20' , } headers = { 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0' } response = requests.get(url=url,params=param,headers=headers) list_data = response.json() fp = open ('./douban.json' ,'w' ,encoding='utf-8' ) json.dump(list_data,fp=fp,ensure_ascii=False ) print ('生成!!' )
数据解析 数据解析原理概述
解析的局部的文本内容都会在标签之间或者标签对应的属性中进行存储
1.进行指定标签的定位
2.标签或者标签对应的属性中存储的数据进行提取(解析)
使用流程
指定url
发起请求
获取响应数据
数据解析
持久化存储
编写 爬取页面图片
1 2 3 4 5 6 7 8 9 10 11 12 13 import requestsif __name__ == "__main__" : headers = { 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0' } url = 'https://blog.lxscloud.top/static/post/esp32_cam/360%E6%88%AA%E5%9B%BE16810304334270.jpg' img_data = requests.get(url=url).content with open ('./my.jpg' ,'wb' ) as fp: fp.write(img_data)
正则
参考此篇学习正则
正则解析:
1 2 3 4 5 6 7 8 <div class ="thumb" > <a href ="/article/125091174" target ="_blank" > <img src ="//pic.qiushibaike.com/system/pictures/12509/125091174/medium/Z9FEKWUBTD8D11QV.jpg" alt ="糗事#125091174" class ="illustration" width ="100%" height ="auto" > </a > </div > ex = '<div class ="thumb" > .*?'<img src ="(.*?)" alt. *?</div > div>'
编写 用正则爬取指定页面所有图片
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 import requestsimport reimport osif __name__ == "__main__" : if not os.path.exists('./qiutuLibs' ): os.mkdir('./qiutuLibs' ) url = 'https://www.qiushibaike.com/imgrank/page/%d/' headers = { 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0' } for pageNum in range (1 ,3 ): new_url = format (url%pageNum) page_text = requests.get(url=new_url,headers=headers).text ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>' img_src_list = re.findall(ex,page_text,re.S) for src in img_src_list: src = 'https:' + src img_data = requests.get(url=src,headers=headers).content img_name = src.split('/' )[-1 ] img_path = './qiutuLibs/' + img_name with open (img_path,'wb' ) as fp: fp.write(img_data) print (img_name,'下载成功!!' )
bs4 bs4解析
数据解析的原理:
1.标签定位
2.提取标签、标签属性中存储的数据值
bs4数据解析的原理:
1.实例化一个BeautifulSoup对象,并且将页面源码数据加载到该对象中
2.通过调用BeautifulSoup对象中相关的属性或者方法进行标签定位和数据提取
环境安装:
pip install bs4
pip install lxml
如何实例化BeautifulSoup对象:
from bs4 import BeautifulSoup
对象的实例化:
1.将本地的html文档中的数据加载到该对象 fp = open(‘./test.html’,’r’,encoding=’utf-8’) soup = BeautifulSoup(fp,’lxml’)
2.将互联网上网页源码的数据加载到该对象 page_text = response.text soup = BeautifulSoup(page_text,’lxml’)
提供的用于数据解析的方法和属性:
soup.tagName: 返回的是文档中第一次出现的tagName对应的标签
soup.find():
find(‘tagName’):等同于soup.div
属性定位:
soup.find(‘div’,class_/id/attr=’song’)
soup.find_all(‘tagName’):返回符合要求的所有标签(列表)
select:
select(‘某种选择器(id,class,标签…选择器)’),返回的是一个列表。
层级选择器:
soup.select(‘.tang > ul > li > a’): >表示的是一个层级
soup.select(‘.tang > ul a’): 空格表示的多个层级
获取标签之间的文本数据:
soup.a.text/string/get_text()
text/get_text(): 可以获取某一个标签中的所有文本内容
string: 只可以获取该标签下直系的文本内容
获取标签中属性值:
编写 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 from bs4 import BeautifulSoupimport lxmlimport requestsif __name__ == "__main__" : headers = { 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0' } url = 'https://www.shicimingju.com/book/sanguoyanyi.html' page_text = requests.get(url=url,headers=headers).text soup = BeautifulSoup(page_text,'lxml' ) li_lixt = soup.select('.book-mulu > ul > li' ) fp = open ('./sanguo.txt' ,'w' ,encoding='utf-8' ) for li in li_lixt: title = li.a.string d_url = 'https://www.shicimingju.com' +li.a['href' ] d_text = requests.get(url=d_url,headers=headers).text d_soup = BeautifulSoup(d_text,'lxml' ) div_tag = d_soup.find('div' ,class_='chapter_content' ) content = div_tag.text fp.write(title+':' +content+'\n' ) print (title,'爬取成功' )
xpath
xpath解析原理:
1.实例化一个etree的对象,且需要将被解析的页面源码数据加载到该对象中。
2.调用etree对象中的xpath方法结合着xpath表达式实现标签的定位和内容的捕获。
环境的安装:
如何实例化一个etree对象: from lxml import etree
1.将本地的html文档中的源码数据加载到etree对象: etree.parse(filePath)
2.可以将从互联网上获取的源码数据加载到该对象中 etree.HTML(‘page_text’)
xpath(‘xpath表达式’)
/:表示从根节点开始定位,一个/表示一个层级 //表示多个层级
属性定位: //div[@class=’song’] ——–>tag[@attrName=’attrValue’]
索引定位: //div[@class=’song’]/p[3] 从1开始索引
取文本:
/text() 获取的是标签中直系的文本内容
//text() 获取标签中非直系的文本内容(所有的文本内容)
取属性: /@attrName ====> img/src
编写
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 from socket import herrorimport requestsfrom lxml import etreeif __name__ == "__main__" : url = 'https://bj.58.com/ershoufang/' headers = { 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0' } page_text = requests.get(url=url,headers=headers).text tree = etree.HTML(page_text) li_list = tree.xpath('//section[@class="list"]/div' ) fp = open ('58.txt' ,'w' ,encoding='utf-8' ) for li in li_list: title = li.xpath('./a/div[2]/div[1]/div[1]/h3/text()' )[0 ] print (title) fp.write(title+'\n' )
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 import requestsfrom lxml import etreeimport osif __name__ == "__main__" : url = 'https://pic.netbian.com/4kdongman/' headers = { 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0' } page_text = requests.get(url=url,headers=headers).text tree = etree.HTML(page_text) li_list = tree.xpath('//div[@class="slist"]/ul/li' ) if not os.path.exists('./picLibs' ): os.mkdir('./picLibs' ) for li in li_list: img_src = 'https://pic.netbian.com' +li.xpath('./a/img/@src' )[0 ] img_name = li.xpath('./a/img/@alt' )[0 ]+'.jpg' img_name = img_name.encode('iso-8859-1' ).decode('gbk' ) img_data = requests.get(url=img_src,headers=headers).content img_path = 'picLibs/' +img_name with open (img_path,'wb' ) as fp: fp.write(img_data) print (img_name,'Download succeed!' )
自己写的爬简历
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 import requestsfrom lxml import etreeimport osif __name__ == "__main__" : url = 'https://sc.chinaz.com/jianli/free_%d.html' headers = { 'User-Agent' :'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0' } if not os.path.exists('./pics' ): os.mkdir('./pics' ) for page in range (1 ,21 ): if page == 1 : new_url = 'http://sc.chinaz.com/jianli/free.html' else : new_url = format (url%page) page_text = requests.get(url=new_url,headers=headers).text tree = etree.HTML(page_text) li_list = tree.xpath('//*[@id="container"]/div' ) for li in li_list: img_src ='http:' +li.xpath('./a/@href' )[0 ] img_name = li.xpath('./a/img/@alt' )[0 ]+'.rar' img_name = img_name.encode('iso-8859-1' ).decode('utf-8' ) img_data = requests.get(url=img_src,headers=headers).text tree = etree.HTML(img_data) download_src = tree.xpath('//div[@id="down"]/div[2]//li[6]/a/@href' )[0 ] img_data = requests.get(url=download_src,headers=headers).content img_path = 'pics/' +img_name with open (img_path,'wb' ) as fp: fp.write(img_data) print (img_name,'Download succeed!' )
验证码识别 反爬机制:验证码,识别验证码图片中的数据,用于模拟登录操作
识别验证码的操作: - 人工肉眼识别 - 第三方自动识别 - 云打码: http://www.yundama.com/demo.html 云打码使用流程: - 注册:普通和开发者用户 - 登录: - 普通用户的登录: 查询该用户是否还有剩余的部分 - 开发者用户的登录: - 创建一个软件: 我的软件 -> 添加新软件 -> 录入软件名称 -> 提交(软件id和密钥) - 下载实例代码: 开发文档 -> 点此下载: 云打码接口DLL -> PythonHTTP实例下载
模拟登录: - 爬取基于某些用户的用户信息
需求: 对人人网进行模拟登录 - 点击登录按钮后会发起一个post请求 - post请求中会携带登录之前录入的相关的登录信息(用户名、密码、验证码)
谢谢观看