python项目1_爬虫学习指南

本文最后更新于：2024年8月3日下午

记录python项目1–爬虫的学习过程

基础步骤

step1 :安装requests库和BeautifulSoup库

在pycharm的设置上找到项目解释器，查询插件安装。

可能出现报错：安装版本与python版本不一致，安装最新的即可。

在书写爬虫程序的开头引入两个模块：

1 2	`import requests from bs4 import BeautifulSoup`

获取header和cookie是一个爬虫程序必须的，它直接决定了爬虫程序能不能准确的找到网页位置进行爬取。

首先进入想要爬取的网页页面，按下F12，就会出现网页的js语言设计部分。

找到网页上的Network部分。然后按下ctrl+R刷新页面。如果，进行就有文件信息，就不用刷新了，当然刷新了也没啥问题。

然后，浏览Name部分，找到我们想要爬取的文件，鼠标右键，选择copy，复制下网页的URL。

step 3: 获取网页

使用request请求

1	`response = requests.get('https://tophub.today/n/KqndgxeLl9', cookies=cookies, headers=headers)`

requests.get(url)函数

获取一个网页，最简单的代码就是r=requests.get(url) url：输入目标网址

requests.get(url)方法就是构造一个向服务器请求资源的Request对象，这个对象是Request库内部生成的。

Request库有两个重要对象，分别是Request和Response。Request对象对应的是请求，向目标网址发送一个请求访问服务。而Response对象，是包含了爬虫返回的内容。

实例：

import requests
#get()获取网页
r = requests.get('https://www.baidu.com')
# 检查连接状态
print(r.status_code)
# 检测r的类型
print(type(r))
# 获取页面的头部信息
print(r.headers)

返回内容为：

1
2
3

200
<class 'requests.models.Response'>
{'Content-Encoding': 'gzip', 'Content-Length': '1145', 'Content-Type': 'text/html', 'Server': 'bfe', 'Date': 'Tue, 24 Mar 2020 07:31:58 GMT'}

Response对象的属性，有以下几种 :

r.status_code： HTTP请求的返回状态，200表示连接成功，404表示失败
r.text： HTTP响应内容的字符串形式，即，ur对应的页面内容
r.encoding：从HTTP header中猜测的响应内容编码方式,是自己定义的，实际内容编码在r.apparent_encoding里。如果两者不同会出现分析内容为乱码的情况，需要替换编码
1
2
3
4
5
# 查看编码 print(r.encoding) print(r.apparent_encoding) # 替换编码,假设实际响应内容编码为utf-8 r.encoding='utf-8'
r.apparent_encoding：从内容中分析出的响应内容编码方式（备选编码方式）
r.content： HTTP响应内容的二进制形式

这几个属性，都是访问网页时必要的属性。如果状态码是200，就可以用Response属性来获取网页信息。

step 4:解析网页,分析简化地址

回到网页，按F12，找到网页的Elements部分；用左上角的小框带箭头的标志，点击网页内容，网页会自动在右边显示出对应的代码。在找到想要爬取的页面部分的网页代码后，将鼠标放置于代码上，右键，copy到selector部分。

此时复制的selector相当于网页上对应部分存放的地址。由于爬虫需要的是网页上的一类信息，所以需要对获取的地址进行分析，提取，制造CSS选择器
。那个地址本身只能获取到你选择的网页上的那一小部分内容。

step 5:爬取内容，清洗数据

用一个标签存储上面提炼出的像地址一样的东西，标签会拉取到爬虫想获得的网页内容。

1 2	`#爬取内容 content="#page > div.c-d.c-d-e > div.Zd-p-Sc > div:nth-child(1) > div.cc-dc-c > div > div.jc-c > table > tbody > tr # > td.al > a"`

之后我们就要soup和text过滤掉不必要的信息，比如js类语言，排除这类语言对于信息受众阅读的干扰。

soup.select()函数

1	`select(self, selector, namespaces=None, limit=None, **kwargs)`

功能：查找html中所需要的内容
主要使用的参数是selector，其定义为”包含CSS选择器的字符串“。

soup.select( )可通过以下方法进行查找：

通过（HTML）标签名查找
通过CCS类选择器查找
通过CCS id 选择器查找
组合查找
子标签查找
通过属性查找

参考：https://blog.csdn.net/wei_lin/article/details/102334956

实例源码1：爬取实时微博热搜

import requests
from bs4 import BeautifulSoup

# 爬虫头数据
cookies = {
    'Hm_lvt_3b1e939f6e789219d8629de8a519eab9': '1690790195',
    'Hm_lpvt_3b1e939f6e789219d8629de8a519eab9': '1690790469',
}

headers = {
    'authority': 'tophub.today',
    'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7',
    'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,en-GB;q=0.7,en-US;q=0.6',
    'cache-control': 'max-age=0',
    # 'cookie': 'Hm_lvt_3b1e939f6e789219d8629de8a519eab9=1690790195; Hm_lpvt_3b1e939f6e789219d8629de8a519eab9=1690790469',
    'referer': 'https://www.bing.com/',
    'sec-ch-ua': '"Not/A)Brand";v="99", "Microsoft Edge";v="115", "Chromium";v="115"',
    'sec-ch-ua-mobile': '?0',
    'sec-ch-ua-platform': '"Windows"',
    'sec-fetch-dest': 'document',
    'sec-fetch-mode': 'navigate',
    'sec-fetch-site': 'cross-site',
    'sec-fetch-user': '?1',
    'upgrade-insecure-requests': '1',
    'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36 Edg/115.0.1901.188',
}


#page > div.c-d.c-d-e > div.Zd-p-Sc > div:nth-child(1) > div.cc-dc-c > div > div.jc-c > table > tbody > tr:nth-child(1) > td.al > a
#page > div.c-d.c-d-e > div.Zd-p-Sc > div:nth-child(1) > div.cc-dc-c > div > div.jc-c > table > tbody > tr:nth-child(2) > td.al > a

#数据存储
fo = open("./微博热搜.txt",'a',encoding="utf-8")
#获取网页
response = requests.get('https://tophub.today/n/KqndgxeLl9', cookies=cookies, headers=headers)
#解析网页
response.encoding='utf-8'
soup = BeautifulSoup(response.text, 'html.parser')
#爬取内容
content="#page > div.c-d.c-d-e > div.Zd-p-Sc > div:nth-child(1) > div.cc-dc-c > div > div.jc-c > table > tbody > tr > td.al > a"
#数据清洗
a=soup.select(content)
for i in range(0,len(a)):
    a[i] = a[i].text
    fo.write(a[i]+'\n')
fo.close()

进阶

1. 绕过反爬机制

请求频率限制

使用sleep等待随机时间，或者使用代理ip去访问。

随机延迟

1	`time.sleep(random.randint(3,5))`

建立代理ip池

代码：

# 建立属于自己的开放代理IP池
import requests
import random
import time
from lxml import etree
from fake_useragent import UserAgent

class IpPool:
    def __init__(self):
        # 测试ip是否可用url
        self.test_url = 'http://httpbin.org/get'
        # 获取IP的 目标url
        self.url = 'https://www.89ip.cn/index_{}.html'

        self.headers = {'User-Agent': UserAgent().random}
        # 存储可用ip
        self.file = open('ip_pool.txt', 'wb')

    def get_html(self, url):
        '''获取页面'''
        html = requests.get(url=url, headers=self.headers).text

        return html

    def get_proxy(self, url):
     	'''数据处理  获取ip 和端口''' 
        html = self.get_html(url=url)
        # print(html)
       
        elemt = etree.HTML(html)
        
        ips_list = elemt.xpath('//table/tbody/tr/td[1]/text()')
        ports_list = elemt.xpath('//table/tbody/tr/td[2]/text()')

        for ip, port in zip(ips_list, ports_list):
            # 拼接ip与port
            proxy = ip.strip() + ":" + port.strip()
            # print(proxy)
            
            # 175.44.109.195:9999
            self.test_proxy(proxy)

    def test_proxy(self, proxy):
        '''测试代理IP是否可用'''
        proxies = {
            'http': 'http://{}'.format(proxy),
            'https': 'https://{}'.format(proxy),
        }
        # 参数类型
        # proxies
        # proxies = {'协议': '协议://IP:端口号'}
        # timeout 超时设置 网页响应时间3秒 超过时间会抛出异常
        try:
            resp = requests.get(url=self.test_url, proxies=proxies, headers=self.headers, timeout=3)
           # 获取 状态码为200 
            if resp.status_code == 200:
                print(proxy, '\033[31m可用\033[0m')
                # 可以的IP 写入文本以便后续使用
                self.file.write(proxy)
                
            else:
                print(proxy, '不可用')

        except Exception as e:
            print(proxy, '不可用')

    def crawl(self):
        '''执行函数'''
        # 快代理每页url 的区别
        # https://www.kuaidaili.com/free/inha/1/
        # https://www.kuaidaili.com/free/inha/2/
        # .......
		# 提供的免费ip太多
        # 这里只获取前100页提供的免费代理IP测试
        for i in range(1, 101):
            # 拼接完整的url
            page_url = self.url.format(i)
            # 注意抓取控制频率
            time.sleep(random.randint(1, 4))
            self.get_proxy(url=page_url)

        # 执行完毕关闭文本
        self.file.close()


if __name__ == '__main__':
    ip = IpPool()
    ip.crawl()

使用开源项目proxcyPool(不好用，ip拿不出来)

项目仓库：https://github.com/jhao104/proxy_pool

redis常用命令：

启动server
1
redis-server.exe redis.windows.conf
先输入上面的命令手动开启server，再重新开个终端，输入redis-cli.exe开始运行。
关闭redis
1
shutdown

在 Windows 上，Redis 默认情况下并不会自动生成日志文件，也不会将日志输出到文件中。Redis 会将日志输出到控制台（Console），除非在配置文件中显式地指定了日志文件路径。
开启爬取代理 IP
1
python proxyPool.py schedule
使用代理 IP，需要启动 webApi 服务
1
python proxyPool.py server
启动web服务后, 默认配置下会开启 127.0.0.1:5010 的api接口服务。

api	method	Description	params
/	GET	api介绍	None
/get	GET	随机获取一个代理	可选参数: ?type=https 过滤支持https的代理
/pop	GET	获取并删除一个代理	可选参数: ?type=https 过滤支持https的代理
/all	GET	获取所有代理	可选参数: ?type=https 过滤支持https的代理
/count	GET	查看代理数量	None
/delete	GET	删除代理	?proxy=host:ip

如果要在爬虫代码中使用的话，可以将此api封装成函数直接使用:

import requests

def get_proxy():
    return requests.get("http://127.0.0.1:5010/get/").json()

def delete_proxy(proxy):
    requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))

# your spider code

def getHtml():
    # ....
    retry_count = 5
    proxy = get_proxy().get("proxy")
    while retry_count > 0:
        try:
            html = requests.get('http://www.example.com', proxies={"http": "http://{}".format(proxy)})
            # 使用代理访问
            return html
        except Exception:
            retry_count -= 1
    # 删除代理池中代理
    delete_proxy(proxy)
    return None

重新找了一个开源的代理ip池

https://github.com/Python3WebSpider/ProxyPool.git

伪装浏览器进行访问(User-Agent)

User-Agent中文名为用户代理，简称 UA，是Http协议中的一部分，属于头域的组成部分。它是一个特殊字符串头，使得服务器能够识别客户使用的操作系统及版本、CPU 类型、浏览器及版本、浏览器渲染引擎、浏览器语言、浏览器插件等。

如何查看UA

打开一个网页，按下F12，选择network，再点击headers就可以看到User-Agent

如果同一个网站被相同浏览器频繁访问，很容易被网站识别为爬虫程序，所以一般通过使用多个User-Agent随机调用的方式，可以有效避免同一个请求头访问网站。

调用python中的useragent模块

Python 中的第三方模块 fake_useragent 可以返回一个随机封装好的UA，直接使用即可。

先安装库：pip install fake-useragent

简单的使用一下：

import fake_useragent

# 实例化 user-agent 对象
ua = fake_useragent.UserAgent()
print(ua.chrome)

要频繁抓取一个网页，每次都设置一样的UA，这也会被网站怀疑，因此需要在抓取网页的过程中随机更换UA：

import fake_useragent

# 实例化 user-agent 对象
ua = fake_useragent.UserAgent()
print(ua.random)

实例：

from urllib import request
import fake_useragent
import re
import random

url = r'http://www.baidu.com/'
ua = fake_useragent.UserAgent()
list2 = []
for j in range(4):
    headers = {"User-Agent":ua.random}#headers = {"User-Agent": ua.random}
    list2.append(headers)
i= random.randint(0, 3)
req = request.Request(url,headers= list2[i])
pat = r'<title>(.*?)</title>'
response = request.urlopen(req)
print(response.status)
reponse = request.urlopen(req).read().decode()
data = re.findall(pat,reponse)
print(data[0])

#输出
#百度一下，你就知道

爬虫项目

DouBanSpider – 豆瓣读书爬虫

可以爬下豆瓣读书标签下的所有图书，按评分排名依次存储，存储到Excel中，可方便大家筛选搜罗，比如筛选评价人数>1000的高分书籍；可依据不同的主题存储到Excel不同的Sheet ，采用User Agent伪装为浏览器进行爬取，并加入随机延时来更好的模仿浏览器行为，避免爬虫被封。

记录报错：

AttributeError: module ‘urllib’ has no attribute ‘quote’

报错的意思是在urllib模块中找不到名为quote的属性或函数。在Python中，urllib是一个用于处理URL的标准库模块，其中包含了许多有用的函数，包括quote和unquote函数，用于URL编码和解码。

在Python 3中，quote函数已经移动到了urllib.parse模块中，所以正确的调用方式应该是：

import urllib.parse

url = "https://example.com/?query=hello world"
encoded_url = urllib.parse.quote(url)
print(encoded_url)

soup = BeautifulSoup(plain_text)

这个警告是来自BeautifulSoup库，它是一个用于解析HTML和XML文档的Python库。该警告表明您在创建BeautifulSoup对象时没有显式地指定解析器.可以将解析器设置为”lxml”，因为它是一个快速且功能强大的解析器。代码示例如下：

from bs4 import BeautifulSoup

# 显式指定解析器为"lxml"
soup = BeautifulSoup(plain_text, features="lxml")

如果在代码中已经显式指定了解析器为"lxml"，但仍然收到相同的警告，那么代码中可能有多个地方创建了BeautifulSoup对象，并且其中某些地方没有传递features=”lxml”参数。请检查代码中的其他地方，确认所有的BeautifulSoup对象创建都传递了正确的解析器参数。

关于直接输出的解码问题

使用codecs库里面的escape_decode函数。

book_list.append([title,rating,people_num,author_info,pub_info])

           title_output = codecs.escape_decode(title)[0].decode()
           rating_output = codecs.escape_decode(rating)[0].decode()
           people_num_output = codecs.escape_decode(people_num)[0].decode()
           author_info_output = codecs.escape_decode(author_info)[0].decode()
           pub_info_output = codecs.escape_decode(pub_info)[0].decode()
           print(title_output,rating_output,people_num_output,author_info_output,pub_info_output)

写入excel的技术积累

def print_book_lists_excel(book_lists,book_tag_lists):
    wb=Workbook(optimized_write=True)
    ws=[]
    for i in range(len(book_tag_lists)):
        ws.append(wb.create_sheet(title=book_tag_lists[i].decode())) #utf8->unicode
    for i in range(len(book_tag_lists)): 
        ws[i].append(['序号','书名','评分','评价人数','作者','出版社'])
        count=1
        for bl in book_lists[i]:
            ws[i].append([count,bl[0],float(bl[1]),int(bl[2]),bl[3],bl[4]])
            count+=1
    save_path='book_list'
    for i in range(len(book_tag_lists)):
        save_path+=('-'+book_tag_lists[i].decode())
    save_path+='.xlsx'
    wb.save(save_path)

这段代码也提供了utf-8转unicode的方式：.decode( )函数

bilibili用户数据爬虫

python

#my technology stack

python项目1_爬虫学习指南

http://zoechen04616.github.io/2023/07/31/python项目1-爬虫学习指南/

作者

Yunru Chen

发布于

2023年7月31日

许可协议

CSS学习指南上一篇

python 下一篇

python项目1_爬虫学习指南

基础步骤

step1 :安装requests库和BeautifulSoup库

step 2：获取爬虫所需的header和cookie

step 3: 获取网页

requests.get(url)函数

step 4:解析网页,分析简化地址

step 5:爬取内容，清洗数据

soup.select()函数

实例源码1：爬取实时微博热搜

进阶

1. 绕过反爬机制

请求频率限制

随机延迟

建立代理ip池

使用开源项目proxcyPool(不好用，ip拿不出来)

重新找了一个开源的代理ip池

伪装浏览器进行访问(User-Agent)

如何查看UA

调用python中的useragent模块

爬虫项目

DouBanSpider – 豆瓣读书爬虫

记录报错：

AttributeError: module ‘urllib’ has no attribute ‘quote’

soup = BeautifulSoup(plain_text)

关于直接输出的解码问题

写入excel的技术积累

bilibili用户数据爬虫