Python:爬虫技术

TAGS: Python

概述和HTTP请求与响应处理

概述

爬虫，应该称为网络爬虫，也叫网页蜘蛛、网络机器人、网络蚂蚁等。
搜索引擎，就是网络爬虫的应用者。
大数据时代的到来，所有企业都希望通过海量数据发现其中的价值。所以需要爬取对特定网站、特顶类别的数据，而搜索引擎不能提供这样的功能，因此需要自己开发爬虫来解决。

爬虫分类

1.通用爬虫

常见就是搜索引擎，无差别的搜集数据、存储、提取关键字、构建索引库，给用户提供搜索接口。

爬取一般流程
1. 初始化一批URL,将这些URL放到带爬队列
2. 从队列取出这些URL，通过DNS解析IP，对IP对应的站点下载HTML页面，保存到本地服务器中，爬取完的URL放到已爬取队列。
3. 分析这些网页内容，找出网页里面的其他关心的URL链接，继续执行第2步，直到爬取条件结束。
搜索引擎如何获取一个网站的URL
1. 新网站主动提交给搜索引擎
2. 通过其他网站页面中设置的外链接
3. 搜索引擎和DNS服务商合作，获取最新收录的网站

2. 聚焦爬虫

有针对性的编写特定领域数据的爬取程序，针对某些类别数据采集的爬虫，是面向主题的爬虫

Robots协议

指定一个robots.txt文件，告诉爬虫引擎什么可以爬取

=/=表示网站根目录，表示网站所有目录。
=Allow=允许爬取的目录
=Disallow=禁止爬取的目录
可以使用通配符

robots是一个君子协定，"爬亦有道"
这个协议为了让搜索引擎更有效率搜索自己内容，提供了Sitemap这样的文件。Sitemap往往是一个XML文件，提供了网站想让大家爬取的内容的更新信息。
这个文件禁止爬取的往往又是可能我们感兴趣的内容，反而泄露了这些地址。

示例：淘宝的robotshttp://www.taobao.com/robots.txt

User-agent:  Baiduspider
Allow:  /article
Allow:  /oshtml
Allow:  /ershou
Allow: /$
Disallow:  /product/
Disallow:  /

User-Agent:  Googlebot
Allow:  /article
Allow:  /oshtml
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /oversea
Allow:  /list
Allow:  /ershou
Allow: /$
Disallow:  /

User-agent:  Bingbot
Allow:  /article
Allow:  /oshtml
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /oversea
Allow:  /list
Allow:  /ershou
Allow: /$
Disallow:  /

User-Agent:  360Spider
Allow:  /article
Allow:  /oshtml
Allow:  /ershou
Disallow:  /

User-Agent:  Yisouspider
Allow:  /article
Allow:  /oshtml
Allow:  /ershou
Disallow:  /

User-Agent:  Sogouspider
Allow:  /article
Allow:  /oshtml
Allow:  /product
Allow:  /ershou
Disallow:  /

User-Agent:  Yahoo!  Slurp
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /oversea
Allow:  /list
Allow:  /ershou
Allow: /$
Disallow:  /

User-Agent:  *
Disallow:  /

示例马蜂窝tobotshttp://www.mafengwo.cn/robots.txt

User-agent: *
Disallow: /
Disallow: /poi/detail.php

Sitemap: http://www.mafengwo.cn/sitemapIndex.xml

HTTP请求和响应处理

其实爬取网页就是通过HTTP协议访问网页，不过通过浏览器反问往往是人的行为，把这种行为变成使用程序来访问。

urllib包

urllib是标准库，它一个工具包模块，包含下面模块来处理url: * urllib.request 用于打开和读写url * urllib.error 包含了由urllib.request引起的异常 * urllib.parse 用于解析url * urllib.robotparser 分析robots.txt文件

Python2中提供了urllib和urllib2。urllib提供较为底层的接口，urllib2对urllib进行了进一步封装。Python3中将urllib合并到了urllib2中，并更名为标准库urllib包。

urllib.request模块

定义了在基本和摘要式身份验证、重定向、cookies等应用中打开Url(主要是HTTP)的函数和类。

urlopen方法
1. urlopen(url,data=None)
  - url是链接地址字符串，或请求类的实例
  - data提交的数据，如果data为Non发起的*GET*请求，否则发起*POST*请求。见=urllib.request.Request#get_method=返回http.client.HTTPResponse类的相遇对象，这是一个类文件对象。
```
from urllib.request import urlopen

# 打开一个url返回一个相应对象，类文件对象
# 下面链接访问后会有跳转
responses = urlopen("http://www.bing.com") #默认GET方法
print(responses.closed)
with responses:
    print(1, type(responses)) # http.client.HTTPResponse类文件对象
    print(2,responses.status,responses.reason) #状态
    print(3,responses.geturl()) #返回真正的URL
    print(4,responses.info()) #headers
    print(5,responses.read()[:50]) #读取返回的内容

print(responses.closed)
```
Figure 1: robots_001
1. 上例，通过urllib.request.urlopen方法，发起一个HTTP的GET请求，WEB服务器返回了网页内容。响应的数据被封装到类文件对象中，可以通过read方法、readline方法、readlines方法获取数据，status和reason属性表示返回的状态码，info方法返回头信息，等等。
User-Agent问题
1. 上例代码非常精简，即可以获得网站的响应数据。但目前urlopen方法通过url字符串和data发起HTTP的请求。如果想修改HTTP头，例如useragent,就的借助其他方式。
  - 原码中构造的useragen如下：
  `=python # urllib.request.OpenerDirector class OpenerDirector: def init__(self): client_version = "Python-urllib/%s" % __version self.addheaders = [('User-agent', client_version)]=
  - 当前显示为Python-urlib/3.7
  - 有些网站是反爬虫的，所以要把爬虫伪装成浏览器。顺便打开一个浏览器，复制李立群的UA值，用来伪装。
Request类
Request(url,data=None,headers={})
初始化方法，构造一个请求对象。可添加一个header的字典。data参数决定是GET还是POST请求。
=obj.add_header(key,val)=为header增加一个键值对。

from urllib.request import Request,urlopen
import random

# 打开一个url返回一个Request请求对象
# url = "https://movie.douban.com/" #注意尾部的斜杠一定要有
url = "http://www.bing.com/"

ua_list = [
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36", # chrome
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/537.36 (KHTML, like Gecko) Version/5.0.1 Safari/537.36", # safafi
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0", # Firefox
    "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)" # IE
]

ua = random.choice(ua_list)
request = Request(url)
request.add_header("User-Agent",ua)
print(type(request))

response = urlopen(request,timeout=20) #request对象或者url都可以
print(type(response))

with response:
    print(1,response.status,response.getcode(),response.reason) #状态，getcode本质上就是返回status
    print(2,response.geturl()) #返回数据的url。如果重定向，这个url和原始url不一样
    # 例如：原始url是http://www.bing.com/,返回http://cn.bing.com/
    print(3,response.info()) #返回响应头headers
    print(4,response.read()[:50]) #读取返回的内容

print(5,request.get_header("User-agent"))
print(6,request.headers)
print(7,"user-agent".capitalize())

urllib.parse模块

该模块可以完成对url的编解码

parse.urlencode({key:value}) #对查询字符串进行编码

from urllib import parse

u = parse.urlencode({
    "url":"http://www.xdd.com/python",
    "p_url":"http://www.xdd.com/python?id=1&name=张三"
})
print(u)

# 运行结果
url=http%3A%2F%2Fwww.xdd.com%2Fpython&p_url=http%3A%2F%2Fwww.xdd.com%2Fpython%3Fid%3D1%26name%3D%E5%BC%A0%E4%B8%89

从运行结果来看冒号、斜杠、&、等号、问号等符号全部被编码了，%之后实际上是单字节十六进制表示的值。
一般来说url中的地址部分，一般不需要使用中文路径，但是参数部分，不管GET还是POST方法，提交的数据中，可能有斜杆、等号、问号等符号，这样这些字符表示数据，不表示元字符。如果直接发给服务器端，就会导致接收方无法判断谁是元字符，谁是数据了。为了安全，一般会将数据部分的字符做url编码，这样就不会有歧义了。后来可以传送中文，同样会做编码，一般先按照字符集的encoding要求转换成字节序列，每一个字节对应的十六进制字符串前加上百分号即可。

from urllib import parse

u = parse.urlencode({"wd":"中"}) #编码查询字符串
url= "https://www.baidu.com/s?{}".format(u)
print(url)

print("中".encode("utf-8")) # b'xe4\xb8\xad'
print(parse.unquote(u)) #解码
print(parse.unquote(url))

提交方法method

常用的HTTP交互数据的方法是GET、POST
1. GET方法，数据是通过URL传递的，也就是说数据是在HTTP报文的header部分。
2. POST方法，数据是放在HTTP报文的body部分提交的。
3. 数据是键值对形式，多个参数质检使用&符号链接。例如a=1&b=abc

GET方法

链接=必应=搜索引擎官网，获取一个搜索的URL=http://cn.bing.com/search?q=%E7%A5%9E%E6%8E%A2%E7%8B%84%E4%BB%81%E6%9D%B0=

from urllib.request import urlopen,Request
from urllib.parse import urlencode

data = urlencode({"q":"神探狄仁杰"})
base_url = "http://cn.bing.com/search"
url = "{}?{}".format(base_url,data)
safafi = "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/537.36 (KHTML, like Gecko) Version/5.0.1 Safari/537.36" # safafi

request = Request(url,headers={"User-agent":safafi})
repost = urlopen(request)
with repost:
    with open("d:/abc.html","wb") as f:
        f.write(repost.read())
print("ok")

POST方法

http://httpbin.org/测试网站

from urllib.request import Request,urlopen
from urllib.parse import urlencode
import simplejson

request = Request("http://httpbin.org/post")
request.add_header("User-agent","Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/537.36 (KHTML, like Gecko) Version/5.0.1 Safari/537.36")
data = urlencode({"name":"张三,@=/&","age":"6"})
print(data)

res = urlopen(request,data.encode()) #POST方法，Form提价数据，如果Data的值不是None就使用Post方法，否则Get方法
with res:
    j = res.read().decode() #json
    print(j)
    print("===============================")
    print(simplejson.loads(j))

处理JSON数据

查看“豆瓣电影”,中的热门电影，通过分析，我们知道这部分内容，是通过AJAX从后台拿到的JSON数据。
访问URL是=https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&page_limit=50&page_start=0=
1. =%E7%83%AD%E9%97%A8=是utf-8编码的中文”热门”
2. tag 标签”热门”，表示热门电影
3. type 数据类型，movie是电影
4. page_limit表示返回数据的总数
5. page_start 表示数据偏移
服务器返回json数据如下：(轮播组件，共50条数据)

from urllib.request import Request,urlopen
from urllib.parse import urlencode

base_url = "https://movie.douban.com/j/search_subjects"
data = urlencode({
    "tag":"热门",
    "type":"movie",
    "page_limit":10,
    "page_start":10
})
request = Request(base_url)

# POST方法
repost = urlopen(request,data=data.encode())
with repost:
    print(repost._method)
    print(repost.read().decode()[:100])

# GET方法
with urlopen("{}?{}".format(base_url,data)) as res:
    print(res._method)
    print(res.read().decode()[:100])

HTTPS证书忽略

HTTPS使用SSL安全套层协议，在传输层对网络数据进行加密。HTTPS使用的时候需要证书，而证书需要CA认证。
CA(Certificate Authority)是数字证书认证中心的简称，是指发放、管理、废除数字证书的机构。
CA是受信任的第三方，有CA签发的证书具有可信任。如果用户由于信任了CA签发的证书导致的损失，可以追究CA的法律责任。
CA是层级结构，下级CA信任上级CA,且有上级CA颁发给下级CA证书并认证。
一些网站，例如淘宝，使用HTTPS加密数据更加安全。
以前旧版本12306网站需要下载证书

from urllib.request import Request,urlopen

# request = Request("http://www.12306.cn/mormhweb/") #可以访问
# request = Request("https://www.baidu.com/") #可以访问

request = Request("https://www.12306.cn/mormhweb/") #旧版本报SSL认证异常
request.add_header(
    "User-agent",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/537.36 (KHTML, like Gecko) Version/5.0.1 Safari/537.36"
)

# ssl.CertificateError: hostname 'www.12306.cn' doesn't match either of ......
with urlopen(request) as res:
    print(res._method)
    print(res.read())

注意：一下说明都是针对旧版本的12306网站，来讲解，现在实在无法找打第二个自己给自己发证书的。
通过HTTPS访问12306的时候，失败的原因在于12306的证书未通过CA认证，它是自己生产的证书，不可信。而其它网站访问，如=https://www.baidu.com/=%E5%B9%B6%E6%B2%A1%E6%9C%89%E6%8F%90%E7%A4%BA%E7%9A%84%E5%8E%9F%E5%9B%A0%EF%BC%8C%E5%AE%83%E7%9A%84%E8%AF%81%E4%B9%A6%E7%9A%84%E5%8F%91%E8%A1%8C%E8%80%85%E5%8F%97%E4%BF%A1%E4%BB%BB%EF%BC%8C%E4%B8%94%E6%97%A9%E5%B0%B1%E5%AD%98%E5%82%A8%E5%9C%A8%E5%BD%93%E5%89%8D%E7%B3%BB%E7%BB%9F%E4%B8%AD。
遇到这种问题，解决思路：忽略证书不安全信息

from urllib.request import Request,urlopen
import ssl #导入ssl模块


# request = Request("http://www.12306.cn/mormhweb/") #可以访问
# request = Request("https://www.baidu.com/") #可以访问

request = Request("https://www.12306.cn/mormhweb/") #旧版本报SSL认证异常
request.add_header(
    "User-agent",
    "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/537.36 (KHTML, like Gecko) Version/5.0.1 Safari/537.36"
)

# 忽略不信任的证书
context = ssl._create_unverified_context()
res = urlopen(request,context=context)

# ssl.CertificateError: hostname 'www.12306.cn' doesn't match either of ......
with res:
    print(res._method)
    print(res.geturl())
    print(res.read().decode())

urllib3库

https://urllib3.readthedocs.io/en/latest/
标准库urlib缺少了一些关键的功能，非标准库的第三方库urllib3提供了，比如说连接池管理。

安装 pip install urlib3

import urllib3
from urllib3.response import HTTPResponse

url = "https://movie.douban.com"
ua = "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/537.36 (KHTML, like Gecko) Version/5.0.1 Safari/537.36"

# 链接池管理
with urllib3.PoolManager() as http:
    response:HTTPResponse = http.request("GET",url,headers={"User-Agent":ua})
    print(type(response))
    print(response.status,response.reason)
    print(response.headers)
    print(response.data[:50])

requests库

requests使用了urllib3,但是API更加友好，推荐使用。
安装=pip install requests=

import requests

ua = "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36"
url = "https://movie.douban.com/"

response = requests.request("GET",url,headers={"User-Agent":ua})

with response:
    print(type(response))
    print(response.url)
    print(response.status_code)
    print(response.request.headers) #请求头
    print(response.headers) #响应头
    response.encoding = "utf-8"
    print(response.text[:200]) #HTML的内容
    with open('d:/movie.html',"w",encoding='utf-8') as f:
        f.write(response.text)

requests默认使用Session对象，是为了多次和服务器端交互中保留会话的信息，例如：cookie。

#直接使用Session
import requests

ua = "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:50.0) Gecko/20100101 Firefox/50.0"
urls = ["https://www.baidu.com/s?wd=xdd","https://www.baidu.com/s?wd=xdd"]

session = requests.Session()
with session:
    for url in urls:
        response = session.get(url,headers = {"User-Agent":ua})
        # response = requests.request("GET",url,headers={"User-Agent":ua}) #观察两种方式区别
        with response:
            print(response.request.headers) #请求头
            print(response.cookies) #响应的cookie
            print(response.text[:20]) #HTML的内容
            print("-"*30)

使用session访问，第二次带上了cookie

HTML解析-Xpath

[toc]

HTML的内容返回给浏览器，浏览器就会解析它，并对它渲染。

HTML超文本表示语言，设计的初衷就是为了超越普通文本，让文本表现力更强。
XML扩展标记语言，不是为了替代HTML，而是觉得HTML的设计中包含了过多的格式，承担了一部分数据之外的任务，所以才设计了XML只用来描述数据。

HTML和XML都有结构，使用标记形成树型的嵌套结构。DOM(Document Object Model)来解析这种嵌套树型结构，浏览器往往都提供了对DOM操作的API，可以用面向对象的方式来操作DOM。

XPath

http://www.w3school.com.cn/xpath/index.asp中文教程
XPath是一门在XML文档中查找信息的语言。XPath可用来在XML文档中对元素和属性进行遍历。
测试工具：XMLQuire win7+需要.net框架4.0-4.5。

测试XML、XPath

测试文档

<?xml version="1.0" encoding="utf-8"?>
<bookstore>
<book id="bk101">
    <author>Gambardella, Matthew</author>
    <title>XML Developer's Guide</title>
    <genre>Computer</genre>
    <price>44.95</price>
    <publish_date>2000-10-01</publish_date>
    <description>An in-depth look at creating applications 
    with XML.</description>
</book>
<book id="bk102" class="bookinfo even">
    <author>Ralls, Kim</author>
    <title>Midnight Rain</title>
    <genre>Fantasy</genre>
    <price>5.95</price>
    <publish_date>2000-12-16</publish_date>
    <description>A former architect battles corporate zombies, 
    an evil sorceress, and her own childhood to become queen 
    of the world.</description>
</book>
<book id="bk103">
    <author>Corets, Eva</author>
    <title>Maeve Ascendant</title>
    <genre>Fantasy</genre>
    <price>5.95</price>
    <publish_date>2000-11-17</publish_date>
    <description>After the collapse of a nanotechnology 
    society in England, the young survivors lay the 
    foundation for a new society.</description>
</book>
<book id="bk104">
    <author>Corets, Eva</author>
    <title>Oberon's Legacy</title>
    <genre>Fantasy</genre>
    <price>5.95</price>
    <publish_date>2001-03-10</publish_date>
    <description>In post-apocalypse England, the mysterious 
    agent known only as Oberon helps to create a new life 
    for the inhabitants of London. Sequel to Maeve 
    Ascendant.</description>
</book>
</bookstore>

测试工具:XMLQuire win7+需要.NET框架4.0-4.5。

节点

在XPath中，有七种类型的节点：*元素、属性、文本、命名空间、处理指令、注释以及文档(根)节点。*
1. =/=根节点
2. =<bookstore>=元素节点
3. =<author>Corets,Eva</author>=元素节点。
4. id"bk104"=是属性节点，id是元素节点book的属性
节点之间的嵌套形成*父子(parent,children)关系*。
具有统一个父结点的不同节点是*兄弟(sibling)关系*。
节点选择

操作符或表达式	含义
`/`	从根节点开始找
`//`	从当前节点开始的任意层找
`.`	当前节点
`..`	当前结点的父节点
`@`	选择属性
`节点名`	选取所有这个节点名的节点
`*`	匹配任意元素节点
`@*`	匹配任意属性节点
`node()`	匹配任意类型的节点
`text()`	匹配text类型节点

谓语(Predicates)
谓语用来查找某个特定的节点或者包含某个指定的值的节点。
*谓语被嵌在方括号中*。
谓语就是查询的条件。
即在路径选择时，在中括号内指定查询条件。
XPath轴(Axes)
轴的意思是相对于当前结点的节点集

轴名称	结果
ancestor	选取当前结点的所有先辈(父、祖父等)
ancestor-or-self	选取当前节点的所有先辈(父、祖父等)以及当前节点本身
attribute	选取当前节点的所有属性。?? (????)::id
child	选取当前节点的所有子元素，title等价于child:title
descendant	选取当前节点的所有后代元素(子、孙等)
descendant-or-self	选取当前节点的所有后代运算(子、孙等)以及当前节点本身
following	选取文档中当前节点的结束标签之后的所有结点
namespace	选取当前节点的所有命名空间节点
parent	选取当前节点的父节点
preceding	选取当前节点的父节点
preceding-sibling	选取当前节点之前的所有同级节点
self	选取当前节点。等驾驭self::node()

步Step
步的语法=轴名称：节点测试[谓语]=

例子	结果
`child::book`	选取所有属于当前节点的只元素的book节点
`attribute::lang`	选取当前节点的lang属性
`child::*`	选取当前节点的所有只元素
`attribute::*`	选取当前节点的所有属性
`child::text()`	选取当前节点的所有文本子节点
`child::node()`	选取当前节点的所有子节点
`descendant::book`	选取当前节点的所有book后代
`ancestor:book`	选择当前节点的所有book先辈
`ancestor-or-self::book`	选取当前节点的所有book先辈以及当前节点(如果此节点是book节点)
`child::*/child::price`	选取当前节点的所有price孙节点

XPATH示例
以斜杠开始的称为绝对路径，表示从根开始。
不以斜杠开始的称为相对路径，一般都是依照当前节点来计算。当前节点在上下文环境中，当前节点很可能已经补是根节点了。
一般为了方便，往往xml如果层次很深，都会使用=//=来查找节点。

路径表达式	含义
`title`	选取当前节点下所有title子节点
`/book`	从根节点找子节点是book的，找不到
`book/title`	当前节点下所有子节点book下的title节点
`//title`	从根节点向下找任意层中title的结点
`book//title`	当前节点下所有book子节点下任意层次的title节点
`//@id`	任意层次下含有id的属性，取回的是属性
`//book[@id]`	任意层次下含有id属性的book节点
`//*[@id]`	任意层下含有id属性的节点
`//book[@id`"bk102"]=	任意层次下book节点，且含有id属性为bk102的节点。
`/bookstore/book[1]`	根节点bookstore下第一个book节点，从1开始
`/bookstore/book[1]/@id`	根节点bookstore下的第一个book节点的id属性
`/bookstore/book[last()-1]`	根节点bookstore下倒数第二个book节点,函数last()返回最后一个元素索引
`/bookstore/*`	匹配根节点bookstore的所有子节点，不递归
`//*`	匹配所有子孙节点
`//[@]`	匹配所有有属性的节点
=//book/title	//price=	匹配任意层下的book下节点是title节点，或者任意层下的price
`//book[position()=2]`	匹配book节点，取第二个
`//book[position()<last()-1]`	匹配book节点，取位置小于倒数第二个
`//book[price>40]`	匹配book节点，取节点值大于40的book节点
`//book[2]/node()`	匹配位置为2的book节点下的所有类型的节点
`//book[1]/text()`	匹配第一个book节点下的所有文本子节点
`//book[1]/text()`	匹配第一个book节点下的所有文本节点
`//*[local-name()`"book"]=	匹配所有节点且不带限定名的节点名称为book的所有节点。local-name函数取不带限定名的名称。相当于指定标签元素为…的节点
下面这三种表达式等价=//book[price<6]/price==//book/price[text()<6]==//book/child::node()[local-name()="price" and text()<6]=	获取book节点下的price节点，且price中内容小于6的节点
=//book//*[self::title or self::price]=等价于=//book//title	//book/price=也等价于=//book//*[local-name()="title" or local-name()="price"]=	所有book节点下子孙节点，且这些节点是title或者price。
`//*[@class]`	所有有class属性的节点
`//*[@class`"bookinfo even"]=	所有属性为“bookinfo even”的节点
`//*[contains(@class,'even')`	获取所有属性class中包含even字符串的节点
`//*[contains(local-name(),'book')`	标签名包含book的节点

函数总结

函数	含义
`local-name()`	获取不带限定名的名称。相当于指定标签元素
`text()`	获取标签之间的文本内容
`node()`	所有节点。
`contains(@class,str)`	包含
`starts-with(local-name(),"book")`	以book开头
`last()`	最后一个元素索引
`position()`	元素索引

lxml

lxml是Python下功能丰富的XML、HTML解析库，性能非常好，是对libxml2和libxslt的封装。
最新版本支持Python 2.6+,python3支持3.6.
在CentOS编译安装需要

#yum install libxml2-devel libxslt-devel

注意,不同平台不一样，参看https://lxml.de/installation.html
lxml安装=$ pip install lxml=

from lxml import etree

# 使用etree构建HTML
root = etree.Element("html")
print(type(root))
print(root.tag)

body = etree.Element("body")
root.append(body)
print(etree.tostring(root))

#增加子节点
sub = etree.SubElement(body,"child1")
print(type(sub))
sub = etree.SubElement(body,"child2").append(etree.Element("child21"))
html = etree.tostring(root,pretty_print=True).decode()
print(html)
print("- "*30)

r = etree.HTML(html) #返回根节点
print(r.tag)
print(r.xpath("//*[contains(local-name(),'child')]"))

etree还提供了2个有用的函数
etree.HTML(text)解析HTML文档，返回根节点
anode.xpath('xpath路径')对节点使用xpath语法
练习：爬取“口碑榜”
1. 从豆瓣电影中获取”本周口碑榜”

from lxml import etree
import requests

url = "https://movie.douban.com/"
ua = "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/537.36 (KHTML, like Gecko) Version/5.0.1 Safari/537.36"

with requests.get(url,headers={"User-agent":ua}) as response:
    if response.status_code==200:
        content = response.text #html内容
        html = etree.HTML(content) #分析html，返回DOM根节点
        titles = html.xpath("//div[@class='billboard-bd']//tr/td/a/text()") #返回文本列表
        for i in titles: #豆瓣电影之本周排行榜
            print(i)
    else:
        print("访问错误")

BeautifulSoup4和JsonPath

[toc]

BeautifulSoup4

BeautifulSoup可以从HTML、XML中提取数据，目前BS4在持续开发。
官方中文文档https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/
安装
1. pip install beautifulsoup4
导入：=from bs4 import BuautifulSoup=
初始化：
1. BeautifulSoup(markup=““,features=None)
  - markup,被解析对象，可以是文件对象或者html字符串
  - feature指定解析器
  - return:返回一个文档对象

from bs4 import BeautifulSoup

#文件对象
soup = BeautifulSoup(open("test.html"))
# 标记字符串
soup = BeautifulSoup("<html>data</html>")

可以不指定解析器，就依赖系统已经安装的解析器库了。

解析器	使用方法	优势	劣势
Python标准库	BeautifulSoup(markup,"html.parser")	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3、3.2.2前的版本中文档容错能力差
lxml HTML 解析器	BeautifulSoup(markup,"lxml")	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup,["lxml","xml"])BeautifulSoup(markup,"xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	BeautifulSoup(markup,"html5lib")	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

BeautifulSoup(markup,"html.parser")使用Python标准库，容错差且性能一般。
BeautifulSoup(markup,"lxml")容错能力强，速度快。需要安装系统C库。
推荐使用lxml作为解析器，效率高。
需要手动指定解析器，以保证代码在所有运行环境中解析器一致。
使用下面内容构建test.html使用bs4解析它

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>首页</title>
</head>
<body>
<h1>xdd欢迎您</h1>
<div id="main">
    <h3 class="title highlight"><a href="http://www.python.org">python</a>高级班</h3>
    <div class="content">
        <p id="first">字典</p>
        <p id="second">列表</p>
        <input type="hidden" name="_csrf" value="absdoia23lkso234r23oslfn">
        <!-- comment -->
        <img id="bg1" src="http://www.xdd.com/">
        <img id="bg2" src="http://httpbin.org/">
    </div>
</div>
<p>bottom</p>
</body>

四种对象
BeautifulSoup将HTML文档解析成复杂的树型结构，每个节点都是Python的对象，可分为4种：
- BeautifulSoup、Tag、NavigableString、Comment
- *BeautifulSoup对象*：代表整个文档。
- *Tag对象*：对应着HTML中的标签。有2个常用的属性：
  1. name:Tag对象的名称，就是标签名称
  2. attrs:标签的属性字典
    - 多值属性，对于class属性可能是下面的形式，=<h3 class="title highlight">python高级班</h3>=这个属性就是多值({"class":["title","highlight"]})
    - 属性可以被修改、删除
BeautifulSoup.prettify() #带格式输出解析的文档对象(即有缩进的输出)，注意：直接输出BeautifulSoup会直接输出解析的文档对象，没有格式。
BeautifulSoup.div #输出匹配到的第一个div对象中的内容，返回对象是bs4.element.Tag类型
BeautifulSoup.h3.get("class") #获取文档中第一个标签为h3对象中class属性值

from bs4 import BeautifulSoup

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(soup.builder)
    # print(0,soup) #输出整个解析的文档对象(不带格式）
    # print(1,soup.prettify()) #按照格式输出文档内容
    print("- "*30)
    # print(2,soup.div,type(soup.div)) #类型bs4.element.Tag，Tag对象
    # print(3,soup.div["class"]) #会报错，keyError，div没有class属性
    print(3,soup.div.get("class")) #获取div的class属性，没有返回None

    print(4,soup.div.h3["class"]) #多值属性
    print(4,soup.h3.get("class")) #多值属性,获取文档中第一h3标签中的class属性
    print(4,soup.h3.attrs.get("class")) #多值属性

    print(5,soup.img.get("src")) #获取img中src属性值
    soup.img["src"] = "http://www.xddupdate.com" #修改值
    print(5,soup.img["src"])

    print(6,soup.a) #找不到返回None
    del soup.h3["class"] #删除属性
    print(4,soup.h3.get("class"))

注意：我们一般不使用声明这种方式来操作HTML，此代码时为了熟悉对象类型
NavigableString
如果只想输出标记的文本，而不关心标记的话，就要使用NavigableString.

print(soup.div.p.string) #第一个div下第一个p的字符串
print(soup.p.string) #同上

*注释对象*：这就是HTML中的注释，它被BeautifulSoup解析后对应Comment对象。

遍历文档树

在文档树中找到关心的内容才是日常的工资，也就是说如何遍历树中的节点。使用上面的test.html来测试
使用Tag
- soup.div可以找到从根节点开始查找第一个div节点,返回一个Tag对象
- soup.div.p说明从根节点开始找到第一个div后返回一个Tag对象，这个Tag对象下继续找第一个p，找到返回Tag对象
- soup.p返回了文字“字典”，而不是文字“bottom”说明遍历时*深度优先*，返回也是Tag对象
遍历直接子节点
- Tag.contents #将对象的所有类型直接子节点以列表方式输出
- Tag.children #返回子节点的迭代器
  - Tag.children #等价于Tag.contents

遍历所有子孙节点

Tag.descendants #返回节点的所有类型子孙节点，可以看出迭代次序是深度优先

from bs4 import BeautifulSoup
from bs4.element import Tag

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(soup.p.string)
    print(soup.div.contents) #直接子标签列表
    print("- "*30)

    for i in soup.div.children: #直接子标签可迭代对象
        print(i.name)
    print("- "*30)
    print(list(map(
        lambda x:x.name if x.name else x,
        soup.div.descendants #所有子孙
    )))

遍历字符串

在前面的例子中，soup.div.string返回None，是因为string要求soup.div只能有一个NavigableString类型子节点，也就是这样=<div>only string</div>=。
Tag.string #获取Tag下的string对象，如果多余1个结点返回None
Tag.strings #返回迭代器，带多余的空白字符。所有的string对象
Tag.stripped_strings #返回，会去除多余空白字符

from bs4 import BeautifulSoup
from bs4.element import Tag

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(soup.div.string) #返回None，因为多余1个子节点
    print("- "*30)
    print("".join(soup.div.strings).strip()) #返回迭代器，带多余的空白字符
    print("- "*30)
    print("".join(soup.div.stripped_strings)) #返回迭代器，去除多余空白字符

遍历祖先节点

BeautifulSoup.parent #获取根节点的父结点，必定返回None,根节点没有父结点
Tag.parent #获取第一个Tag的父结点
Tag.parent.parent.get("id") #获取第一个tag的父结点的父结点的id属性
Tag.parents #获取Tag节点的所有父结点，由近及远

from bs4 import BeautifulSoup
from bs4.element import Tag

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(type(soup))
    print(soup.parent)
    print(soup.div.parent.name) #body ,第一个div的父节点
    print(soup.p.parent.parent.get("id")) #取id属性， main
    print("- "*30)
    print(list(map(lambda x:x.name,soup.p.parents))) #父迭代器，由近及远

遍历兄弟节点

Tag.next_sibling #第一个Tag元素的下一个(下面)兄弟节点，注意：可能是一个文本节点
Tag.previous_sibling #第一个Tag元素之前的兄弟节点(上面)，注意：可能是一个文本节点
Tag.next_siblings #获取Tag元素的下面的所有兄弟节点

from bs4 import BeautifulSoup
from bs4.element import Tag

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(type(soup),type(soup.p))
    print("{} [{}]".format(1,soup.p.next_sibling.encode()))
    print("{} [{}]".format(2,soup.p.previous_sibling.encode()))
    print(soup.p.previous_sibling.next_sibling) #等价于soup.p
    print(soup.p.next_sibling.previous_sibling)  # 等价于soup.p
    print(soup.p)
    print(list(soup.p.next_siblings))

遍历其他元素

Tag.next_element #是下一个可被解析的对象(字符串或tag),和下一个兄弟节点next_sibling不一样
Tag.next_elements #返回所有下一个可被解析的对象，是个可迭代对象。

from bs4 import BeautifulSoup

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(type(soup),type(soup.p))
    print(soup.p.next_element) #返回"字典"2个字
    print(soup.p.next_element.next_element.encode())
    print(soup.p.next_element.next_element.next_element)
    print(list(soup.p.next_elements))

    print("- "*30)
    #对比差异
    print(list(soup.p.next_elements))
    print(list(soup.p.next_siblings))

搜索文档树

find系有很多分发，请执行查询帮助https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id25
=find_all(name=None,attrs={},recursive=True,text=None,limit=None,**kwargs)=#立即返回一个列表

name参数:官方称为*fiter过滤器*，这个参数可以是一下

*字符串*：一个标签名称的字符串，会按照这个字符串全长匹配标签名 =print(soup.find_all('p'))=#返回文档中所有p标签

正则表达式对象:按照”正则表达式对象”的模式匹配标签名

import re
print(soup.find_all(re.compile("^h\d"))) #标签名以h开头后接数字

*列表*：或关系查找列表中的每个字符串

print(soup.find_all(["p","h1","h3"])) #或关系，找出列表所有的标签
print(soup.find_all(re.compile(r"^p|h|\d$"))) #使用正则表达式完成

True或None,则find_all返回全部非字符串节点、非注释节点，就是Tag标签类型

from bs4 import BeautifulSoup

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(list(map(lambda x: x.name, soup.find_all(True))))
    print(list(map(lambda x: x.name, soup.find_all(None))))
    print(list(map(lambda x: x.name, soup.find_all())))

from bs4 import BeautifulSoup
from bs4.element import Tag

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    values = [True,None,False]
    for value in values:
        all = soup.find_all(value)
        print(type(all[0]))
        print(len(all))

    print("- "*30)
    count = 0
    for i,t in enumerate(soup.descendants): #遍历所有类型的子孙节点
        print(i,type(t),t.name)
        if isinstance(t,Tag): #只对Tag类型计数
            count += 1
    print(count)
# 数目一致，所以返回的是Tag类型的节点，源码中确实返回的Tag类型

函数

如果使用以上过滤器还不能提取想要的节点，可以使用函数，此函数仅只能*接收一个参数*。
如果这个函数返回True,表示当前节点配置；返回False则是不匹配。
示例：找出所有class属性且有多个值的节点(测试html中符合这个条件只有h3标签)

from bs4 import BeautifulSoup
from bs4.element import Tag

def many_classes(tag:Tag):
    # print(type(tag))
    # print(type(tag.attrs))
    return len(tag.attrs.get("class",[])) > 1

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(soup.find_all(many_classes))

keyword传参

使用关键字传参，如果参数名不是find系函数已定义的位置参数名，参数会被kwargs收集并被*当做标签的属性*来搜索。
属性的传参可以是字符串、正则表达式对象、True、列表。

from bs4 import BeautifulSoup
import re

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(soup.find_all(id="first")) #id为first的所有结点列表
    print(1,"- "*30)
    print(soup.find_all(id=re.compile("\w+"))) #相当于找有di的所有节点
    print(2,"- " * 30)
    print(soup.find_all(id=True)) #所有有id的节点

    print(3,"- " * 30)
    print(list(map(lambda x:x["id"],soup.find_all(id=True))))
    print(4,"- " * 30)
    print(soup.find_all(id=["first",re.compile(r"^sec")])) #指定id的名称列表
    print(5,"- " * 30)
    print(soup.find_all(id=True,src=True)) #相当于条件and,既有id又有src属性的节点列表

css的class的特殊处理

class是Python关键字，所以使用=class_=。class是多值属性，可以匹配其中任意一个，也可以完全匹配。

print(soup.find_all(class_="content"))
print(soup.find_all(class_="title")) #可以使用任意一个css类
print(soup.find_all(class_="highlight")) #可以使用任意一个css类
print(soup.find_all(class_="highlight title")) #顺序错了，找不到
print(soup.find_all(class_="title highlight")) #顺序一致，找到。就是字符串完全匹配

attrs参数

attrs接收一个字典，字典的key为属性名，value可以是字符串、正则表达式对象、True、列表。可以多个属性

print(soup.find_all(attrs={"class":"title"}))
print(soup.find_all(attrs={"class":"highlight"}))
print(soup.find_all(attrs={"class":"title highlight"}))
print(soup.find_all(attrs={"id":True}))
print(soup.find_all(attrs={"id":re.compile(r"\d$")}))
print(list(map(lambda x:x.name,soup.find_all(attrs={"id":True,"src":True}))))

text参数

可以通过text参数搜索文档中的字符串内容，接受字符串、正则表达式对象、True、列表

from bs4 import BeautifulSoup
import re

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(list(map(lambda x:(type(x),x),soup.find_all(text=re.compile("\w+"))))) #返回文本类节点
    print("- "*30)
    print(list(map(lambda x:(type(x),x),soup.find_all(text=re.compile("[a-z]+")))))
    print("- "*30)
    print(soup.find_all(re.compile(r"^(h|p)"),text=re.compile("[a-z]+"))) #相当于过滤Tag对象，并看它的string是否符合text参数要求，返回Tag对象

*limit参数*：显示返回结果的数量

print(soup.find_all(id=True,limit=3)) #返回列表中有3个结果

recursive参数
- 默认是递归搜索所有子孙节点，如果不需要请设置为False

简化写法

find_all()是非常常用的方法，可以简化省略掉

from bs4 import BeautifulSoup
import re

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(soup("img")) #所有img标签对象的列表，等价于soup.find_all("img")
    print(soup.img) #深度优先第一个img

    print(soup.a.find_all(text=True)) #返回文本
    print(soup.a(text=True)) #返回文本，和上面等价
    print(soup("a",text=True)) #返回a标签对象

    print(soup.find_all("img",attrs={"id":"bg1"}))
    print(soup("img",attrs={"id":"bg1"})) #find_all的省略
    print(soup("img",attrs={"id":re.compile("1")}))

find方法

find(name,attrs,recursive,text,**kwargs)
- 参数几乎和find_all一样。
- 找到了，find_all返回一个列表，而find返回一个单值，元素对象。
- 找不到，find_all返回一个空列表，而find返回一个None。

from bs4 import BeautifulSoup

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    print(soup.find("img",attrs={"id":"bg1"}).attrs.get("src","xdd"))
    print(soup.find("img",attrs={"id":"bg1"}).get("src")) #简化了attrs
    print(soup.find("img",attrs={"id":"bg1"})["src"])

CSS选择器

和JQuery一样，可以使用CSS选择器来查找节点
使用soup.select()方法，select方法支持大部分CSS选择器，返回列表。
CSS中，标签名直接使用，类名前加=.=点号,id名前加=#=井号。
BeautifulSoup.select("css选择器")

from bs4 import BeautifulSoup

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    #元素选择器
    print(1,soup.select("p")) #所有的p标签

    #类选择器
    print(2,soup.select(".title"))

    #使用了伪类
    #直接子标签是p的同类型的所有p标签中的第二个
    #(同类型)同标签名p的第2个，伪类只实现了nth-of-type，且要求是数字
    print(3,soup.select("div.content >p:nth-of-type(2)"))

    # id选择器
    print(4,soup.select("p#second"))
    print(5,soup.select("#bg1"))

    #后代选择器
    print(6,soup.select("div p")) # div下逐层找p
    print(7,soup.select("div div p")) #div下逐层找div下逐层找p

    #子选择器，直接后代
    print(8,soup.select("div > p")) #div下直接子标签的p，有2个

    #相邻兄弟选择器
    print(9, soup.select("div p:nth-of-type(1) + [src]")) #返回[]
    print(9, soup.select("div p:nth-of-type(1) + p"))  # 返回p标签
    print(9, soup.select("div > p:nth-of-type(2) + input"))  # 返回input Tag
    print(9, soup.select("div > p:nth-of-type(2) + [type]"))  # 同上

    #普通兄弟选择器
    print(10, soup.select("div p:nth-of-type(1) ~ [src]")) #返回2个img

    #属性选择器
    print(11,soup.select("[src]")) #有属性src
    print(12,soup.select("[src='/']")) #属性src等于/
    print(13,soup.select("[src='http://www.xdd.com/']")) #完全匹配
    print(14,soup.select("[src^='http://www']")) #以http://www开头
    print(15,soup.select("[src$='com/']")) #以com/结尾
    print(16,soup.select("img[src*='xdd']")) #包含xdd
    print(17,soup.select("img[src*='.com']")) #包含.com
    print(18,soup.select("[class='title highlight']")) #完全匹配calss等于'title highlight'
    print(19,soup.select("[class~=title]")) #多值属性中有一个title

获取文本内容
搜索节点的目的往往是为了提取该节点的文本内容，一般不需要HTML标记，只需要文字

from bs4 import BeautifulSoup

with open("d://xdd.html",encoding="utf-8") as f:
    soup = BeautifulSoup(f,"lxml")
    # 元素选择器
    ele = soup.select("div") #所有的div标签
    print(type(ele))
    print(ele[0].string) #内容仅仅只能是文本类型，否则返回None
    print(list(ele[0].strings)) #迭代保留空白字符
    print(list(ele[0].stripped_strings)) #迭代不保留空白字符

    print("- "*30)
    print(ele[0])
    print("- " * 30)

    print(list(ele[0].text))#本质上就是get_text(),保留空白字符的strings
    print(list(ele[0].get_text())) #迭代并join，保留空白字符，strip默认为False
    print(list(ele[0].get_text(strip=True))) #迭代并join，不保留空白字符

bs4.element.Tag#string源码

class Tag(PageElement):
@property
    def string(self):
        if len(self.contents) != 1:
            return None
        child = self.contents[0]
        if isinstance(child, NavigableString):
            return child
        return child.string

    @string.setter
    def string(self, string):
        self.clear()
        self.append(string.__class__(string))

    def _all_strings(self, strip=False, types=(NavigableString, CData)):
        for descendant in self.descendants:
            if (
                (types is None and not isinstance(descendant, NavigableString))
                or
                (types is not None and type(descendant) not in types)):
                continue
            if strip:
                descendant = descendant.strip()
                if len(descendant) == 0:
                    continue
            yield descendant

    strings = property(_all_strings)

    @property
    def stripped_strings(self):
        for string in self._all_strings(True):
            yield string

    def get_text(self, separator="", strip=False,
                 types=(NavigableString, CData)):
        return separator.join([s for s in self._all_strings(
                    strip, types=types)])
    getText = get_text
    text = property(get_text)

Json解析

拿到一个Json字符串，如果想提取其中的部分内容，就需要遍历了。在遍历过程中进行判断。
还有一种方式，类似于XPath,叫做jsonPath。
安装=pip install jsonpath=
官网https://goessner.net/articles/JsonPath/

XPath	JsonPath	说明
`/`	`$`	根元素
`.`	`@`	当前节点
`/`	`.=或者`[]=	获取子节点
`..`	不支持	父节点
`//`	`..`	任意层次
`*`	`*`	通配符，匹配任意节点
`@`	不支持	json中没有属性
`[]`	`[]`	下标操作
=	=	`[,]`	XPath是或操作，JSONPath allows alternate names or array indices as a set.
不支持	`[start:stop:step]`	切片
`[]`	`?()`	过滤操作
不支持	`()`	表达式计算
`()`	不支持	分组

综合示例，使用豆瓣电影的热门电影的Jsonhttps://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&page_limit=10&page_start=0

{
    "subjects":[
        {
            "rate":"8.8",
            "cover_x":1500,
            "title":"寄生虫",
            "url":"https://movie.douban.com/subject/27010768/",
            "playable":false,
            "cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2561439800.jpg",
            "id":"27010768",
            "cover_y":2138,
            "is_new":false
        },
        {
            "rate":"7.7",
            "cover_x":1500,
            "title":"恶人传",
            "url":"https://movie.douban.com/subject/30211551/",
            "playable":false,
            "cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2555084871.jpg",
            "id":"30211551",
            "cover_y":2145,
            "is_new":false
        },
        {
            "rate":"6.6",
            "cover_x":1500,
            "title":"异地母子情",
            "url":"https://movie.douban.com/subject/26261189/",
            "playable":false,
            "cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2562107493.jpg",
            "id":"26261189",
            "cover_y":2222,
            "is_new":true
        },
        {
            "rate":"6.7",
            "cover_x":2025,
            "title":"我的生命之光",
            "url":"https://movie.douban.com/subject/26962841/",
            "playable":false,
            "cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2563625370.jpg",
            "id":"26962841",
            "cover_y":3000,
            "is_new":true
        },
        {
            "rate":"7.3",
            "cover_x":2025,
            "title":"皮肤",
            "url":"https://movie.douban.com/subject/27041467/",
            "playable":false,
            "cover":"https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2559479239.jpg",
            "id":"27041467",
            "cover_y":3000,
            "is_new":true
        },
        {
            "rate":"8.9",
            "cover_x":2000,
            "title":"绿皮书",
            "url":"https://movie.douban.com/subject/27060077/",
            "playable":true,
            "cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2549177902.jpg",
            "id":"27060077",
            "cover_y":3167,
            "is_new":false
        },
        {
            "rate":"8.0",
            "cover_x":3600,
            "title":"疾速备战",
            "url":"https://movie.douban.com/subject/26909790/",
            "playable":false,
            "cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2551393832.jpg",
            "id":"26909790",
            "cover_y":5550,
            "is_new":false
        },
        {
            "rate":"7.9",
            "cover_x":1786,
            "title":"流浪地球",
            "url":"https://movie.douban.com/subject/26266893/",
            "playable":true,
            "cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2545472803.jpg",
            "id":"26266893",
            "cover_y":2500,
            "is_new":false
        },
        {
            "rate":"8.2",
            "cover_x":684,
            "title":"沦落人",
            "url":"https://movie.douban.com/subject/30140231/",
            "playable":false,
            "cover":"https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2555952192.jpg",
            "id":"30140231",
            "cover_y":960,
            "is_new":false
        },
        {
            "rate":"6.4",
            "cover_x":960,
            "title":"疯狂的外星人",
            "url":"https://movie.douban.com/subject/25986662/",
            "playable":true,
            "cover":"https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2541901817.jpg",
            "id":"25986662",
            "cover_y":1359,
            "is_new":false
        }
    ]
}

from jsonpath import jsonpath
import requests
import json

ua = "Mozilla/5.0 (Windows; U; Windows NT 6.1; zh-CN) AppleWebKit/537.36 (KHTML, like Gecko) Version/5.0.1 Safari/537.36"
url = "https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&page_limit=10&page_start=0"

with requests.get(url,headers={"User-agent":ua}) as response:
    if response.status_code==200:
        text = response.text
        print(text[:100])
        js = json.loads(text)
        print(str(js)[:100]) #json转换为Python数据结构

        #知道所有电影的名称
        rs1 = jsonpath(js,"$..title") #从根目录开始，任意层次的title属性
        print(rs1)

        #找到所有subjects
        rs2 = jsonpath(js,"$..subjects")
        print(len(rs2),str(rs2[0])[:100]) #由于太长，取前100个字符

        print("- " * 30)
        # 找到所有得分高于8分的电影名称
        # 根下任意层的subjects的子节点rate大于字符串8
        rs3 = jsonpath(js,'$..subjects[?(@.rate > "8")]') #？()是过滤器
        print(rs3)

        print("- "*30)
        #根下任意层的subjects的子节点rate大于字符串8的节点的子节点title
        rs4 = jsonpath(js,'$..subjects[?(@.rate > "8")].title')
        print(rs4)
        print("- " * 30)

        #切片
        rs5 = jsonpath(js,"$..subjects[?(@.rate > '6')].title")
        print(rs5[:2])

RabbitMQ

[toc]

RabbitMQ是由LShift提供的一个Advanced Message Queuing Protocol(AMQP)的开源实现，由以高性能、健壮以及可伸缩性出名的Erlang写成，因此也是继承了这些优点。

安装

选择RPM包下载，选择对应平台，本次安装在CentOS7，其他平台类似。https://www.rabbitmq.com/install-rpm.html
由于使用了erlang语言开发，所以需要erlang的包。erlang和RabbitMQ的兼容性，参考https://www.rabbitmq.com/which-erlang.html#compatibility-matrix
第二种错误，可以修改host文件，修改主机名即可
下载 rabbitmq-server-3.7.16-1.el7.noarch.rpm、erlang-21.3.8.6-1.el7.x86_64.rpm。socat在CentOS中源中有。

yum -y install erlang-21.3.8.6-1.el7.x86_64.rpm rabbitmq-server-3.7.16-1.el7.noarch.rpm

查看安装的文件

[root@xdd ~]# rpm -ql rabbitmq-server 
/etc/logrotate.d/rabbitmq-server
/etc/profile.d/rabbitmqctl-autocomplete.sh
/etc/rabbitmq
/usr/lib/ocf/resource.d/rabbitmq/rabbitmq-server
/usr/lib/ocf/resource.d/rabbitmq/rabbitmq-server-ha
/usr/lib/rabbitmq/autocomplete/bash_autocomplete.sh
/usr/lib/rabbitmq/autocomplete/zsh_autocomplete.sh
/usr/lib/rabbitmq/bin/cuttlefish
/usr/lib/rabbitmq/bin/rabbitmq-defaults

配置

参考资料https://www.rabbitmq.com/configure.html#config-location

环境配置

使用系统环境变量，如果没有使用rabbitmq-env.conf中定义环境变量，否则使用缺省值

RABBITMQ_NODE_IP_ADDRESS the empty string, meaning that it should bind to all network interfaces.  
RABBITMQ_NODE_PORT 5672  
RABBITMQ_DIST_PORT RABBITMQ_NODE_PORT + 20000  #内部节点和客户端工具通信用  
RABBITMQ_CONFIG_FILE 配置文件路径默认为/etc/rabbitmq/rabbitmq

环境变量文件，可以不配置

工作特性配置文件

rabbitmq.config配置文件
3.7支持新旧两种配置文件格式
erlang配置文件格式，为了兼容继续采用

Figure 23: rabbitmq_001
sysctl格式，如果不需要兼容，RabbitMQ鼓励使用。 (这个文件也可以不配置)

插件管理

列出所有可用插件

rabbitmq-plugins list

启动WEB管理插件，会依赖启用其他几个插件。

[root@xdd rabbitmq]$ rabbitmq-plugins enable rabbitmq_management

启动服务

systemctl start rabbitmq-server

启动中，可能出现下面的错误
1. Error when reading /var/lib/rabbitmq/.erlang.cookie:eacces=这就是这个文件的权限问题，修改属组、属组为rabbitmq即可 =chown rabbitmq.rabbitmq /var/lib/rabbitmq/.erlang.cookie
服务启动成功

[root@xdd ~]# ss -tanl | grep 5672
LISTEN     0      128          *:25672                    *:*
LISTEN     0      128          *:15672                    *:*
LISTEN     0      128         :::5672                    :::*
[root@xdd ~]#

用户管理

开始登陆WEB界面,=http://192.168.61.108(rabbitmq所在主机的ip):15672=
使用guest/guest只能本地登陆，远程登录会报错
rabbitmqctl命令
- rabbitmqctl [-n <node>] [-1][-q] <command> [<command options>]
- General options:
  1. -n node
  2. -q,–quiet
  3. -t,–timeout timeout
  4. -l longnames
- Commands:
  1. add_user <username> <password> 添加用户
  2. list_user 列出用户
  3. delete_user username 删除用户
  4. change_password <username> <password> 修改用户名，密码
  5. set_user_tags <username> <tag> [...] 设置用户tag
  6. list_user_permissions 列出用户权限
添加用户：=rabbitmqctl add_user username password=
删除用户：=rabbitmqctl delete_user username=
更改密码：=rabbitmqctl change_password username newpassword=
设置权限Tags，其实就是分配组：=rabbitmqctl set_user_tags username tag=
设置xdd用户为管理员tag后登陆

# rabbitmqctl add_user gdy gdy  #添加xdd用户
# rabbitmqctl list_users #查看所有用户
# rabbitmqctl set_user_tags gdy administrator #设置xdd用户为管理员用户

tag的意义如下：
1. administrator可以管理用户、权限、虚拟主机。
基本信息(web管理端口15672，协议端口5672)
虚拟主机
1. 缺省虚拟主机，默认只能是guest用户在本机链接，下图新建的用户gdy默认无法访问任何虚拟主机

Pika库

Pika是纯Python实现的支持AMQP协议的库
1. pip install pika

RabbitMQ工作原理及应用

工作模式

参考官网https://www.rabbitmq.com/getstarted.html

名词解释

名词	说明
Server	服务器接受客户端连接，实现消息队列及路由功能的进程(服务),也称为消息代理注意：客户端可用生产者，也可以是消费者，它们都需要连接到Server
Connection	网络物理连接
Channel	一个连接允许多个客户端连接
Exchange	交换器。接收生产者发来的消息，决定如何路由给服务器中的队列。常用的类型有：direct(point-to-point)topic(publish-subscribe)fanout(multicast)
Message	消息
Message Queue	消息队列，数据的存储载体
Bind	绑定建立消息队列和交换器之间的关系，也就是说交换器拿到数据，把什么样的数据送给哪个队列
Virtual Host	虚拟主机一批交换器、消息队列和相关对象的集合。为了多用户互不干扰，使用虚拟主机分组交换机，消息队列
Topic	主题、话题
Broker	可等价为Server

1.队列

这种模式就是最简单的生产者消费者模型，消息队列就是一个FIFO的队列
生产者send.py,消费者receie.py
官方例子：https://www.rabbitmq.com/tutorials/tutorial-one-python.html
注意：出现如下运行结果

pika.exceptions.ProbableAuthenticationError: (403, 'ACCESS_REFUSED - Login was refused using authentication mechanism PLAIN. For details see the broker logﬁle.')

访问被拒绝，还是权限问题，原因还是guest用户只能访问localhost上的缺省虚拟主机
解决办法
1. 缺省虚拟主机，默认只能在本机访问，不要修改为远程访问，是安全的考虑。
2. 因此，在Admin中Virtual hosts中，新建一个虚拟主机test。
3. 注意：新建的test虚拟主机的Users是谁，本次是gdy用户

在ConnectionParameters中没有用户名、密码填写的参数，它使用参数credentials传入，这个需要构建一个pika.credentials.Credentials对象。
参照官方例子，写一个小程序

# send.py
import pika
from pika.adapters.blocking_connection import BlockingChannel

#构建用户名密码对象
credential = pika.PlainCredentials("gdy","gdy")
# 配置链接参数
params = pika.ConnectionParameters(
    "192.168.61.108",#ip地址
    5672,  #端口
    "test",#虚拟机
    credential #用户名密码
)

# # 第二种建立连接方式
# params = pika.URLParameters("amqp://gdy:[email protected]:5672/test")

# 建立连接
connection = pika.BlockingConnection(params)

with connection:
    # 建立通道
    channel:BlockingChannel = connection.channel()

    #创建一个队列，queue命名为hello，如果queue不存在，消息将被dropped
    channel.queue_declare(queue="hello")

    channel.basic_publish(
        exchange="",#使用缺省exchange
        routing_key="hello", #routing_key必须指定，这里要求和目标queue一致
        body="Hello world" #消息
    )
    print("消息发送成功Sent Message OK")

测试通过。去服务管理界面查看Exchanges和Queues。
URLParameters，也可以使用URL创建参数

# amqp://username:password@host:port/<virtual_host>[?query-string] 
parameters = pika.URLParameters('amqp://guest:guest@rabbit-server1:5672/%2F') 
# %2F指代/，就是缺省虚拟主机

queue_declare声明一个queue，有必要可以创建。
basic_publish exchange为空就使用缺省exchange,如果找不到指定的exchange,抛异
使用缺省exchange,就必须指定routing_key，使用它找到queue
修改上面生产者代码，让生产者连续发送send Message。在web端查看Queues中Ready的变化

# send.py
import pika
from pika.adapters.blocking_connection import BlockingChannel
import time

# 第二种建立连接方式
params = pika.URLParameters("amqp://gdy:[email protected]:5672/test")
# 建立连接
connection = pika.BlockingConnection(params)

with connection:
    # 建立通道
    channel:BlockingChannel = connection.channel()

    #创建一个队列，queue命名为hello，如果queue不存在，消息将被dropped
    channel.queue_declare(queue="hello")

    for i in range(40):

        channel.basic_publish(
            exchange="",#使用缺省exchange
            routing_key="hello", #routing_key必须指定，这里要求和目标queue一致
            body="data{:02}".format(i) #消息
        )
        time.sleep(0.5)
    print("消息发送成功Sent Message OK")

构建receive.py消费者代码
单个消费消息
- BlockingChannel.basic_get("queue名称",是否阻塞)->(method,props,body)
  - body为返回的消息

# receie.py
import pika
from pika.adapters.blocking_connection import BlockingChannel

# 建立连接
params = pika.URLParameters("amqp://gdy:[email protected]:5672/test")
connection = pika.BlockingConnection(params)

with connection:
    channel:BlockingChannel = connection.channel()
    msg = channel.basic_get("hello",True) #从名称为hello的queue队列中获取消息，获取不到阻塞
    method,props,body = msg #拿不到的消息tuple为(None,None,None)
    if body:
        print("获取到了一个消息Get A message = {}".format(body))
    else:
        print("没有获取到消息empty")

获取到消息后msg的结构如下：

(<Basic.GetOk(['delivery_tag=1', 'exchange=', 'message_count=38', 'redelivered=False', 'routing_key=hello'])>, <BasicProperties>, b'data01')  
返回元组：(方法method,属性properties,消息body)
无数据返回：(None,None,None)

批量消费消息recieve.py

# receie.py 消费代码
import pika
from pika.adapters.blocking_connection import BlockingChannel

# 建立连接
params = pika.URLParameters("amqp://gdy:[email protected]:5672/test")
connection = pika.BlockingConnection(params)

def callback(channel,method,properties,body):
    print("Get a message = {}".format(body))

with connection:
    channel:BlockingChannel = connection.channel()
    channel.basic_consume(
        "hello",#队列名
        callback,#消费回调函数
        True,#不回应
    )
    print("等待消息，退出按CTRL+C;Waiting for messages. To exit press CTRL+C")
    channel.start_consuming()

2.工作队列

继续使用*队列*模式的生产者消费者代码，启动2个消费者。观察结果，可以看到，2个消费者是交替拿到不同的消息。
这种工作模式时一种竞争工作方式，对某一个消息来说，只能有一个消费者拿走它。
1. 从结果知道，使用的是轮询方式拿走数据的。
2. 注意：虽然上面的图中没有画出exchange。用到*缺省exchange*。

3.发布、订阅模式(Publish/Subscribe)

Publish/Subscribe发布订阅，想象一下订阅者(消费者)订阅这个报纸(消息),都应该拿到一份同样内容的报纸。
订阅者和消费者之间还有一个exchange,可以想象成邮局，消费者去邮局订阅报纸，报社发报纸到邮局，邮局决定如何投递到消费者手中。
上例子中工作队列模式的使用，相当于，每个人只能拿到不同的报纸。所以不适合发布订阅模式。
当模式的exchange的type是fanout，就是一对多，即广播模式。
注意，同一个queue的消息只能被消费一次，所以，这里使用了多个queue,相当于为了保证不同的消费者拿到同样的数据，每一个消费者都应该有自己的queue。

# 生成一个交换机
channel.exchange_declare(
    exchange="logs", #新交换机
    exchange_type="fanout" #广播
)

生产者使用*广播模式*。在test虚拟主机中构建了一个logs交换机
至于queue，可以由生产者创建，也可以由消费者创建。
本次采用使用消费者端创建，生产者把数据都发往交换机logs，采用了fanout，然后将数据通过交换机发往已经绑定到此交换机的所有queue。

绑定Bingding,建立exchange和queue之间的联系

# 消费者端
result =channel.queue_declare(queue="") #生成一个随机名称的queue
resutl = channel.queue_declare(queue="",exclusive=True) #生成一个随机名称的queue,并在断开链接时删除queue

# 生成queue
q1:Method = channel.queue_declare(queue="",exclusive=True)
q2:Method = channel.queue_declare(queue="",exclusive=True)
q1name = q1.method.queue #可以通过result.method.queue 查看随机名称
q2name = q2.method.queue

print(q1name,q2name)

#绑定
channel.queue_bind(exchange="logs",queue=q1name)
channel.queue_bind(exchange="logs",queue=q2name)

生成者代码
1. 注意观察交换机和队列

# send.py 生产者代码
import pika
from pika.adapters.blocking_connection import BlockingChannel
import time

# 建立连接
params = pika.URLParameters("amqp://gdy:[email protected]:5672/test")
connection = pika.BlockingConnection(params)
channel:BlockingChannel = connection.channel()

with connection:
    #指定交换机和模式
    channel.exchange_declare(
        exchange="logs",#新交换机
        exchange_type="fanout" #扇出，广播
    )

    for i in range(40):
        channel.basic_publish(
            exchange="logs",#使用指定的exhcange
            routing_key="", #广播模式，不指定routing_key
            body = "data-{:02}".format(i) #消息
        )
        time.sleep(0.01)
    print("消息发送完成")

特别注意：如果先开启生产者，由于没有队列queue,请观察数据
消费者代码
1. 构建queue并绑定到test虚拟机的logs交换机上

# receie.py 消费者代码

import time
import pika
from pika.adapters.blocking_connection import BlockingConnection
from pika.adapters.blocking_connection import BlockingChannel

connection:BlockingConnection = pika.BlockingConnection(pika.URLParameters("amqp://gdy:[email protected]:5672/test"))
channel:BlockingChannel = connection.channel()
# 指定交换机
channel.exchange_declare(exchange="logs",exchange_type="fanout")

q1 = channel.queue_declare(queue="",exclusive=True)
q2 = channel.queue_declare(queue="",exclusive=True)
name1 = q1.method.queue #队列名
name2 = q2.method.queue

#为交换机绑定queue
channel.queue_bind(exchange="logs",queue=name1)
channel.queue_bind(exchange="logs",queue=name2)

def callback(channel,method,properties,body):
    print("{}\n{}".format(channel,method))
    print("获取了一个消息 Get a message = {}".format(body))

with connection:
    #为第一个队列绑定消费者函数
    channel.basic_consume(
        name1,#队列名
        callback, #消费者回调函数
        True #不回应
    )
    #为第二个队列绑定消费者函数
    channel.basic_consume(name2,callback,True)

    print("等待消息，退出按CTRL+C;Waiting for messages. To exit press CTRL+C")
    channel.start_consuming()

先启动消费者receie.py可以看到已经创建了exchange

如果exchange是fanout，也就是广播了，routing_key就不用关心了。

q1 = channel.queue_declare(queue="",exclusive=True)
q2 = channel.queue_declare(queue="",exclusive=True)

注意：演示时要先启动消费者，再启动生产。如果先尝试启动生产者，在启动消费者会导致部分数据丢失。因为：exchange收了数据，没有queue接受，所以，exchange丢弃了这些数据。

4.路由模式Routing

路由其实就是生产者的数据经过exhange的时候，通过匹配规则，决定数据的去向。
生产者代码，交换机类型为direct，指定路由的key

# send生产者
import time
import pika
import random
from pika.adapters.blocking_connection import BlockingConnection
from pika.adapters.blocking_connection import BlockingChannel

exchangename = "color"
colors = ("orange","black","green")

#建立连接
connection:BlockingConnection = pika.BlockingConnection(pika.URLParameters("amqp://gdy:[email protected]:5672/test"))
channel:BlockingChannel = connection.channel()

with connection:
    channel.exchange_declare(
        exchange=exchangename,#使用指定的exchange
        exchange_type="direct" #路由模式
    )

    for i in range(40):
        rk = random.choice(colors)
        msg = "{}-data-{:02}".format(rk,i)
        channel.basic_publish(
            exchange=exchangename,#
            routing_key=rk,#指定routing_key
            body=msg #消息
        )
        print(msg,"----")
        time.sleep(0.01)
    print("消息发送完成 Sent ok")

消费者代码

# receie.py消费者
import time
import pika
import random
from pika.adapters.blocking_connection import BlockingConnection
from pika.adapters.blocking_connection import BlockingChannel

exchangename = "color"
colors = ("orange","black","green")

#建立连接
connection:BlockingConnection = pika.BlockingConnection(pika.URLParameters("amqp://gdy:[email protected]:5672/test"))
channel:BlockingChannel = connection.channel()
channel.exchange_declare(exchange=exchangename,exchange_type="direct")

# 生成队列，名称随机，exclusive=True断开删除该队列
q1 = channel.queue_declare(queue="",exclusive=True)
q2 = channel.queue_declare(queue="",exclusive=True)
name1 = q1.method.queue #查看队列名
name2 = q2.method.queue
print(name1,name2)

#绑定到交换机，而且一定要绑定routing_key
channel.queue_bind(exchange=exchangename,queue=name1,routing_key=colors[0])
channel.queue_bind(exchange=exchangename,queue=name2,routing_key=colors[1])
channel.queue_bind(exchange=exchangename,queue=name2,routing_key=colors[2])

def callback(channel,method,properties,body):
    print("{}\n{}".format(channel,method))
    print("获取了一个消息get a message = {}".format(body))
    print()

with connection:
    channel.basic_consume(
        name1,#队列名
        callback, #消息回调函数
        True #不回应
    )
    channel.basic_consume(name2,callback,True)
    print("等待消息，退出按CTRL+C;Waiting for messages. To exit press CTRL+C")
    channel.start_consuming()

注意：如果routing_key设置一样，绑定的时候指定routing_key='black',如下图。和fanout就类似了，都是1对多，但是不同。
1. 因为fanout时，exchange不做数据过滤，1个消息，所有绑定的queue都会拿到一个副部。
2. direct时候，要按照routing_key分配数据，上图的black有2个queue设置了，就会把1个消息分发给这2个queue。

5.Topic话题

Topic就是更加高级的路由，支持模式匹配而已。
Topic的routing_key必须使用=.=点号分割的单词组成。最多255个字节。
支持使用通配符：
1. =*=表示严格的一个单词
2. =#=表示0个或多个单词
如果queue绑定的routing_key只是一个=#=,这个queue其实可以接收所有的消息。
如果没有使用任何通配符，效果类似于direct，因为只能和字符串匹配了。
生产者代码

# send.py生产者代码
import time
import pika
import random
from pika.adapters.blocking_connection import BlockingConnection
from pika.adapters.blocking_connection import BlockingChannel

exchangename = "products"
#产品和颜色搭配
colors = ("orange","black","green")
topics = ("phone.*","*.red") #两种话题
product_type = ("phone","pc","tv") #3种产品

#建立连接
connection:BlockingConnection = pika.BlockingConnection(pika.URLParameters("amqp://gdy:[email protected]:5672/test"))
channel:BlockingChannel = connection.channel()
#指定交换机为话题模式
channel.exchange_declare(exchange=exchangename,exchange_type="topic")


with connection:
    for i in range(40):
        rk = "{}.{}".format(random.choice(product_type),random.choice(colors))
        msg = "{}-data-{:02}".format(rk,i)
        channel.basic_publish(
            exchange=exchangename,#使用指定的exchange
            routing_key=rk,#指定routing_key
            body=msg #消息
        )
        print(msg,"-----")
        time.sleep(0.5)
    print("消息发送完成 Sent ok")

消费者代码

# recieve.py 消费者代码
import time
import pika
import random
from pika.adapters.blocking_connection import BlockingConnection
from pika.adapters.blocking_connection import BlockingChannel
# 虚拟机名称
exchangename = "products"


#建立连接
connection:BlockingConnection = pika.BlockingConnection(pika.URLParameters("amqp://gdy:[email protected]:5672/test"))
channel:BlockingChannel = connection.channel()
#指定虚拟机，交换机为话题模式
channel.exchange_declare(exchange=exchangename,exchange_type="topic")

# 生成队列，名称随机，exclusive=True断开删除该队列
q1 = channel.queue_declare(queue="",exclusive=True)
q2 = channel.queue_declare(queue="",exclusive=True)
name1 = q1.method.queue #查看队列名
name2 = q2.method.queue
print(name1,name2)

#绑定到交换机，而且一定绑定routing_key
#q1只收集phone开头的routing_key的消息，也就是说只管收集类型信息
channel.queue_bind(exchange=exchangename,queue=name1,routing_key="phone.*")
# q2只收集red结尾的routing_key的消息，也就是说只管红色的信息
channel.queue_bind(exchange=exchangename,queue=name2,routing_key="*.red")

def callback(channel,method,properties,body):
    print("{}\n{}".format(channel,method))
    print("获取了一个消息Get a message = {}".format(body))
    print()

with connection:
    channel.basic_consume(
        name1,#队列名
        callback,#消息回调函数
        True #不回应
    )
    channel.basic_consume(name2,callback,True)

    print("等待消息，退出按CTRL+C;Waiting for messages. To exit press CTRL+C")
    channel.start_consuming()

观察消费者拿到的数据，注意观察phone.red的数据出现次数。
由此，可以知道*交换机在路由消息的时候，只要和queue的routing_key匹配，就把消息发给该queue。*

RPC远程过程调用

RabbitMQ的RPC的应用场景较少，因为有更好的RPC通信框架。

消息队列的作用

系统间解耦
解决生产者、消费者速度匹配
由于稍微上规模的项目都会分层、分模块开发，模块间或系统间尽量不要直接耦合，需要开放公共接口提供给别的模块或系统调用，而调用可能触发并发问题，为了缓冲和解耦，往往采用中间件技术。
RabbitMQ只是消息中间件中的一种应用程序，也是较常用的中间件服务。