1. 数据请求

urllib模块

导入模块

# python3
import urllib.request

# python2
import urllib2

发送请求

urllib.request.urlopen(url)

args:url
return:resonse对象

获取数据

response.data()

return:网页源码字节串

response.data().decode('utf-8')

args:编码方式
return : 将网页源码字节串根据对应的编码方式转换为字符串

发送url带参数的请求

import urllib.request
import urllib.parse

url_head="https://www.baidu.com/s?"
params={
    "wd":"案子",
    "key":"zhangsan",
    "value":"san"
}

#访问网址有多个参数需要传入时使用以下代码，实现自动拼接参数和字符替换
str_params=urllib.parse.urlencode(params)

final_url=url+str_params
print(final_url)

response=urllib.request.urlopen(final_url)

发送带请求头的请求

1.根据url构建请求对象

2.在请求对象中添加请求头

方式一

import urllib.request
import string
import random

UA_list=[
    "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/14.0.835.163 Safari/535.1",
    "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0"
]

URL="https://www.baidu.com/s?wd=hello"

request=urllib.request.Request(URL)

random_UA=random.choice(UA_list)

#增加请求头信息
request.add_header("User-Agent",random_UA)

方式二

#自定义请求头
header = {
    # 浏览器请求头
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36 Edg/91.0.864.71",
    "authority": "ss1.bdstatic.com"
}

#自定义响应对象,即实例化对象
request=urllib.request.Request(url,headers=header)

3.发送请求

#返回请求对象
response=urllib.request.urlopen(request)

用代理IP发送请求

1.用代理IP创建handler处理器

创建普通handler

urllib.request.HTTPHandler()

创建代理handler

urllib.request.ProxyHandler(proxy)

args->dict :代理参数
return : 代理handler

免费代理IP的写法:

proxy={
    #写法一
    "http":"http://120.77.249.46:8080",
    #写法二
    "http":"120.77.249.46:8080",
}

付费IP的写法:

写法一：

#付费IP写法
proxy={
     "http":"<username>:<password>@<IP:port>"
}

写法二：

#第一步:自定义变量保存用户名、密码以及IP
#username=<username>
#password=<password>
#IP=<proxy_IP>

#第二步：创建密码管理器
password_manger=urllib.request.HTTPPasswordMgrWithDefaultRealm()

#第三步：为资源管理器添加密码和用户名
password_manger.add_password(None,IP,username,password)

#第四步：用密码管理器创建handler
proxy_handler = urllib.request.HTTPBasicAuthHandler(password_manger)

2.用handler处理器创建opener

urllib.request.build_opener(handler)

args: handler对象
return:opner对象

3.用opener发送请求

import urllib.request
import random

#ssl(安全套接层) 第三方的CA数字证书
# urllib.request.urlopen()和opener.open()一样，都是返回一个响应对象，实际上urllib.request.urlopen()内部调用了opener.open()
# python自带的urlopen()方法不能添加代理IP，如果要添加需要用自定义的opener请求数据

URL="http://www.cnblogs.com/zrmw/p/9332801.html"

IP_list=[
    {"http":"http://120.77.249.46:8080"},
    {"http":"http://106.75.226.36:808"},
    {"http":"http://61.135.217.7:80"}
]

random_IP=random.choice(IP_list)

proxy_handler=urllib.request.ProxyHandler(random_IP)

opener=urllib.request.build_opener(proxy_handler)

try:
    response=opener.open(URL,timeout=1.0)
    
except Exception as e:
        print(e)

向内网发送请求

1.创建密码管理器

password_manager=urllib.request.HTTPPasswordMgrWithDefaultRealm()

2.为密码管理器添加信息

username=<username>->str
password=<password>->str
intranet_IP=<proxy_IP>->str

# 添加的IP是本机IP
password_manger.add_password(None,IP,username,password)

3.用密码管理器创建handler

proxy_handler = urllib.request.HTTPBasicAuthHandler(password_manger)

4.创建opener

opener=urllib.request.build_opener(proxy_handler)

5.发送请求

response = opener.open(intranet_IP)

requests模块

安装Python库

pip install requests

导入Python库

import requests

Sending requests

# get请求
r = requests.get('https://api.github.com/events')

# post请求
form_data = {
    'key1': value1,
    'key2': value2
}
r = requests.post('https://httpbin.org/post', data = form_data)

Passing Parameters In URLs

payload = {'key1': 'value1', 'key2': 'value2'}
r = requests.get('https://httpbin.org/get', params=payload)

print(r.url)
# https://httpbin.org/get?key2=value2&key1=value1
    
payload = {'key1': 'value1', 'key2': ['value2', 'value3']}
r = requests.get('https://httpbin.org/get', params=payload)

print(r.url)
# https://httpbin.org/get?key1=value1&key2=value2&key2=value3

Binary Response Content

You can also access the response body as bytes, for non-text requests:

print(r.content)
# b'[{"repository":{"open_issues":0,"url":"https://github.com/...

For example, to create an image from binary data returned by a request, you can use the following code:

from PIL import Image
from io import BytesIO

i = Image.open(BytesIO(r.content))

JSON Response Content

There’s also a builtin JSON decoder, in case you’re dealing with JSON data:

import requests

r = requests.get('https://api.github.com/events')
r.json()
# [{'repository': {'open_issues': 0, 'url': 'https://github.com/...

proxy

proxies=proxy
- {}

auth认证

Many web services that require authentication accept HTTP Basic Auth. This is the simplest kind, and Requests supports it straight out of the box.

from requests.auth import HTTPBasicAuth
user = ''
password = ''

requests.get('https://api.github.com/user', auth=HTTPBasicAuth('user', 'password'))
# <Response [200]>

In fact, HTTP Basic Auth is so common that Requests provides a handy shorthand for using it: Providing the credentials in a tuple like this is exactly the same as the HTTPBasicAuth example belwow

requests.get('https://api.github.com/user', auth=('user', 'password'))
# <Response [200]>

ssl认证

HTTPs CA证书
忽略https证书认证
- verify=False

# https是有第三方CA证书认证的
# 但是12306的证书不是CA，而是自己颁布的证书
# 解决方法：忽略第三方认证，加参数verify=False
url = "https://www.12306.cn"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"
}
response = requests.get(url, headers=headers, verify=False)

cookies

cookie

Session Object

The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance.A Session object has all the methods of the main Requests API.

s = requests.Session()

s.get('https://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get('https://httpbin.org/cookies')

print(r.text)
# '{"cookies": {"sessioncookie": "123456789"}}'

创建session对象

import requests

post_url = "https://github.com/session"
session = requests.session()

先对某个网址发送请求以获取cookie，以便于session记录并保持cookies

login_url = "https://github.com/login"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50"
}

# 表单数据很多，但是其中有很多项数据并不影响操作结果，具体需要自己一个一个尝试
# 有些数据需要自己构造
post_data = {
    'commit': 'Sign in',
    'authenticity_token': 'vm7cR/BQGTzJcTwzQDIexrkBKuiae7I6MSYfBCCMZ+d1le+HXBmZ0bSX7RjVCpOVWYUbmjhR8lhPi0NnvQ7xtw==',
    'login': '1262026927@qq.com',  # 账户
    'password': '18738174181ljr',  # 密码
	# 此处省略多项数据
    'timestamp': '1631372437015',
    'timestamp_secret': 'bc5b354b1ea052ba8006d389dcb010f2a66640bfd77594bb3e0c66eb96affe93'
}

session.post(url=login_url, headers=headers, data=post_data)

向目标页面发送POST请求

target_url = "https://github.com/settings/profile"
session.get(url=target_url, headers=headers)

获取响应

html = session.content

response响应对象的其它属性或方法

response.url响应的url；有时候响应的url和请求的url并不一致
response.status_code 响应状态码
response.request.headers 请求头
response.headers 响应头
response.request._cookies 响应对应请求的cookie；返回cookieJar类型
response.cookies 响应的cookie（经过了set-cookie动作；返回cookieJar类型
response.json()自动将json字符串类型的响应内容转换为python对象（dict or list）

python-requests官方文档

2. 数据解析

正则

Python之re模块
python3正则表达式的几个高级用法
re模块详解
https://blog.csdn.net/make164492212/article/details/51699545

匹配中文: [ \u4e00-\u9fa5]

xpath(lxml)

语法

//
- 跨级寻找节点，忽略位置
/text()
- 取标签内容
/@<属性名>
- 取属性
[]
- 填入数字
  - 作下标用，起始下标为1
- 填入表达式
  - 根据表达式进行元素筛选
模糊查询
- //div[contains(@id,'187381741')]
取平级关系的下一个节点
- /following-sibling::*[1]

使用

1.安装解析库

pip install lxml

2.导入模块

from lxml import etree

3.将响应的源码转换为xpath可以解析的对象

html = response.content
data = etree.HTML(html)

4.开始解析

data.xpath('')

bs4

安装Python模块

pip install BeautifulSoup4

导入Python库

from bs4 import BeautifulSoup

解析器

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	BeautifulSoup(markup, ["lxml-xml"])``BeautifulSoup(markup, "xml")	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

推荐使用lxml作为解析器,因为效率更高. 在Python2.7.3之前的版本和Python3中3.2.2之前的版本,必须安装lxml或html5lib, 因为那些Python版本的标准库中内置的HTML解析方法不够稳定.

使用方法

将一段文档传入BeautifulSoup 的构造方法,就能得到一个文档的对象, 可以传入一段字符串或一个文件句柄.
```
from bs4 import BeautifulSoup

soup = BeautifulSoup(open("index.html"))

soup = BeautifulSoup("<html>data</html>")
```

示例文档

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

对象的种类

Beautiful Soup将复杂HTML文档转换成一个复杂的树形结构,每个节点都是Python对象,所有对象可以归纳为4种: Tag , NavigableString , BeautifulSoup , Comment .

NavigableString
Tag
BeautifulSoup
Comment

NavigableString

tag.string的类型是NavigableString

tag.string
# u'Extremely bold'
type(tag.string)
# <class 'bs4.element.NavigableString'>

方法

tag.string.replace_with(str)
- tag中包含的字符串不能编辑,但是可以被替换成其它的字符串

Tag

tag = soup.<标签名>
- 返回标签名符合的第一个标签
属性(返回值仍然是Tag对象)

tag.name
- 返回该节点的标签名
tag.contents
- 以列表形式返回直接子节点
- 标签节点返回列表，字符串节点调用该属性会报错
```
head_tag = soup.head
head_tag
# <head><title>The Dormouse's story</title></head>

head_tag.contents
[<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
title_tag
# <title>The Dormouse's story</title>
title_tag.contents
# [u'The Dormouse's story']
 
text = title_tag.contents[0]
text.contents
# AttributeError: 'NavigableString' object has no attribute 'contents'
```
tag.str'<属性名>
- 返回该节点的该属性值，等价于tag.get()
tag[str'<属性名>]
- 返回该节点的该属性值，等价于tag.get()
tag.string
- tag对象包含多个子节点时，返回None
- tag对象只包含一个标签子节点时，返回该标签子节点的字符串
- tag对象只包含一个字符串子节点，返回字符串
tag.children
- 循环生成直接子节点
tag.descendants

tag.strings
- 包含子节点的所有字符串
tag.stripped_strings
- 包含子节点的所有字符串
- 去除字符串中的多余的空格和空行
- 全部是空格的行会被忽略掉,段首和段末的空白会被删除
tag.parent
- 返回该节点的父节点
- 顶层节点的父节点是BeaufulSoup对象
- BS对象的父节点是None
tag.parents
- 递归返回所有父节点
tag.next_sibling
- 返回下一个同级节点
- 无则返回None
tag.previous_sibling
- 返回上一个同级节点
- 无则返回None
tag.next_siblings
- 递归返回下一级所有节点
tag.previous_siblings
- 递归返回上一级所有节点
- 实际文档中的tag的 .next_sibling 和 .previous_sibling 属性通常是字符串或空白
tag.previous_element
- 返回上一个解析对象
tag.next_element
- 返回下一个解析对象
tag.previous_elements
- 循环返回上一个解析对象
tag.next_elements
- 循环返回下一个解析对象
方法

tag.get_text()
- 获取所有字符串内容包括子孙tag中的内容
find_parents( name , attrs , recursive , string , **kwargs )

params : 同soup.find_all()参数

return : list

find_parent( name , attrs , recursive , string , **kwargs )

params : 同soup.find()参数

find_next_siblings( name , attrs , recursive , string , **kwargs )

params : 同soup.find_all()参数

return : list

find_next_sibling( name , attrs , recursive , string , **kwargs )

params : 同soup.find()参数

find_previous_siblings( name , attrs , recursive , string , **kwargs )

params : 同soup.find_all()参数

return : list

find_previous_sibling( name , attrs , recursive , string , **kwargs )

params : 同soup.find()参数

find_all_next( name , attrs , recursive , string , **kwargs )

params : 同soup.find_all()参数

return : list

find_next( name , attrs , recursive , string , **kwargs )

params : 同soup.find()参数

find_all_previous( name , attrs , recursive , string , **kwargs )

params : 同soup.find_all()参数

return : list

find_previous( name , attrs , recursive , string , **kwargs )

params : 同soup.find()参数

tag.get(str'<属性名>)
- 返回该节点的该属性值

多值属性

HTML 4定义了一系列可以包含多个值的属性.在HTML5中移除了一些,却增加更多.最常见的多值的属性是 class (一个tag可以有多个CSS的class). 还有一些属性 rel , rev , accept-charset , headers , accesskey . 在Beautiful Soup中多值属性的返回类型是list:
```
css_soup = BeautifulSoup('')
css_soup.p['class']
# ["body", "strikeout"]

css_soup = BeautifulSoup('')
css_soup.p['class']
# ["body"]
```
如果某个属性看起来好像有多个值,但在任何版本的HTML定义中都没有被定义为多值属性,那么Beautiful Soup会将这个属性作为字符串返回
```
id_soup = BeautifulSoup('')
id_soup.p['id']
# 'my id'
```
将tag转换成字符串时,多值属性会合并为一个值
```
rel_soup = BeautifulSoup('Back to the <a rel="index">homepage</a>')
rel_soup.a['rel']
# ['index']
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
# Back to the <a rel="index contents">homepage</a>
```

BeautifulSoup

BeautifulSoup 对象表示的是一个文档的全部内容.大部分时候,可以把它当作 Tag 对象

soup = BeautifulSoup(<字符串>，<解析库>)
过滤器

介绍 find_all() 方法前,先介绍一下过滤器的类型。这些过滤器贯穿整个搜索的API.过滤器可以被用在tag的name中,节点的属性中,字符串中或他们的混合中.

字符串

最简单的过滤器是字符串.在搜索方法中传入一个字符串参数,Beautiful Soup会查找与字符串完整匹配的内容,下面的例子用于查找文档中所有的标签:

soup.find_all('b') # [The Dormouse's story]

如果传入字节码参数,Beautiful Soup会当作UTF-8编码,可以传入一段Unicode 编码来避免Beautiful Soup解析编码出错

正则表达式

如果传入正则表达式作为参数,Beautiful Soup会通过正则表达式的 match() 来匹配内容.下面例子中找出所有以b开头的标签,这表示和标签都应该被找到:

import re for tag in soup.find_all(re.compile("^b")): print(tag.name) # body # b

列表

如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.下面代码找到文档中所有标签和标签:

soup.find_all(["a", "b"]) # [The Dormouse's story, # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

True

True 可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点

for tag in soup.find_all(True): print(tag.name) # html # head # title # body # p # b # p # a # a # a # p

方法

soup.find_all(name , attrs , recursive , string , **kwargs)

params :

name 参数可以搜索并返回所有名字为 name 的tag,字符串对象会被自动忽略掉。查找规则有以下几种:

str'<标签名>

返回该标签名的所有tag对象的列表

正则表达式：re.compile('')

字符串列表

返回标签匹配列表任意元素的tag对象

True

匹配所有Tag，但是不匹配字符串

自定义方法

attrs={'<属性名>：<属性值>'}

针对HTML5中的 data-* 属性等

recursive参数

调用tag的 find_all() 方法时,检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False

一段简单的文档:

<html> <head> <title> The Dormouse's story </title> </head> ...

是否使用 recursive 参数的搜索结果:

soup.html.find_all("title") # [<title>The Dormouse's story</title>] soup.html.find_all("title", recursive=False) # []

string参数

通过 string 参数可以搜索并返回文档中的字符串内容。查找规则有以下几种:

字符串

字符串列表

正则表达式

True

keyword参数

<属性名>= (返回包含该属性的tag对象)

str'<属性值>

re.compile('')

True

按CSS搜索

当属性名是class时，用<class_>来避免与Python保留字冲突

tag的 class 属性是多值属性 .按照CSS类名搜索tag时,可以分别搜索tag中的每个CSS类名

也可以通过CSS值完全匹配

可以多个参数同时使用

limit参数

限制匹配成功的tag的数量

return : 返回值为列表，元素为tag对象

soup.find()

params : 与soup.find_all()参数相同

相当于指定limit=1的find_all方法

return : 返回值是一个tag对象，无则返回None

soup.head.title 是 tag的名字方法的简写.这个简写的原理就是多次调用当前tag的 find() 方法

soup.head.title # <title>The Dormouse's story</title> soup.find("head").find("title") # <title>The Dormouse's story</title>

soup.select()

params :

str'<CSS选择器>

![CSS Selectors](Image/CSS Selectors.jpg)

soup.prettify()

返回html的格式化输出

soup.get_text()

获取该文档的所有文字内容

属性

soup.name

soup.name # u'[document]'

soup.content

soup对象本身包含子节点

Comment

特殊的NavaigableString，该对象的string值是注释

备注

BeautifulSoup 对象和 tag 对象可以被当作一个方法来使用

soup.find_all("a") soup("a")

soup.title.find_all(string=True) soup.title(string=True)

子节点

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点

使用方法

基本使用

string

点语法标签名字

标签名

查看

tag.<标签名>

通过点只能获取该标签名的第一个标签

修改

tag.<标签名>=str'

属性值

查看

tag[<属性名>]

单指属性则返回字符串，多值属性返回字符串列表

tag.attrs

返回{<属性名>:<属性值>}字典

修改

tag[<属性名>]=<属性值>

传入字符串时，该字符串作为单值属性；传入字符串列表时，拼接字符串元素作为单属性值

删除

del tag[<属性名>]

字符串

查看

tag.string

tag.get_text()

Navigable

Comment

修改

tag.string.replace_with(<字符串>)

Navigable.replace_with(<字符串>)

多节点解析

find

find_all

select

CSS选择器

select_one

取字符串

get_text()

可以取出多层标签的所有字符串

取属性值

get('属性名称')

bs4，xpath，正则三种解析方法对比

难度

bs4最简单，xpath较简单，正则较难，

效率

正则速度最快，xpath其次，bs4较慢

中文官方文档

直接使用css中的选择器就可以解析HTML和XML

http://www.w3.org/TR/CSS2/selector.html

3. 数据存储

JSON

JSON (JavaScript Object Notation) http://json.org is a subset of JavaScript syntax (ECMA-262 3rd edition) used as a lightweight data interchange format.

XML

JSON是轻量级的数据交互格式，XML是重量级

数据交互格式

简单理解就是一个字典或者列表

书写格式

1.不能写注释

2.key:value

必须是双引号

3.末尾不能写逗号

4.整个文件有且仅有一个[ ]

读写操作

读取JSON

loads(s, *, encoding=None, cls=None, object_hook=None, parse_float=None,parse_int=None, parse_constant=None, object_pairs_hook=None, **kw):

Deserialize s (a str, bytes or bytearray instance containing a JSON document) to a Python object

with open('data.json', 'w') as file: data = json.loads(file.read())

load(fp, *, cls=None, object_hook=None, parse_float=None, parse_int=None, parse_constant=None, object_pairs_hook=None, **kw)

Deserialize fp (a .read()-supporting file-like object containing a JSON document) to a Python object.

输出JSON

dumps(obj, *, skipkeys=False, ensure_ascii=True, check_circular=True,allow_nan=True, cls=None, indent=None, separators=None,default=None, sort_keys=False, **kw)

Serialize obj to a JSON formatted str.

args :

obj :

ensure_ascii：

If ensure_ascii is false, then the return value can contain non-ASCII　characters if they appear in strings contained in obj. Otherwise, all　such characters are escaped in JSON strings.

indent :

If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation.

seperators :

If specified, separators should be an (item_separator, key_separator) tuple. The default is (', ', ': ') if indent is None and (',', ': ') otherwise. To get the most compact JSON representation, you should specify (',', ':') to eliminate whitespace.

data = [ { 'name': '王伟'， 'gender': '男'. 'birthday': '1992-10-18' } ] with open('data.json', 'w') as file: file.write(json.dumps(data))

dump(obj, fp, *, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, cls=None, indent=None, separators=None, default=None, sort_keys=False, **kw)

Serialize obj as a JSON formatted stream to fp (a .write()-supporting file-like object)

CSV

csv.writer(csvfile, dialect='excel', **kwargs)

params :

csvfile : The csvfile argument can be any object that supports the file API.

kwargs :

delimiter : A one-character string used to separate fields. It defaults to ','.

lineterminator : The string used to terminate lines produced by the writer. It defaults to '\r\n'.

return : Return a reader object which will iterate over lines in the given csvfile.

class csv.DictWriter(csvfile, fieldnames, restval='', extrasaction='raise', dialect='excel', *args, **kwds)

Create an object that operates like a regular reader but maps the information in each row to a dict whose keys are given by the optional fieldnames parameter.

params :

csvfile : The csvfile argument can be any object that supports the file API.

fieldnames : a sequence of keys that identify the order in which values in the dictionary passed to the writerow() method are written to file f.

return : Return a reader object which will iterate over lines in the given csvfile.

class csv.DictReader(csvfile, fieldnames=None, restkey=None, restval=None, dialect='excel', *args, **kwds)

Create an object that operates like a regular reader but maps the information in each row to a dict whose keys are given by the optional fieldnames parameter.

import json import csv # list.json # [{"name": "\u5f20\u4e09", "age": 20}, {"name": "\u674e\u56db", "age": 18}] # 1.创建文件对象 json_fp = open('list.json', 'r') csv_fp = open('1.csv', 'w', encoding='utf-8') # 2.获取表 data_json = json.load(json_fp) sheet_title = data_json[0].keys() sheet_data = [] for data in data_json: sheet_data.append(data.values()) # 3. CSV写入器 writer = csv.writer(csv_fp) # 4. 写入表 writer.writerow(sheet_title) writer.writerows(sheet_data) # 5.关闭文件对象 json_fp.close() csv_fp.close()

读写操作

Writer Object

Writer objects ( class DictWriter instances and objects returned by the writer function) have the following public methods

writerow(row)

params :

row : A row must be an iterable of strings or numbers for Writer objects and a dictionary mapping fieldnames to strings or numbers for DictWriter objects.

writerows(rows)

params :

rows : a list of row

DictWriter.writeheader()

Write a row with the field names (as specified in the constructor) to the writer’s file object. Return the return value of the csvwriter.writerow() call used internally.

import csv with open('names.csv', 'w', newline='') as csvfile: # 字段名 fieldnames = ['first_name', 'last_name'] writer = csv.DictWriter(csvfile, fieldnames=fieldnames) writer.writeheader() writer.writerow({'first_name': 'Baked', 'last_name': 'Beans'}) writer.writerow({'first_name': 'Lovely', 'last_name': 'Spam'}) writer.writerow({'first_name': 'Wonderful', 'last_name': 'Spam'})

MongoDB

安装Python库

pip install pymongo

最先安装

直接参考官方网站

端口号

27017

非关系型数据库

启动数据库

mongodb的客户端

mango

用户登录

1.use dbname

2.db.auth('','password')

创建用户

db.createUser({'user':'','pwd':'',roles:[str']})

mongodb的服务端

sudo mongodb

sudo service mongodb start

关闭：sudo service mongodb stop

三个警告

root权限太大了

设置权限的开启

数据库的操作

集合操作

数据库能存储的类型

文档内容操作

增

删

改

基本查询

db.集合名字.find()

查询所有数据

db.集合名字.find({条件})

比较运算

逻辑运算符

范围运算符

正则表达式

定义函数

查询结果显示

复合查询

db.createUser({user:'A',pwd:'123456',roles:['root']})

索引查询

备份和恢复

mongo和Python的交互

redis

python模块redis使用

1.安装模块

pip install redis

2.导入模块

import redis

3.连接数据库

redis.StrictRedis(host='localhost', port=6379, db=0)

host:数据库服务端的IP地址

port:6379

db->int:选择使用的数据库，整型数0-15

r = redis.StrictRedis(host='localhost', port=6379, db=0)

4.建立连接池(可选步骤)

redis-py 使用 connection pool 来管理对一个 redis server 的所有连接，避免每次建立、释放连接的开销。

默认，每个Redis实例都会维护一个自己的连接池。可以直接建立一个连接池，然后作为参数 Redis，这样就可以实现多个 Redis 实例共享一个连接池。

import redis # 导入redis 模块 # decode_response参数用于输出中文 pool = redis.ConnectionPool(host='localhost', port=6379, decode_responses=True, db=15) r = redis.Redis(connection_pool=pool)

string操作

增加操作

set(keyname, value, ex=None, px=None, nx=False, xx=False)

ex - 过期时间（秒）

px - 过期时间（毫秒）

nx - 如果设置为True，则只有name不存在时，当前set操作才执行

xx - 如果设置为True，则只有name存在时，当前set操作才执行

# key是"food" value是"mutton" 将键值对存入redis缓存 r.set('food', 'mutton', ex=3) print(r.get('food')) # mutton 取出键food对应的值

mset(mapping)

args :

mapping : a dictionary of key/value pairs. Both keys and values should be strings or types that can be cast to a string via str()

# 写法一 r.mget({'k1': 'v1', 'k2': 'v2'}) # 写法二 r.mset(k1="v1", k2="v2")

查看操作

mget(keys, *args)

return : list of values at the specified keys.

# k1,k2都是键名 print(r.mget('k1', 'k2')) print(r.mget(['k1', 'k2']))

get(keyname)

args:

keyname->str : keyname

return : keyname对应的字符串，key不存在时返回nil

hash操作

增加

hset(name, key=None, value=None, mapping=None)

name->str：keyname，相当于Python中的字典变量名

key->str：fieldname，相当于Python中字典中的键名

value->str：value，相当于Python中字典中的键值

mapping->dict : 键值对字典

r.hset("hash1", "k1", "v1")

hmset(name, mapping)

name->str : keyname

mapping->dict : 字典，如：{'k1':'v1', 'k2': 'v2'}

r.hmset("hash2", {"k2": "v2", "k3": "v3"})

查看

hget(name, key)

name->str : keyname

key->str : fieldname

hmget(name, keys, *args)

name->str : keyname

fields->str : 多个键名，或者键名列表

# 批量取出"hash2"的key-k2 k3对应的value # 方式1 print(r.hmget("hash2", "k2", "k3")) # 方式2 print(r.hmget("hash2", ["k2", "k3"]))

hgetall(name)

args :

name : keyname

return : Return a Python dict of the hash's name/value pairs

hlen(name)

args :

name : keyname

return : Return the number of elements in hash name

hkeys(name)

args :

name : keyname

return :

all field names in the hash stored at name

hvals(name)

args :

name : keyname

return :

all values in the hash stored at name

删除

hdel(name,*fields)

args :

name : keyname

fields : one field name or field names for deleting values

return :

the number of fields that were removed from the hash, not including specified but non existing fields

list操作

增加

lpush(name,elements)

在name对应的list中添加元素，每个新的元素都添加到列表的最左边

args :

name : keyname

elements : multi values

return :

the length of the list after the push operations.

rpush(name,elements)

在name对应的list中添加元素，每个新的元素都添加到列表的最右边

args :

name : keyname

elements : multi values

return : the length of the list after the push operations.

查看

lrange(keyname, start, stop)

像Python中的可迭代对象一样可以进行正反向索引，正向起始索引是0，反向起始索引是-1。

Note that if you have a list of numbers from 0 to 100, LRANGE list 0 10 will return 11 elements, that is, the rightmost item is included

args :

keyname->str : keyname

start->int : 起始位置

stop->int : 末尾位置

return : list of elements in the specified range.

llen(name)

params :

name : listname

return : the length of the list name

删除

lrem(name, count, value)

Remove the first count occurrences of elements equal to value from the list stored at name

params :

name : listname

value : value

count->int :

count > 0: Remove elements equal to value moving from head to tail

count < 0: Remove elements equal to value moving from tail to head

count = 0: Remove all elements equal to value

set操作

增加

sadd(name, *values)

params :

name : setname

names : values

查看

smembers(name)

params :

name : setname

Return : all members of the set name

删除

srem(name, *values)

Remove values from set name

params :

name : setname

values : values

成员

sismember(name, value)

params :

name : setname

value : value

Return : a boolean indicating if value is a member of set name

scard(name)

params :

name : setname

return : the number of elements in set name

MySql

ER模型

E

实体

表

R

关系

表表关系

1对1

1对多

多对多

三范式

1.列不可拆分

2.唯一标志

3.引用主键

数据的完整性

字段类型

int

decimal

数字

char

varchar

text

字符串

datatime

日期

bit

布尔

主键

primary key

非空

not null

唯一

unique

默认

default

外键

foreign key

逻辑删除

物理删除

从磁盘删除

重要删除

删除标识

0

1

数据库的启动

mysql.server start

mysql.server stop

客户端

mysql -u root -p

输入密码

查看数据库版本

select version();

显示当前时间

select now();

事务

四大特性ACID

A

原子性

C

一致性

I

隔离性

D

持久性

mysql的表要求类型

innodb

ddb

开启事务

begin

提交事务

commit

事务回滚

rollback

查询效率问题

索引查询

数据类型越小越好

能用整型最好

不适用NULL用0，“”，等特殊符号代替

创建索引

create index index on table(())

查看所有索引

show index from table

删除索引

drop index index on table

验证时间

1.set profiling=1

2.内容查询内容

Python和MySQL的交互

1.安装模块

pip install PyMySql

2.导入模块

import PyMySql

3.连接数据库

# 1.链接数据库连接对象connection db = pymysql.connect( host='localhost', port=3306, db='company', user='root', password='123456', charset='utf8' )

4.创建cursor对象

cursor = db.cursor()

5.将MySql语句写成字符串

insert_sub = 'insert into emp values("0007","陈强","男",27,"销售",3000,"市场部")'

6.执行操作

# 不会真正执行操作，提交事务时才会真正执行操作 cursor.execute(insert_sub)

7.提交事务

# 使用异常检测保证数据的一致性 try: cursor.execute(insert_sub) # 真正实现MySql语句 db.commit() except: # 事务回滚功能 db.rollback()

8.取出执行结果

# 取出所有行数据，返回一个元组，元组中的每一个元素是一行数据(元组形式) result = cursor.fetchall() # 将一行数据以元组的形式返回 result = cursor.fetchone()

9.关闭cursor连接

cursor.close()

10.关闭数据库连接

db.close()

MySql操作

提交事务

sql = '' try: cursor.execute(sql) db.commit() except: db.rollback()

插入数据

data = { 'id': '20122001', 'name': 'Bob', 'age': 20 } tablename = 'Students' keys = ','.join(data.keys()) specifier = ','.join([%s] * len(data)) sql = f'Insert into {tablename}({keys}) values ({specifier})' try: if cursor.execute(sql, tuple(data.values())): db.commit() except: db.rollback()

更新数据

data = { 'id': '20122001', 'name': 'Bob', 'age': 20 } tablename = 'Students' keys = ','.join(data.keys()) specifier = ','.join([%s] * len(data)) sql = f'Insert into {tablename}({keys}) values ({specifier}) on duplicate key update' # update = ','.join(["{key} = %s".format(key=key) for key in data]) update = ','.join([f"{key} = %s" for key in data]) sql += update try: if cursor.execute(sql, tuple(data.values())): db.commit() except: db.rollback()

MySQL的备份和恢复

备份

mysqldump -u root -p db_name table_name > 备份文件的绝对路径

恢复

mysql -u root -p db_name < 备份文件的绝对路径

Files

爬虫入门.md

Latest commit

History

爬虫入门.md

File metadata and controls

1. 数据请求

urllib模块

导入模块

发送请求

获取数据

发送url带参数的请求

发送带请求头的请求

用代理IP发送请求

向内网发送请求

requests模块

安装Python库

导入Python库

Sending requests

Passing Parameters In URLs

Binary Response Content

JSON Response Content

proxy

auth认证

ssl认证

cookies

Session Object

response响应对象的其它属性或方法

python-requests官方文档

2. 数据解析

正则

xpath(lxml)

语法

使用

bs4

安装Python模块

导入Python库

解析器

使用方法

对象的种类

NavigableString

Tag

BeautifulSoup

字符串

正则表达式

列表

True

Comment

bs4，xpath，正则 三种解析方法对比

中文官方文档

3. 数据存储

JSON

XML

数据交互格式

读写操作

CSV

读写操作

MongoDB

安装Python库

redis

python模块redis使用

1.安装模块

2.导入模块

3.连接数据库

4.建立连接池(可选步骤)

string操作

增加操作

查看操作

hash操作

增加

查看

删除

list操作

增加

查看

删除

set操作

增加

查看

删除

bs4，xpath，正则三种解析方法对比