Python re模块使用详解

## re模块
众多编程语言都为正则表达式的使用设置了标准库，python也不例外。在python中使用re模块来操作正则表达式的匹配。下面我们一起学习re模块的用法。
### 使用re还是compile对象
在re模块中可以直接调用一些正则表达式操作函数，而这些函数也可以使用compile
函数返回值对象来调用，只是被re调用时第一个参数为pattern，被regxobj调用时第一个不用加pattern，因为已经在调用compile函数时声明了pattern。所以功能是一样的，总结如下。
既能被re调用又能regxobj调用：
findall（string）返回所有匹配到的字符串列表
finditer（string）返回所有匹配到的对象生成迭代对象，
split（string）将字符串以正则表达式分割
sub（replacestring，string，max）
subn（replacestring，string，max）同sub只是返回替换的个数
match（string）匹配一个匹配到的对象，这个对象必须以正则表达式匹配的字符串开头
search（string）同match只是不用以这个正则表达式开头
下面分别介绍下这些函数的具体用法。
#### compile(pattern,flags=0)
使用compile()函数编译regex
- 对正则表达式模式pattern进行编译，flags是可选标志符，
并返回一个 regex 对象
regex对象有两个属性 groups和groupindex。groups是显示正则表达式有几个分组，groupindex是正则表达式中命名的分组形成的一个字典。
```
prog = repile(pattern)
result = prog.findall(string)
```
例子：
```
import re
p = repile('ab*')
```
* findall(pattern,string[,flags])
在字符串string中查找正则表达式模式pattern的所有(非重复)出现；返回一个匹配对象的列表
- findall()==总是返回一个列表==，如果findall()没有找到匹配的部分，就返回一个空列表，但如果匹配成功，列表将包含所有成功的匹配部分
```
re.findall('car', 'car')
re.findall('car', 'scary')
re.findall('car', 'carry the barcardi to the car') # ['car', 'car', 'car']
pattern = repile('car')
pattern.findall( 'carry the barcardi to the car', 13) # ['car', 'car']
pattern.findall( 'carry the barcardi to the car', 13, 16) # ['car']
s = 'this and that'
re.findall(r'thw+ and thw+', s, re.i) # ['this and that']
re.findall(r'(thw+)', s, re.i) # ['this', 'that'] *
re.findall(r'(thw+) and (thw+)', s, re.i) # [('this', 'that')] **
re.findall(r'(thw+) (and) (thw+)', s, re.i) # [('this', 'and', 'that')]
```
* finditer(pattern,string[, flags])
和 findall()相同，但返回的不是列表而是迭代器；对于每个匹配，该迭代器返回一个匹配对象
```
s = 'this and that'
re.finditer(r'(thw+) and (thw+)', s, re.i).__next__().groups() # ('this', 'that')
re.finditer(r'(thw+) and (thw+)', s, re.i).__next__().group() # 'this and that'
re.finditer(r'(thw+) and (thw+)', s, re.i).__next__().group(1) # 'this'
re.finditer(r'(thw+) and (thw+)', s, re.i).__next__().group(2) # 'that'
[g.groups() for g in re.finditer(r'(thw+) and (thw+)', s, re.i)] # [('this', 'that')]
```
* split(pattern,string, max=0)
根据正则表达式 pattern 中的分隔符把字符 string 分割为一个列表，返回成功匹配的列表，最多分割 max 次(默认是分割所有匹配的地方)。
- split(pattern, string, max=0) 根据正则表达式的模式分隔符，split函数将字符串分割为列表，然后返回成功匹配的列表，分隔最多操作max次（默认分隔所有匹配成功的位置）
- 如果给定分隔符不是使用特殊符号来匹配多重模式的正则表达式，那么re.split()与str.split()的工作方式相同：
```
re.split(':', 'str1:str2:str3')
# ['str1', 'str2', 'str3']
```
```
import re
data = (
'mountain view, ca 94040',
'sunnyvale, ca',
'los altos, 94023',
'cupertino 95014',
'palo alto ca',
)
for datum in data:
print (re.split(', |(?= (?:d{5}|[a-z]{2}))', datum))
```
上述regex有一个简单的组件：使用split语句基于逗号分割字符串。更难的部分是最后的正则表达式和扩展符号。
* sub(pattern, repl, string, max=0)
把字符串 string 中所有匹配正则表达式 pattern 的地方替换成字符串 repl,如果 max 的值没有给出，则对所有匹配的地方进行替换(另外，请参考 subn(),它还会返回一个表示替换次数的数值)。
- 将某字符串中所有匹配正则表达式的部分进行某种形式的替换
- 用来替换的部分通常是一个字符串，也可能是一个函数，该函数返回一个用来替换的字符串
- subn()还返回一个表示替换的总数，替换后的字符串和表示替换总数的数字一起作为一个
拥有两个元素的元组返回
```
re.sub('x', 'mr.smith', 'attn:\\ndear x,\n')
# 'attn:mr.smit\\ndear mr.smith,\n'
re.subn('x', 'mr.smith', 'attn:\\ndear x,\n')
# ('attn:mr.smit\\ndear mr.smith,\n', 2)
print (re.sub('x', 'mr.smith', 'attn:\\ndear x,\n'))
#attn:mr.smith
#
#dear mr.smith,
```
- 使用匹配对象的group()方法除了能够取出匹配分组标号外，还可以使用n, 其中n是在替换字符串中使用的分组编号。下面的代码仅仅只是将美式的日期表示法mm/dd/yy{，yy}格式转换为其他国家常用的格式dd/mm/yy{,yy}
```
re.sub(r'(d{1,2}) / (d{1,2}) / (d{2} | d{4})', r'//', '2/20/91')
re.sub(r'(d{1,2}) / (d{1,2}) / (d{2} | d{4})', r'//', '2/20/1991')
```
* match(pattern,string, flags=0)
尝试用正则表达式模式pattern匹配字符串string，flags是可选标志符，如果匹配成功，则返回一个match匹配对象；否则返回 none。这里注意，match正则表达式只能匹配字符串的开始。如果不是开始地方则无法匹配。并且如果匹配到的字符串出现了多次也只能匹配一次。
- 试图从字符串的起始部分对模式进行匹配
- 匹配成功就返回一个匹配对象；匹配失败，返回none
- 匹配对象的group()方法能用于显示那个成功的匹配
```
result = re.match(pattern, string)
```
equivalent to:
```
prog = repile(pattern)
result = prog.match(string)
```
例子：
```
import re
m = re.match('foo', 'food on the table')
if m is not none:
m.group() # 返回匹配对象foo
```
==注意：在下面的例子中，省略if语句，如果匹配失败，会抛出attributeerror异常==
```
re.match('foo', 'food on the table').group()
```
```
pattern = repile(o)
pattern.match('doodle')
pattern.match('doodle',1)
#
pattern.match('doodle',2)
#
```
* search(pattern,string, flags=0)
在字符串 string 中查找正则表达式模式 pattern 的第一次出现，flags 是可选标志符，如果匹配成功，则返回一个match匹配对象；否则返回 none。这看起来和match相同，但唯一的不同是search可以匹配任意位置。
```
import re
m = re.match('foo', 'seafood') # 匹配失败
if m is not none:
m.group()
```
```
import re
m = re.search('foo', 'food on the table')
if m is not none:
m.group()
```
```
pattern = repile(d)
pattern.search(dog) # match at index 0
#
pattern.search(dog, 1)
pattern.search('doodle', 1,4)
bt = 'bat | bet | bit'
m = re.match(bt, 'bat man')
# m = re.match(bt, 'he bit me!') # not match
# m = re.search(bt, 'he bit me!') # match
#m = re.match(bt, 'batman') # not match
# m = re.search(bt, 'batman') # not match
if m is not none:
m.group()
# m.group(0) #num 0 is the number of subgroups in the matching object
```
#### 关于flags标志位
在 re 模块中标志可以使用两个名字，一个是全名如 ignorecase，一个是缩写，一字母形式如 i。多个标志可以通过按位 or-ing 它们来指定。如 re.i | re.m
被设置成 i 和 m 标志：
标志含义
dotall, s 使 . 匹配包括换行在内的所有字符
ignorecase, i 使匹配对大小写不敏感
multiline, m 多行匹配，影响 ^ 和 $
verbose, x 能够使用 res 的 verbose 状态，使之被组织得更清晰易懂
i
ignorecase
使匹配对大小写不敏感；字符类和字符串匹配字母时忽略大小写。举个例子，[a-z]也可以匹配小写字母，spam 可以匹配 spam, spam, 或 spam。这
个小写字母并不考虑当前位置。
m
multiline
使用 ^ 只匹配字符串的开始，而 $ 则只匹配字符串的结尾和直接在换行前（如果有的话）的字符串结尾。当本标志指定后， ^ 匹配字符串的开始和字符串
中每行的开始。同样的， $ 元字符匹配字符串结尾和字符串中每行的结尾（直接在每个换行之前）。
s
dotall
使 . 特殊字符完全匹配任何字符，包括换行；没有这个标志， . 匹配除了换行外的任何字符。
x
verbose
该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。当该标志被指定时，在 re 字符串中的空白符被忽略，除非该空白符在字符类中或在反斜杠之后；这可以让你更清晰地组织和缩进 re。它也可以允许你将注释写入 re，这些注释会被引擎忽略；注释用 #号来标识，不过该符号不能在字符串或反斜
杠之后。
### match对象的使用
#### 属性
pos：目标字符串的开始位置
endpos：目标字符串的结束位置
lastgroup：最后一个组的组名
lastindex: 最后一个组是第几个组（从1开始计数）
```
regex = repile('(ab)cd(?pef)')
match_obj=regex.search('abcdefg')
b.lastgroup
'name'
b.lastindex
2
```
#### 方法
*start()
返回匹配到的字符串的起始位置
* end()
返回匹配到的字符串的起始位置
*span()
返回匹配到的字符串的起始位置和终止位置
* group(num=0)
返回全部匹配对象(或指定编号是 num 的子组)
```
m = re.match('(www)-(ddd)', 'abc-123')
print (m.group()) #完整匹配 ‘abc-123'
print (m.group(1)) #子组1 'abc'
print (m.group(2)) #子组2 '123'
print (m.groups()) #全部子组 ('abc', '123')
```
* groups()
返回一个包含全部匹配的子组的元组(如果没有成功匹配，就返回一个空元组)
```
m = re.match('ab','ab')
m.group()
m.groups()
m = re.match('(ab)', 'ab')
m.group()
m.group(1)
m.groups()
m = re.match('(a)(b)', 'ab')
m.group()
m.group(1)
m.group(2)
m.groups()
m = re.match('(a)(b)', 'ab')
m.group()
m.group(1)
m.group(2)
m.groups()
```
```
patt1 = '^(w){3}'
patt2 = '^(www)'
patt3 = '^(w{3})'
m = re.match(patt2, 'wed 123')
```
* groupdict()
返回一个包含全部匹配的子组的字典(要求子组有名称，名称为键，匹配到的内容为值)
```
regex = repile('(ab)cd(?pef)')
match_obj=regex.search('abcdefg')
match_obj.groupdict()
{'name': 'ef'}
```

Python re模块使用详解

VIP推荐