首页 > > 网络编程 > 其它 >

python re模块

2018-06-18 00:30:12来源：未知阅读 ()

一、单字符匹配

字符	功能
.	匹配任意1个字符（除了\n）
[ ]	匹配[ ]中列举的字符
\d	匹配数字，即0-9
\D	匹配非数字，即不是数字
\s	匹配空白，即空格，tab键，\n
\S	匹配非空白
\w	匹配单词字符，即a-z、A-Z、0-9、_
\W	匹配非单词字符

二、表示数量

字符	功能
*	匹配前一个字符出现0次或者无限次，即可有可无
+	匹配前一个字符出现1次或者无限次，即至少有1次
?	匹配前一个字符出现1次或者0次，即要么有1次，要么没有
{m}	匹配前一个字符出现m次
{m,}	匹配前一个字符至少出现m次
{m,n}	匹配前一个字符出现从m到n次

三、表示边界

字符	功能
^	匹配字符串开头
$	匹配字符串结尾
\b	匹配一个单词的边界
\B	匹配非单词边界

四、分组

字符	功能
\|	匹配左右任意一个表达式
(ab)	将括号中字符作为一个分组
`\num`	引用分组num匹配到的字符串
`(?P<name>)`	分组起别名
(?P=name)	引用别名为name分组匹配到的字符串

五、标志

修饰符	描述
re.I	使匹配对大小写不敏感
re.L	做本地化识别（locale-aware）匹配
re.M	多行匹配，影响 ^ 和 $
re.S	使 . 匹配包括换行在内的所有字符
re.U	根据Unicode字符集解析字符。这个标志影响 \w, \W, \b, \B.
re.X	该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。

A = <RegexFlag.ASCII: 256>
ASCII = <RegexFlag.ASCII: 256>
DOTALL = <RegexFlag.DOTALL: 16>
I = <RegexFlag.IGNORECASE: 2>
IGNORECASE = <RegexFlag.IGNORECASE: 2>
L = <RegexFlag.LOCALE: 4>
LOCALE = <RegexFlag.LOCALE: 4>
M = <RegexFlag.MULTILINE: 8>
MULTILINE = <RegexFlag.MULTILINE: 8>
S = <RegexFlag.DOTALL: 16>
U = <RegexFlag.UNICODE: 32>
UNICODE = <RegexFlag.UNICODE: 32>
VERBOSE = <RegexFlag.VERBOSE: 64>
X = <RegexFlag.VERBOSE: 64>

六、match方法

match(pattern, string, flags=0)
        Try to apply the pattern at the start of the string, returning  从字符串头开始匹配
        a match object, or None if no match was found.

参数	描述
pattern	匹配的正则表达式
string	要匹配的字符串。
flags	标志位，用于控制正则表达式的匹配方式，如：是否区分大小写，多行匹配等等

 1 import re
 2 
 3 ret = re.match('com', 'www.baidu.com')
 4 print(ret)                 # None
 5 
 6 ret = re.match('(w)(w)(w)', 'www.baidu.com')
 7 print(type(ret))           # <class '_sre.SRE_Match'>
 8 
 9 '''
10 得到的是Match对象
11 属性：
12 string: 匹配时使用的文本。
13 re: 匹配时使用的Pattern对象。
14 pos: 文本中正则表达式开始搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
15 endpos: 文本中正则表达式结束搜索的索引。值与Pattern.match()和Pattern.seach()方法的同名参数相同。
16 lastindex: 最后一个被捕获的分组在文本中的索引。如果没有被捕获的分组，将为None。
17 lastgroup: 最后一个被捕获的分组的别名。如果这个分组没有别名或者没有被捕获的分组，将为None。
18 方法：
19 group([group1, …])：
20 获得一个或多个分组截获的字符串；指定多个参数时将以元组形式返回。group1可以使用编号也可以使用别名；编号0代表整个匹配的子串；不填写参数时，返回group(0)；没有截获字符串的组返回None；截获了多次的组返回最后一次截获的子串。
21 groups([default])： 
22 以元组形式返回全部分组截获的字符串。相当于调用group(1,2,…last)。default表示没有截获字符串的组以这个值替代，默认为None。
23 groupdict([default])：
24 返回以有别名的组的别名为键、以该组截获的子串为值的字典，没有别名的组不包含在内。default含义同上。
25 start([group])： 
26 返回指定的组截获的子串在string中的起始索引（子串第一个字符的索引）。group默认值为0。
27 end([group])：
28 返回指定的组截获的子串在string中的结束索引（子串最后一个字符的索引+1）。group默认值为0。
29 span([group])：
30 返回(start(group), end(group))。
31 expand(template)： 
32 将匹配到的分组代入template中然后返回。template中可以使用\id或\g<id>、\g<name>引用分组，但不能使用编号0。\id与\g<id>是等价的；但\10将被认为是第10个分组，如果你想表达\1之后是字符'0'，只能使用\g<1>0。
33 '''
34 
35 print(ret.lastgroup)       # None
36 print(ret.lastindex)       # 3
37 print(ret.pos)             # 0
38 print(ret.endpos)          # 13
39 print(ret.re)              # re.compile('(w)(w)(w)')
40 print(ret.regs)            # ((0, 3), (0, 1), (1, 2), (2, 3))
41 print(ret.string)          # www.baidu.com
42 
43 print(ret.start())         # 0
44 print(ret.end())           # 3
45 print(ret.span())          # (0, 3)
46 print(ret.group())         # www
47 print(ret.groups())        # ('w', 'w', 'w')
48 print(ret.groupdict())     # {}
49 print(ret.expand(r'\1-\2-\3'))   # w-w-w

七、search方法

search(pattern, string, flags=0)
        Scan through string looking for a match to the pattern, returning
        a match object, or None if no match was found.

1 import re
2 
3 ret = re.search('(b)(a)(i)', 'www.baidu.com')
4 print(type(ret))           # <class '_sre.SRE_Match'>
5 print(ret.group())         # bai
6 print(ret.groups())        # ('b', 'a', 'i')

八、complie方法

compile(pattern, flags=0)
        Compile a regular expression pattern, returning a pattern object.

 1 import re
 2 
 3 pattern = re.compile('(b)(a)(i)')
 4 print(pattern, type(pattern))
 5 
 6 
 7 ret = re.search(pattern, 'www.baidu.com')
 8 print(type(ret))           # <class '_sre.SRE_Match'>
 9 print(ret.group())         # bai
10 print(ret.groups())        # ('b', 'a', 'i')

九、findall方法

findall(pattern, string, flags=0)
        Return a list of all non-overlapping matches in the string.
        
        If one or more capturing groups are present in the pattern, return
        a list of groups; this will be a list of tuples if the pattern
        has more than one group.
        如果正则表达式设置分组了，返回组的列表。如果设置了多个分组，则返回元组的列表。
        Empty matches are included in the result.

 1 import re
 2 
 3 # 1、无分组情况
 4 ret = re.findall('bai', 'www.baidu.com, www.baidu.com')
 5 print(ret)           # ['bai', 'bai']
 6 
 7 # 2、匹配不到
 8 ret = re.findall('bbidy', 'www.baidu.com, www.baidu.com')
 9 print(ret)           # []
10 
11 # 3、设置单个分组
12 ret = re.findall('b(a)i', 'www.baidu.com, www.baidu.com')
13 print(ret)           # ['a', 'a']
14 
15 # 4、设置多个分组
16 ret = re.findall('(b)(a)(i)', 'www.baidu.com, www.baidu.com')
17 print(ret)           # [('b', 'a', 'i'), ('b', 'a', 'i')]

10、split方法

split(pattern, string, maxsplit=0, flags=0)
        Split the source string by the occurrences of the pattern,
        returning a list containing the resulting substrings.  If
        capturing parentheses are used in pattern, then the text of all
        groups in the pattern are also returned as part of the resulting
        list.  If maxsplit is nonzero, at most maxsplit splits occur,
        and the remainder of the string is returned as the final element
        of the list.

 1 import re
 2 
 3 # 1、无分组情况
 4 ret = re.split('bai', 'www.baidu.com, www.baidu.com')
 5 print(ret)    # ['www.', 'du.com, www.', 'du.com']
 6 
 7 # 2、匹配不到
 8 ret = re.split('bbidy', 'www.baidu.com, www.baidu.com')
 9 print(ret)    # ['www.baidu.com, www.baidu.com']
10 
11 # 3、设置分组
12 ret = re.split('b(a)i', 'www.baidu.com, www.baidu.com')
13 print(ret)    # ['www.', 'a', 'du.com, www.', 'a', 'du.com']
14 
15 # 4、设置split次数
16 ret = re.split('bai', 'www.baidu.com, www.baidu.com', 1)
17 print(ret)    # ['www.', 'du.com, www.baidu.com']
18 
19 # 5、设置split次数，并分组
20 ret = re.split('b(a)i', 'www.baidu.com, www.baidu.com', 1)
21 print(ret)    # ['www.', 'a', 'du.com, www.baidu.com']

11、sub方法

sub(pattern, repl, string, count=0, flags=0)
        Return the string obtained by replacing the leftmost
        non-overlapping occurrences of the pattern in string by the
        replacement repl.  repl can be either a string or a callable;
        if a string, backslash escapes in it are processed.  If it is
        a callable, it's passed the match object and must return
        a replacement string to be used.

 1 import re
 2 
 3 # 1、字符串替换
 4 ret = re.sub('baidu', 'sina', 'www.baidu.com')
 5 print(ret)    # www.sina.com
 6 
 7 # 2、不存在替换
 8 ret = re.sub('bbidu', 'sina', 'www.baidu.com')
 9 print(ret)    # www.baidu.com
10 
11 # 3、使用分组替换
12 ret = re.sub('(bai)(du)', r'\2\1', 'www.baidu.com')  
13 print(ret)    # www.dubai.com
14 
15 # 4、函数方法替换
16 def handler(s):
17     return 'abc'
18 
19 ret = re.sub('baidu', handler, 'www.baidu.com')
20 print(ret)    # www.abc.com

12、finditer方法

finditer(pattern, string, flags=0)
        Return an iterator over all non-overlapping matches in the
        string.  For each match, the iterator returns a match object.
        
        Empty matches are included in the result.

 1 import re  
 2 
 3 # 1、无分组  
 4 ret = re.finditer(r'ab', 'ababab')
 5 print(ret)
 6 print(type(ret))    # <class 'callable_iterator'>
 7 
 8 for r in ret:
 9     print(r.group())
10 
11 
12 # 2、有分组
13 ret = re.finditer(r'(a)b', 'ababab')
14 print(ret)
15 print(type(ret))   # <class 'callable_iterator'>
16 
17 for r in ret:
18     print(r.group(), r.group(1))

13、subn

subn(pattern, repl, string, count=0, flags=0)
        Return a 2-tuple containing (new_string, number).
        new_string is the string obtained by replacing the leftmost
        non-overlapping occurrences of the pattern in the source
        string by the replacement repl.  number is the number of
        substitutions that were made. repl can be either a string or a
        callable; if a string, backslash escapes in it are processed.
        If it is a callable, it's passed the match object and must
        return a replacement string to be used.

 1 import re
 2 
 3 # 1、字符串替换
 4 ret = re.subn(r'(\w+) (\w+)', r'\2 \1', 'i say, hello world!')
 5 print(ret)     # ('say i, world hello!', 2)
 6 
 7 
 8 # 2、函数替换
 9 def func(m):  
10     return m.group(1).title() + ' ' + m.group(2).title()  
11 
12 ret = re.subn(r'(\w+) (\w+)', func, 'i say, hello world!') 
13 print(ret)     # ('I Say, Hello World!', 2)

14、escape方法

escape(pattern)
        Escape all the characters in pattern except ASCII letters, numbers and '_'.

对字符串中所有可能被解释为正则运算符的字符进行转义的应用函数，除了\w均做转义。

1 import re
2 
3 ret = re.escape('ab12_!@#$%^&*(){}[]')
4 print(ret)     # ab12_\!\@\#\$\%\^\&\*\(\)\{\}\[\]

15、fullmath方法

fullmatch(pattern, string, flags=0)
        Try to apply the pattern to all of the string, returning
        a match object, or None if no match was found.

1 import re
2 
3 ret = re.fullmatch('ab121', 'ab121')
4 print(ret)          # <_sre.SRE_Match object; span=(0, 5), match='ab121'>
5 print(ret.group())  # ab121
6 
7 ret = re.fullmatch('ab12', 'ab121')
8 print(ret)          # None

16、purge方法

purge()
        Clear the regular expression caches

17、template方法

template(pattern, flags=0)
        Compile a template pattern, returning a pattern object

1 import re
2 
3 pattern = re.compile('(b)(a)(i)')
4 print(pattern, type(pattern))
5 # re.compile('(b)(a)(i)') <class '_sre.SRE_Pattern'>
6 
7 ret = re.template('ab121')
8 print(ret, type(ret))
9 # re.compile('ab121', re.TEMPLATE) <class '_sre.SRE_Pattern'>

19、flag举例

 1 import re
 2 
 3 # 1、大小写不敏感
 4 ret = re.search(r'ABC', 'www.abc.com')
 5 print(ret)   # None
 6 
 7 ret = re.search(r'ABC', 'www.abc.com', re.I)
 8 print(ret.group())   # abc
 9 
10 ret = re.search(r'ABC', 'www.abc.com', re.IGNORECASE)
11 print(ret.group())   # abc
12 
13 
14 # 2、.匹配所有字符
15 ret = re.search(r'a.b', 'www.a\nbc.com')
16 print(ret)   # None
17 
18 ret = re.search(r'a.b', 'www.a\nbc.com', re.S)
19 print(ret.group())   # a\nb
20 
21 ret = re.search(r'a.b', 'www.a\nbc.com', re.DOTALL)
22 print(ret.group())   # a\nb
23 
24 
25 # 3、多行匹配。^代表行首，$d代表行末
26 ret = re.findall(r'foo.$', 'foo1\nfoo2\n')
27 print(ret)   # ['foo2']
28 
29 ret = re.findall(r'foo.$', 'foo1\nfoo2\n', re.M)
30 print(ret)   # ['foo1', 'foo2']
31 
32 ret = re.findall(r'foo.$', 'foo1\nfoo2\n', re.MULTILINE)
33 print(ret)   # ['foo1', 'foo2']
34 
35 
36 # 4、使 \w \W \b \B \s \S 只匹配 ASCII 字符
37 ret = re.findall(r'\w\W\s\S', '你， 是')
38 print(ret)   # ['你， 是']
39 
40 ret = re.findall(r'\w\W\s\S', 'n, s', re.A)
41 print(ret)   # ['n, s']
42 
43 ret = re.findall(r'\w\W\s\S', 'n, s', re.ASCII)
44 print(ret)   # ['n, s']
45 
46 
47 # 5、Unicode 字符匹配，默认就是该模式
48 ret = re.findall(r'\w\W\s\S', '你， 是', re.U)
49 print(ret)   # ['你， 是']
50 
51 ret = re.findall(r'\w\W\s\S', 'n, s', re.UNICODE)
52 print(ret)   # ['n, s']
53 
54 
55 # 6、使预定字符类 \w \W \b \B \s \S 取决于当前区域设定
56 
57 
58 # 7、详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。
59 # 为了让正则表达式更易读
60 a = re.compile(r"""\d +  # the integral part 
61                    \.    # the decimal point
62                    \d *  # some fractional digits""", re.X)
63 print(a)
64 
65 a = re.compile(r"""\d +  # the integral part 
66                    \.    # the decimal point
67                    \d *  # some fractional digits""", re.VERBOSE)
68 print(a)