Pattern 物件

剖析、驗證規則表示式往往是最耗時的階段，在頻繁使用某規則表示式的場合，若可以將剖析、驗證過後的規則表示式重複使用，對效率將會有幫助。

compile

re.compile 可以建立規則表示式物件，在剖析、驗證過規則表示式無誤後傳回的規則表示式物件可以重複使用。例如：

regex = re.compile(r'.*foo')

re.compile 函式可以指定 flags 參數，進一步設定規則表示式物件的行為，例如想不分大小寫比對 dog 文字，可以如下：

regex = re.compile(r'dog', re.IGNORECASE)

也可以在規則表示式中使用行內旗標（Inline flag）。例如 re.IGNORECASE 等效的嵌入旗標表示法為 (?i)，以下片段效果等同上例：

regex= re.compile('(?i)dog')

行內旗標可使用的字元有 i、a、m、s、L、x，各自對應著 re.compile 函式的 flags 參數之作用：

re.IGNORECASE 或 re.I：(?i)
re.ASCII 或 re.A：(?a)
re.MULTILINE 或 re.M：(?m)
re.DOTALL 或 re.S：(?s)
re.LOCALE 或 re.L：(?L)
re.VERBOSE 或 re.X：(?x)

Python 預設支援 Unicode 模式，若想令規則表示式的字元類回歸傳統僅比對ASCII的模式，可以使用 re.ASCII；re.MULTILINE 啟用多行文字模式（影響了 ^、$ 的行為，換行字元後、前會被視為行首、行尾）；預設情況下.不匹配換行字元，可設置 re.DOTALL 來匹配換行字元。

不建議設置 re.LOCALE，它會根據區域設定而影響 \w、\W、\b、\B 以及大小寫判斷，因為區域機制並不可靠而且一次只能處理一種區域。

re.VERBOSE 可以「排版」規則表示式，空白（除非被放在字元類、反斜線後等情況，可參考 re 模組說明）會被忽略、可以使用 # 為規則表示式添加註解等。例如，底下的 a 與 b 代表等效的規則表示式：

a = re.compile(r"""\d +  # 整數部份
                   \.    # 小數點
                   \d *  # 小數數字""", re.X)

b = re.compile(r"\d+\.\d*")

split／findall

在取得規則表示式物件後，可以使用 split 方法，將指定字串依規則表示式切割，效果等同於使用 re.split 函式；findall 方法找出符合的全部子字串，效果等同於使用 re.findall 函式：

>>> dog = re.compile('(?i)dog')
>>> dog.split('The Dog is mine and that dog is yours')
['The ', ' is mine and that ', ' is yours']
>>> dog.findall('The Dog is mine and that dog is yours')
['Dog', 'dog']
>>>

取得 Matcher

如果想取得符合時的更進一步資訊，可以使用 finditer 方法，它會傳回一個 iterable 物件，每一次迭代都會得到一個 Match 物件，可以使用它的 group 來取得符合整個規則表示式的子字串，使用 start 來取得子字串的起始索引，end 來取得結尾索引。例如：

>>> dog = re.compile('(?i)dog')
>>> for m in dog.finditer('The Dog is mine and that dog is yours'):
...     print(m.group(), 'between', m.start(), 'and', m.end())
...
Dog between 4 and 7
dog between 25 and 28
>>>

search 方法與 match 方法必須小心區分，search 會在整個字串中，找尋第一個符合的子字串，match 只在字串開頭看看接下來的字串是否符合，search 方法與 match 若有符合，都會傳回 Match 物件，否則傳回 None。

>>> dog.search('The Dog is mine and that dog is yours')
<_sre.SRE_Match object; span=(4, 7), match='Dog'>
>>> dog.match('The Dog is mine and that dog is yours')
>>> dog.match('Dog is mine and that dog is yours')
<_sre.SRE_Match object; span=(0, 3), match='Dog'>
>>>

分組處理

如果規則表示式中設定了分組，findall 方法會以清單傳回各個分組。例如 (\d\d)\1 的話，表示要輸入四個數字，輸入的前兩個數字與後兩個數字必須相同：

>>> twins = re.compile(r'(\d\d)\1')
>>> twins.findall('12341212345453999928202')
['12', '45', '99']
>>>

能符合的數字只有 1212、4545、9999，因為分組設定是 (\d\d) 兩個數字，而 findall 以清單傳回各個分組，因此結果是 12、45、99，如果想取得 1212、4545、9999 這樣的結果，方式之一是再設一層分組，取外層分組結果：

>>> twins = re.compile(r'((\d\d)\2)')
>>> twins.findall('12341212345453999928202')
[('1212', '12'), ('4545', '45'), ('9999', '99')]
>>>

另一個方式是使用 finditer 方法，透過迭代 Match 並呼叫 group，這會取得符合整個規則表示式的子字串。例如：

>>> for m in twins.finditer('12341212345453999928202'):
...     print(m.group())
...
1212
4545
9999
>>>

先前談到，(["'])[^"']*\1 可比對出前後引號必須一致的狀況，若想找出單引號或雙引號中的文字，如下使用 findall 是行不通的：

>>> regex = re.compile(r'''(["'])[^"']*\1''')
>>> regex.findall(r'''your right brain has nothing 'left' and your left has nothing "right"''')
["'", '"']
>>>

因為 findall 以清單傳回各個分組，而分組設定為 (["'])，符合的是單引號或雙引號，因此清單中才會只看到’與"，如果要找出單引號或雙引號中的文字，必須如下：

>>> import re
>>> regex = re.compile(r'''(["'])[^"']*\1''')
>>> for m in regex.finditer(r'''your right brain has nothing 'left' and your left has nothing "right"'''):
...     print(m.group())
...
'left'
"right"
>>>

如果設定了分組，search 或 match 在搜尋到文字時，也可以使用 group 指定數字，表示要取得哪個分組，或者是使用 groups 傳回一個 tuple，其中包含符合的分組。例如：

>>> regex = re.compile(r'''(["'])([^"']*)\1''')
>>> m = regex.search(r"your right brain has nothing 'left'")
>>> m.group(1)
"'"
>>> m.group(2)
'left'
>>> m.groups()
("'", 'left')
>>> m.group(0)
"'left'"
>>>

group(0) 實際上等於呼叫 group() 不指定數字，表示整個符合規則表示式的字串。如果使用了 (?P<name>…) 為分組命名，在呼叫 group() 方法時，也可以指定分組名稱。例如：

>>> twins = re.compile(r'(?P<tens>\d\d)(?P=tens)')
>>> m = twins.search('12341212345453999928202')
>>> m.group('tens')
'12'
>>>

字串取代

如果要取代符合的子字串，可以使用規則表示式物件的sub()方法。例如，將單引號都換成雙引號：

>>> regex = re.compile(r"'")
>>> regex.sub('"', "your right brain has nothing 'left' and your left brain has nothing 'right'")
'your right brain has nothing "left" and your left brain has nothing "right"'
>>>

如果規則表示式有分組設定，在使用 sub 時，可以使用 \num 來捕捉被分組匹配的文字，num 表示第幾個分組。例如，以下示範如何將使用者郵件位址從 .com 取代為 .cc：

>>> regex = re.compile(r'(^[a-zA-Z]+\d*)@([a-z]+?.)com')
>>> regex.findall('caterpillar@openhome.com')
[('caterpillar', 'openhome.')]
>>> regex.sub(r'\1@\2cc', 'caterpillar@openhome.com')
'caterpillar@openhome.cc'
>>>

整個規則表示式匹配了 'caterpillar@openhome.com'，第一個分組捕捉到 'caterpillar'，第二個分組捕捉到 'openhome.'，\1 與 \2 就分別代表這兩個部份。

如果使用了 (?P<name>…) 為分組命名，在呼叫 sub 方法時，必須使用 \g<name> 來參考。例如：

>>> regex = re.compile(r'(?P<user>^[a-zA-Z]+\d*)@(?P<preCom>[a-z]+?.)com')
>>> regex.findall('caterpillar@openhome.com')
[('caterpillar', 'openhome.')]
>>> regex.sub(r'\g<user>@\g<preCom>cc', 'caterpillar@openhome.com')
'caterpillar@openhome.cc'
>>>