Go 的 Regexp 實例

在 Go 中要使用規則表示式取得比對成功的部份、取代等任務，都得將規則表示式編譯為 Regexp 才可以：

func Compile(expr string) (*Regexp, error)
func CompilePOSIX(expr string) (*Regexp, error)
func MustCompile(str string) *Regexp
func MustCompilePOSIX(str string) *Regexp

POSIX 結尾的函式，表示規則表示式必須符合 POSIX ERE (egrep) 語法，Must 開頭的函式，表示剖析錯誤的話會 panic。

剖析成功的話，傳回 *Regexp，之後就是比對任務了，不用再處理錯誤。例如：

package main

import (
    "fmt"
    "regexp"
)

func main() {
    re, err := regexp.Compile(`\d{4}-\d{6}`)
    fmt.Println(re, err)

    matched := re.MatchString("0970-168168")
    fmt.Println(matched)
    matched = re.MatchString("Phone: 0970-168168")
    fmt.Println(matched)
}

尋找符合項目

如果想找出最左邊第一個符合項目，可以使用 Find 開頭的方法版本：

func (re *Regexp) Find(b []byte) []byte
func (re *Regexp) FindIndex(b []byte) (loc []int)
func (re *Regexp) FindReaderIndex(r io.RuneReader) (loc []int)
func (re *Regexp) FindReaderSubmatchIndex(r io.RuneReader) []int
func (re *Regexp) FindString(s string) string
func (re *Regexp) FindStringIndex(s string) (loc []int)
func (re *Regexp) FindStringSubmatch(s string) []string
func (re *Regexp) FindStringSubmatchIndex(s string) []int
func (re *Regexp) FindSubmatch(b []byte) [][]byte
func (re *Regexp) FindSubmatchIndex(b []byte) []int

有 Index 字樣的版本，傳回的 []int 中會有兩個元素，分別是符合項目的位元組開頭與結尾索引位置，例如：

package main

import (
    "fmt"
    "regexp"
)

func main() {
    re := regexp.MustCompile(`foo.?`)
    fmt.Printf("%q\n", re.FindString("seafood fool"))      // "food"
    fmt.Printf("%v\n", re.FindStringIndex("seafood fool")) // [3 7]
}

有 Submatch 字樣的方法，是用來支援分組。例如：

package main

import (
    "fmt"
    "regexp"
)

func main() {
    re := regexp.MustCompile(`(\d{4})-(\d{6})`)
    // ["0970-666888" "0970" "666888"]
    fmt.Printf("%q\n", re.FindStringSubmatch("0970-666888"))
}

如果要找出全部的符合項目呢？在這之前來看看如何用規則表示式來切割子字串，這可以使用 Regexp 的 Split 方法，它的第二個參數可以指定至少切割幾個子字串，若指定小於 0 的數，會切出全部的子字串：

package main

import (
    "fmt"
    "regexp"
)

func main() {
    re, _ := regexp.Compile(`\d`)
    fmt.Println(re.Split("Justin1Monica2Irene", 1))   // [Justin1Monica2Irene]
    fmt.Println(re.Split("Justin1Monica2Irene", 2))   // [Justin Monica2Irene]
    fmt.Println(re.Split("Justin1Monica2Irene", 3))   // [Justin Monica Irene]
    fmt.Println(re.Split("Justin1Monica2Irene", -1))  // [Justin Monica Irene]
}

Regexp 提供的 Find 開頭的方法，有不少是這種指定模式，例如：

func (re *Regexp) FindAll(b []byte, n int) [][]byte
func (re *Regexp) FindAllIndex(b []byte, n int) [][]int
func (re *Regexp) FindAllString(s string, n int) []string
func (re *Regexp) FindAllStringIndex(s string, n int) [][]int
func (re *Regexp) FindAllStringSubmatch(s string, n int) [][]string
func (re *Regexp) FindAllStringSubmatchIndex(s string, n int) [][]int
func (re *Regexp) FindAllSubmatch(b []byte, n int) [][][]byte
func (re *Regexp) FindAllSubmatchIndex(b []byte, n int) [][]int

因此，要找出全部的符合項目，一個例子如下：

package main

import (
    "fmt"
    "regexp"
)

func main() {
    re := regexp.MustCompile(`(\d{4})-(\d{6})`)

    // 分行顯示 "0970-666888" 與 "0970-168168"
    for _, submatch := range re.FindAllString("0970-666888, 0970-168168", -1) {
        fmt.Printf("%q\n", submatch)
    }
}

底下則是捕捉分組的版本：

package main

import (
    "fmt"
    "regexp"
)

func main() {
    re := regexp.MustCompile(`(\d{4})-(\d{6})`)

    // 分行顯示 "0970-666888" 與 "0970-168168"
    for _, submatch := range re.FindAllStringSubmatch("0970-666888, 0970-168168", -1) {
        fmt.Printf("%q\n", submatch)
    }
}

取代相符項目

若要進行取代，使用的是 Replace 開頭的方法：

func (re *Regexp) ReplaceAll(src, repl []byte) []byte
func (re *Regexp) ReplaceAllFunc(src []byte, repl func([]byte) []byte) []byte
func (re *Regexp) ReplaceAllLiteral(src, repl []byte) []byte
func (re *Regexp) ReplaceAllLiteralString(src, repl string) string
func (re *Regexp) ReplaceAllString(src, repl string) string
func (re *Regexp) ReplaceAllStringFunc(src string, repl func(string) string) string

有 Func 結尾的方法，表示可以指定函式，該函式接收符合的項目，由函式決定用什麼取代。沒有 Literal 字樣的方法，repl 的部份支援分組捕捉，分組計數表示方式是 ${n}，例如：

package main

import (
    "fmt"
    "regexp"
)

func main() {
    re := regexp.MustCompile(`(^[a-zA-Z]+\d*)@([a-z]+?.)com`)
    // 顯示 caterpillar@openhome.cc
    fmt.Println(re.ReplaceAllString("caterpillar@openhome.com", "${1}@${2}cc"))
}

如果使用了 (?P<name>…) 為分組命名，可以使用 ${name}，例如：

package main

import (
    "fmt"
    "regexp"
)

func main() {
    re := regexp.MustCompile(`(?P<user>^[a-zA-Z]+\d*)@(?P<preCom>[a-z]+?.)com`)
    // 顯示 caterpillar@openhome.cc
    fmt.Println(re.ReplaceAllString("caterpillar@openhome.com", "${user}@${preCom}cc"))
}

雖然說方才的 ${2} 也可以寫為 $2，然而之後接上其他文字的話，例如 $2cc，就會被認為是分組命名，類似地，方才的 ${preCom} 寫成 $preCom 也可以，不過之後接上其他文字的話，例如 $preComcc 就會被認為名稱是 preComcc，建議還是加上 {}。

Replace 方法中具有 Literal 字樣的，就是直接把 $ 當成字面文字來解釋：

package main

import (
    "fmt"
    "regexp"
)

func main() {
    re := regexp.MustCompile(`(?P<user>^[a-zA-Z]+\d*)@(?P<preCom>[a-z]+?.)com`)
    // $user@${preCom}cc
    fmt.Println(re.ReplaceAllLiteralString("caterpillar@openhome.com", "$user@${preCom}cc"))
}