Regex is a important tool when dealing text. In python
we have a library named re
.
Most of the regular expression operations are available as module-level function and RegexObject methods.
import the lib
import re
re.compile(pattern, flag=0)
Compile a regular expression pattern into a regular expression object, which can be used for matching using its
match()
andsearch()
methods, described below.import re text = 'I love China' regexs = [re.compile(p) for p in ['love', 'll'] ] for regex in regexs: print regex.pattern if regex.search(text): print "Match" else: print "Not Match"
re.search(pattern, string) Scan through string looking for a location where the regular expression pattern produces a match, and return a corresponding
MatchObject
instance.ReturnNone
if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.#!/usr/bin/python #encoding:utf-8 import re text = 'I love China' regexs = [re.compile(p) for p in ['love', 'll'] ] for regex in regexs: print regex.pattern match = re.search(regex, text): if match: print "Match" else: print "Not Match"
RegexObject.search(string)
#!/usr/bin/python #encoding:utf-8 import re text = 'I love China' pattern = 'love' match = re.search(pattern, text) s = match.start() e = match.end() print 'Found "%s"\nin "%s"\nfrom %d to %d ("%s")' % \ (match.re.pattern, match.string, s, e, text[s:e])
re.match(pattern, srting) If zero or more characters at the beginning of string match this regular expression, return a corresponding
MatchObject
instance. ReturnNone
if the string does not match the pattern; note that this is different from a zero-length match.#!/usr/bin/python #encoding:utf-8 import re text = 'His phone 12345 number is 67890' regex = re.compile(r'.*?(\d+).*?(\d+)') match = re.match(regex, text) if match: print match.group(1),match.group(2),
RegexObject.match(string)
#!/usr/bin/python #encoding:utf-8 import re text = 'His phone 12345 number is 67890' regex = re.compile(r'.*?(\d+).*?(\d+)') match = regex.match(text) if match: print match.group(1), match.group(2),
Match object has many attributes:
Match more than one
re.findall(pattern, string)
#!/usr/bin/python
#encoding:utf-8
import re
text = 'this is a text'
pattern = 'is'
matchs = re.findall(pattern, text)
for match in matchs:
print match
lo@ubuntu:~/try/regex$ python searchall.py
is
is
re.finditer(pattern, string)
#!/usr/bin/python
#encoding:utf-8
import re
text = 'this is a text'
pattern = 'is'
matchs = re.finditer(pattern, text)
for match in matchs:
s = match.start()
e = match.end()
print "Found ", text[s:e], "at: ", s, e
lo@ubuntu:~/try/regex$ python searchall.py
Found is at: 2 4
Found is at: 5 7
Pattern Syntax
Repetition
{a, b}
* | equivalent to {0,} |
+ | equivalent to {1,} |
? | equivalent to {0,1} |
#!/usr/bin/python
#encoding: utf-8
import re
text = '101000111'
patterns = [
'10',
'10?',
'10*',
'10+',
'10{3}',
'10{1,3}'
]
print 'orginal string: ', text
for pattern in patterns:
matchs = re.finditer(pattern, text)
for match in matchs:
s = match.start()
e = match.end()
substr = text[s:e]
print 'pattern: ', pattern,' Found: ', substr, 'at', s, e
@ubuntu:~/try/regex$ python repetition.py
orginal string: 101000111
pattern: 10 Found: 10 at 0 2
pattern: 10 Found: 10 at 2 4
pattern: 10? Found: 10 at 0 2
pattern: 10? Found: 10 at 2 4
pattern: 10? Found: 1 at 6 7
pattern: 10? Found: 1 at 7 8
pattern: 10? Found: 1 at 8 9
pattern: 10* Found: 10 at 0 2
pattern: 10* Found: 1000 at 2 6
pattern: 10* Found: 1 at 6 7
pattern: 10* Found: 1 at 7 8
pattern: 10* Found: 1 at 8 9
pattern: 10+ Found: 10 at 0 2
pattern: 10+ Found: 1000 at 2 6
pattern: 10{3} Found: 1000 at 2 6
pattern: 10{1,3} Found: 10 at 0 2
pattern: 10{1,3} Found: 1000 at 2 6
Character set
[a|b] [a-z] [0-9] [a-zA-Z]
#!/usr/bin/python
#encoding utf-8
import re
text = 'string is not 12324, 234 IS NOT STRING'
patterns =[
'[a-z]+',
'[A-Z]+',
'[0-9]+',
'[a-zA-Z]+',
'[a-zA-Z0-9]+'
]
print "orginal string: ", text
for pattern in patterns:
print "pattern is: ", pattern
matchs = re.findall(pattern, text)
for match in matchs:
print match
lo@ubuntu:~/try/regex$ python searchset.py
orginal string: string is not 12324, 234 IS NOT STRING
pattern is: [a-z]+
string
is
not
pattern is: [A-Z]+
IS
NOT
STRING
pattern is: [0-9]+
12324
234
pattern is: [a-zA-Z]+
string
is
not
IS
NOT
STRING
pattern is: [a-zA-Z0-9]+
string
is
not
12324
234
IS
NOT
STRING
Greedy Or Non-greedy(minimal fashion)
Escape sequences #TODO
\d | any decimal digit |
\D | any character that is not a decimal digit |
\w | any 'word' character |
\W | any 'non-word' character |
\s | any whitespace character |
\S | any character that is not a whitespace character |
Anchors
^ | the current match point is at the start of the subject string |
$ | the current match point is at the end of the subject string |
\b | word boundary |
\B | not a word boundary |
\A | start of subject (independent of multiline mode) |
\Z | end of subject or newline at end (independent of multiline mode) |