python regular expression

regular expression

use the match string
- but not all the string can be matched

basic character in python

import re

ordinary character
- like re’test’ will match the string test
oral character
- like . ^ $ * + ? {} [] \ | ()
- like []
  - use to direct to a range of the string set
  - s = r'abc' re.findall(s, "aaaaaaaabc") //use [] rt = "top tip tap twp tep" r1 = r"t[io]p" re.findall(r1, rt) //output ['top'], ['tip'] r2 = r"t[^io]p" re.findall(r2, rt) //output ['tap'],['tep'], ['twp']
  - and oral character is no use in the []
  - also can use r'0-3' replace r'0123'
    - use ‘r’[0-3a-cA-C]‘’ replace r'[0123abcABC]'
- like ^
  - use the match the head of the line
  - s = r'^t' st = 'tss' //output ['t']
- like $
  - use to match the end of the line
- like \
  - used if you want to transform the oral character into a original one
  - use ‘^’ to make ^ as a original character
  - and can be used as
    - \d match [0-9]
    - \D match [^0-9]
    - \s match [\t\n\r\f\v]
      - means and empty character
    - \S match [^\t\n\r\f\v]
      - means non empty character
    - \w match [a-zA-Z0-9]
    - \W match [^a-zA-Z0-9]
- like *
  - match multiple character
  - means repeat the character in front of the * for 0-many times
  - r = r'ab*' rt = 'abbbbbb' re.findall(r, rt) //output['abbbbbb']
- like +
  - match the charter that appear more than one time
- like ?
  - match the charter that appear zero or one time
  - can be used as minimum match
    - r = r'ab+?'
    - rt = 'abbbbbb'
    - output the ab
- like the {}
  - means that the character can be repeat how many times
  - r = r'a{1,3}' rt = 'aaaaa' then can match a, aa, aaa
  - {0,} == *
  - {1,} == +
  - {0,1} == ?

functions

compile the expression to speed up
- r = r'\d{3,4}-?\d{8}' p_telephone = re.compile(r) p_telephone.findall('010-12345678') //output ['010-12345678']
- and you can add attribute while compile
there are some normal functions
- match
  - only search from the front
  - p_telephone.match('o010-12345678') //output nothing because the 010-12345678 is not start from the begin
- search
  - search the whole string
  - no matter where the number is , if you can find it, you can find it.
- findall
- finditer
  - the same as findall but you need to use iter so that you can get the value
and there are sub(), subn(), split()
- sub()
  - r = r'c..t' re.sub(r, 'aaa', 'caat cast cccc') //output 'aaa aaa cccc'
- subn()
  - the difference between sub and subj is that subj provide a count of the how many stuff you replace
  - r = r'c..t' re.subn(r, 'aaa', 'caat cast cccc') //output 'aaa aaa cccc', 2
- split()
  - split the string using a regular express
  - re.split(r'[\+\*\-]', '1+2-3*5') //output ['1', '2', '3', '5']
- use dir(re) to see what functions re have

flags in re module

flags	meaning
dotall, S	let . match all the character contains \n
ignore case, I	make the match no sensitive about the uppercase and lowercase
locale, L	do locale-aware match, match the French or the other language
multiline, M	match multiline, affect ^ and $
verbose, X	can use the REs verbose status, and make the organise more clearly

devision

()
email = r'\w{3}@\w+(\.com|\.cn)'
use the () to divide the .com and .cn
so that we can use the regular form to match email address

KING

Do more, say less

Python Regular Expression

regular expression

basic character in python

functions

flags in re module

devision

Comments