字符串基础
split() strip() join()
In [3]:
val = 'a,b, guido'
val.split(',')
Out[3]:
['a', 'b', ' guido']
In [4]:
pieces=[x.strip() for x in val.split(',')]
pieces
Out[4]:
['a', 'b', 'guido']
In [5]:
'::'.join(pieces)
Out[5]:
'a::b::guido'
子串的定位 in index find count
In [6]:
'guido' in val
Out[6]:
True
In [14]:
val.index(',')#找不到字符串 会报异常 找到字符串返回第一个出现的位置
Out[14]:
1
In [12]:
val.find(':') #找不到返回-1
Out[12]:
-1
In [15]:
val.count(',')#字符串出现的次数
Out[15]:
2
字串的替换
In [16]:
val.replace(',', '::')
Out[16]:
'a::b:: guido'
In [17]:
val.replace(',', '')
Out[17]:
'ab guido'
正则表达式
re 模块
In [18]:
import re
text = "foo bar\t baz \tqux"
re.split('\s+', text)#\s+ 匹配一个或多个空白符 \s匹配任意空白符 + 一个或多个
Out[18]:
['foo', 'bar', 'baz', 'qux']
In [21]:
regex=re.compile('\s+')# 编译
regex.findall(text) # 返回字符串匹配项
Out[21]:
[' ', '\t ', ' \t']
如果打算对许多字符串应用同一条正则表达式,建议用编译,节省CPU
In [22]:
regex.split(text)
Out[22]:
['foo', 'bar', 'baz', 'qux']
In [23]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}' #邮箱正则表达式
regex = re.compile(pattern, flags=re.IGNORECASE) #re.IGNORECASE 对大小写不敏感
In [24]:
regex.findall(text)
Out[24]:
['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']
In [25]:
regex.search(text)# 返回文本中第一个匹配,并告知起始和终止位置
Out[25]:
<_sre.SRE_Match object; span=(5, 20), match='dave@google.com'>
In [28]:
m=regex.search(text)
text[m.start():m.end()]
Out[28]:
'dave@google.com'
In [31]:
print(regex.match(text)) #仅匹配字符串开头
None
In [36]:
print(regex.sub('REDACTED', text)) #匹配到的模式替换为制定的字符串,并返回新的字符串
Dave REDACTED
Steve REDACTED
Rob REDACTED
Ryan REDACTED
In [ ]:
#sub \1 \2 用法如下 访问各匹配项中的分组
-- 将地址拆为三部分,只需将待分段的模式的各个部分用圆括号包起来
In [34]:
pattern = r'([A-Z0-9._%+-]+)@([A-Z0-9.-]+)\.([A-Z]{2,4})'
regex = re.compile(pattern, flags=re.IGNORECASE)
regex.findall(text)# 元祖列表
Out[34]:
[('dave', 'google', 'com'),
('steve', 'gmail', 'com'),
('rob', 'gmail', 'com'),
('ryan', 'yahoo', 'com')]
In [39]:
m = regex.match('wesm@bright.net')
m.groups() #返回各段组成的元祖
Out[39]:
('wesm', 'bright', 'net')
In [ ]:
#如果得到是字典 则可以groupdict()
In [37]:
print(regex.sub(r'Username: \1, Domain: \2, Suffix: \3', text))
Dave Username: dave, Domain: google, Suffix: com
Steve Username: steve, Domain: gmail, Suffix: com
Rob Username: rob, Domain: gmail, Suffix: com
Ryan Username: ryan, Domain: yahoo, Suffix: com
In [40]:
regex = re.compile(r"""
(?P<username>[A-Z0-9._%+-]+)
@
(?P<domain>[A-Z0-9.-]+)
\.
(?P<suffix>[A-Z]{2,4})""", flags=re.IGNORECASE|re.VERBOSE)
In [41]:
m = regex.match('wesm@bright.net')
m.groupdict()
Out[41]:
{'domain': 'bright', 'suffix': 'net', 'username': 'wesm'}
In [ ]: