第五章:处理数据

原创

minseo 2021-09-10 15:06:14 ©著作权

文章标签 python 字符串赋值 vim 数据 文章分类 Python 后端开发

©著作权归作者所有：来自51CTO博客作者minseo的原创作品，请联系作者获取转载授权，否则将追究法律责任

这个教练是你的好朋友，他记录了四个人的跑步十个跑步时间在四个文件里面

james.txt ,julie.txt,mikey.txt,sarah.txt

2-34,3:21,2.34,2.45,3.01,2:01,2:01,3:10,2-22

2.59,2.11,2:11,2:23,3-10,2-23,3:10,3.21,3-21

2:22,3.01,3:01,3.02,3:02,3.02,3:22,2.49,2:38

2:58,2.58,2:39,2-25,2-55,2:54,2.18,2:55,2:55

写一个脚本为这四个文件创建一个列表并且在屏幕上面显示出来

vim chapter5-1.py

#!/usr/bin/python
# -*- coding:utf-8 -*-
with open('james.txt') as jaf:          #打开文件       
        data = jaf.readline()           #读取第一行
        james = data.strip().split(',') #去空格以,为分割符转换成列表赋值给james
with open('julie.txt') as juf:
        data = juf.readline()
        julie = data.strip().split(',')
with open('mikey.txt') as mif:
        data = mif.readline()
        mikey = data.strip().split(',')
with open('sarah.txt') as saf:
        data = saf.readline()
        sarah = data.strip().split(',')
print (james)
print (julie)
print (mikey)
print (sarah)

执行输出

[root@VPN chapter5]# python chapter5-1.py

['2-34', '3:21', '2.34', '2.45', '3.01', '2:01', '2:01', '3:10', '2-22']

['2.59', '2.11', '2:11', '2:23', '3-10', '2-23', '3:10', '3.21', '3-21']

['2:22', '3.01', '3:01', '3.02', '3:02', '3.02', '3:22', '2.49', '2:38']

['2:58', '2.58', '2:39', '2-25', '2-55', '2:54', '2.18', '2:55', '2:55']

下面想要把时间按升序排列

在Python里面排序有两种方法

1，原地排序

>>> data = [6,3,1,2,4,5]

>>> data.sort()

>>> data

[1, 2, 3, 4, 5, 6]

2，复制排序

>>> data

[1, 2, 3, 4, 5, 6]

>>> data = [6,3,1,2,4,5]

>>> data

[6, 3, 1, 2, 4, 5]

>>> data2 = sorted(data)

>>> data

[6, 3, 1, 2, 4, 5]

>>> data2

[1, 2, 3, 4, 5, 6]

修改以上源代码

vim chapter5-2.py

#!/usr/bin/python
# -*- coding:utf-8 -*-
with open('james.txt') as jaf:          #打开文件       
        data = jaf.readline()           #读取第一行
        james = data.strip().split(',') #去空格以,为分割符转换成列表赋值给james
with open('julie.txt') as juf:
        data = juf.readline()
        julie = data.strip().split(',')
with open('mikey.txt') as mif:
        data = mif.readline()
        mikey = data.strip().split(',')
with open('sarah.txt') as saf:
        data = saf.readline()
        sarah = data.strip().split(',')
'''
print (james)
print (julie)
print (mikey)
print (sarah)
'''
print (sorted(james))
print (sorted(julie))
print (sorted(mikey))
print (sorted(sarah))

　　执行

[root@VPN chapter5]# python chapter5-2.py

['2-22', '2-34', '2.34', '2.45', '2:01', '2:01', '3.01', '3:10', '3:21']

['2-23', '2.11', '2.59', '2:11', '2:23', '3-10', '3-21', '3.21', '3:10']

['2.49', '2:22', '2:38', '3.01', '3.02', '3.02', '3:01', '3:02', '3:22']

['2-25', '2-55', '2.18', '2.58', '2:39', '2:54', '2:55', '2:55', '2:58']

观察排序第四行2-25排序到2.18前面了，只不符合我们的需求

是因为数据格式不统一导致分割符需要统一

光是分隔符还远远不够，因为分割后会把所有成绩成为字符串来保存，Python

可以对字符串进行排序。短横在点号前面点号在冒号前面，教练数据中的这种

不一致性导致的排序失败

下面创建一个函数名为sanitize（），这个函数从各个选手的列表接收一个字符串作为

输入，然后处理这些字符串。把所有的横线和冒号转换成点，如果已经包含点则不处理

定义函数然后使用函数处理字符串，把字符串里面包含的-和：全部转换成.

vim chapter5-3.py

#!/usr/bin/python
# -*- coding:utf-8 -*-
def sanitize(time_string):              #定义函数
        if '-' in time_string:
                splitter = '-'
        elif ':' in time_string:
                splitter = ':'          #检查字符串是否有:和-
        else:
                return(time_string)
        (mins,secs) = time_string.split(splitter)       #分解字符串抽出分和秒
#!/usr/bin/python
# -*- coding:utf-8 -*-
def sanitize(time_string):              #定义函数
        if '-' in time_string:
                splitter = '-'
        elif ':' in time_string:
                splitter = ':'          #检查字符串是否有:和-
        else:
                return(time_string)
        (mins,secs) = time_string.split(splitter)       #分解字符串抽出分和秒
        return(mins + '.' + secs)
with open('james.txt') as jaf:          #打开文件       
        data = jaf.readline()           #读取第一行
        james = data.strip().split(',') #去空格以,为分割符转换成列表赋值给james
with open('julie.txt') as juf:
        data = juf.readline()
        julie = data.strip().split(',')
with open('mikey.txt') as mif:
        data = mif.readline()
        mikey = data.strip().split(',')
with open('sarah.txt') as saf:
        data = saf.readline()
        sarah = data.strip().split(',')
clean_james = []
clean_julie = []
clean_mikey = []
clean_sarah = []        #定义新的列表接收排序后的参数

for each_t in james:
        clean_james.append(sanitize(each_t))
for each_t in julie:
        clean_julie.append(sanitize(each_t))
for each_t in mikey:
        clean_mikey.append(sanitize(each_t))
for each_t in sarah:
        clean_sarah.append(sanitize(each_t))


print (sorted(clean_james))
print (sorted(clean_julie))
print (sorted(clean_mikey))
print (sorted(clean_sarah))

　　执行

[root@VPN chapter5]# python chapter5-3.py

['2.01', '2.01', '2.22', '2.34', '2.34', '2.45', '3.01', '3.10', '3.21']

['2.11', '2.11', '2.23', '2.23', '2.59', '3.10', '3.10', '3.21', '3.21']

['2.22', '2.38', '2.49', '3.01', '3.01', '3.02', '3.02', '3.02', '3.22']

['2.18', '2.25', '2.39', '2.54', '2.55', '2.55', '2.55', '2.58', '2.58']

输出了正确的排序

但是以上方法会有太多的列表以及迭代，代码会重复。Python提供了一个工具转换列表

看一下例子

>>> mins = [1,2,3]

>>> secs = [m*60 for m in mins]

>>> secs

[60, 120, 180]

分钟转换成秒

lower = ["I","am","liuyueming"]

upper = [s.upper() for s in lower]

>>> upper

['I', 'AM', 'LIUYUEMING']

所有字母转换成大写

修改代码改成列表推算的方法

vim chapter5-4.py

#!/usr/bin/python
# -*- coding:utf-8 -*-
def sanitize(time_string):              #定义函数
        if '-' in time_string:
                splitter = '-'
        elif ':' in time_string:
                splitter = ':'          #检查字符串是否有:和-
        else:
                return(time_string)
        (mins,secs) = time_string.split(splitter)       #分解字符串抽出分和秒
        return(mins + '.' + secs)
with open('james.txt') as jaf:          #打开文件       
        data = jaf.readline()           #读取第一行
        james = data.strip().split(',') #去空格以,为分割符转换成列表赋值给james
with open('julie.txt') as juf:
        data = juf.readline()
        julie = data.strip().split(',')
with open('mikey.txt') as mif:
        data = mif.readline()
        mikey = data.strip().split(',')
with open('sarah.txt') as saf:
        data = saf.readline()
        sarah = data.strip().split(',')

clean_james = [sanitize(t) for t in james]
clean_julie = [sanitize(t) for t in julie]
clean_mikey = [sanitize(t) for t in mikey]
clean_sarah = [sanitize(t) for t in sarah]      #定义新的列表使用推导列表的方法赋值

print (sorted(clean_james))
print (sorted(clean_julie))
print (sorted(clean_mikey))
print (sorted(clean_sarah))

　　运行结果是一样的但是代码精简了不少

但是教练想要的结果是去除相同的数据然后取出排名前三的数据

vim chapter5-5.py

#!/usr/bin/python
# -*- coding:utf-8 -*-
def sanitize(time_string):    #定义函数
  if '-' in time_string:            
    splitter = '-'
  elif ':' in time_string:  
    splitter = ':'    #检查字符串是否有:和-
  else:
    return(time_string)
  (mins,secs) = time_string.split(splitter) #分解字符串抽出分和秒
  return(mins + '.' + secs) 
with open('james.txt') as jaf:    #打开文件 
  data = jaf.readline()   #读取第一行
  james = data.strip().split(',') #去空格以,为分割符转换成列表赋值给james
with open('julie.txt') as juf:
  data = juf.readline()
  julie = data.strip().split(',')
with open('mikey.txt') as mif:
  data = mif.readline()
  mikey = data.strip().split(',')
with open('sarah.txt') as saf:
  data = saf.readline()
  sarah = data.strip().split(',')

clean_james = sorted([sanitize(t) for t in james])
clean_julie = sorted([sanitize(t) for t in julie])
clean_mikey = sorted([sanitize(t) for t in mikey])
clean_sarah = sorted([sanitize(t) for t in sarah])  #定义新的列表使用推导列表的方法赋值

unique_james = []       #新建列表存储去除重复数据以后的数据
for each_t in clean_james:
  if each_t not in unique_james:    
    unique_james.append(each_t) #如果不在列表中追加到列表中
print (unique_james[0:3])     #输出前三

unique_julie = []
for each_t in clean_julie:
  if each_t not in unique_julie:
    unique_julie.append(each_t)
print (unique_julie[0:3])
  
unique_mikey = []
for each_t in clean_mikey:
  if each_t not in unique_mikey:
    unique_mikey.append(each_t)
print (unique_mikey[0:3])

unique_sarah = []
for each_t in clean_sarah:
  if each_t not in unique_sarah:
    unique_sarah.append(each_t)
print (unique_sarah[0:3])

[root@VPN chapter5]# python chapter5-5.py

['2.01', '2.22', '2.34']

['2.11', '2.23', '2.59']

['2.22', '2.38', '2.49']

['2.18', '2.25', '2.39']

去除了排序以后重复的数据然后取出前三的数据了

这里使用了一个逻辑新建一个列表来存储去重的数据，有没有办法直接去重呢

python提供了一个内置函数集合来去重

vim chapter5-6.py

#!/usr/bin/python
# -*- coding:utf-8 -*-
def sanitize(time_string):    #定义函数
  if '-' in time_string:            
    splitter = '-'
  elif ':' in time_string:  
    splitter = ':'    #检查字符串是否有:和-
  else:
    return(time_string)
  (mins,secs) = time_string.split(splitter) #分解字符串抽出分和秒
  return(mins + '.' + secs) 
with open('james.txt') as jaf:    #打开文件 
  data = jaf.readline()   #读取第一行
  james = data.strip().split(',') #去空格以,为分割符转换成列表赋值给james
with open('julie.txt') as juf:
  data = juf.readline()
  julie = data.strip().split(',')
with open('mikey.txt') as mif:
  data = mif.readline()
  mikey = data.strip().split(',')
with open('sarah.txt') as saf:
  data = saf.readline()
  sarah = data.strip().split(',')



print (sorted(set([sanitize(t) for t in james]))[0:3])    #输出前三
print (sorted(set([sanitize(t) for t in julie]))[0:3])    #输出前三
print (sorted(set([sanitize(t) for t in mikey]))[0:3])    #输出前三
print (sorted(set([sanitize(t) for t in sarah]))[0:3])    #输出前三

[root@VPN chapter5]# python chapter5-6.py

['2.01', '2.22', '2.34']

['2.11', '2.23', '2.59']

['2.22', '2.38', '2.49']

['2.18', '2.25', '2.39']

输出结果一样但是代码又精简了不少

使用一个函数来代替with

vim chapter5-7.py

#!/usr/bin/python
# -*- coding:utf-8 -*-
def sanitize(time_string):              #定义函数
        if '-' in time_string:
                splitter = '-'
        elif ':' in time_string:
                splitter = ':'          #检查字符串是否有:和-
        else:
                return(time_string)
        (mins,secs) = time_string.split(splitter)       #分解字符串抽出分和秒
        return(mins + '.' + secs)
def get_coach_data(filename):
        try:
                with open(filename) as f:
                        data = f.readline()
                return(data.strip().split(','))
        except IOError as ioerr:
                print('File error:' + str(ioerr))
                return(None)

james = get_coach_data('james.txt')
julie = get_coach_data('julie.txt')
mikey = get_coach_data('mikey.txt')
sarah = get_coach_data('sarah.txt')

print (sorted(set([sanitize(t) for t in james]))[0:3])          #输出前三
print (sorted(set([sanitize(t) for t in julie]))[0:3])          #输出前三
print (sorted(set([sanitize(t) for t in mikey]))[0:3])          #输出前三
print (sorted(set([sanitize(t) for t in sarah]))[0:3])          #输出前三