将json串解析为DataFrame的方式主要有三种:
- 利用pandas自带的read_json直接解析字符串
- 利用json的loads和pandas的json_normalize进行解析
- 利用json的loads和pandas的DataFrame直接构造(这个过程需要手动修改loads得到的字典格式)
由于read_json直接对字符串进行的解析,其效率是最高的,但是其对JSON串的要求也是最高的,需要满足其规定的格式才能够读取。其支持的格式可以在pandas的官网点击打开链接可以看到。然而json_normalize是解析json串构造的字典的,其灵活性比read_json要高很多。但是令人意外的是,其效率还不如我自己解析来得快(自己解析时使用列表解析的功能比普通的for循环快很多)。当然最灵活的还是自己解析,可以在构造DataFrame之前进行一些简单的数据处理。
# -*- coding: UTF-8 -*-
from pandas.io.json import json_normalize
import pandas as pd
import json
import time
# 读入数据
data_str = open('data.json').read()
print data_str
# 测试json_normalize
start_time = time.time()
for i in range(0, 300):
data_list = json.loads(data_str)
df = json_normalize(data_list)
end_time = time.time()
print end_time - start_time
# 测试自己构造
start_time = time.time()
for i in range(0, 300):
data_list = json.loads(data_str)
data = [[d['timestamp'], d['value']] for d in data_list]
df = pd.DataFrame(data, columns=['timestamp', 'value'])
end_time = time.time()
print end_time - start_time
# 测试read_json
start_time = time.time()
for i in range(0, 300):
df = pd.read_json(data_str, orient='records')
end_time = time.time()
print end_time - start_time
pandas里的read_json函数可以将json数据转化为dataframe。 pandas.read_json的语法如下:
pandas.read_json(path_or_buf=None, orient=None, typ='frame', dtype=True, convert_axes=True, convert_dates=True, keep_default_dates=True, numpy=False, precise_float=False, date_unit=None, encoding=None, lines=False, chunksize=None, compression='infer')
第一参数就是json文件路径或者json格式的字符串。
第二参数orient是表明预期的json字符串格式。orient的设置有以下几个值:
(1).'split' : dict like {index -> [index], columns -> [columns], data -> [values]}
这种就是有索引,有列字段,和数据矩阵构成的json格式。key名称只能是index,columns和data。
import pandas as pd
s='{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}'
print(pd.read_json(s,orient='split'))
运行结果:
a b
1 1 3
2 2 8
3 3 9
(2). 'records' : list like [{column -> value}, ... , {column -> value}]
这种就是成员为字典的列表。构成是列字段为键,值为键值,每一个字典成员就构成了dataframe的一行数据。
import pandas as pd
s='[{"name":"xiaomaimiao","age":20},{"name":"xxt","age":18},{"name":"xmm","age":1}]'
print(pd.read_json(s,orient='records'))
运行结果:
age name
0 20 xiaomaimiao
1 18 xxt
2 1 xmm
再例如:
# coding=utf-8
import pandas as pd
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)
s=open('a.json', encoding='UTF-8').read()
df=pd.read_json(s,orient='records')
print(df.head(5))
# df.to_excel('pandas处理json1.xlsx', index=False, columns=["Company", "Job", "Location", "Name", "MajorTag","University"])
df.to_excel('pandas处理json1.xlsx', index=False)
运行结果:
数据: a.json.zip
或:
(3). 'index' : dict like {index -> {column -> value}}
以索引为key,以列字段构成的字典为键值。如:
import pandas as pd
s='{"0":{"a":1,"b":2},"1":{"a":9,"b":11}}'
print(pd.read_json(s,orient='index'))
运行结果:
a b
0 1 2
1 9 11
(4). 'columns' : dict like {column -> {index -> value}}
这种处理的就是以列为键,对应一个值字典的对象。这个字典对象以索引为键,以值为键值构成的json字符串。如下图所示:
import pandas as pd
s='{"a":{"0":1,"1":9},"b":{"0":2,"1":11}}'
print(pd.read_json(s,orient='columns'))
运行结果:
a b
0 1 2
1 9 11
(5). 'values' : just the values array
values这种我们就很常见了。就是一个嵌套的列表。里面的成员也是列表,2层的。
import pandas as pd
s='[["a",1],["b",2]]'
print(pd.read_json(s,orient='values'))
运行结果:
0 1
0 a 1
1 b 2
要处理的json字符串:
strtext='[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\
{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\
{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\
{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'
代码:
strtext = '[{"ttery":"min","issue":"20130801-3391","code":"8,4,5,2,9","code1":"297734529","code2":null,"time":1013395466000},\
{"ttery":"min","issue":"20130801-3390","code":"7,8,2,1,2","code1":"298058212","code2":null,"time":1013395406000},\
{"ttery":"min","issue":"20130801-3389","code":"5,9,1,2,9","code1":"298329129","code2":null,"time":1013395346000},\
{"ttery":"min","issue":"20130801-3388","code":"3,8,7,3,3","code1":"298588733","code2":null,"time":1013395286000},\
{"ttery":"min","issue":"20130801-3387","code":"0,8,5,2,7","code1":"298818527","code2":null,"time":1013395226000}]'
df = pd.read_json(strtext, orient='records')
df.to_excel('pandas处理json.xlsx', index=False, columns=["ttery", "issue", "code", "code1", "code2", "time"])
运行结果: 最终写入excel如下图:
再例如:
# coding=utf-8
import pandas as pd
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)
s='[{"Location":{"L1":"北京","L2":"北京市","L3":""},"University":null,"StudentFlag":0,"_id":"5cd682d61c0acc002c7a9282","Teams":[{"TeamOid":"5cd685cbe32417002baf7fb8"}],"Name":"soyotec","Avatar":"//cdn.kesci.com/images/avatar/4.jpg","Job":"总经理","Company":"北京树优信息技术有限公司","MajorTag":null},{"Location":{"L1":"上海","L2":"上海市","L3":""},"University":null,"StudentFlag":0,"_id":"5cf094ffa0e904002bae85a1","Teams":[{"TeamOid":"5cf08ec12e1a1d002b3912ed"}],"Name":"ne12212","Avatar":"//cdn.kesci.com/images/avatar/1.jpg","Job":"产品经理","Company":"中国电信股份有限公司上海分公司","MajorTag":null},{"Location":{"L1":"上海","L2":"上海市","L3":""},"University":null,"StudentFlag":0,"_id":"5ce37539e60640002b771cfa","Teams":[{"TeamOid":"5ce3af2f0e87f8002caad07e"}],"Name":"图森未来","Avatar":"https://cdn.kesci.com/upload/image/pruh63s30o.jpg","Job":"AIWIN大赛","Company":"上海图森未来人工智能科技有限公司","MajorTag":null,"Signature":"L4级无人驾驶卡车"}]'
print(pd.read_json(s,orient='records'))
结果:
Avatar Company Job Location MajorTag Name Signature StudentFlag Teams University _id
0 //cdn.kesci.com/images/avatar/4.jpg 北京树优信息技术有限公司 总经理 {'L1': '北京', 'L2': '北京市', 'L3': ''} NaN soyotec NaN 0 [{'TeamOid': '5cd685cbe32417002baf7fb8'}] NaN 5cd682d61c0acc002c7a9282
1 //cdn.kesci.com/images/avatar/1.jpg 中国电信股份有限公司上海分公司 产品经理 {'L1': '上海', 'L2': '上海市', 'L3': ''} NaN ne12212 NaN 0 [{'TeamOid': '5cf08ec12e1a1d002b3912ed'}] NaN 5cf094ffa0e904002bae85a1
2 https://cdn.kesci.com/upload/image/pruh63s30o.jpg 上海图森未来人工智能科技有限公司 AIWIN大赛 {'L1': '上海', 'L2': '上海市', 'L3': ''} NaN 图森未来 L4级无人驾驶卡车 0 [{'TeamOid': '5ce3af2f0e87f8002caad07e'}] NaN 5ce37539e60640002b771cfa
示例:
from pandas.io.json import json_normalize
data = [{'id': 1, 'name': {'first': 'Coleen', 'last': 'Volk'}},
{'name': {'given': 'Mose', 'family': 'Regner'}},
{'id': 2, 'name': 'Faye Raker'}]
print(json_normalize(data))
运行结果:
id name name.family name.first name.given name.last
0 1.0 NaN NaN Coleen NaN Volk
1 NaN NaN Regner NaN Mose NaN
2 2.0 Faye Raker NaN NaN NaN NaN
示例:
from pandas.io.json import json_normalize
data = [{'state': 'Florida',
'shortname': 'FL',
'info': {
'governor': 'Rick Scott'
},
'counties': [{'name': 'Dade', 'population': 12345},
{'name': 'Broward', 'population': 40000},
{'name': 'Palm Beach', 'population': 60000}]},
{'state': 'Ohio',
'shortname': 'OH',
'info': {
'governor': 'John Kasich'
},
'counties': [{'name': 'Summit', 'population': 1234},
{'name': 'Cuyahoga', 'population': 1337}]}]
result = json_normalize(data, 'counties', ['state', 'shortname',['info', 'governor']])
print(result)
运行结果:
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
示例:
from pandas.io.json import json_normalize
data = {'A': [1, 2]}
result=json_normalize(data, 'A', record_prefix='Prefix.')
print(result)
运行结果:
Prefix.0
0 1
1 2
示例:
#coding=utf-8
from pandas.io.json import json_normalize
import json
# 读入数据
data_str = open('a.json',encoding='utf-8').read()
data_list = json.loads(data_str)
df = json_normalize(data_list)
df.to_excel('1.xlsx', index=False )
print(df)
运行结果:
a.json.zip
或:
示例:
# coding=utf-8
import json
import pandas as pd
pd.set_option('display.max_rows',500)
pd.set_option('display.max_columns',500)
pd.set_option('display.width',1000)
data_str=open('a.json', encoding='UTF-8').read()
data_list = json.loads(data_str)
data = [[d["Location"], d["Company"]] for d in data_list]
df = pd.DataFrame(data, columns=["Company", "Location"])
print(df.head(5))
# df.to_excel('pandas处理json1.xlsx', index=False, columns=["Company", "Job", "Location", "Name", "MajorTag","University"])
# df.to_excel('pandas处理json1.xlsx', index=False)
结果:
Company Location
0 {'L1': '北京', 'L2': '北京市', 'L3': ''} 北京树优信息技术有限公司
1 {'L1': '上海', 'L2': '上海市', 'L3': ''} 中国电信股份有限公司上海分公司
2 {'L1': '上海', 'L2': '上海市', 'L3': ''} 上海图森未来人工智能科技有限公司
3 {'L1': '上海', 'L2': '上海市', 'L3': ''} None
4 {'L1': '上海', 'L2': '上海市', 'L3': ''} AI
示例:
import json
import pandas as pd
data_str = '''[{"department": "xxt",
"query_result": {"code": "10", "description": "1000"},
"is_invoice": 1,
"imageName": "./imgs/8888888.jpeg",
"reco_result": {"total": "", "invoice_no": "01111111", "create_date": "", "check_code": "", "invoice_code": ""}},
{"department": "xxt2",
"query_result": {},
"is_invoice": 0,
"imageName": "./imgs/51111111.jpeg",
"reco_result": {}}]'''
# 从文件读取
# data_str = open('a.json',encoding='utf-8').read()
# data_list = json.loads(data_str)
# 从字符串直接获取
data_list = json.loads(data_str)
data1_all = [[d["department"], d["is_invoice"], d["imageName"]] for d in data_list]
data2_all = [d["query_result"] for d in data_list]
data5_all = [d["reco_result"] for d in data_list]
# for i in data1_all:
# print("data1_0:",i[0])
# print("data5",data5_all)
# print("data1",data1_all)
def get_data(data1_all, data2_all, data5_all):
col_value = []
for data1, data2, data5 in zip(data1_all, data2_all, data5_all):
department = data1[0]
is_invoice = data1[1]
imageName = data1[2]
if 'code' in data2:
code = str(data2).split(",")[0].split(":")[1].replace("'", "").replace("}", "")
description = str(data2).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
else:
code = ""
description = ""
if 'total' in data5:
total = str(data5).split(",")[0].split(":")[1].replace("'", "").replace("}", "")
invoice_no = str(data5).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
create_date = str(data5).split(",")[2].split(":")[1].replace("'", "").replace("}", "")
check_code = str(data5).split(",")[3].split(":")[1].replace("'", "").replace("}", "")
invoice_code = str(data5).split(",")[4].split(":")[1].replace("'", "").replace("}", "")
else:
total = ""
invoice_no = ""
create_date = ""
check_code = ""
invoice_code = ""
col_value.append((department, is_invoice, imageName, code, description, total, invoice_no, create_date,
check_code, invoice_code))
return col_value
col_value = get_data(data1_all, data2_all, data5_all)
df = pd.DataFrame(col_value, index=None,
columns=["department", "is_invoice", "imageName", "code", "description", "total", "invoice_no",
"create_date", "check_code", "invoice_code"])
df.to_excel('excel_pd.xls')
print(df)
结果:
department is_invoice ... check_code invoice_code
0 xxt 1 ...
1 xxt2 0 ...
[2 rows x 10 columns]
-----自己处理数据
import json
import pandas as pd
# 从字符串直接获取
# data_str='''{"Location":{"L1":"上海","L2":"上海市","L3":""},"University":null,"StudentFlag":0,"_id":"5ce37539e60640002b771cfa","Teams":[{"TeamOid":"5ce3af2f0e87f8002caad07e"}],"Name":"图森未来","Avatar":"https://cdn.kesci.com/upload/image/pruh63s30o.jpg","Job":"AIWIN大赛","Company":"上海图森未来人工智能科技有限公司","MajorTag":null,"Avatar":"L4级无人驾驶卡车"}'''
# data_list = json.loads(data_str)
# 从文件读取
data_str = open('a.json',encoding='utf-8').read()
data_list = json.loads(data_str)
data_s_all = [[ d["StudentFlag"], d["Name"],d["Job"],d["Company"]] for d in data_list]
data_Location_all = [d["Location"] for d in data_list]
data_University_all = [d["University"] for d in data_list]
data_MajorTag_all = [d["MajorTag"] for d in data_list]
# for i in data1_all:
# print("data1_0:",i[0])
# print("data5",data5_all)
# print("data1",data1_all)
def get_data(data_s_all, data_Location_all, data_University_all,data_MajorTag_all):
col_value = []
for data_s, data_Location, data_University,data_MajorTag in zip(data_s_all, data_Location_all, data_University_all,data_MajorTag_all):
StudentFlag = data_s[0]
Name = data_s[1]
Job = data_s[2]
Company = data_s[3]
Location_result=''
University_result=''
Major=''
# Location
try:
if 'L1' in data_Location:
L1 = str(data_Location).split(",")[0].split(":")[1].replace("'", "").replace("}", "")
L2 = str(data_Location).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
L3 = str(data_Location).split(",")[2].split(":")[1].replace("'", "").replace("}", "")
Location_result=L1+L2+L3
else:
L1 = ""
L2 = ""
L3 = ""
except Exception as e:
None
# Country
try:
if 'Country' in data_University:
Country = str(data_University).split(",")[0].split(":")[1].replace("'", "").replace("}", "")
Province = str(data_University).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
University = str(data_University).split(",")[2].split(":")[1].replace("'", "").replace("}", "")
University_result=Country+Province+University
else:
Country = ""
Province = ""
University = ""
except Exception as e:
None
# Major
try:
if 'Major' in data_MajorTag:
Major=str(data_MajorTag).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
else:
Major = ""
except Exception as e:
None
col_value.append((Name, StudentFlag, Job, Company, Location_result, University_result, Major))
return col_value
col_value = get_data(data_s_all, data_Location_all, data_University_all,data_MajorTag_all)
df = pd.DataFrame(col_value, index=None,columns=["Name", "StudentFlag", "Job", "Company", "Location_result", "University_result", "Major"])
df.to_excel('excel_pd2.xls')
print(df.head(3))
运行结果:
Name StudentFlag Job ... Location_result University_result Major
0 soyotec 0 总经理 ... 北京 北京市
1 ne12212 0 产品经理 ... 上海 上海市
2 图森未来 0 AIWIN大赛 ... 上海 上海市
技巧 :将复杂的json串整理成以下格式再读取,再使用 data_list = json.loads(data_str)
读取即可
{"error_code":40007,"error_msg":"fail to recognize"}[{"department": "abcdef",
"query_result": {"code": "1000", "description": "1000"},
"is_invoice": 1,
"imageName": "./imgs/8888888.jpeg",
"reco_result": {"total": "", "invoice_no": "123", "create_date": "", "check_code": ""}}]1234567
批量读取 json 文件(中文 json)
./out_file
下两个json文件内容如下:
out_01.txt 内容为:"{"name_ID":"12343","name":"张三","身份编码":"未知"}"
out_02.txt 内容为:"{"name_ID":"12344","name":"李四","身份编码":"98983"}"12
import jsonimport osdef img_w_h(text_path):
data_str_list = []
img_name_list = []
for filename in os.listdir(text_path):
file_path = text_path+'/'+filename print("获取文件:",file_path)
data_str = open(file_path,"r",encoding='UTF-8').read()
data_str_list.append(data_str)
img_name_list.append(filename)
print("data_str_list",data_str_list)
return data_str_list,img_name_listdef json_to_excel(data_str_list):
data_all = []
for data_str in data_str_list:
if data_str.startswith(u'\ufeff'):
content = data_str.encode('utf8')[3:].decode('utf8')
text = json.loads(content[1:-1])
if text["身份编码"] =="未知":
data_all.append(text["身份编码"])
return data_allif __name__ == "__main__":
text_path = "./out_file"
data_str_list, img_name_list = img_w_h(text_path)
data_all = json_to_excel(data_str_list)
print("data_all:",data_all)输出:
获取文件: ./out_file/out_01.txt
获取文件: ./out_file/out_02.txt
data_str_list ['\ufeff"{"name_ID":"12343","name":"张三","身份编码":"98983"}"', '\ufeff"{"name_ID":"12343","name":"张三","身份编码":"未知"}"']data_all: ['未知']123456789101112131415161718192021222324252627282930313233343536
复杂json格式解析 保存 Excel
import jsonimport pandas as pd"""
数据格式一(为方便查看格式化如下):
[{"department": "abcdef",
"query_result": {"code": "1000", "description": "1000"},
"is_invoice": 1,
"imageName": "./imgs/8888888.jpeg",
"reco_result": {"total": "", "invoice_no": "01111111", "create_date": "", "check_code": "", "invoice_code": ""}},
{"department": "abcdef",
"query_result": {},
"is_invoice": 0,
"imageName": "./imgs/51111111.jpeg",
"reco_result": {}},
...]
"""data_str = open('json_img.json').read()data_list = json.loads(data_str)data1_all = [[d["department"], d["is_invoice"], d["imageName"]] for d in data_list]data2_all = [d["query_result"] for d in data_list]data5_all = [d["reco_result"] for d in data_list]for i in data1_all:
print("data1_0:",i[0])print("data5",data5_all)print("data1",data1_all)def get_data(data1_all,data2_all,data5_all):
col_value = []
for data1,data2,data5 in zip(data1_all,data2_all,data5_all):
department = data1[0]
is_invoice = data1[1]
imageName = data1[2]
if 'code' in data2:
code = str(data2).split(",")[0].split(":")[1].replace("'", "").replace("}", "")
description = str(data2).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
else:
code = "NAN"
description = "NAN"
if 'total' in data5:
total = str(data5).split(",")[0].split(":")[1].replace("'", "").replace("}", "")
invoice_no = str(data5).split(",")[1].split(":")[1].replace("'", "").replace("}", "")
create_date = str(data5).split(",")[2].split(":")[1].replace("'", "").replace("}", "")
check_code = str(data5).split(",")[3].split(":")[1].replace("'", "").replace("}", "")
invoice_code = str(data5).split(",")[4].split(":")[1].replace("'", "").replace("}", "")
else:
total = "NAN"
invoice_no = "NAN"
create_date = "NAN"
check_code = "NAN"
invoice_code = "NAN"
col_value.append((department,is_invoice,imageName, code,description, total, invoice_no, create_date, check_code, invoice_code))
return col_value
col_value = get_data(data1_all,data2_all,data5_all)df = pd.DataFrame(col_value, index=None,columns=["department", "is_invoice", "imageName", "code", "description", "total", "invoice_no", "create_date", "check_code", "invoice_code"])df.to_excel('excel_pd.xls')1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
读取excel中某列的json数据(每个单元格数据格式为上面:数据格式一)
import jsonimport pandas as pdimport xlrd
excel_path = "C:\\Users\\Desktop\\test_data.xlsx"def read_excel(excel_path):
workbook = xlrd.open_workbook(excel_path)
sheet = workbook.sheet_by_name("Sheet1")
nrows = sheet.nrows
list1 = []
for i in range(1,nrows):
list1.append(sheet.row_values(i)[0])
return list1def get_data(excel_path):
list1 = read_excel(excel_path)
All_data = []
for i in range(len(list1)): #遍历列表数据(相当于遍历该列所有单元格)
data_list = json.loads(list1[i])
# print("data_list:", type(data_list))
for i in range(len(data_list)): #遍历该单元格列表中所有json串
# print(type(data_list[i]))
data_dict = data_list[i]
try:
imageNo = data_dict["imageNo"]
businessType = data_dict["businessType"]
reco_result = data_dict["reco_result"]
try:
total = reco_result["total"]
invoice_no = reco_result["invoice_no"]
create_date = reco_result["create_date"]
check_code = reco_result["check_code"]
invoice_code = reco_result["invoice_code"]
except:
total = "NAN"
invoice_no = "NAN"
create_date = "NAN"
check_code = "NAN"
invoice_code = "NAN"
is_invoice = data_dict["is_invoice"]
billId = data_dict["billId"]
imageName = data_dict["imageName"]
applyNum = data_dict["applyNum"]
department = data_dict["department"]
query_result = data_dict["query_result"]
try:
code = query_result["code"]
description = query_result["description"]
except:
code = "NAN"
description = "NAN"
All_data.append((imageNo, businessType, total, invoice_no, create_date, check_code,
invoice_code, is_invoice, billId, imageName,
applyNum, department, code, description))
except:
print("数据格式出错!")
pass
return All_data
All_data = get_data(excel_path)df = pd.DataFrame(All_data, index=None,columns=["imageNo", "businessType", "total","invoice_no", "create_date", "check_code", \ "invoice_code","is_invoice","billId","imageName",\ "applyNum","department","code","description"])df.to_excel('C:\\Users\\Desktop/001.xls')print("done!")12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
特殊json文件格式化
最原始数据:{"41196516":"{\"type\":\"身份证正面\",\"name\":\"徐XX\",\"sex\":\"男\",\"people\":\"汉\",...,"41196243":"{\"error_code\"处理成如下json文件:(非常不正规){"41196516":"{"type":"身份证正面","name":"徐XX","sex":"男","people":"汉","birthday":"19XX年7月XX日","address":"广州市花都区*****号","id_number":"4401***15","issue_authority":"广州市XXX局","validity":"20XX.XX.13-20XX.XX.13","time_cost":{"recognize":348,"preprocess":28},"complete":true,"border_covered":false,"head_covered":false,"head_blurred":false,"gray_image":true,"error_code":0,"error_msg":"OK"}","41196243":"{"error_code":40007,"error_msg":"fail to recognize"}","41196510":"{"type":"二代身份证","name":"魏XX","sex":"男","people":"汉","birthday":"19XX年9月XX日","address":"江苏省江阴市XXX号","id_number":"320XXX17","time_cost":{"recognize":398,"preprocess":29},"complete":true,"border_covered":false,"head_covered":false,"head_blurred":false,"gray_image":false,"error_code":0,"error_msg":"OK"}","41197139":"{"type":"身份证背面","issue_authority":"佛山市XXX分局","validity":"2005.XX.XX-2025.XX.XX","time_cost":{"recognize":464,"preprocess":48},"complete":true,"error_code":0,"error_msg":"OK"}"}格式化展示:{"41196516":"{"type":"身份证正面",
"name":"徐XX",
"sex":"男",
"people":"汉",
"birthday":"19XX年7月XX日",
"address":"广州市花都区*****号",
"id_number":"4401***15",
"issue_authority":"广州市XXX局",
"validity":"20XX.XX.13-20XX.XX.13",
"time_cost":{"recognize":348,"preprocess":28},
"complete":true,
"border_covered":false,
"head_covered":false,
"head_blurred":false,
"gray_image":true,
"error_code":0,
"error_msg":"OK"}","41196243":"{"error_code":40007,"error_msg":"fail to recognize"}","41196510":"{"type":"二代身份证",
"name":"魏XX",
"sex":"男",
"people":"汉",
"birthday":"19XX年9月XX日",
"address":"江苏省江阴市XXX号",
"id_number":"320XXX17",
"time_cost":{"recognize":398,"preprocess":29},
"complete":true,
"border_covered":false,
"head_covered":false,
"head_blurred":false,
"gray_image":false,
"error_code":0,
"error_msg":"OK"}","41197139":"{"type":"身份证背面",
"issue_authority":"佛山市XXX分局",
"validity":"2005.XX.XX-2025.XX.XX",
"time_cost":{"recognize":464,"preprocess":48},
"complete":true,
"error_code":0,
"error_msg":"OK"}"}123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051
解析代码如下:
import jsonimport pandas as pd
data_str = open('D:/XXX/XXX文档/reize_result20181227.txt',"r",encoding="utf-8").read()data_str0 = data_str.replace("\\","")print(data_str0)imgName_list = []def get_data(data_str0):
All_data = []
num = data_str0.count("error_code") #统计共有多少个json("error_code"每个json都有)
for i in range(num):
imgName = data_str0[1:-1].split("\":\"{")[i][-8:] #获取ImageID([1:-1]去除最外层括号)
print("imgName", imgName)
img_str1 = "{"+data_str0[1:-1].split("\":\"{")[i+1].split("}\",\"")[0]+"}" #获取整个json
img_str1 = img_str1.replace("\"}\"}}","\"}") if "\"}\"}" in img_str1 else img_str1 #去除末尾多余的符号
print("img_str1:", img_str1)
data_list = json.loads(img_str1)
#########################################################################
try:
type_ = data_list["type"]
except:
type_ = "NAN"
try:
name = data_list["name"]
except:
name = "NAN"
try:
sex = data_list["sex"]
except:
sex = "NAN"
try:
people = data_list["people"]
except:
people = "NAN"
try:
birthday = data_list["birthday"]
except:
birthday = "NAN"
try:
address = data_list["address"]
except:
address = "NAN"
try:
id_number = data_list["id_number"]
except:
id_number = "NAN"
try:
issue_authority = data_list["issue_authority"]
except:
issue_authority = "NAN"
try:
validity = data_list["validity"]
except:
validity = "NAN"
try:
time_cost = data_list["time_cost"]
recognize = time_cost["recognize"]
preprocess = time_cost["preprocess"]
except:
time_cost = "NAN"
recognize = "NAN"
preprocess = "NAN"
try:
complete = data_list["complete"]
except:
complete = "NAN"
try:
border_covered = data_list["border_covered"]
except:
border_covered = "NAN"
try:
head_covered = data_list["head_covered"]
except:
head_covered = "NAN"
try:
head_blurred = data_list["head_blurred"]
except:
head_blurred = "NAN"
try:
gray_image = data_list["gray_image"]
except:
gray_image = "NAN"
error_code = data_list["error_code"]
error_msg = data_list["error_msg"]
All_data.append((imgName,type_,name,sex,people,birthday,\
address,id_number,issue_authority,validity,\
recognize,preprocess,complete,border_covered,\
head_covered,head_blurred,gray_image,error_code,error_msg))
return All_data
All_data = get_data(data_str0)df = pd.DataFrame(All_data, index=None,columns=["imgName", "type_", "name","sex", "people", "birthday", \ "address","id_number","issue_authority","validity",\ "recognize","preprocess","complete","border_covered",\ "head_covered","head_blurred","gray_image","error_code","error_msg"])df.to_excel('D:/XXX/XXX文档/reize_result20181227.xls')123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
复杂json解析
- 报错1:
json.decoder.JSONDecodeError: Expecting property name enclosed in double quotes: line 1 column 3 (char 2)
原因在于:字符串里用单引号来标识字符。
解决办法如下:将字符串里的 单引号 替换成 双引号
写json文件
import jsonimport osdef get_img(file_path):
img_path = []
for path,dirname,filenames in os.walk(file_path):
for filename in filenames:
img_path.append(path+"/"+filename)
return img_path
def json_str(file_path):
dict_str = []
img_path = get_img(file_path)
for i in img_path:
dict_str.append({"ImageName":"/image/bus/"+i,"id":"8abs63twy2001"})
return dict_str
file_path = "./image/ocr"dict_str = json_str(file_path)json_str = json.dumps(dict_str)with open("./dict_str_to_json.json","w") as json_file:
json_file.write(json_str)
print("done!")12345678910111213141516171819202122232425
About Me
........................................................................................................................ ● 本文作者:小麦苗,部分内容整理自网络,若有侵权请联系小麦苗删除 ..................... ● QQ群号: 230161599 (满) 、618766405 ● weixin群:可加我weixin,我拉大家进群,非诚勿扰 ● 联系我请加QQ好友 ( 646634621 ) ,注明添加缘由 ● 于 2019-06-01 06:00 ~ 2019-06-30 24:00 在西安完成 ● 最新修改时间:2019-06-01 06:00 ~ 2019-06-30 24:00 ● 文章内容来源于小麦苗的学习笔记,部分整理自网络,若有侵权或不当之处还请谅解 ● 版权所有,欢迎分享本文,转载请保留出处 ........................................................................................................................ ● ........................................................................................................................ 使用 weixin客户端 扫描下面的二维码来关注小麦苗的weixin公众号( xiaomaimiaolhr )及QQ群(DBA宝典)、添加小麦苗weixin, 学习最实用的数据库技术。 ........................................................................................................................ |
来自 “ ITPUB博客 ” ,链接:http://blog.itpub.net/26736162/viewspace-2647241/,如需转载,请注明出处,否则将追究法律责任。