github:iTomxy/data/coco


COCO 数据集[1] 2017 版共 12,3287 张 images,每一张对应 5 句 sentences 描述。实际有 80 个 classes,虽然 ID 标到去 90,但中间有缺的。

这里为检索任务做预处理。后期数据划分不按它原本的 train / val,所以将两者合并。操作需要用到 COCO api[2],用例见 [3,4]。

text 参照 [5] 处理成 300-d 的 Doc2Vec 向量,用到 [6] 提供的预训练模型,而 [6] 又依赖于 python 2 和其作者 forked 的某旧版 gensim[7]。方便起见,我用的是 [8] 的容器(自带有 conda),并在处理 text 时创了 python 2.7 的虚拟环境(COCO api 在 python 2.7 下可用)。

(2022.7.11)根据 [6] 中的代码和 [6] 中提到的文章 [17],文本的预处理应该是用 Stanford CoreNLP[18,19] 分词、转小写,所以还要下载 Stanford CoreNLP 和 Java,在 docker 容器里安装。

Preparation

COCO Files

COCO 下载链见 [9]:

  • 2017 Train images [118K/18GB]
  • 2017 Val images [5K/1GB]
  • 2017 Train/Val annotations [241MB]

分别解压到:train2017/、val2017/、annotations/,都在 COCO/ 目录下。

Doc2Vec Files

[6] 提供的预训练 Doc2Vec 模型:English Wikipedia DBOW (1.4GB),解压到 Doc2Vec/enwiki_dbow/ 下。

Docker Environment

这一小节的操作需要在 [8] 创建的容器中做。我也写了一个 Dockerfile 来创建配置到的 docker images,在 [21]。

COCO API & gensim

在 python 2.7 虚拟环境内装 COCO api 和 [7] 版 gensim 前,可以提前装好 Cython、matplotlib、scipy、smart_open 几个依赖。

clone [2] 和 [7]。COCO api 的安装就是执行 cocoapi/PythonAPI/Makefile 中的 4 条命令(可能会遇到一些小问题 [16]);gensim 安装就是在 cloned 的 gensim/ 内执行 python setup.py install(见 [7] 的 README)。

Java

从 [20] 下载 JDK 1.8,解压到 /usr/local/java/jdk1.8.0_40/,并在 /etc/profile 文件尾加上:

export JAVA_HOME=/usr/local/java/jdk1.8.0_40
export CLASSPATH=.:${JAVA_HOME}/lib/dt.jar:${JAVA_HOME}/lib/tools.jar
export PATH=$PATH:${JAVA_HOME}/bin
Stanford CoreNLP

从 [19] 下载(我此时 2022.7.11 下的是 4.4.0 版),解压到 /usr/local/stanford-corenlp/stanford-corenlp-4.4.0/,执行:

for f in `find /usr/local/stanford-corenlp/stanford-corenlp-4.4.0/ -name "*.jar"`; do
    echo "CLASSPATH=\$CLASSPATH:`realpath $f`" >> /etc/profile;
done

来配置 CLASSPATH。

Class Order

  • COCO 原本的 annotations 中就有各 classes 的 ID,但不连续(从 1 标到 90 但实际只有 80 个)。这里按原有的 category id 的升序重新定义连续的、0-based 的 class ID。
  • train 和 val 都包含所有类,所以这里只用 val set 处理。
  • 结果写入 class-name.COCO.txt。
import os
import os.path as osp
from pycocotools.coco import COCO
import pprint

"""process class order
Record the mapping between tightened/discretized 0-base class ID,
original class ID and class name in `class-name.COCO.txt`,
with format `<original ID> <class name>`.

The class order is consistent to the ascending order of the original IDs.
"""

COCO_P = "/home/dataset/COCO"
ANNO_P = osp.join(COCO_P, "annotations")
SPLIT = ["val", "train"]

for _split in SPLIT:
    print("---", _split, "---")
    anno_file = osp.join(ANNO_P, "instances_{}2017.json".format(_split))
    coco = COCO(anno_file)
    cats = coco.loadCats(coco.getCatIds())
    # print(cats[0])
    cat_list = sorted([(c["id"], c["name"]) for c in cats],
        key=lambda t: t[0])  # 保证升序
    # pprint.pprint(cat_list)
    with open(osp.join(COCO_P, "class-name.COCO.txt"), "w") as f:
        for old_id, c in cat_list:
            cn = c.replace(" ", "_")
            # format: <original ID> <class name>
            f.write("{} {}\n".format(old_id, cn))

    break  # 只用 val set

Data Order

  • 合并 train、val 两个集合,统一按原本的 id(即 images 文件名中的数字,也是不连续的,且 train、val 无重合)升序重新排 0-based 的 data ID。
  • 结果写入 id-map.COCO.txt。
import os
import os.path as osp

"""discretization of the original file ID
Map the file ID to sequential {0, 1, ..., n},
and record this mapping in `id-map.txt`,
with format `<new id> <original id> <image file name>`.

Note that the new ids are 0-base.
"""

COCO_P = "/home/dataset/COCO"
TRAIN_P = osp.join(COCO_P, "train2017")
VAL_P = osp.join(COCO_P, "val2017")

file_list = [f for f in os.listdir(TRAIN_P) if (".jpg" in f)]
file_list.extend([f for f in os.listdir(VAL_P) if (".jpg" in f)])
print("#data:", len(file_list))  # 12,3287

id_key = lambda x: int(x.split(".jpg")[0])
file_list = sorted(file_list, key=id_key)  # ascending of image ID
# print(file_list[:15])

with open(osp.join(COCO_P, "id-map.COCO.txt"), "w") as f:
    # format: <original id> <image file name>
    for f_name in file_list:
        _original_id = id_key(f_name)
        f.write("{} {}\n".format(_original_id, f_name))
print("DONE")

Labels

  • multi-hot label 向量,用到 instances_*2017.json。
  • class 顺序按前述 class-name.COCO.txt,data 顺序按 id-map.COCO.txt。
import os
import os.path as osp
import numpy as np
import scipy.io as sio
from pycocotools.coco import COCO


"""process labels
Data in both train & val set will be all put together,
with data order determined by `id-map.COCO.txt`
and catetory order by `class-name.COCO.txt`.
"""


COCO_P = "/home/dataset/COCO"
ANNO_P = osp.join(COCO_P, "annotations")
SPLIT = ["val", "train"]


id_map_cls = {}
with open(osp.join(COCO_P, "class-name.COCO.txt"), "r") as f:
    for _new_id, line in enumerate(f):
        _old_id, _ = line.strip().split()
        id_map_cls[int(_old_id)] = _new_id
N_CLASS = len(id_map_cls)
print("#class:", N_CLASS)  # 80

id_map_data = {}
with open(osp.join(COCO_P, "id-map.COCO.txt"), "r") as f:
    for _new_id, line in enumerate(f):
        line = line.strip()
        _old_id, _ = line.strip().split()
        id_map_data[int(_old_id)] = _new_id
N_DATA = len(id_map_data)
print("#data:", N_DATA)  # 123,287


labels = np.zeros([N_DATA, N_CLASS], dtype=np.uint8)
for _split in SPLIT:
    print("---", _split, "---")
    anno_file = osp.join(ANNO_P, "instances_{}2017.json".format(_split))
    coco = COCO(anno_file)
    id_list = coco.getImgIds()
    for _old_id in id_list:
        _new_id = id_map_data[_old_id]
        _annIds = coco.getAnnIds(imgIds=_old_id)
        _anns = coco.loadAnns(_annIds)
        for _a in _anns:
            _cid = id_map_cls[_a["category_id"]]
            labels[_new_id][_cid] = 1

print("labels:", labels.shape, labels.sum())  # (123287, 80) 357627
sio.savemat(osp.join(COCO_P, "labels.COCO.mat"), {"labels": labels}, do_compression=True)

Texts

using Stanford CoreNLP

(2022.7.11)按 [6] 中的文章 [17],文本预处理用 Stanford CoreNLP 进行分词、转小写。由于不会用 Stanford CoreNLP 转小写,仅用其进行分词,而用 python 原生的 lower() 转小写。

  • 这里是多线程版的处理代码。可以用顺序处理的逻辑进行对拍,见 iTomxy/data/coco/make.text.py 和 iTomxy/data/coco/check.text.py。
from __future__ import print_function
import codecs
import multiprocessing
import os
import os.path as osp
import pprint
import time
import threading
from pycocotools.coco import COCO
import gensim
from gensim.models import Doc2Vec
import numpy as np
import scipy.io as sio


"""Multi-Threading version of make.text.py"""


# COCO
COCO_P = "/home/dataset/COCO"
ANNO_P = osp.join(COCO_P, "annotations")
SPLIT = ["val", "train"]
# doc2vec
MODEL = "/home/dataset/Doc2Vec/enwiki_dbow/doc2vec.bin"
D2V_SEED = 0  # keep consistency


id_map_data = {}
with open(osp.join(COCO_P, "id-map.COCO.txt"), "r") as f:
    for _new_id, line in enumerate(f):
        _old_id, _ = line.strip().split()
        id_map_data[int(_old_id)] = _new_id
N_DATA = len(id_map_data)
print("#data:", N_DATA)  # 123,287

# pre-trained Doc2Vec model
model = Doc2Vec.load(MODEL)


def prep_text(tid, sentences):
    """preprocess sentences into a single document
    Input:
        - sentences: list of str, one per sentence.
    Output:
        - doc: list of a single str (i.e. all sentences in one line), processed document.
    """
    # use gensim.utils.simple_preprocess
    # sentences = [gensim.utils.simple_preprocess(s) for s in sentences]
    # # pprint.pprint(sentences)
    # doc = []
    # for s in sentences:
    #     doc.extend(s)

    # use Stanford CoreNLP
    with codecs.open("input.{}.txt".format(tid), "w", "utf-8") as f:
        for s in sentences:
            s = s.strip()  # must <- maybe trailing space
            if '.' != s[-1]:
                s += '.'
            f.write(s + '\n')
    os.system("java edu.stanford.nlp.pipeline.StanfordCoreNLP " \
        "-annotators tokenize,ssplit -outputFormat conll -output.columns word " \
        "-file input.{}.txt > /dev/null 2>&1".format(tid))
    with codecs.open("input.{}.txt.conll".format(tid), "r", "utf-8") as f:
        doc = " ".join([ln.strip().lower() for ln in f.readlines() if ln.strip() != ""])
    doc = doc.split()

    return doc


# multi-threading vars
N_THREAD = max(4, multiprocessing.cpu_count() - 2)
results, mutex_res = [], threading.Lock()
meta_index, mutex_mid = 0, threading.Lock()
mutex_d2v = threading.Lock()


def run(tid, id_list):
    global results, meta_index, id_map_data, model
    n = len(id_list)
    while True:
        mutex_mid.acquire()
        meta_idx = meta_index
        meta_index += 1
        mutex_mid.release()
        if meta_idx >= n:
            break

        _old_id = id_list[meta_idx]
        _new_id = id_map_data[_old_id]
        _annIds = coco_caps.getAnnIds(imgIds=_old_id)
        _anns = coco_caps.loadAnns(_annIds)
        # print(len(anns))
        # pprint.pprint(anns)
        sentences = [_a["caption"] for _a in _anns]
        # pprint.pprint(sentences)
        doc = prep_text(tid, sentences)
        # pprint.pprint(doc)
        mutex_d2v.acquire()
        model.random.seed(D2V_SEED)  # to keep it consistent
        vec = model.infer_vector(doc)
        mutex_d2v.release()
        # print(vec.shape)
        mutex_res.acquire()
        results.append((_new_id, vec[np.newaxis, :]))
        mutex_res.release()
        if meta_idx % 1000 == 0:
            print(meta_idx, ',', time.strftime("%Y-%m-%d-%H-%M", time.localtime(time.time())))

    # remove the intermedia output files (when using Stanford CoreNLP)
    for f in ["input.{}.txt".format(tid), "input.{}.txt.conll".format(tid)]:
        if osp.exists(f):
            os.remove(f)


for _split in SPLIT:
    print("---", _split, "---")
    tic = time.time()
    anno_file = osp.join(ANNO_P, "instances_{}2017.json".format(_split))
    caps_file = osp.join(ANNO_P, "captions_{}2017.json".format(_split))
    coco = COCO(anno_file)
    coco_caps = COCO(caps_file)
    id_list = coco.getImgIds()

    meta_index = 0  # reset for each split
    t_list = []
    for tid in xrange(N_THREAD):
        t = threading.Thread(target=run, args=(tid, id_list))
        t_list.append(t)
        t.start()

    for t in t_list:
        t.join()

    del t_list


assert len(results) == N_DATA
texts = sorted(results, key=lambda t: t[0])  # ascending by new ID
for i in xrange(100):#N_DATA):
    assert texts[i][0] == i, "* order error"
texts = [t[1] for t in texts]
texts = np.vstack(texts).astype(np.float32)
assert texts.shape[0] == N_DATA
print("texts:", texts.shape, texts.dtype)  # (123287, 300) dtype('<f4')
print(texts.mean(), texts.min(), texts.max())  # -0.0047004167, -0.7569326, 0.804541
sio.savemat(osp.join(COCO_P, "texts.COCO.d2v-{}d.mat".format(texts.shape[1])), {"texts": texts})

(depreated) using gensim.utils.simple_preprocess

  • (2022.7.11)这是旧方法,仅供参考,别用,换用上一小节的。
  • 此处要用 python 2!
  • 300-d doc2vec 向量,Doc2Vec 用法见 [11,12]。
  • 5 个句子用 gensim.utils.simple_preprocess 预处理后,简单拼在一起当 document 传给 doc2vec 模型。
  • 用到 instances_*2017.json 和 captions_*2017.json。
# make.texts.py
from __future__ import print_function
import os
import os.path as osp
from pycocotools.coco import COCO
import gensim
from gensim.models import Doc2Vec
import numpy as np
import scipy.io as sio

"""process texts
python 2 needed by `jhlau/doc2vec`, and COCO api CAN work with python 2.7.
So I choose to create a virtual env of python 2.7.

dependencies:
    matplotlib (COCO api)
    smart_open (gensim)
"""

# COCO
COCO_P = "/home/dataset/COCO"
ANNO_P = osp.join(COCO_P, "annotations")
SPLIT = ["val", "train"]
# doc2vec
MODEL = "/home/dataset/Doc2Vec/enwiki_dbow/doc2vec.bin"
start_alpha = 0.01
infer_epoch = 1000
DIM = 300  # dimension of the doc2vec feature


id_map_data = {}
with open("id-map.COCO.txt", "r") as f:
    for line in f:
        line = line.strip()
        _new_id, _old_id, _ = line.split()
        id_map_data[int(_old_id)] = int(_new_id)
N_DATA = len(id_map_data)
print("#data:", N_DATA)

# pre-trained Doc2Vec model
model = Doc2Vec.load(MODEL)

texts = []
for _split in SPLIT:
    print("---", _split, "---")
    anno_file = osp.join(ANNO_P, "instances_{}2017.json".format(_split))
    caps_file = osp.join(ANNO_P, "captions_{}2017.json".format(_split))
    coco = COCO(anno_file)
    coco_caps = COCO(caps_file)

    id_list = coco.getImgIds()
    for _old_id in id_list:
        _new_id = id_map_data[_old_id]
        _annIds = coco_caps.getAnnIds(imgIds=_old_id)
        _anns = coco_caps.loadAnns(_annIds)
        # print(len(anns))
        # pprint.pprint(anns)
        sentences = [_a["caption"] for _a in _anns]
        # pprint.pprint(sentences)
        sentences = [gensim.utils.simple_preprocess(s) for s in sentences]
        # pprint.pprint(sentences)
        doc = []
        for s in sentences:
            doc.extend(s)
        # print(doc)
        vec = model.infer_vector(doc)
        # print(vec.shape)
        texts.append(vec[np.newaxis, :])
        # break
    # break

texts = np.vstack(texts).astype(np.float32)
print("texts:", texts.shape, texts.dtype)  # (123287, 300) dtype('<f4')
sio.savemat("texts.COCO.doc2vec.{}.mat".format(DIM), {"texts": texts})

Image

  • 把 train2017/、val2017/ 中的 images(的软链接[13])统一放在 images/,方便读。
  • linux 下用相对路径,方便以后直接 soft-link,而不用每个工程都单独处理一次。
import os
import os.path as osp
import platform


"""
soft-link all images into a single folder,
with the name modified to their (0-base) sample ID.
Hint:
Use RELATIVE source path in linux,
then you can simply soft-link that `images/` in any project
instead of creating a new project-specific `images/`.
But this trick does NOT work in Windows.
"""


# convert path seperator
cvt_sep = lambda p: p.replace('\\/'.replace(os.sep, ''), os.sep)


P = cvt_sep("/usr/local/dataset/COCO")
SPLIT = ["val2017", "train2017"]
IMAGE_DEST = osp.join(P, "images")  # path you place `images/` in
if not osp.exists(IMAGE_DEST):
    os.makedirs(IMAGE_DEST)


id_map_data = {}
with open(osp.join(P, "id-map.COCO.txt"), "r") as f:
    for _new_id, line in enumerate(f):
        _old_id, _ = line.strip().split()
        id_map_data[int(_old_id)] = _new_id
N_DATA = len(id_map_data)
print("#data:", N_DATA)  # 123,287


# soft-linking command
if "Windows" == platform.system():
    cmd = "mklink {1} {0} > nul"
else:
    assert "Linux" == platform.system()
    cmd = "ln -s {0} {1}"

_cnt = 0
for split in SPLIT:
    IMAGE_SRC_ABS = osp.join(P, split)  # absolute source path
    IMAGE_SRC_REL = osp.join("..", split)  # relative source path
    if "Windows" == platform.system():
        img_src_path_pre = IMAGE_SRC_ABS
    else:
        assert "Linux" == platform.system()
        img_src_path_pre = IMAGE_SRC_REL

    print("soft-linking:", IMAGE_SRC_ABS, "->", IMAGE_DEST)
    for f in os.listdir(IMAGE_SRC_ABS):
        old_id = int(f.split(".jpg")[0])
        new_id = id_map_data[old_id]
        # img_p = osp.join(IMAGE_SRC_ABS, f)  # use this in Windows
        # img_p = osp.join(IMAGE_SRC_REL, f)  # use this in linux
        img_p = osp.join(img_src_path_pre, f)
        new_img_p = osp.join(IMAGE_DEST, "{}.jpg".format(new_id))
        os.system(cmd.format(img_p, new_img_p))
        _cnt += 1
        if _cnt % 1000 == 0:
            print(_cnt)
print("DONE")

Clean Data & Comparison

两点观察:

  • [14] 中 IV.A.(2) 节中说 keep COCO unchanged,但数据只有 12,3274 条,即比完整的 12,3287 少了 13 条。
  • COCO 中有些数据的 label 为空。

由于 [14] 在其开源仓库[15]中有顺便提供 label,考虑在清掉 empty label 的数据后对拍。

  • 思路:[15] 提供的 labels 和自制 labels 都清掉 empty label 对应的数据,剩余的 labels 比较排序后的行和、列和。
  • 结论:一致,当它正确。
import os.path as osp
import scipy.io as sio
import numpy as np


ZQ_P = "F:/codes/Zero-Shot-Hashing/data/coco"
L_train = sio.loadmat(osp.join(ZQ_P, "train_label.mat"))["train"]
L_val = sio.loadmat(osp.join(ZQ_P, "val_label.mat"))["val"]
L_test = sio.loadmat(osp.join(ZQ_P, "test_label.mat"))["test"]
print(L_train.shape, L_val.shape, L_test.shape)  # (86291, 80) (31983, 80) (5000, 80)
print("train zero:", (0 == L_train.sum(1)).sum())  # 756
print("val zero:", (0 == L_val.sum(1)).sum())  # 258
print("test zero:", (0 == L_test.sum(1)).sum())  # 42


L_zq = np.vstack([L_train, L_val, L_test])
print("L_zq:", L_zq.shape)  # (123274, 80)
rs_zq = L_zq.sum(1)

L_my = sio.loadmat("labels.COCO.80.mat")["labels"].astype(np.int32)
rs_my = L_my.sum(1)
print("#zero-label:", (0 == rs_my).sum())  # 1069
print("#total:", L_my.shape[0])  # 123287


# both filter out the data with empty labels
non_zero_zq = (rs_zq > 0)
L_zq_nz = L_zq[non_zero_zq]
print("#ZQ's non-zero:", L_zq_nz.shape)  # (122218, 80)
non_zero_my = (rs_my > 0)
L_my_nz = L_my[non_zero_my]
print("#my non-zero:", L_my_nz.shape)  # (122218, 80)

# check the consistency of sorted row / column sum
print("- row sum -")
rs_sort_zq = np.sort(L_zq_nz.sum(1))
rs_sort_my = np.sort(L_my_nz.sum(1))
print("diff:", (rs_sort_zq != rs_sort_my).sum())  # 0

print("- column sum -")
cs_sort_zq = np.sort(L_zq_nz.sum(0))
cs_sort_my = np.sort(L_my_nz.sum(0))
print("diff:", (cs_sort_zq != cs_sort_my).sum())  # 0

print("- clean id -")
indices = np.arange(L_my.shape[0])
clean_id = indices[non_zero_my]
print("clean id:", clean_id.shape)  # (122218,)
L_clean = L_my[clean_id]
assert (L_clean.sum(1) > 0).all()
zero_id = np.setdiff1d(indices, clean_id)
assert (0 == L_my[zero_id].sum(1)).all()
sio.savemat("clean_id.COCO.mat", {"clean_id": clean_id})

Cloud Drive

Baidu Cloud

百度云盘:https://pan.baidu.com/s/1G8R0gHNI33vhx3TQukYDBg,提取码:vgvr 。

coco数据集 opencv Coco数据集预处理_python

kaggle

www.kaggle.com/dataset/d5240c2030f1f0664e217a51fcd2b51e85925ff471ca3de36ea546e555ccebd6

References

  1. COCO
  2. cocodataset/cocoapi
  3. cocoapi/PythonAPI/pycocoDemo.ipynb
  4. COCO-stuff用法
  5. Separated variational hashing networks for cross-modal retrieval
  6. jhlau/doc2vec
  7. jhlau/gensim
  8. pytorch/pytorch:1.4-cuda10.1-cudnn7-runtime
  9. COCO download
  10. windows装COCO-stuff python API
  11. jhlau/doc2vec/infer_test.py
  12. Doc2Vec Model
  13. linux创建、删除文件夹的软链接
  14. Transductive Zero-Shot Hashing for Multilabel Image Retrieval
  15. qinnzou/Zero-Shot-Hashing
  16. 解决编译 COCOAPI时出现的 “pycocotools/_mask.c: No such file or directory”错误
  17. An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation
  18. ACL 2014 | The Stanford CoreNLP Natural Language Processing Toolkit
  19. Stanford CoreNLP
  20. jdk-8u40-linux-x64.gz,提取码:g9jb
  21. iTomxy/ml-template/docker-files/pt1_4-d2v