哈希表：快速查找的艺术与科学

原创

wx65dfdaaec020c 2025-08-23 09:20:47 博主文章分类：数据结构与算法 ©著作权

文章标签 python ci 哈希冲突 文章分类 Python 后端开发

©著作权归作者所有：来自51CTO博客作者wx65dfdaaec020c的原创作品，请联系作者获取转载授权，否则将追究法律责任

哈希表：快速查找的艺术与科学

摘要

哈希表(Hash Table)是一种通过哈希函数将键映射到值的高效数据结构，提供平均O(1)时间复杂度的查找、插入和删除操作。本文将深入探讨哈希函数设计、冲突解决方法、性能优化以及在实际系统中的应用。

1. 哈希表的基本原理

1.1 核心概念

哈希表通过哈希函数将键(key)转换为数组索引，实现快速数据访问。

class HashTable:
    def __init__(self, size=10):
        self.size = size
        self.table = [None] * size
        self.count = 0
    
    def _hash(self, key):
        """简单哈希函数示例"""
        return hash(key) % self.size

1.2 哈希表操作复杂度

操作	平均情况	最坏情况	说明
插入	O(1)	O(n)	哈希冲突导致
查找	O(1)	O(n)	哈希冲突导致
删除	O(1)	O(n)	哈希冲突导致

2. 哈希函数设计

2.1 优秀哈希函数的特性

确定性：相同键产生相同哈希值
均匀性：键均匀分布到整个空间
高效性：计算速度快
抗碰撞性：不同键产生相同哈希值的概率低

2.2 常见哈希函数实现

def simple_hash(key, size):
    """简单取模哈希"""
    return key % size

def multiplicative_hash(key, size):
    """乘法哈希"""
    A = 0.6180339887  # 黄金比例
    return int(size * ((key * A) % 1))

def string_hash(s, size):
    """字符串哈希函数"""
    hash_val = 0
    prime = 31  # 质数基数
    
    for char in s:
        hash_val = (hash_val * prime + ord(char)) % size
    
    return hash_val

3. 冲突解决方法

3.1 链地址法(Separate Chaining)

class ChainingHashTable:
    def __init__(self, size=10):
        self.size = size
        self.table = [[] for _ in range(size)]
    
    def insert(self, key, value):
        index = self._hash(key)
        # 检查是否已存在相同key
        for i, (k, v) in enumerate(self.table[index]):
            if k == key:
                self.table[index][i] = (key, value)
                return
        self.table[index].append((key, value))
    
    def get(self, key):
        index = self._hash(key)
        for k, v in self.table[index]:
            if k == key:
                return v
        return None

3.2 开放地址法(Open Addressing)

线性探测：

class LinearProbingHashTable:
    def __init__(self, size=10):
        self.size = size
        self.table = [None] * size
        self.keys = [None] * size
    
    def insert(self, key, value):
        index = self._hash(key)
        
        while self.keys[index] is not None:
            if self.keys[index] == key:  # 更新已存在的key
                self.table[index] = value
                return
            index = (index + 1) % self.size  # 线性探测
        
        self.keys[index] = key
        self.table[index] = value
    
    def get(self, key):
        index = self._hash(key)
        start_index = index
        
        while self.keys[index] is not None:
            if self.keys[index] == key:
                return self.table[index]
            index = (index + 1) % self.size
            if index == start_index:  # 回到起点，说明已遍历所有位置
                break
        return None

二次探测和双重哈希：

def quadratic_probing(index, i, size):
    """二次探测"""
    return (index + i*i) % size

def double_hashing(key, i, size):
    """双重哈希"""
    hash1 = hash(key) % size
    hash2 = 1 + (hash(key) % (size - 1))
    return (hash1 + i * hash2) % size

4. 动态扩容与负载因子

4.1 负载因子管理

负载因子 λ = 元素数量 / 哈希表大小

class DynamicHashTable:
    def __init__(self, initial_size=10, load_factor=0.75):
        self.size = initial_size
        self.load_factor = load_factor
        self.table = [None] * initial_size
        self.count = 0
    
    def insert(self, key, value):
        if self.count / self.size >= self.load_factor:
            self._resize()
        
        # 插入逻辑...
        self.count += 1
    
    def _resize(self):
        """扩容并重新哈希所有元素"""
        old_table = self.table
        self.size *= 2
        self.table = [None] * self.size
        self.count = 0
        
        for item in old_table:
            if item is not None:
                self.insert(item[0], item[1])

5. 哈希表的实际应用

5.1 Python字典实现

# Python字典的近似实现
class PyDict:
    def __init__(self):
        self.size = 8  # 初始大小
        self.used = 0
        self.indices = [None] * self.size
        self.entries = []
    
    def _probe(self, key):
        """探测寻找合适位置"""
        index = hash(key) % self.size
        while self.indices[index] is not None:
            entry_index = self.indices[index]
            if self.entries[entry_index][0] == key:
                return index, entry_index
            index = (index + 1) % self.size
        return index, None

5.2 应用场景示例

词频统计：

def word_frequency(text):
    """使用哈希表统计词频"""
    freq = {}
    words = text.lower().split()
    
    for word in words:
        # 清理标点符号
        word = word.strip('.,!?;:"')
        if word:
            freq[word] = freq.get(word, 0) + 1
    
    return freq

缓存实现：

class LRUCache:
    """LRU缓存使用哈希表+双向链表"""
    def __init__(self, capacity: int):
        self.capacity = capacity
        self.cache = {}
        self.head = DLinkedNode()
        self.tail = DLinkedNode()
        self.head.next = self.tail
        self.tail.prev = self.head
    
    def get(self, key: int) -> int:
        if key not in self.cache:
            return -1
        node = self.cache[key]
        self._move_to_head(node)
        return node.value

6. 哈希表的高级话题

6.1 一致性哈希

用于分布式系统中的数据分片

6.2 布隆过滤器(Bloom Filter)

概率型数据结构，用于快速判断元素是否存在

6.3 完美哈希

无冲突的哈希函数，适用于静态数据集

7. 性能优化技巧

7.1 选择合适的大小

def next_prime(n):
    """寻找下一个质数作为哈希表大小"""
    if n % 2 == 0:
        n += 1
    while not is_prime(n):
        n += 2
    return n

def is_prime(n):
    """检查是否为质数"""
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    i = 5
    while i * i <= n:
        if n % i == 0 or n % (i + 2) == 0:
            return False
        i += 6
    return True

7.2 自定义哈希函数

class CustomObject:
    def __init__(self, x, y):
        self.x = x
        self.y = y
    
    def __hash__(self):
        # 使用质数减少冲突
        return hash((self.x, self.y)) * 31
    
    def __eq__(self, other):
        return self.x == other.x and self.y == other.y