python中一个类向另一个类注册回调函数

转载

架构魔法师 2024-10-16 20:11:49

文章标签 python 堆栈栈溢出类名 文章分类 Python 后端开发

背景：

1、公司爬虫框架存在一坨巨大的屎山，上万行的ifelse语句。

2、没有明确文档哪些网站使用了splash进行访问。

3、splash未单独部署并发太高导致oom

目标：

从爬虫代码中筛选出使用了splash的class，并匹配对应的网站id

需要基础知识：python，AST基础语法（与babel AST 类似）

踩坑：

1、正则匹配类名较慢，而且匹配if-elif 语句中的网站url和className费劲。

2、上万行的if-else嵌套会导致回调堆栈溢出

过程：

1、遍历爬虫代码所在目录找出所有的py文件

def get_file_name(path: str):
    """
    找到该路径下的所有py文件
    :param path:
    :return: py文件名的list
    """
    print("开始扫描{}路径下文件".format(path))
    file_start_time = time.time()
    file_list = []
    for root, dirs, files in os.walk(path):
        for file in files:
            if ".pyc" in file or ".pyo" in file:
                continue
            file_list.append(os.path.join(root, file))
    file_end_time = time.time()
    print("获取文件结束 耗时：{}秒".format(file_end_time - file_start_time))
    return file_list

2、遍历这些py文件，判断是否使用了splash

通过判断文件内容有没有发起splash请求来判断是否使用的了splash。然后通过AST遍历body节点到class定义节点获取name就可以得到类名。详情如下。

def get_splash(path: str):
    """
    判断是否使用了spalsh
    :param path:
    :return:
    """
    with open(path, 'r', encoding="utf-8") as f:
        data = f.read()
    if "SplashRequest(" in data:

        copy_tree = ast.parse(data)
        nodes = copy_tree.body
        for node in nodes:
            if isinstance(node, ast.ClassDef):
                class_names.append(node.name)

3、在上万行if-elif 中匹配出对应的 url和class关系

python中一个类向另一个类注册回调函数_栈溢出

因为if-elif太多导致堆栈溢出，没法ast.dump(node)查看节点内容。所以我们直接复制到网站AST explorer查看这里只能用作参考节点名称与解析出来的节点名称不太一样。

通过分析我们只需要找IfStatement 就可以了。IfStatement下的test、consequent、alternate。其中test是判断条件 consequent对应的是test为True的代码。alternate对应的是test为False的代码。在python 自带的代码中对应的是 test、body、orelse

python中一个类向另一个类注册回调函数_栈溢出_02

4、首先找到IF节点代码如下：

AST相关文档如下 ast --- 抽象语法树 — Python 3.11.4 文档

class NodeVisitor(ast.NodeVisitor):
    def visit_If(self, node):
        parse_start_time = time.time()
        print("开始遍历if节点")
        result, node = get_url_class(node, {}, 0)
        while (node):
            result, node = get_url_class(node, result, 0)
        parse_end_time = time.time()
        print("节点处理完毕 耗时：", parse_end_time - parse_start_time, "秒")
        return

def get_url_class(node, dict_info, num):
    """
    从if节点中获取test 和 return的映射
    :param node: if 节点
    :param dict_info: class和url的映射
    :param num: 计数器，防止堆栈爆了
    :return:
    """
    global website_info
    num += 1
    # if嵌套大于1000会导回调致堆栈溢出
    if num == 1000:
        return dict_info, node
    # 处理if “” or “” ：return的情况
    if isinstance(node.test, ast.BoolOp):
        values = node.test.values
        return_valuer = node.body[0].value.func.id
        dict_info[return_valuer] = []
        for value in values:
            test = value.left.s
            dict_info[return_valuer].append(test)
    else:
        # 处理 if  ：return 的情况
        test = node.test.left.s
        return_valuer = node.body[0].value.func.id
        dict_info[return_valuer] = test

    # 处理else节点
    else_node = node.orelse[0]
    if else_node and isinstance(else_node, ast.If):
        return get_url_class(else_node, dict_info, num)
    else:
        website_info = dict_info
        return dict_info, None

循环是为了规避回调堆栈最多为1000的限制，循环完前一千个if语句后退出堆栈进入一个新的堆栈进行执行。

小结：

这就是一篇水文，啊大海啊全是水

本文章为转载内容，我们尊重原作者对文章享有的著作权。如有内容错误或侵权问题，欢迎原作者联系我们进行内容更正或删除文章。