python文档吉布斯采样 python上采样

转载

mob64ca14147fe3 2024-07-25 10:19:48

文章标签 python文档吉布斯采样 python java 机器学习大数据 文章分类 Python 后端开发

Today I’d like to introduce a little python library I’ve toyed around with here and there for the past year or so, ripyr. Originally it was written just as an excuse to try out some newer features in modern python: asyncio and type hinting. The whole package is type hinted, which turned out to be a pretty low level of effort to implement, and the asyncio ended up being pretty speedy.

今天，我想介绍一个小小的python库，我在过去一年左右的时间里一直在玩， ripyr 。最初，它只是作为借口尝试现代python中的一些较新功能：asyncio和类型提示。整个程序包都带有类型提示，事实证明实现起来的工作量很低，而且asyncio的运行速度也很快。

But first the goal: we wanted to stream through large datasets stored on disk in a memory efficient way, and parse out some basic metrics from them. Things like cardinality, what a date field’s format might be, an inferred type, that sort of thing.

但首要目标是：我们希望以内存有效的方式流式传输存储在磁盘上的大型数据集，并从中解析出一些基本指标。诸如基数之类的东西，日期字段的格式可能是什么，推断的类型等等。

It’s an interesting use-case, because in many cases, pandas is actually really performant here. The way pandas pulls data off of disk into a dataframe can balloon memory consumption for a short time, making analysis on very large files prohibitive, but short of that it’s pretty fast and easy. So keep that in mind if you’re dealing with smaller datasets, YMMV.

这是一个有趣的用例，因为在许多情况下，熊猫实际上在这里确实表现出色。大熊猫将数据从磁盘中拉出到数据帧中的方式可能会在短时间内增加内存消耗，从而使对非常大的文件的分析变得令人望而却步，但总之，这是非常快捷，容易的。因此，如果要处理较小的数据集YMMV，请记住这一点。

So using asyncio to lower memory overhead is a great benefit, but additionally, I wanted to make a nicer interface for the developer. Anyone who’s written a lot of pandas-based code can probably parse what this is doing (and it probably can be done in some much nicer ways), but it’s not super pretty:

因此，使用asyncio降低内存开销是一个很大的好处，但是此外，我想为开发人员提供一个更好的界面。任何写过很多基于熊猫的代码的人都可以解析它的作用（并且可以用一些更好的方法来完成），但是它并不是很漂亮：

df = pd.read_csv('sample.csv')
df['date'] = pd.to_datetime(df['date'], infer_datetime_format=True)
report = {
    "columns": df.columns.values.tolist(),
    "metrics": {
        "A": {
            "count": df.shape[0]
        },
        "B": {
            "count": df.shape[0],
            "approximate_cardinality": len(pd.unique(df['B']))
        },
        "C": {
            "count": df.shape[0],
            "approximate_cardinality": len(pd.unique(df['C'])),
            "max": df['C'].max()
        },
        "D": {
            "count": df.shape[0]
        },
        "date": {
            "count": df.shape[0],
            "estimated_schema": 'pandas-internal'
        }
    }
}

The equivalent with ripyr is:

ripyr的等效项是：

cleaner = StreamingColCleaner(source=CSVDiskSource(filename='sample.csv'))
cleaner.add_metric_to_all(CountMetric())
cleaner.add_metric('B', [CountMetric(), CardinalityMetric()])
cleaner.add_metric('C', [CardinalityMetric(size=10e6), MaxMetric()])
cleaner.add_metric('date', DateFormat())
cleaner.process_source(limit=10000, prob_skip=0.5)
print(json.dumps(cleaner.report(), indent=4, sort_keys=True))

I think that is a lot more readable. In that second to last line you’ll also see a 3rd and final cool thing we do in ripyr. So in this example, we will end up with a sampling of 10,000 records regardless of the total size of the dataset, and will skip rows as the yield of disk with a probability of 50%. So running this analysis on just the first N rows, or a random sampling of M rows, is super super easy. In many cases, that’s all you need to do, so we don’t need to be pulling huge datasets into memory in the first place.

我认为这更具可读性。在倒数第二行中，您还将看到我们在ripyr中所做的第三件也是最后一件很酷的事情。因此，在此示例中，无论数据集的总大小如何，我们最终都会获得10,000条记录的样本，并且将跳过行作为磁盘的收益率，概率为50％。因此，仅在前N行或M行的随机采样上进行此分析非常容易。在很多情况下，这就是您需要做的，因此我们不需要首先将大型数据集拉入内存。

Currently there are two supported source types:

当前有两种受支持的源类型：

CSVDisk: a CSV on disk, backed by the standard python CSVReader
JSONDisk: a file of one-json-blob-per-row data, without newlines in the json itself anywhere.
CSVDisk：磁盘上的CSV，由标准python CSVReader支持
JSONDisk：每行一个json-blob数据的文件，json本身中的任何位置都没有换行符。

For a given source, you can apply any of a number of metrics:

对于给定的来源，您可以应用多种指标中的任何一种：

categorical

approximate cardinality, based on a bloom filter

dates

date format inference

inference

type inference

numeric

count
min
max
histogram
基于布隆过滤器的近似基数
日期格式推断
类型推断
计数
分
最高
直方图

It’s not on PyPI or anything yet, and isn’t super actively developed, since it was mostly just a learning exercise for asyncio and type hinting, but if you think it’s interesting, find me on github, or comment below. I’d be happy to collaborate with others that find this sort of thing useful.

它尚未在PyPI上或任何东西上使用，并且还没有进行过积极的开发，因为它主要只是用于异步和类型提示的学习练习，但是如果您认为这很有趣，请在github上找到我，或在下面评论。我很高兴与发现这种事情有用的其他人合作。

https://github.com/predikto/ripyr