spark构建数据仓库



Our mission at Data Mechanics is to give data engineers and data scientists the ability to build pipelines and models over large datasets with the simplicity of running a script on their laptop. Let them focus on their data, while we handle the mechanics of infrastructure management.

我们在Data Mechanics的使命是使数据工程师和数据科学家能够在便携式计算机上运行脚本的简单性,从而在大型数据集上构建管道和模型。 在我们处理基础架构管理机制的同时,让他们专注于数据。

So, we built a serverless Spark platform, a more easy-to-use and more performant alternative to services like Amazon EMR, Google Dataproc, Azure HDInsight, Databricks, Qubole, Cloudera and Hortonworks.

因此,我们构建了一个无服务器的Spark平台,它是诸如Amazon EMR,Google Dataproc,Azure HDInsight,Databricks,Qubole,Cloudera和Hortonworks之类的服务的更易用且性能更高的替代产品。

In this video, we will give you a product tour of our platform and some of its core features:

在此视频中,我们将为您提供有关我们平台及其一些核心功能的产品介绍:

  1. How to connect a Jupyter notebook to the platform, play with Spark interactively
  2. How to submit applications programmatically using our API or our Airflow integration
  3. How to monitor logs and metrics for your Spark app from our dashboard
  4. How to track your costs, stability and performance over time of your jobs (recurring apps)

演示地址


Data Mechanics Intro to Spark & Product Tour Data Mechanics Spark和产品导览

(What makes Data Mechanics a Serverless Spark platform?)

(Our autopilot features)

Our platform dynamically and continuously optimizes the infrastructure parameters and Spark configurations of each of your Spark applications to make them stable and performant. Here are some parameters we tune:

我们的平台动态,连续地优化每个Spark应用程序的基础结构参数和Spark配置,以使其稳定和高效。 以下是我们调整的一些参数:

  • The container sizes (memory, CPU) — to keep your app stable (avoid OutOfMemory errors), to optimize the binpacking of containers on your nodes, and to boost the performance of your app by acting on its bottleneck (Memory-bound, CPU-bound, I/O-bound)
  • The default number of partitions used by Spark to increase its degree of parallelism.
  • The disk sizes, shuffle and I/O configurations to make sure data transfer phases run at their optimal speed.

Our automated tuning feature is trained on the past runs of a recurrent application. It will automatically react to changes to code or to input data, such that your apps stay stable and performant over time, without the need for manual action from you.

我们的自动调整功能是针对重复应用程序的过去运行进行训练的。 它会自动对代码或输入数据的更改做出React,从而使您的应用程序随着时间的推移保持稳定和高效,而无需您手动进行操作。




How to automate performance tuning for Apache Spark 如何为Apache Spark自动化性能调整

In addition to autotuning, our second autopilot feature is autoscaling. We support two levels of autoscaling:

除了自动调整之外,我们的第二个自动驾驶功能是自动调整比例。 我们支持两个级别的自动缩放:

  • At the application level: each Spark app dynamically scales its number of executors based on load (dynamic allocation)
  • At the cluster level: the Kubernetes cluster automatically adds and removes nodes from the cloud provider

This model lets each app work in complete isolation (with its own Spark version, dependencies, and ressources) while keeping your infrastructure cost-efficient at all times.

该模型使每个应用程序都能完全隔离地工作(具有其自己的Spark版本,依赖关系和资源),同时始终保持基础架构的成本效益。

(Our cloud-native containerization)

Data Mechanics is deployed on a Kubernetes cluster in our customers’ cloud accounts (while most other platforms still run Spark on YARN, Hadoop’s scheduler).

Data Mechanics部署在客户云帐户中的Kubernetes集群上(而其他大多数平台仍在Hadoop的调度程序YARN上运行Spark)。

This deployment model has key benefits:

此部署模型具有以下主要优点:

  • An airtight security model: our customers’ sensitive data stays in their cloud account and VPC.
  • Native Docker support: our customers can use our set of pre-built optimized Spark docker images or build their own Docker images to package their dependency in a reliable way. Learn more about using custom Docker images on Data Mechanics.
    本地Docker支持:我们的客户可以使用我们的一组预先构建的优化的Spark Docker映像或构建自己的Docker映像以可靠的方式打包其依赖关系。 了解有关在Data Mechanics上使用自定义Docker映像的更多信息。
  • Integration with the rich tools from the Kubernetes ecosystem.
  • Cloud agnosticity: Data Mechanics is available on AWS, GCP, and Azure.


The Pros and Cons of running Apache Spark on Kubernetes (instead of YARN) 在Kubernetes上运行Apache Spark的利弊 (代替YARN)

(Our serverless pricing model)

Competing data platforms’ pricing models are based on server uptime. For each instance type, they’ll charge you an hourly fee, whether this instance is actually used to run Spark apps or not. This puts the burden on Spark developers to efficiently manage their clusters and make sure they’re not wasting ressources due to over-provisioning or parallelism issues.

竞争数据平台的定价模型基于服务器正常运行时间。 对于每种实例类型,无论该实例是否实际用于运行Spark应用,他们都会向您收取小时费用。 这给Spark开发人员带来了负担,使其无法有效地管理集群,并确保他们不会因为过度配置或并行性问题而浪费资源。


Instead, the Data Mechanics fee is based on the sum of the duration of all the Spark tasks (the units of work distributed by Spark, reported with a millisecond accuracy). This means our platform only makes money when our users do real work. We don’t make money:

取而代之的是, Data Mechanics费用基于所有Spark任务的持续时间(Spark分配的工作单位,以毫秒为单位报告)的总和。 这意味着我们的平台只有在用户完成实际工作时才能赚钱。 我们不赚钱:

  • When an application is completely idle (because you took a break from your notebook and forgot to scale down your cluster)
  • When most of your application ressources are waiting on a straggler task to finish
  • When you run a Spark driver-only operation (pure scala or python code)

As a result, Data Mechanics will aggressively scale down your apps when they’re idle, so that we reduce your cloud costs (without impacting our revenue). In fact the savings we generate on your cloud costs will cover or typically exceed the fee we charge for our services.

因此,Data Mechanics会在闲置状态下积极缩减您的应用程序,以便我们降低您的云成本(不影响我们的收入)。 实际上,我们在您的云成本上节省的费用将覆盖或通常超过我们为服务收取的费用。

(I’d like to try this, how do I get started?)

Great! The first step is to book a demo with our team so we can learn more about your use case. After this initial chat, we’ll invite you to a shared slack channel — we use Slack for our support and we’re very responsive there. We’ll send you instructions on how to give us permissions on the AWS, GCP, or Azure account of your choice, and once we have these permissions we’ll deploy Data Mechanics and you’ll be ready to get started using our docs.

大! 第一步是与我们的团队预定一个演示 ,以便我们可以更多地了解您的用例。 初次聊天后,我们将邀请您加入一个共享的slack频道-我们使用Slack作为我们的支持,我们在那里的React非常快。 我们将向您发送有关如何向我们授予您选择的AWS,GCP或Azure帐户权限的说明,一旦获得这些权限,我们将部署Data Mechanics,您将准备好开始使用我们的文档

There are other features which we didn’t get to cover in this post — like our support for spot/preemptible nodes, our support for private clusters (cut off from the internet), our Spark UI replacement project, our integration with tools for CI/CD tools and machine learning model tracking and serving. So stay tuned and reach out if you’re curious to learn more.

我们在这篇文章中没有涉及其他功能,例如我们对现货/可抢占节点的支持,对私有集群的支持(从互联网断开),Spark UI替换项目,与CI工具的集成/ CD工具和机器学习模型的跟踪和服务。 因此如果您想了解更多信息,请继续关注并联系我们。


翻译自: https://towardsdatascience.com/how-we-built-a-serverless-spark-platform-video-tour-of-data-mechanics-583d1b9f6cb0

spark构建数据仓库