什么是可观测?

不仅仅是日志、指标和TRACES

撰文 | Jay Livens

翻译 | 老夏(优维科技专家组成员)

As dynamic systems architectures increase in complexity and scale, IT teams face mounting pressure to track and respond to conditions and issues across their multi-cloud environments. As a result, IT operations, DevOps, and SRE teams are all looking for greater observability into these increasingly diverse and complex computing environments.

由于灵活性增加了系统架构的复杂度和规模,IT团队就如何跟踪和应对多云环境的健康状况和各种问题正面临日益攀升的压力。因此,运维,Devops和SRE团队都渴望有更好的针对异构复杂计算环境的可观测性。

But what is observability? Why is it important, and what can it actually help organizations achieve?

然而,究竟什么是可观测性呢?可观测性又为什么这么重要,以及它是如何帮助IT部门获得成功的呢?



What is observability?

什么是可观测性?


In IT and cloud computing, observability is the ability to measure a system’s current state based on the data it generates, such as logs, metrics, and traces.

在IT和云计算中,可观测性是基于系统本身产生的数据(比如说日志,指标和Trace)测量系统当前状态的能力。

Observability relies on telemetry derived from instrumentation that comes from the endpoints and services in your multi-cloud computing environments. In these modern environments, every hardware, software, and cloud infrastructure component and every container, open-source tool, and microservice generates records of every activity. The goal of observability is to understand what’s happening across all these environments and among the technologies, so you can detect and resolve issues to keep your systems efficient and reliable and your customers happy.

可观测性依赖于多云计算环境中端点和服务的统计功能的结果数据。环境中的每个硬件,软件,云基础设施组件和容器,开源工具,微服务都会记录每个行为。可观察性的目标是了解所有这些环境和技术之间发生了什么,这样您就可以检测并解决问题,以保持您的系统高效可靠,让您的客户满意。

Organizations usually implement observability using a combination of instrumentation methods including open-source instrumentation tools, such as OpenTelemetry.

公司的IT部门通常使用多种检测方法实现可观测性,包括开源工具,如OpenTelemetry。

Many organizations also adopt an observability solution to help them detect and analyze the significance of events to their operations, software development life cycles, application security, and end-user experiences.

许多公司也采用可观察性解决方案来日常运维,软件开发生命周期,应用安全以及终端用户体验相关的事件进行影响分析。

Observability has become more critical in recent years, as cloud-native environments have gotten more complex and the potential root causes for a failure or anomaly have become more difficult to pinpoint. As teams begin collecting and working with observability data, they are also realizing its benefits to the business, not just IT.

近年来,随着云计算环境变得越来越复杂,导致故障或异常的根因定位越来越困难,可观测性变得更加关键。随着团队开始采集和处理可观测性数据,他们也意识到可观测性对业务的好处,不仅仅是IT。

Because cloud services rely on a uniquely distributed and dynamic architecture, observability may also sometimes refer to the specific software tools and practices businesses use to interpret cloud performance data. Although some people may think of observability as a buzzword for sophisticated application performance monitoring (APM), there are a few key distinctions to keep in mind when comparing observability and monitoring.

由于云服务是基于独特的分布式动态架构,可观测性有时也用来指解释云性能数据的软件工具和实践。尽管有些人将可观测性说成是应用性能监控(APM)的一个新的时尚名称,请记住监控和可观测还是有几个非常关键的区别。



What is the difference between monitoring and observability?

监控和可观测性的区别


Is observability really monitoring by another name? In short, no. While observability and monitoring are related — and can complement one another — they are actually different concepts.

可观测性真的是监控的新名称吗?肯定不是。他们是相关且互补的两个不同概念。

In a monitoring scenario, you typically preconfigure dashboards that are meant to alert you to performance issues you expect to see later. However, these dashboards rely on the key assumption that you’re able to predict what kinds of problems you’ll encounter before they occur.

在监控的场景,通常你需要预配置Dashboard,后续收到相关的性能告警时你会查看这些Dashboard来发现问题。然后这些Dashboard隐含了一个很重要的假设,那就是你能够预先知道系统将会发生什么类型的问题。

Cloud-native environments don’t lend themselves well to this type of monitoring because they are dynamic and complex, which means you have no way of knowing in advance what kinds of problems might arise.

这种类型的监控不适合云原生环境,因为它们是动态的、复杂的,这意味着你无法提前知道可能会出现什么样的问题。

In an observability scenario, where an environment has been fully instrumented to provide complete observability data, you can flexibly explore what’s going on and quickly figure out the root cause of issues you may not have been able to anticipate.

在可观测场景,通过对整个环境进行全覆盖测量来提供可观测数据,你可以自由的探索系统正在发生什么,并且能够快速定位出之前你没有预料到的故障的根因。



Why is observability important?

为什么可观测性如此重要?


In enterprise environments, observability helps cross-functional teams understand and answer specific questions about what’s happening in highly distributed systems. Observability enables you to understand what is slow or broken and what needs to be done to improve performance. With an observability solution in place, teams can receive alerts about issues and pro-actively resolve them before they impact users.

在企业环境里,可观测性可以帮助跨职能团队快速了解和明白高分布式系统里究竟发生了什么。它能够让你清楚哪里慢了或断了,需要怎么做才可以提升性能。有了可观测性解决方案,团队能够收到问题警告,并在影响用户之前主动修复。

Because modern cloud environments are dynamic and constantly changing in scale and complexity, most problems are neither known nor monitored. Observability addresses this common issue of “unknown unknowns,” enabling you to continuously and automatically understand new types of problems as they arise.

因为现代云环境是动态的、持续变化的和复杂的,所以大部分的问题都是未知的也不会被监控到。可观测性是用来解决“不知道不知道”的问题,它使得你能够持续且自动的了解即将发生的新类型的问题。

Observability is also a critical capability of artificial intelligence for IT operations (AIOps). As more organizations adopt cloud-native architectures, they are also looking for ways to implement AIOps, harnessing AI as a way to automate more processes throughout the DevSecOps life cycle. By bringing AI to everything — from gathering telemetry to analyzing what’s happening across the full technology stack — your organization can have the reliable answers essential for automating application monitoring, testing, continuous delivery, application security, and incident response.

可观测性是AIOps的关键能力。随着越来越多的企业使用云原生架构,他们也渴望能够实现AIOps,利用AI尽可能去实现整个DevSecOps生命周期里的各个过程的自动化。通过收集测量数据来感知全技术栈的运行过程,让一切AI化,才能为企业如何实现应用监控,测试,持续部署,应用安全和事件响应的自动化提供可靠的且必不可少的答案

The value of observability doesn’t stop at IT use cases. Once you begin collecting and analyzing observability data, you have an invaluable window into the business impact of your digital services. This visibility enables you to optimize conversions, validate that software releases meet business goals, measure the outcomes of your user experience SLOs, and prioritize business decisions based on what matters most.

可观测性的价值并不局限于IT的应用场景。一旦你开始收集和分析可观测性数据,你就有了一个了解IT系统如何影响业务的宝贵窗口。这种业务感知能力可以帮助你优化转化率,验证软件版本是否满足业务目标,量化用户体验SLO,并根据重用性确定业务决策的优先级。

When an observability solution also analyzes user experience data using synthetic and real-user monitoring, you can discover problems before your users do and design better user experiences based on real, immediate feedback.

当可观测性解决方案运用综合、真实的用户监控数据来分析用户体验的时候,你可以在用户报障前前发现问题,并能够基于真实、实时的用户反馈来优化用户体验。



Benefits of observability

可观测性的好处


Observability delivers powerful benefits to IT teams, organizations, and end-users alike. Here are some of the use cases observability facilitates:

可观测性为IT团队、组织和最终用户带来了巨大的收益。如下列举了可观测性带来的优势:

  1. Application performance monitoring: Full end-to-end observability enables organizations to get to the bottom of application performance issues much faster, including issues that arise from cloud-native and microservices environments. An advanced observability solution can also be used to automate more processes, increasing efficiency and innovation among Ops and Apps teams.
    应用性能监控:端到端的全链路可观测能够让你快速定位到应用性能问题的根源,包括是由于云原生和微服务基础环境引起的问题。一个优秀的可观测解决方案,也能够实现很多过程的自动化,提升运维和开发团队的效率和创新力。
  2. DevSecOps and SRE: Observability is not just the result of implementing advanced tools, but a foundational property of an application and its supporting infrastructure. The architects and developers who create the software must design it to be observed. Then DevSecOps and SRE teams can leverage and interpret the observable data during the software delivery life cycle to build better, more secure, more resilient applications.
    DevSecOps 和 SRE:可观测性不仅仅是实现一个优秀的工具,它也是一个应用和支撑应用运行的基础架构的基础属性。负责软件研发的架构师和开发人员在设计之初就需要考虑软件本身的可观测性。DevSecOps和SRE团队在软件交付的生命周期里可以利用可观测数据构建一个更好的,更安全的,更有弹性的应用。
  3. Infrastructure, cloud, and Kubernetes monitoring: Infrastructure and operations (I&O) teams can leverage the enhanced context an observability solution offers to improve application uptime and performance, cut down the time required to pinpoint and resolve issues, detect cloud latency issues and optimize cloud resource utilization, and improve administration of their Kubernetes environments and modern cloud architectures.
    基础架构,云和K8S监控:基础设施运营(I&O)团队可以利用可观测性解决方案提供的增强上下文,
  1. 提升应用程序的正常运行时间和性能
  2. 减少问题定位和问题解决所需要的时间
  3. 检测云延时问题和优化云资源利用率
  4. 并改善对Kubernetes环境和现代云架构的管理。
  1. End-user experience: A good user experience can enhance a company’s reputation and increase revenue, delivering an enviable edge over the competition. By spotting and resolving issues well before the end-user notices and making an improvement before it’s even requested, an organization can boost customer satisfaction and retention. It’s also possible to optimize the user experience through real-time playback, gaining a window directly into the end-user’s experience exactly as they see it, so everyone can quickly agree on where to make improvements.
    最终用户体验:良好的用户体验可以提高公司的声誉,增加收入,在竞争中取得令人羡慕的优势。通过在最终用户感知到问题之前,发现、解决或改进问题,部门可以提高客户满意度和留存率。还可以通过实时用户追踪来优化用户体验,通过一个窗口能够真实感知用户体验,这样很快就能对如何提升用户体验达成共识。
  2. Business analytics: Organizations can combine business context with full stack application analytics and performance to understand real-time business impact, improve conversion optimization, ensure that software releases meet expected business goals, and confirm that the organization is adhering to internal and external SLAs.
    业务分析:部门可以结合业务环境与全栈应用的分析手段和性能数据,来了解实时业务变化,改进转化率优化方法,确保软件发布满足预期业务目标,并确保部门能够遵守内部和外部的SLA。

DevSecOps teams can tap observability to get more insights into the apps they develop, and automate testing and CI/CD processes so they can release better quality code faster. This means organizations waste less time on war rooms and finger-pointing. Not only is this a benefit from a productivity standpoint, but it also strengthens the positive working relationships that are essential for effective collaboration.

DevSecOps团队可以利用可观测性来获得对他们开发的应用程序的更多认知,并实现测试和CI/CD的自动化,以便更快地发布更高质量的代码。这意味着用于紧急修复问题和责任相互推诿的时间得到了减少,然而这个好处不仅于此,因为从生产力的角度来看它还加强了对有效合作至关重要的积极工作关系。

These organizational improvements open the door to further innovation and digital transformation. And, more importantly, the end-user ultimately benefits in the form of a high-quality user experience.

这些组织改进为进一步创新和数字化转型打开了大门。更重要的是,最终用户获益于高质量的用户体验。



How do you make a system observable?

怎样才能使系统可观测?


If you’ve read about observability, you likely know that collecting the measurements of logs, metrics, and distributed traces are the three key pillars to achieving success. However, observing raw telemetry from back-end applications alone does not provide the full picture of how your systems are behaving.

如果你读过关于可观测性的文章,你可能知道收集日志、度量和分布式跟踪的度量是取得成功的三个关键支柱。然而,仅依靠后端应用程序的原始测量数据并不能全面了解系统的运行全景。

Neglecting the front-end perspective potentially skews or even misrepresents the full picture of how your applications and infrastructure are performing in the real world for real users. Extending the three-pillars approach, IT teams must augment telemetry collection with user-experience data to eliminate blind spots:

忽略前端数据可能会扭曲甚至歪曲应用程序和基础设施对现实世界中真实用户的影响。IT团队必须采集用户体验数据来扩展”三支柱“方法,从而消除盲点:

  1. Logs: These are structured or unstructured text records of discreet events that occurred at a specific time.
    日志:特定时间发生特定事件的结构化或非结构化的文本记录。
  2. Metrics: These are the values represented as counts or measures that are often calculated or aggregated over a period of time. Metrics can originate from a variety of sources, including infrastructure, hosts, services, cloud platforms, and external sources.
    指标:经过一段时间就需要被计算或汇聚的计数器或统计值。指标来源多种多样,包括基础设施,主机,服务,云平台和外部来源。
  3. Distributed tracing: This displays activity of a transaction or request as it flows through applications and shows how services connect, including code-level details.
    分布式链路追踪:展现了一个事务或请求在应用之间的交互协作过程,同时也揭示了服务之间的连接关系,甚至代码级别的细节。
  4. User experience: This extends traditional observability telemetry by adding the outside-in user perspective of a specific digital experience on an application, even in pre-production environments.
    用户体验:即使在预发布环境,也需要在应用里增加某个在线服务的用户层面的由外而内的测量数据,来扩展传统的可观测性数的测量范围。

Why the three pillars of observability aren’t enough

为什么可观测性的“三支柱”不充分?


Obviously, data collection is only the start. Simply having access to the right logs, metrics, and traces isn’t enough to gain true observability into your environment. Once you’re able to use that telemetry data to achieve the end goals of improving end-user experience and business outcomes, only then can you really say you’ve achieved the purpose of observability.

很显然,数据采集仅仅是开始。只是简单的能够获取到日志,指标和TRACES不足以实现系统的可观测性。只有当你能够利用这些数据来实现提升用户体验或增加业务收入的终结目标时,你才可以说实现了系统的可观测性。

There are other observability capabilities organizations can use to observe their environments. Open-source solutions, such as OpenTelemetry, provide a de facto standard for collecting telemetry data in cloud settings. These open-source solutions enhance observability for cloud-native applications and make it easier for developers and operations teams to achieve a consistent understanding of application health across multiple environments.

也有其它可观测工具被组织用来观测环境。开源解决方案,例如Open Telemetry,为在云环境收集可观测数据提供了事实标准。这些开源解决方案增强了云原生应用的可观测能力,使得研发和运维团队针对跨环境的应用健康状况的认识达成一致。

Organizations can also use real user monitoring to gain real-time visibility into the user experience, tracking the path of a single request and gaining insight into every interaction it has with every service along the way. This experience can be observed by synthetic monitoring or even a recording of the actual session. These capabilities extend telemetry by adding in data for APIs, third-party services, errors occurring in the browser, user demographics, and application performance from the user perspective. This gives IT, DevSecOps, and SRE teams the ability not only to see the complete end-to-end journey of a request but also to access real-time insight into system health. From there, they can proactively troubleshoot areas of degrading health before they impact application performance. They can also more easily recover from failures and gain a more granular understanding of the user experience.

部门可以利用实时用户监控实现对用户体验的实时可观测,跟踪单个请求的路径,并认知该请求路径中每个服务的交互过程。综合监控或会话记录模块可以通过增加用户维度的数据(API、第三方服务、浏览器内错误、用户统计数据和应用性能数据)来扩展延申用户体验的监控能力。这使的IT、DevSecOps和SRE团队不仅能够查看请求的完整端到端旅程,还能够实时了解系统运行状况。因此,他们可以在应用程序性能收到影响之前,主动排除会恶化系统健康状态的问题。他们也更容易恢复故障,并对用户体验有更细致的了解。

While IT organizations have the best of intentions and strategy, they often overestimate the ability of already overburdened teams to constantly observe, understand, and act upon an impossibly overwhelming amount of data and insights. Although there are many complex challenges associated with observability, the organizations that overcome these challenges will find it worth their while.

虽然IT部门有最好的初衷和计划,但他们往往高估了已经负担过重的团队在持续观测、理解和处理大量数据和认知的能力。尽管存在许多与可观察性相关的复杂挑战,但勇于挑战的部门将发现这是值得的。


What are the challenges of observability?

什么是可观测性的挑战?


Observability has always been a challenge, but cloud complexity and the rapid pace of change has made it an urgent issue for organizations to address. Cloud environments generate a far greater volume of telemetry data, particularly when microservices and containerized applications are involved. They also create a far greater variety of telemetry data than teams have ever had to interpret in the past. Lastly, the velocity with which all this data arrives makes it that much harder to keep up with the flow of information, let alone accurately interpret it in time to troubleshoot a performance issue.

可观测性一直是一个挑战,但云的复杂性和更快的变化速度使其成为部门需要解决的紧迫问题。云环境产生的测量数据量要大得多,尤其是引入了微服务和容器之后。测量数据的类型也比过去多的多。这么大量的数据实时存储都面临巨大的挑战,更不用说及时解析和用它来解决性能问题了。

Organizations also frequently run into the following challenges with observability:

实现系统可观测,部门经常会面临如下的挑战:

  • Data silos: Multiple agents, disparate data sources, and siloed monitoring tools make it hard to understand interdependencies across applications, multiple clouds, and digital channels, such as web, mobile, and IoT.
    数据孤岛:多探针、不同的数据源和孤岛式的监控工具使得应用程序、多云和互连方式(如web、移动和物联网)之间的依赖关系无法被识别。
  • Volume, velocity, variety, and complexity: It’s nearly impossible to get answers from the sheer amount of raw data collected from every component in ever-changing modern cloud environments, such as AWS, Azure, and Google Cloud Platform (GCP). This is also true for Kubernetes and containers that can spin up and down in seconds.
    高吞吐,高性能,多样性和复杂性:在不断变化的现代云环境中,例如AWS、Azure和谷歌云平台(Google cloud Platform,GCP),从每个组件收集的大量原始数据中几乎不可能得到有用信息。Kubernetes和容器可以在几秒内启启停停。
  • Manual instrumentation and configuration: When IT resources are forced to manually instrument and change code for every new type of component or agent, they spend most of their time trying to set up observability rather than innovating based on insights from observability data.
    手动统计和配置:当IT资源不得已需要手动统计和或为新组件或探针添加统计代码时,需要会花费大量时间建立可观测性,根本谈不上基于可观测性获得的新认知进行创新。
  • Lack of pre-production: Even with load testing in pre-production, developers still don’t have a way to observe or understand how real users will impact applications and infrastructure before they push code into production.
    缺少预发布环境:即使在预发布阶段进行了负载测试,开发人员仍然无法在发布到生产环境之前了解真实用户将如何影响应用程序和基础设施。
  • Wasting time troubleshooting: Application, operations, infrastructure, development, and digital experience teams are pulled in to troubleshoot and try to identify the root cause of problems, wasting valuable time guessing and trying to make sense of telemetry and come up with answers.
    排障时浪费时间:应用程序、运营、基础设施、开发和数字体验团队被拉进来进行故障排除,在根因分析的时候,浪费宝贵时间进行猜测,理解统计方法。

Then, there’s the issue of multiple tools and vendors. While a single tool may give an organization observability into one specific area of their application architecture, that one tool may not provide complete observability across all the applications and systems that can affect application performance.

最后,还有多个工具和供应商的问题。虽然一个工具可以让部门观测到应用架构的一个特定领域,但仅靠一个工具可能无法对所有应用和系统进行全景可观测。

Also, not all types of telemetry data are equally useful for determining the root cause of a problem or understanding its impact on the user experience. As a result, teams are still left with the time-consuming task of digging for answers across multiple solutions and painstakingly interpreting the telemetry data, when they could be applying their expertise toward fixing the problem right away. However, with a single source of truth, teams can get answers and troubleshoot issues much faster.

此外,并非所有类型的测量数据对根因分析或用户体验的影响分析都同样有用。因此,团队需要在多个解决方案中挖掘答案,并耗费精力理解各个测量数据,这是一项耗时的任务,而原本他们可以利用自己的专业性立即解决问题。然而,有了唯一的数据来源,团队将更快地获得答案和解决问题。



The importance of a single source of truth

唯一数据来源的重要性


Organizations need a single source of truth to gain complete observability across their application infrastructure and accurately pinpoint the root causes of performance issues. When organizations have a single platform that can tame cloud complexity, capture all the relevant data, and analyze it with AI, teams can instantly identify the root cause of any problem, whether it lies in the application itself or the supporting architecture.

利用唯一数据来源实现应用程序及其基础架构的完整可观测性,才能基于该平台准确定位问题根因。该唯一数据来源需要适应云复杂性、捕获所有相关数据,并利用AI进行数据分析。无论问题是发生在应用程序本身还是在支撑应用的基础架构,团队都可以利用该平台快速定位问题根因。

A single source of truth enables teams to:

唯一数据来源可以帮助团队:

  • Turn terabytes of telemetry data into real answers, rather than asking IT teams to cobble together an understanding of what has happened using snippets of data from disparate sources
    将万亿字节的测量数据转化为真正的答案,而不是要求IT团队使用来自不同来源的数据片段拼凑起来理解发生了什么
  • Gain crucial contextual insights into areas of the infrastructure they might not have otherwise been able to see.
    获得关键的的基础设施领域的背景知识,他们可能通过其它方法无法了解到的。
  • Work collaboratively and accelerate the troubleshooting process further, empowering the organization to act faster than it could by using traditional monitoring tools thanks to enhanced visibility.
    协同工作,高效排障。增强的可观测性相较于传统的监控,使部门行动更高效。


Bring observability to everything

让一切可观测


You can’t waste months or years trying to build your own tools or test out multiple vendors that only enable you to solve one piece of the observability puzzle. Instead, you need a solution that can help make all your systems and applications observable, give you actionable answers, and provide technical and business value as fast as possible.

你不能浪费数月或数年的时间来尝试构建自己的工具,或测试多个供应商,这些都只能让你解决观测性难题的一部分。相反,您需要一个解决方案,该解决方案可以帮助您的所有系统和应用程序变得可观测,为您提供可操作的答案,并尽快提供技术和业务价值。


Advanced observability from HyperInsight provides all these capabilities in a single platform, empowering your organization to tame modern cloud complexity and transform faster.

HyperInsight的超融合持续可观测解决方案在一个平台上提供了所有这些功能,使您的组织能够更快地适应现代云计算的复杂性和转换。