文章目录

  • 1、overall
  • 1.1 main() function
  • 1.2 configuration
  • 1.3 组件初始化
  • 1.4 启动各个服务组件
  • 2、项目中用到的其它工具
  • 2.1 Go routine 编排框架:oklog/run 包
  • 2.2 命令行和flag解析器:kingpin.v2
  • 3 服务发现(serviceDiscover)
  • 4 指标采集(scrapeManager)
  • 5 指标缓存(scrapeCache)
  • 6 通知管理(notifierManager)
  • 7 TSDB

1、overall

Prometheus 既是容器时代的标配,也同时解决了应用指标监控的问题。 将任意一个独立的数据源(target)称之为实例(instance)。包含相同类型的实例的集合称之为作业(job), 总结看每一个job有一个与之对应的scrape pool,每一个target有一个与之对应的loop,每个loop内部执 Http Get请求拉取数据。通过一些控制参数,控制采集周期以及结束等逻辑。

1、时序数据库模块(TSDB) 2、配置文件加载模块(Configuration Reloader) 3、服务发现模块(Service Discovery Manager) 4、数据抓取模块(Scrape Manager) 5、RES API 模块(Web Handler) 6、查询引擎模块(Query Engine & PromQL)

Prometheus 提供了 Client 库帮助开发人员在自己的应用中集成符合 Prometheus 格式标准的监控指标。 而对于不适合直接在代码中集成 Client 库的场景,比如应用来自第三方、不是由自己维护,应用不支持 HTTP 协议,那就需要为这些场景单独编写 Exporter 程序。Exporter 作为代理,把监控数据暴露出来。比如 Mysql Exporter,Node Exporter。

【官方的overall】 【启动分析,5个组件及配置】

prometheus pprof分析工具 prometheus源码分析_Go

1.1 main() function

首先,main 函数解析命令行参数,并读取配置文件信息(由 --config.file 参数提供)。 Prometheus 特别区分了命令行参数配置(flag-based configuration)和文件配置(file-based configuration)。 前者用于简单的设置,并且不支持热更新,修改需要启停 Prometheus Server 一次;后者支持热更新。

main 函数完成初始化、启动所有的组件。 1、Termination Handler、Service Discovery Manager、Web Handler 等。 2、各组件是独立的 Go Routine 在运行,之间又通过各种方式相互协调,包括使用 Synchronization Channel、引用对象 Reference、传递 Context 3、这些 Go Routine 的协作使用了 oklog/run 框架。oklog/run 是一套基于 Actor 设计模式的 Go Routine 编排框架,实现了多个 Go Routine 作为统一整体运行并有序依次退出

  • As a first step, main() defines and parses the server’s command-line flags into a local configuration structure
  • Next, main() instantiates all the major run-time components of Prometheus
  • Finally, the server runs all components in an actor-like model, using github.com/oklog/oklog/pkg/group

1.2 configuration

  • Configuration reading and parsing

每个job_name对应一个TargetGroups,而每个TargetGroups可以包含多个provider 每个provider包含实现对应的Discoverer接口和job_name等 所以,对应关系:job_name -> TargetGroups -> 多个targets -> 多个provider -> 多个Discover.

1、服务组件remoteStorage,webHandler,notifierManager和ScrapeManager的ApplyConfig方法,
参数cfg *config.Config中传递的配置文件,是整个文件prometheus.yml
// prometheus/scrape/manager.go

func (m *Manager) ApplyConfig(cfg *config.Config) error {
   .......
}
2、服务组件discoveryManagerScrape和discoveryManagerNotify的Appliconfig方法,参数中传递的配置文件,是文件中的一个Section
// prometheus/discovery/manager.go
 
func (m *Manager) ApplyConfig(cfg map[string]sd_config.ServiceDiscoveryConfig) error {
     ......
}
所以,需要利用匿名函数提前处理下,取出对应的Section
// prometheus/cmd/prometheus/main.go
 
//从配置文件中提取Section:scrape_configs
func(cfg *config.Config) error {
    c := make(map[string]sd_config.ServiceDiscoveryConfig)
    for _, v := range cfg.ScrapeConfigs {
        c[v.JobName] = v.ServiceDiscoveryConfig
    }
    return discoveryManagerScrape.ApplyConfig(c)
},
//从配置文件中提取Section:alerting
func(cfg *config.Config) error {
    c := make(map[string]sd_config.ServiceDiscoveryConfig)
    for _, v := range cfg.AlertingConfig.AlertmanagerConfigs {
        // AlertmanagerConfigs doesn't hold an unique identifier so we use the config hash as the identifier.
        b, err := json.Marshal(v)
        if err != nil {
            return err
        }
        c[fmt.Sprintf("%x", md5.Sum(b))] = v.ServiceDiscoveryConfig
    }
    return discoveryManagerNotify.ApplyConfig(c)
},

3、服务组件ruleManager,在匿名函数中提取出Section:rule_files
prometheus/cmd/prometheus/main.go
 
//从配置文件中提取Section:rule_files
func(cfg *config.Config) error {
    // Get all rule files matching the configuration paths.
    var files []string
    for _, pat := range cfg.RuleFiles {
        fs, err := filepath.Glob(pat)
        if err != nil {
            // The only error can be a bad pattern.
            return fmt.Errorf("error retrieving rule files for %s: %s", pat, err)
        }
        files = append(files, fs...)
    }
    return ruleManager.Update(time.Duration(cfg.GlobalConfig.EvaluationInterval), files)
},

利用该组件内置的Update方法完成配置管理
prometheus/rules/manager.go
 
func (m *Manager) Update(interval time.Duration, files []string) error {
  .......
}
4、最后,通过reloadConfig方法,加载各个服务组件的配置项

prometheus/cmd/prometheus/main.go
 
func reloadConfig(filename string, logger log.Logger, rls ...func(*config.Config) error) (err error) {
	level.Info(logger).Log("msg", "Loading configuration file", "filename", filename)
 
	defer func() {
		if err == nil {
			configSuccess.Set(1)
			configSuccessTime.SetToCurrentTime()
		} else {
			configSuccess.Set(0)
		}
	}()
 
	conf, err := config.LoadFile(filename)
	if err != nil {
		return fmt.Errorf("couldn't load configuration (--config.file=%q): %v", filename, err)
	}
 
	failed := false
  //通过一个for循环,加载各个服务组件的配置项
	for _, rl := range rls {
		if err := rl(conf); err != nil {
			level.Error(logger).Log("msg", "Failed to apply configuration", "err", err)
			failed = true
		}
	}
	if failed {
		return fmt.Errorf("one or more errors occurred while applying the new configuration (--config.file=%q)", filename)
	}
	promql.SetDefaultEvaluationInterval(time.Duration(conf.GlobalConfig.EvaluationInterval))
	level.Info(logger).Log("msg", "Completed loading of configuration file", "filename", filename)
	return nil
}
  • Reload handler
//代码位置 cmd/prometheus/main.go
reloadables := []Reloadable{
   	remoteStorage,
   	targetManager,
   	ruleManager,
   	webHandler,
   	notifier,
   }
reloadConfig(cfg.configFile, logger, reloadables...)

// 其中,reloadConfig函数的核心代码如下,主要是调用实现Reloadable的所有类型的ApplyConfig方法进行config的加载:
conf, err := config.LoadFile(filename)
for _, rl := range rls {
   rl.ApplyConfig(conf);
}
//代码位置  cmd/prometheus/main.go
//	SIGHUP更新
//	调用api来更新
go func() {
		<-hupReady
		for {
			select {
			case <-hup:
				reloadConfig(cfg.configFile, logger, reloadables...);
			case rc := <-webHandler.Reload():
				reloadConfig(cfg.configFile, logger, reloadables...);
			}
		}
	}()

1.3 组件初始化

//- 1、Storage组件初始化
localStorage  = &tsdb.ReadyStorage{} //本地存储
remoteStorage = remote.NewStorage(log.With(logger, "component", "remote"), //远端存储 localStorage.StartTime, time.Duration(cfg.RemoteFlushDeadline))
fanoutStorage = storage.NewFanout(logger, localStorage, remoteStorage) //读写代理服务器

//- 2、notifier 组件初始化,用于服务发现
notifierManager = notifier.NewManager(&cfg.notifier, log.With(logger, "component", "notifier"))

//- 3、discoveryManagerScrape组件初始化
discoveryManagerScrape  = discovery.NewManager(ctxScrape, log.With(logger, "component", "discovery manager scrape"), discovery.Name("scrape"))

//- 4、discoveryManagerNotify组件初始化,用于告警通知服务发现
discoveryManagerNotify  = discovery.NewManager(ctxNotify, log.With(logger, "component", "discovery manager notify"), discovery.Name("notify")
//- scrapeManager组件初始化
// scrapeManager组件利用discoveryManagerScrape组件发现的targets,抓取对应targets的所有metrics,
// 并将抓取的metrics存储到fanoutStorage中,通过方法scrape.NewManager完成初始化
scrapeManager = scrape.NewManager(log.With(logger, "component", "scrape manager"), fanoutStorage)
//- 5、queryEngine组件
opts = promql.EngineOpts{
    Logger:        log.With(logger, "component", "query engine"),
    Reg:           prometheus.DefaultRegisterer,
    MaxConcurrent: cfg.queryConcurrency,       //最大并发查询个数
    MaxSamples:    cfg.queryMaxSamples,
    Timeout:       time.Duration(cfg.queryTimeout), //查询超时时间
}
queryEngine = promql.NewEngine(opts)
//- 6、ruleManager组件初始化
// ruleManager组件通过方法rules.NewManager完成初始化.其中rules.NewManager的参数涉及多个组件:
// 存储,queryEngine和notifier,整个流程包含rule计算和发送告警

ruleManager = rules.NewManager(&rules.ManagerOptions{
    Appendable:      fanoutStorage,                        //存储器
    TSDB:            localStorage,              //本地时序数据库TSDB
    QueryFunc:       rules.EngineQueryFunc(queryEngine, fanoutStorage), //rules计算
    NotifyFunc:      sendAlerts(notifierManager, cfg.web.ExternalURL.String()), //告警通知
    Context:         ctxRule, //用于控制ruleManager组件的协程
    ExternalURL:     cfg.web.ExternalURL, //通过Web对外开放的URL
    Registerer:      prometheus.DefaultRegisterer, 
    Logger:          log.With(logger, "component", "rule manager"),
    OutageTolerance: time.Duration(cfg.outageTolerance), //当prometheus重启时,保持alert状态(https://ganeshvernekar.com/gsoc-2018/persist-for-state/)
    ForGracePeriod:  time.Duration(cfg.forGracePeriod),
    ResendDelay:     time.Duration(cfg.resendDelay),
}
//-  7、Web组件初始化
// Web组件用于为Storage组件,queryEngine组件,scrapeManager组件, ruleManager组件和notifier 组件提供外部HTTP访问
cfg.web.Context = ctxWeb
cfg.web.TSDB = localStorage.Get
cfg.web.Storage = fanoutStorage
cfg.web.QueryEngine = queryEngine
cfg.web.ScrapeManager = scrapeManager
cfg.web.RuleManager = ruleManager
cfg.web.Notifier = notifierManager
 
cfg.web.Version = &web.PrometheusVersion{
    Version:   version.Version,
    Revision:  version.Revision,
    Branch:    version.Branch,
    BuildUser: version.BuildUser,
    BuildDate: version.BuildDate,
    GoVersion: version.GoVersion,
}
 
cfg.web.Flags = map[string]string{}
 
// Depends on cfg.web.ScrapeManager so needs to be after cfg.web.ScrapeManager = scrapeManager
webHandler := web.New(log.With(logger, "component", "web"), &cfg.web)

1.4 启动各个服务组件

1、引用了github.com/oklog/oklog/pkg/group包,实例化一个对象g
prometheus/cmd/prometheus/main.go 
// "github.com/oklog/oklog/pkg/group"
var g group.Group
{
  ......
}
2、对象g中包含各个服务组件的入口,通过调用Add方法把把这些入口添加到对象g中,以组件scrapeManager为例:
prometheus/cmd/prometheus/main.go
{
    // Scrape manager.
  //通过方法Add,把ScrapeManager组件添加到g中
    g.Add(
        func() error {
            // When the scrape manager receives a new targets list
            // it needs to read a valid config for each job.
            // It depends on the config being in sync with the discovery manager so
            // we wait until the config is fully loaded.
            <-reloadReady.C
       //ScrapeManager组件的启动函数
            err := scrapeManager.Run(discoveryManagerScrape.SyncCh())
            level.Info(logger).Log("msg", "Scrape manager stopped")
            return err
        },
        func(err error) {
            // Scrape manager needs to be stopped before closing the local TSDB
            // so that it doesn't try to write samples to a closed storage.
            level.Info(logger).Log("msg", "Stopping scrape manager...")
            scrapeManager.Stop()
        },
    )
}
3、通过对象g,调用方法run,启动所有服务组件
prometheus/cmd/prometheus/main.go
if err := g.Run(); err != nil {
    level.Error(logger).Log("err", err)
    os.Exit(1)
}
level.Info(logger).Log("msg", "See you next time!")

2、项目中用到的其它工具

2.1 Go routine 编排框架:oklog/run 包

oklog/run 包

当一个组件结束的时候,需要通知其他组件有序执行结束操作。这个问题的解决方法可以用 Actor 模型来描述。每个 Go routine 都是一个 actor,互相独立,互相之间只能通过 message 通信。oklog/run 包实现了 Actor 模型,能非常简洁的实现 Go routine 编排功能。

Prometheus 的组件间协调也是用了 oklog/run。这些组件有 服务发现组件(Scrape discovery manager)、采集组件(Scrape Manager)、配置加载组件(Reload handler)等 Finally, the server runs all components in an actor-like model, using github.com/oklog/oklog/pkg/group to coordinate the startup and shutdown of all interconnected actors. Multiple channels are used to enforce ordering constraints, such as not enabling the web interface before the storage is ready and the initial configuration file load has happened.

2.2 命令行和flag解析器:kingpin.v2

【github】】 【非官方】

os标准库提供的解析方法,能够解析简单的命令行参数。 flag能够解析约定好的常规按照-flag传递的命令行参数,有帮助信息。 os和flag还不能够解析复杂结构的启动参数。

3 服务发现(serviceDiscover)

支持多种服务发现系统,这些系统可以动态感知被监控的服务(Target)的变化,把变化的被监控服务(Target)转换为targetgroup.Group的结构,通过管道up发送个服务发现(serviceDiscover)【这哥们儿的源码解读思路和排版堪称完美】

服务发现(serviceDiscover)为了实现对以上服务发现系统的统一管理,提供了一个Discoverer接口,由各个服务发现系统来实现,然后把上线的服务(Target)通过up管道发送给服务发现(serviceDiscover)

除了静态服务发现系统(StaticConfigs)在prometheus/discovery/manager.go中实现了以上接口,其他动态服务发现系统,在prometheus/discovery/下都在有各自的目录下实现.
prometheus/discovery/manager.go
 
type Discoverer interface {
	// Run hands a channel to the discovery provider (Consul, DNS etc) through which it can send
	// updated target groups.
	// Must returns if the context gets canceled. It should not close the update
	// channel on returning.
	Run(ctx context.Context, up chan<- []*targetgroup.Group)
}
 
 
prometheus/discovery/targetgroup/targetgroup.go
 
// Group is a set of targets with a common label set(production , test, staging etc.).
type Group struct {
	// Targets is a list of targets identified by a label set. Each target is
	// uniquely identifiable in the group by its address label.
	Targets []model.LabelSet //服务(Target)主要标签,比如ip + port,示例:"__address__": "localhost:9100"
	// Labels is a set of labels that is common across all targets in the group.
	Labels model.LabelSet //服务(Target)其他标签,可以为空:
 
	// Source is an identifier that describes a group of targets.
	Source string //全局唯一个ID,示例:Source: "0"
}
 
Group的一个示例:
(dlv) p tg
*github.com/prometheus/prometheus/discovery/targetgroup.Group {
	Targets: []github.com/prometheus/common/model.LabelSet len: 1, cap: 1, [
	       [
		       "__address__": "localhost:9100", 
	       ],
        ]
	],
	Labels: github.com/prometheus/common/model.LabelSet nil,
	Source: "0",}
  • 配置文件初始化
prometheus/cmd/prometheus/main.go
 
//discovery.Name("scrape")用于区分notify
discoveryManagerScrape  = discovery.NewManager(ctxScrape, log.With(logger, "component", "discovery manager scrape"), discovery.Name("scrape"))
  • 调用NewManager方法,实例化Manager结构体
prometheus/discovery/manager.go
 
// NewManager is the Discovery Manager constructor.
func NewManager(ctx context.Context, logger log.Logger, options ...func(*Manager)) *Manager {
	if logger == nil {
		logger = log.NewNopLogger()
	}
	mgr := &Manager{
		logger:         logger,
		syncCh:         make(chan map[string][]*targetgroup.Group),
		targets:        make(map[poolKey]map[string]*targetgroup.Group),
		discoverCancel: []context.CancelFunc{},
		ctx:            ctx,
		updatert:       5 * time.Second,
		triggerSend:    make(chan struct{}, 1),
	}
	for _, option := range options {
		option(mgr)
	}
	return mgr
}
prometheus/discovery/manager.go
 
// Manager maintains a set of discovery providers and sends each update to a map channel.
// Targets are grouped by the target set name.
type Manager struct {
	logger         log.Logger //日志
	name           string   // 用于区分srape和notify,因为他们用的同一个discovery/manager.go
	mtx            sync.RWMutex //同步读写锁
	ctx            context.Context //协同控制,比如系统退出
	discoverCancel []context.CancelFunc // 处理服务下线
 
	// Some Discoverers(eg. k8s) send only the updates for a given target group
	// so we use map[tg.Source]*targetgroup.Group to know which group to update.
	targets map[poolKey]map[string]*targetgroup.Group //发现的服务(Targets)
	// providers keeps track of SD providers.
	providers []*provider // providers的类型可分为kubernetes,DNS等
	// The sync channel sends the updates as a map where the key is the job value from the scrape config.
	syncCh chan map[string][]*targetgroup.Group //把发现的服务Targets)通过管道形式通知给scrapeManager
 
	// How long to wait before sending updates to the channel. The variable
	// should only be modified in unit tests.
	updatert time.Duration
 
	// The triggerSend channel signals to the manager that new updates have been received from providers.
	triggerSend chan struct{}
}
  • 通过匿名函数加载prometheus.yml的scrape_configs下对应配置
prometheus/cmd/prometheus/main.go
 
func(cfg *config.Config) error {
    c := make(map[string]sd_config.ServiceDiscoveryConfig)
    for _, v := range cfg.ScrapeConfigs {
        c[v.JobName] = v.ServiceDiscoveryConfig
    }
    return discoveryManagerScrape.ApplyConfig(c)
},
  • ApplyConfig方法实现逻辑比较清晰:先实现每个job的Discoverer接口,然后启动该job对应的服务发现系统.
prometheus/discovery/manager.go
 
// ApplyConfig removes all running discovery providers and starts new ones using the provided config.
func (m *Manager) ApplyConfig(cfg map[string]sd_config.ServiceDiscoveryConfig) error {
	m.mtx.Lock()
	defer m.mtx.Unlock()
 
	for pk := range m.targets {
		if _, ok := cfg[pk.setName]; !ok {
			discoveredTargets.DeleteLabelValues(m.name, pk.setName)
		}
	}
	m.cancelDiscoverers()
    // name对应job_name,scfg是给出该job_name对应的服务发现系统类型,每个job_name下可以包含多种服务发现系统类型,但用的比较少
	for name, scfg := range cfg {
		m.registerProviders(scfg, name)
		discoveredTargets.WithLabelValues(m.name, name).Set(0)
	}
	for _, prov := range m.providers {
    //启动每个job下对应的服务发现系统
		m.startProvider(m.ctx, prov)
	}
 
	return nil
}

ApplyConfig方法主要通过调用方法:registerProviders()和startProvider()实现以上功能
//prometheus/discovery/manager.go
 1、
func (m *Manager) registerProviders(cfg sd_config.ServiceDiscoveryConfig, setName string) {
	var added bool
	add := func(cfg interface{}, newDiscoverer func() (Discoverer, error)) {
		t := reflect.TypeOf(cfg).String()
		for _, p := range m.providers {
			if reflect.DeepEqual(cfg, p.config) {
				p.subs = append(p.subs, setName)
				added = true
				return
			}
		}
 
		d, err := newDiscoverer()
		if err != nil {
			level.Error(m.logger).Log("msg", "Cannot create service discovery", "err", err, "type", t)
			failedConfigs.WithLabelValues(m.name).Inc()
			return
		}
 
		provider := provider{
			name:   fmt.Sprintf("%s/%d", t, len(m.providers)),
			d:      d,
			config: cfg,SyncCh
			subs:   []string{setName},
		}
		m.providers = append(m.providers, &provider)
		added = true
	}
  //若是DNS服务发现,构造DNS的Discoverer
	for _, c := range cfg.DNSSDConfigs {
		add(c, func() (Discoverer, error) {
			return dns.NewDiscovery(*c, log.With(m.logger, "discovery", "dns")), nil
		})
	}
 ......
 ......
   
  //若是静态服务发现,基于示例配置文件,构造静态服务发现的Discoverer
	if len(cfg.StaticConfigs) > 0 {
		add(setName, func() (Discoverer, error) {
			return &StaticProvider{TargetGroups: cfg.StaticConfigs}, nil
		})
	}
	if !added {
		// Add an empty target group to force the refresh of the corresponding
		// scrape pool and to notify the receiver that this target set has no
		// current targets.
		// It can happen because the combined set of SD configurations is empty
		// or because we fail to instantiate all the SD configurations.
		add(setName, func() (Discoverer, error) {
			return &StaticProvider{TargetGroups: []*targetgroup.Group{{}}}, nil
		})
	}
}
其中,cfg.StaticConfigs对应TargetGroups, 以job_name:node为例,TargetGroups对应输出如下:

(dlv) p setName
"node"
(dlv) p cfg.StaticConfigs
[]*github.com/prometheus/prometheus/discovery/targetgroup.Group len: 1, cap: 1, [
	*{
		Targets: []github.com/prometheus/common/model.LabelSet len: 1, cap: 1, [
	          [
		          "__address__": "localhost:9100", 
	          ],
		],
		Labels: github.com/prometheus/common/model.LabelSet nil,
		Source: "0",},
]

每个job_name对应一个TargetGroups,而每个TargetGroups可以包含多个provider,每个provider包含实现对应的Discoverer接口和job_name等.
所以,对应关系:job_name -> TargetGroups -> 多个targets -> 多个provider -> 多个Discover.部分示例如下:

(dlv) p m.providers
[]*github.com/prometheus/prometheus/discovery.provider len: 2, cap: 2, [
	*{
		name: "string/0",
		d: github.com/prometheus/prometheus/discovery.Discoverer(*github.com/prometheus/prometheus/discovery.StaticProvider) ...,
		subs: []string len: 1, cap: 1, [
			"prometheus",
		],
		config: interface {}(string) *(*interface {})(0xc000536268),},
	*{
		name: "string/1",
		d: github.com/prometheus/prometheus/discovery.Discoverer(*github.com/prometheus/prometheus/discovery.StaticProvider) ...,
		subs: []string len: 1, cap: 1, ["node"],
		config: interface {}(string) *(*interface {})(0xc000518b78),},
]
(dlv) p m.providers[0].d
github.com/prometheus/prometheus/discovery.Discoverer(*github.com/prometheus/prometheus/discovery.StaticProvider) *{
	TargetGroups: []*github.com/prometheus/prometheus/discovery/targetgroup.Group len: 1, cap: 1, [
		*(*"github.com/prometheus/prometheus/discovery/targetgroup.Group")(0xc000ce09f0),
	],}

startProvider方法逐一启动job_name对应的所有服务发现系统
2、
prometheus/discovery/manager.go
 
func (m *Manager) startProvider(ctx context.Context, p *provider) {
	level.Debug(m.logger).Log("msg", "Starting provider", "provider", p.name, "subs", fmt.Sprintf("%v", p.subs))
	ctx, cancel := context.WithCancel(ctx)
	updates := make(chan []*targetgroup.Group)
 
	m.discoverCancel = append(m.discoverCancel, cancel)
    
    // 第一个协程启动具体的发现的服务,作为[]*targetgroup.Group的生产者
	go p.d.Run(ctx, updates)
    // 第二个协程是[]*targetgroup.Group的消费者
	go m.updater(ctx, p, updates)
}
 
备注:Run方法调用位置是实现Discoverer的服务发现系统中.若是静态服务发现,Run方法在prometheus/discovery/manager.go中实现,
若是动态服务发现系统,则在对应系统的目录下实现.

Run方法从结构体StaticProvider中取值,传递给[]*targetgroup.Group,作为服务发现的生产者
prometheus/discovery/manager.go
 
// StaticProvider holds a list of target groups that never change.
type StaticProvider struct {
	TargetGroups []*targetgroup.Group
}
 
// Run implements the Worker interface.
func (sd *StaticProvider) Run(ctx context.Context, ch chan<- []*targetgroup.Group) {
	// We still have to consider that the consumer exits right away in which case
	// the context will be canceled.
	select {
	case ch <- sd.TargetGroups:
	case <-ctx.Done():
	}
	close(ch)
}
updater方法从[]*targetgroup.Group获取TargetGroups,并把它发传给结构体Manager中对应的Targets,Manager中对应的Targets是map类型.
prometheus/discovery/manager.go
 
func (m *Manager) updater(ctx context.Context, p *provider, updates chan []*targetgroup.Group) {
	for {
		select {
		case <-ctx.Done(): //退出
			return
		case tgs, ok := <-updates: // 从[]*targetgroup.Group取TargetGroups
			receivedUpdates.WithLabelValues(m.name).Inc()
			if !ok {
				level.Debug(m.logger).Log("msg", "discoverer channel closed", "provider", p.name)
				return
			}
            // subs对应job_names,p.name对应系统名/索引值,比如:string/0
			for _, s := range p.subs {
				m.updateGroup(poolKey{setName: s, provider: p.name}, tgs)
			}
 
			select {
			case m.triggerSend <- struct{}{}:
			default:
			}
		}
	}
}

更新结构体Manager对应的targets,key是结构体poolKey,value是传递过来的TargetGroups,其中包含targets.
// prometheus/discovery/manager.go
 
func (m *Manager) updateGroup(poolKey poolKey, tgs []*targetgroup.Group) {
	m.mtx.Lock()
	defer m.mtx.Unlock()
 
	for _, tg := range tgs {
		if tg != nil { // Some Discoverers send nil target group so need to check for it to avoid panics.
			if _, ok := m.targets[poolKey]; !ok {
				m.targets[poolKey] = make(map[string]*targetgroup.Group)
			}
			m.targets[poolKey][tg.Source] = tg //一个tg对应一个job,在map类型targets中,结构体poolkey和tg.Source可以确定一个tg,即job
		}
	}
}

// prometheus/discovery/manager.go
 
// poolKey定义了每个发现的服务的来源
type poolKey struct {
	setName  string //对应系统名/索引值,比如:string/0(静态服务发现),DNS/1(动态服务发现)
	provider string //对应job_name
}
  • 服务发现 (serviceDiscover)起一个协程从服务发现系统获取数据
在main.go方法中启一个协程,运行Run()方法
prometheus/cmd/prometheus/main.go
 
{
    // Scrape discovery manager.
    g.Add(
        func() error {
            err := discoveryManagerScrape.Run()
            level.Info(logger).Log("msg", "Scrape discovery manager stopped")
            return err
        },
        func(err error) {
            level.Info(logger).Log("msg", "Stopping scrape discovery manager...")
            cancelScrape()
        },
    )
}
Run方法再起一个协程,运行sender()方法

// Run starts the background processing
func (m *Manager) Run() error {
	go m.sender()
	for range m.ctx.Done() {
		m.cancelDiscoverers()
		return m.ctx.Err()
	}
	return nil
}

sender方法的主要功能是处理结构体Manager中map类型的targets,然后传给结构体Manager中的map类型syncCh:syncCh chan map[string][]*targetgroup.Group.
func (m *Manager) sender() {
	ticker := time.NewTicker(m.updatert)
	defer ticker.Stop()
 
	for {
		select {
		case <-m.ctx.Done():
			return
		case <-ticker.C: // Some discoverers send updates too often so we throttle these with the ticker.
			select {
			case <-m.triggerSend:
				sentUpdates.WithLabelValues(m.name).Inc()
				select {
         //方法allGroups负责类型转换,并传给syncCh
				case m.syncCh <- m.allGroups(): 
				default:
					delayedUpdates.WithLabelValues(m.name).Inc()
					level.Debug(m.logger).Log("msg", "discovery receiver's channel was full so will retry the next cycle")
					select {
					case m.triggerSend <- struct{}{}:
					default:
					}
				}
			default:
			}
		}
	}
}
负责转换的allGroups()方法
func (m *Manager) allGroups() map[string][]*targetgroup.Group {
	m.mtx.Lock()
	defer m.mtx.Unlock()
 
	tSets := map[string][]*targetgroup.Group{}
	for pkey, tsets := range m.targets {
		var n int
		for _, tg := range tsets {
			// Even if the target group 'tg' is empty we still need to send it to the 'Scrape manager'
			// to signal that it needs to stop all scrape loops for this target set.
			tSets[pkey.setName] = append(tSets[pkey.setName], tg)
			n += len(tg.Targets)
		}
		discoveredTargets.WithLabelValues(m.name, pkey.setName).Set(float64(n))
	}
	return tSets
}
  • 服务发现 (serviceDiscover)和指标采集 (scrapeManager)通信
scrapeManager通过一个协程启动,监听的正是结构体Manager中的syncCh,实现两者的通信
prometheus/cmd/prometheus/main.go
 
{
	// Scrape manager.
	g.Add(
		func() error {
			// When the scrape manager receives a new targets list
			// it needs to read a valid config for each job.
			// It depends on the config being in sync with the discovery manager so
			// we wait until the config is fully loaded.
			<-reloadReady.C

			err := scrapeManager.Run(discoveryManagerScrape.SyncCh())
			level.Info(logger).Log("msg", "Scrape manager stopped")
			return err
		},
		func(err error) {
			// Scrape manager needs to be stopped before closing the local TSDB
			// so that it doesn't try to write samples to a closed storage.
			level.Info(logger).Log("msg", "Stopping scrape manager...")
			scrapeManager.Stop()
		},
	)
}

prometheus/discovery/manager.go
 
// SyncCh returns a read only channel used by all the clients to receive target updates.
func (m *Manager) SyncCh() <-chan map[string][]*targetgroup.Group {
    //结构体Manager中的syncCh
	return m.syncCh
}

4 指标采集(scrapeManager)

【继续看这位老铁的博客】

为了从服务发现(serviceDiscover)实时获取监控服务(targets),指标采集(scrapeManager)通过协程把管道(chan)获取来的服务(targets)存进一个map类型:map[string][]*targetgroup.Group.其中,map的key是job_name,map的value是结构体targetgroup.Group,该结构体包含该job_name对应的Targets,Labels和Source.

指标采集(scrapeManager)获取服务(targets)的变动,可分为多种情况,以服务增加为例,若有新的job添加,指标采集(scrapeManager)会进行重载,为新的job创建一个scrapePool,并为job中的每个target创建一个scrapeLoop.若job没有变动,只增加了job下对应的targets,则只需创建新的targets对应的scrapeLoop.

  • 指标采集(scrapeManager)实时获取监控服务
  • 指标采集(scrapeManager)配置初始化和应用

5 指标缓存(scrapeCache)

scrapeLoop结构体包含scrapeCache,调用scrapeLoop.append方法处理指标(metrics)存储.在方法append中,把每个指标(metirics)放到结构体scrapeCache对应的方法中进行合法性验证,过滤和存储.

6 通知管理(notifierManager)

Prometheus会在配置文件定义一些告警规则表达式, 当采集的metrics经过聚合, 满足告警表达式条件, 将触发告警, 发送给告警服务Alertmanager. 所以,本文主要分析与Alertmanager交互的通知管理(notifierManager), 但会先梳理下规则管理(ruleManager)的部分内容. 因为告警规则的最终判断是由规则管理(ruleManager)完成.

  • 规则管理(ruleManager)把告警信息发送给通知管理(notifierManager)
  • 启动通知管理(notifierManager)

7 TSDB