Potential Resource Leak from Periodic Reader Goroutines After Reporter Eviction

# Description
## Summary
A potential resource leak has been identified in the metrics reporting pipeline where periodic reader goroutines continue running after their associated reporters are evicted from the LRU cache, preventing garbage collection and causing the OTEL endpoint to receive empty metric exports.
## Root Cause
1. Per-Service Metric Provider Creation
When the `MetricsReporter` processes spans in the `onSpan` function, it creates a per-service `Metrics` instance through the `ReporterPool`. Each `Metrics` instance contains a `MeterProvider` with a `PeriodicReader` that exports metrics at configured intervals.
https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/blob/1807454aabccb84abf4bff6e8aa7a1463155e9a7/pkg/export/otel/metrics.go#L1125-L1151
Each service gets its own `MeterProvider` with a `PeriodicReader`:
https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/blob/1807454aabccb84abf4bff6e8aa7a1463155e9a7/pkg/export/otel/metrics.go#L651-L666
2. Background Goroutine in `PeriodicReader`
The `PeriodicReader` starts a background goroutine that continuously collects and exports metrics on a ticker interval
https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/blob/1807454aabccb84abf4bff6e8aa7a1463155e9a7/pkg/export/otel/metric/periodic_reader.go#L126-L129
https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/blob/1807454aabccb84abf4bff6e8aa7a1463155e9a7/pkg/export/otel/metric/periodic_reader.go#L163-L181
3. Reporter Eviction Without Goroutine Cleanup
After the TTL expires and `expireOldReporters` removes a reporter from the cache, the eviction callback only calls `ForceFlush` asynchronously, but does not call `Shutdown`
https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/blob/1807454aabccb84abf4bff6e8aa7a1463155e9a7/pkg/export/otel/otelcfg/common.go#L293-L306
https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/blob/1807454aabccb84abf4bff6e8aa7a1463155e9a7/pkg/export/otel/metrics.go#L284-L299
## Impact
The periodic reader goroutine continues running indefinitely because:

1. The goroutine only stops when `ctx.Done()` is signaled (line 177 in periodic_reader.go)
2. The context is only canceled in the `Shutdown` method (line 320 in periodic_reader.go)
3. The eviction callback only calls `ForceFlush`, not `Shutdown`

This causes:

- Memory leak: The evicted Metrics instance and its MeterProvider cannot be garbage collected
- Unnecessary network traffic: The OTEL endpoint continues receiving metric exports with no actual data
- Resource waste: Background goroutines accumulate over time for services that are no longer active

### The Dilemma
Calling `MeterProvider.Shutdown()` in the eviction callback would solve the goroutine leak, but creates another problem
https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/blob/1807454aabccb84abf4bff6e8aa7a1463155e9a7/pkg/export/otel/metric/provider.go#L129-L143
The `MeterProvider.Shutdown()` calls the unified shutdown function for all readers
https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/blob/1807454aabccb84abf4bff6e8aa7a1463155e9a7/pkg/export/otel/metric/config.go#L25-L35
Which in turn calls the `PeriodicReader.Shutdown()`
**The critical issue:** Line 338 in `periodic_reader.go` calls `r.exporter.Shutdown(ctx)`, which would shut down the shared exporter instance used by all services' metric providers, breaking metrics export for all other active services.
https://github.com/open-telemetry/opentelemetry-ebpf-instrumentation/blob/1807454aabccb84abf4bff6e8aa7a1463155e9a7/pkg/export/otel/metric/periodic_reader.go#L309-L350
# Steps to Reproduce
**Environment**:
Run opentelemetry-ebpf-instrumentation like the docker compose example on [opentelemetry docs](https://opentelemetry.io/docs/zero-code/obi/setup/docker/)

- opentelemetry-ebpf-instrumentation version: v0.3.0
- Setup: Docker Compose with PostgreSQL database
```
version: "3.7"

services:
  postgres:
    image: postgres:16.10
    ports:
      - 5432:5432
    environment:
      - POSTGRES_HOST_AUTH_METHOD=trust
  obi:
    ​image: otel/ebpf-instrument:v0.3.0
    environment:
    	- OTEL_EBPF_OPEN_PORT=5432
    	- OTEL_EBPF_CONFIG_PATH=/etc/obi/config.yaml
    volumes:
    	- ./obi-config.yaml:/etc/obi/config.yaml
    privileged: true
    pid: "service:postgres"
    depends_on:
      - postgres
```
```
# obi-config.yaml
  
log_level: DEBUG

otel_metrics_export:
  ttl: 5m
  endpoint: http://otel-collector:4317
  features:
    - network
```
**Steps**:
1. Deploy the above configuration in the Docker Compose environment
2. Connect to PostgreSQL database and issue a create table command
3. Observe that the metrics reporter is instantiated
4. After the TTL has passed, connect to PostgreSQL database and issue a create table command to trigger expiration
5. Observe that the eviction callback is being called
6. Note that there are now two metrics reporter sending messages, the old one did not dissapear 

# Expected Behavior
When a metrics reporter is evicted from the cache, the following shutdown sequence should occur:
1. Complete `MeterProvider` Shutdown: The `MeterProvider` associated with the evicted reporter must be fully shut down by calling its `Shutdown()` method, not just `ForceFlush()`. This ensures that the `PeriodicReader` properly terminates its background goroutine, preventing resource leaks.
2. Exporter Independence: The `PeriodicReader.Shutdown()` implementation calls `r.exporter.Shutdown(ctx)`, but since multiple `MeterProvider` instances share the same exporter instance, the shutdown of one reader should not shut down the shared exporter and must not affect other active readers.

# Questions for Maintainers
1. **Is the current behavior intentional?** Is it expected that periodic reader goroutines continue running after reporter eviction, or is this an oversight?
2. **What is the intended lifecycle management strategy** for per-service `MeterProvider` instances in the context of TTL-based cache eviction?
3. **Why does `PeriodicReader.Shutdown()` call `exporter.Shutdown()`**? Given that multiple readers may share the same exporter instance, shouldn't the exporter lifecycle be managed separately from individual readers?
4. **What is the recommended approach** to properly clean up evicted reporters without affecting other active services?


	func (mr *MetricsReporter) onSpan(spans []request.Span) {
	for i := range spans {
	s := &spans[i]
	if s.InternalSignal() {
	continue
	}
	if !s.Service.ExportModes.CanExportMetrics() {
	continue
	}
	// If we are ignoring this span because of route patterns, don't do anything
	if request.IgnoreMetrics(s) {
	continue
	}
	reporter, err := mr.reporters.For(&s.Service)
	if err != nil {
	mlog().Error("unexpected error creating OTEL resource. Ignoring metric",
	"error", err, "service", s.Service)
	continue
	}
	reporter.record(s, mr)

	if mr.commonCfg.Features.AppHost() {
	hostInfo, attrs := mr.hostInfo.ForRecord(s)
	hostInfo.Record(mr.ctx, 1, instrument.WithAttributeSet(attrs))
	}
	}
	}

	opts := []metric.Option{
	metric.WithResource(resources),
	metric.WithReader(metric.NewPeriodicReader(mr.exporter,
	metric.WithInterval(mr.cfg.Interval))),
	}

	opts = append(opts, mr.otelMetricOptions(mlog)...)
	opts = append(opts, mr.spanMetricOptions(mlog)...)

	return Metrics{
	ctx: mr.ctx,
	service: service,
	provider: metric.NewMeterProvider(
	opts...,
	),
	}

	go func() {
	defer func() { close(r.done) }()
	r.run(ctx, conf.interval)
	}()

	func (r *PeriodicReader) run(ctx context.Context, interval time.Duration) {
	ticker := newTicker(interval)
	defer ticker.Stop()

	for {
	select {
	case <-ticker.C:
	err := r.collectAndExport(ctx)
	if err != nil {
	otel.Handle(err)
	}
	case errCh := <-r.flushCh:
	errCh <- r.collectAndExport(ctx)
	ticker.Reset(interval)
	case <-ctx.Done():
	return
	}
	}
	}

	func (rp *ReporterPool[K, T]) expireOldReporters() {
	now := rp.clock()
	if now.Sub(rp.lastExpiration) < rp.ttl {
	return
	}
	rp.lastExpiration = now
	for {
	_, v, ok := rp.pool.GetOldest()
	if !ok \|\| now.Sub(v.lastAccess) < rp.ttl {
	return
	}
	rp.pool.RemoveOldest()
	}
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Potential Resource Leak from Periodic Reader Goroutines After Reporter Eviction #993

Description

Summary

Root Cause

Impact

The Dilemma

Steps to Reproduce

Expected Behavior

Questions for Maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	mr.reporters = otelcfg.NewReporterPool[svc.Attrs, Metrics](cfg.ReportersCacheLen, cfg.TTL, timeNow,
	func(id svc.UID, v *Metrics) {
	llog := log.With("service", id)
	llog.Debug("evicting metrics reporter from cache")
	v.cleanupAllMetricsInstances()

	if !mr.pidTracker.ServiceLive(id) {
	mr.deleteTargetMetrics(&id)
	}

	go func() {
	if err := v.provider.ForceFlush(ctx); err != nil {
	llog.Warn("error flushing evicted metrics provider", "error", err)
	}
	}()
	}, mr.newMetricSet)

	func (mp *MeterProvider) Shutdown(ctx context.Context) error {
	// Even though it may seem like there is a synchronization issue between the
	// call to `Store` and checking `shutdown`, the Go concurrency model ensures
	// that is not the case, as all the atomic operations executed in a program
	// behave as though executed in some sequentially consistent order. This
	// definition provides the same semantics as C++'s sequentially consistent
	// atomics and Java's volatile variables.
	// See https://go.dev/ref/mem#atomic and https://pkg.go.dev/sync/atomic.

	mp.stopped.Store(true)
	if mp.shutdown != nil {
	return mp.shutdown(ctx)
	}
	return nil
	}

	func (c config) readerSignals() (forceFlush, shutdown func(context.Context) error) {
	var fFuncs, sFuncs []func(context.Context) error
	for _, r := range c.readers {
	sFuncs = append(sFuncs, r.Shutdown)
	if f, ok := r.(interface{ ForceFlush(context.Context) error }); ok {
	fFuncs = append(fFuncs, f.ForceFlush)
	}
	}

	return unify(fFuncs), unifyShutdown(sFuncs)
	}

	func (r *PeriodicReader) Shutdown(ctx context.Context) error {
	err := ErrReaderShutdown
	r.shutdownOnce.Do(func() {
	// Prioritize the ctx timeout if it is set.
	if _, ok := ctx.Deadline(); !ok {
	var cancel context.CancelFunc
	ctx, cancel = context.WithTimeout(ctx, r.timeout)
	defer cancel()
	}

	// Stop the run loop.
	r.cancel()
	<-r.done

	// Any future call to Collect will now return ErrReaderShutdown.
	ph := r.sdkProducer.Swap(produceHolder{
	produce: shutdownProducer{}.produce,
	})

	if ph != nil { // Reader was registered.
	// Flush pending telemetry.
	m := r.rmPool.Get().(*sdkmetricdata.ResourceMetrics)
	err = r.collect(ctx, ph, m)
	if err == nil {
	err = r.export(ctx, m)
	}
	r.rmPool.Put(m)
	}

	sErr := r.exporter.Shutdown(ctx)
	if err == nil \|\| errors.Is(err, ErrReaderShutdown) {
	err = sErr
	}

	r.mu.Lock()
	defer r.mu.Unlock()
	r.isShutdown = true
	// release references to Producer(s)
	r.externalProducers.Store([]Producer{})
	})
	return err
	}

Potential Resource Leak from Periodic Reader Goroutines After Reporter Eviction #993

Description

Description

Summary

Root Cause

Impact

The Dilemma

Steps to Reproduce

Expected Behavior

Questions for Maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions