Skip to content

Conversation

@jmacd
Copy link
Contributor

@jmacd jmacd commented Dec 20, 2025

Related to #1663.

This is a proposal to support an internal route for telemetry that lets us process OTAP-Dataflow telemetry using our own pipeline support. This requires special protections from self-induced telemetry, and it requires options to route telemetry in many ways to remain compatible with OpenTelemetry and keep our options. The proposal to accompany this PR is in the ARCHITECTURE.md draft.

@github-actions github-actions bot added the rust Pull requests that update Rust code label Dec 20, 2025
@codecov
Copy link

codecov bot commented Dec 20, 2025

Codecov Report

❌ Patch coverage is 56.05048% with 592 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.02%. Comparing base (3ce308b) to head (38bda49).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1672      +/-   ##
==========================================
- Coverage   84.27%   84.02%   -0.25%     
==========================================
  Files         464      471       +7     
  Lines      133941   135276    +1335     
==========================================
+ Hits       112884   113672     +788     
- Misses      20523    21070     +547     
  Partials      534      534              
Components Coverage Δ
otap-dataflow 85.42% <56.05%> (-0.43%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.25% <ø> (ø)
syslog_cef_receivers ∅ <ø> (∅)
otel-arrow-go 53.50% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

internal telemetry pipeline is rigorously safeguarded against these
pitfalls through:

- OTAP-Dataflow components downstream of an ITR cannot be configured
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ITR - first time usage. Define.

with:

- No-op logging
- Raw logging
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If "Raw logging" means console output, It should be through the SDK. It is yet another exporter, and should not be considered by separate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for stdout/console, we cannot (and should not) use OTel. Need to use tracing::fmt or develop customer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the OTel exporter is not specified as a stable output format by OTel.

I see that it is a JSON-line oriented exporter, and I would welcome such a thing (@gouslu and I discussed deriving such a thing from the AzMon exporter code, which is similar to JSON-line).

In this PR as proof of concept I replaced tracing::fmt and developed an OTLP-bytes-to-console exporter, that's essentially what you're saying I think @cijothomas.

Actually, I've seen logging SDKs emit directly to protobuf bytes before! One reason we can't use tracing::fmt over the OTLP bytes representation in this case (AFAICT) is that the tracing Event struct does not contain a timestamp, there's no way to format a log statement recorded in the past. This is not the case for OTel SDK, why in this proposal we're able to reconstruct an OTLP SDK log event and process/export as an alternative to the ITR.

@@ -0,0 +1,72 @@
// Copyright The OpenTelemetry Authors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

avoid the "mod.rs" format. Use a file internal_telemetry_receiver.rs at the same level as lib.rs instead.

@jmacd
Copy link
Contributor Author

jmacd commented Dec 20, 2025

I apologize-- the code in this PR is not acceptable for review, we may view it as a feasibility study to accompany the ARCHITECTURE.md document.

pub max_record_bytes: usize,

/// Maximum number of records in the bounded channel.
/// When full, new records fall back to raw console logger.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why fallback here, instead of keeping count of the drops?

#[serde(default = "default_max_record_bytes")]
pub max_record_bytes: usize,

/// Maximum number of records in the bounded channel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"bounded" channel feels like internal implementation details; so we should avoid exposing them to public config..

pub(crate) resource_bytes: OtlpProtoBytes,
pub(crate) scope_name: String,
pub(crate) flush_threshold_bytes: usize,
pub(crate) overflow_sender: mpsc::UnboundedSender<Bytes>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it okay to use a unbounded sender for overflow purposes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is a draft, so it is probably a temporary shortcut. The rule I personally follow is to systematically use bounded channels together with a policy that specifies what should happen in case of overflow, that is, drop the incoming message or block the sender. This policy can be configurable, or chosen directly in the code depending on whether it is worth making configurable.

Comment on lines +23 to +27
/// Internal collection for component-level logs.
///
/// When enabled, component logs (otel_info!, otel_warn!, etc.) are routed through
/// an internal telemetry receiver in the OTAP pipeline, allowing use of built-in
/// batch processors, retry, and exporters (console, OTLP, etc.).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a big fan of this approach.

Copy link
Contributor

@lquerel lquerel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly focused on the architecture document for now, which is very detailed and informative. A few general remarks:

  • In the long term, we should be able to define TelemetrySettings at a global level, at the engine level, with the possibility to override them in a specific pipeline configuration.

  • In the case of a tight integration with the otap-engine, I think we need to find a way to reuse the AttributeSet concept already used for metrics. This offers several advantages:

    1. we get a common language and definition for attributes,
    2. they are populated by the engine and registered only once, which avoids redefining them on every call. We only need to provide an attribute set ID, which is essentially just a number. Additional "dynamic" attributes could be added when needed.
  • We should be able to compile the engine in a way that eliminates all overhead for a given logging level, essentially the same behavior we have today with the current macros. I believe this is already the plan, but I wanted to make this rule explicit.

/// When disabled (default), component logs are routed to the OpenTelemetry SDK,
/// using the same export path as 3rd party logs from tokio-tracing-rs.
#[serde(default)]
pub internal_collection: InternalCollectionConfig,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long term, I think this approach will apply not only to logs, but also to metrics and traces. At some point, we might promote this configuration to a more general level.
For now, I suggest that this field be:

  • an option, so we can remove the enabled field from InternalCollectionConfig, which I think would simplify things
  • renamed to something more explicit. I'm not fully satisfied with my proposal otap_pipeline; there is probably a better name to find

///
/// This method never fails - errors are silently dropped to prevent recursion.
/// If the telemetry buffer is not configured, this is a no-op.
pub fn log_event(&self, log_record: &impl otap_df_pdata::views::logs::LogRecordView) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to find a way to reuse the NodeAttributeSet that we already use for metrics. That will let every log emitted by a node share a common context with the metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants