Feature: OpenTelemetry #111

niklastreml · 2024-02-21T14:32:32Z

Is there an existing feature request for this?

I have searched the existing issues

Problem Description

The prometheus format is not sufficient for exporting metrics from the traceroute check, aside from the amount of hops. To provide more complex data, it might be necessary to support a second data format like opentelemetry.

Solution Description

Implement the OTEL library and inject it into the checks, so they can write their own traces. Traceroute can use this to collect more detailed data about how every single invocation of the check behaves, essentially allowing a user to visualize how packets move from sparrow to their target

Who can address the issue?

No response

Additional Context

https://github.com/open-telemetry/opentelemetry-go
https://opentelemetry.io/docs/languages/go/
https://www.jaegertracing.io/docs/1.54/client-libraries/#go
https://grafana.com/docs/grafana/latest/panels-visualizations/visualizations/traces/

niklastreml · 2024-02-21T15:00:04Z

There might be a way to build traceroute metrics with prometheus, but I'm not sure if I like it.
We can create a sparrow_traceroute_hop metric that looks sort of like this:

sparrow_traceroute_hop{target="somedomain.com" n=2 hopAddr="10.0.0.1"} 152

target is what the user wants to check in the config
hopAddr is the specific address where the packet failed
n declares that hopAddr is the nth hop on the way to target
the actual metric is the time in milliseconds it took to get to that hop and back

The following traceroute cli output would map to prometheus metrics like so:

$ traceroute 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  MyRouter (68.87.192.226)         0.345 ms
 2  68.87.192.227 (68.87.192.227)    9.291 ms
 3  dns.google (8.8.8.8)             9.231 ms

sparrow_traceroute_hop{target="8.8.8.8" n=1 hopAddr="68.87.192.226"} 0.345
sparrow_traceroute_hop{target="8.8.8.8" n=2 hopAddr="68.87.192.227"} 9.291
sparrow_traceroute_hop{target="8.8.8.8" n=3 hopAddr="8.8.8.8"} 9.231

I'm not sure how easy it is to create a nice visualization from this though. Let me know what your thought on this are @lvlcn-t @puffitos @y-eight

y-eight · 2024-02-21T15:07:26Z

I think it is not a good idea. Especially when the route is changing often. The metrics do not provide a dependency to each other over time.

Even if we would clear the last state, the issue is transported to the next layer (the Prometheus).

Update: I am not 100% sure, it might be fine for a start. Clearing the state and expose the last traceroute as metrics. Especially it is easy to implement in contrast to OTEL. In the end we need to get feedback from the experts.

niklastreml · 2024-02-21T15:42:02Z

thats what I thought too. We would be generating a lot of metrics with different labels with this. In the worst case we'd get maxHops * targets metrics, which isn't exactly great

puffitos · 2024-02-21T16:45:35Z

Let's forget Prometheus for this one. The metrics will explode if we go that way. Let's try to make some simple API response available and then concentrate on offering OTEL traces, which can be visualised in the the preferred monitoring tool, which is grafana.

lvlcn-t · 2024-02-21T17:19:44Z

@niklastreml Imo we shouldn't use prometheus metrics for something they're not intended/hardly capable of. If the traceroute check was the only check that would implement otel traces then we could consider building this workaround for prometheus metrics but since the other checks will also write their own traces we should focus on implementing OTel.

@puffitos I don't know about forgetting prometheus completely for the traceroute check. I think we should still offer the amount of hops as prometheus metric for the traceroute check.

lvlcn-t · 2024-02-21T17:23:35Z

When implementing the traces for the dns check, we should keep this in mind: #81 (comment)

puffitos · 2024-02-22T17:54:23Z

@lvlcn-t sorry, I communicated my intention incorrectly; we shouldn't use prometheus metrics for each trace, as that would make the metrics' cardinality rise quickly. We can use the metrics for number of hops to target, totalDuration, meanDuration and so on, if those metrics make any sense in a trace route scenario. I haven't used the check in a while, so I'm not exactly sure what's critical information and what's not.

lvlcn-t · 2024-04-13T23:11:26Z

I've started to implement the first step of completing this issue in feat/otel-provider. I've added a otel provider setup depending on the user's (startup) configuration.
To be able to keep the provider selection dynamic, I'm using the OpenTelemetry Protocol (OTLP) to communicate with the collector. This way, the user can choose the provider and the collector to use.

The following configuration options will be available for the user (naming is not final):

# Configures the telemetry exporter.
telemetry:
  # The telemetry exporter to use.
  # Options:
  # grpc: Exports telemetry using OTLP via gRPC.
  # http: Exports telemetry using OTLP via HTTP.
  # stdout: Prints telemetry to stdout.
  # noop | "": Disables telemetry.
  exporter: grpc
  # The address to export telemetry to.
  url: localhost:4317
  # The token to use for authentication.
  # If the exporter does not require a token, this can be left empty.
  token: ""
  # The path to the tls certificate to use.
  # To disable tls, either set this to an empty string or set it to insecure.
  certPath: ""

Since the OpenTelemetry Protocol (OTLP) is a standard, the user can choose any collector that supports this protocol. The user can also choose to use the stdout exporter to see the traces in the console (for debugging purposes) or the noop exporter to disable the telemetry. If the user chooses an external collector, the user can also provide a bearer token for authentication. The user can also provide a path to a tls certificate to use for secure communication.

Next I'm wondering if I should already open the PR even though the actual traces are not yet implemented. This way we can split the issue into smaller parts and everyone can instrument one part of the codebase. What do you think?

niklastreml added the request Indicates a feature request label Feb 21, 2024

lvlcn-t changed the title ~~Feature: OpenTelemtry~~ Feature: OpenTelemetry Feb 21, 2024

niklastreml mentioned this issue Feb 21, 2024

Feat/traceroute #113

Merged

2 tasks

puffitos mentioned this issue Mar 5, 2024

refactor: startup config validation #120

Merged

4 tasks

This was referenced Aug 5, 2024

Feat/otel #160

Merged

Feat/improve traceroute #159

Merged

niklastreml self-assigned this Aug 7, 2024

niklastreml modified the milestones: 0.4, 0.4.0 Aug 7, 2024

lvlcn-t self-assigned this Sep 4, 2024

lvlcn-t mentioned this issue Sep 10, 2024

feat: instrument traceroute check #186

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature: OpenTelemetry #111

Feature: OpenTelemetry #111

niklastreml commented Feb 21, 2024

niklastreml commented Feb 21, 2024 •

edited

Loading

y-eight commented Feb 21, 2024 •

edited

Loading

niklastreml commented Feb 21, 2024

puffitos commented Feb 21, 2024

lvlcn-t commented Feb 21, 2024

lvlcn-t commented Feb 21, 2024

puffitos commented Feb 22, 2024

lvlcn-t commented Apr 13, 2024 •

edited

Loading

Feature: OpenTelemetry #111

Feature: OpenTelemetry #111

Comments

niklastreml commented Feb 21, 2024

Is there an existing feature request for this?

Problem Description

Solution Description

Who can address the issue?

Additional Context

niklastreml commented Feb 21, 2024 • edited Loading

y-eight commented Feb 21, 2024 • edited Loading

niklastreml commented Feb 21, 2024

puffitos commented Feb 21, 2024

lvlcn-t commented Feb 21, 2024

lvlcn-t commented Feb 21, 2024

puffitos commented Feb 22, 2024

lvlcn-t commented Apr 13, 2024 • edited Loading

niklastreml commented Feb 21, 2024 •

edited

Loading

y-eight commented Feb 21, 2024 •

edited

Loading

lvlcn-t commented Apr 13, 2024 •

edited

Loading