- Notifications
You must be signed in to change notification settings - Fork715
feat: service graph through streams#9418
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
Failed to generate code suggestions for PR |
Greptile OverviewGreptile SummaryReplaces in-memory service graph processing with SQL-based daemon architecture that runs in compactor every hour, completely decoupling service graph from trace ingestion. Major Changes:
Architecture Benefits:
Configuration:
Confidence Score: 4/5
Important Files ChangedFile Analysis
Sequence DiagramsequenceDiagram participant Client as Frontend Client participant API as Service Graph API participant Compactor as Compactor Job participant Processor as SQL Processor participant DataFusion as DataFusion Query Engine participant TraceStream as Trace Streams participant ServiceGraphStream as _o2_service_graph Stream Note over Compactor,Processor: Daemon runs every processing_interval_secs (default: 3600s) Compactor->>Processor: process_service_graph() Processor->>Processor: get_trace_streams() - List all orgs & trace streams loop For each trace stream Processor->>DataFusion: Execute SQL with CTE Note right of DataFusion: WITH edges AS (<br/> SELECT client, server, duration, span_status<br/> WHERE span_kind IN ('2','3')<br/> AND peer_service IS NOT NULL<br/>)<br/>SELECT COUNT(*), percentiles<br/>GROUP BY client, server DataFusion->>TraceStream: Query last 60min of traces TraceStream-->>DataFusion: Span records DataFusion-->>Processor: Aggregated edges (p50/p95/p99) Processor->>ServiceGraphStream: write_sql_aggregated_edges() Note right of ServiceGraphStream: Bulk ingest via logs API<br/>to _o2_service_graph end Note over Client,API: User queries topology (independent of daemon) Client->>API: GET /topology/current?stream_name=X API->>ServiceGraphStream: SQL query last 60min ServiceGraphStream-->>API: Edge records API->>API: build_topology() - Convert to nodes/edges API-->>Client: {nodes, edges, availableStreams} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
20 files reviewed, no comments
b0d49a6 toe563680Compare5964399 toe23e8c5CompareReplace in-memory EdgeStore with daemon-based architecture that periodicallyqueries trace streams using SQL aggregation in DataFusion.Architecture:- Daemon runs in compactor job every 1 hour (default, configurable)- Queries last 60 minutes of trace data (configurable window)- Uses DataFusion SQL with CTE for efficient aggregation- Calculates p50, p95, p99 percentiles using approx functions- Writes aggregated edges to internal `_o2_service_graph` stream- Zero impact on trace ingestion performanceKey Changes:- Add `processor.rs` with SQL-based daemon that queries trace streams- Add `aggregator.rs` to write SQL results to `_o2_service_graph` stream (uses json! macro)- Refactor `api.rs` to query pre-aggregated data from stream- Move business logic to enterprise repo (`build_topology()`, `span_to_graph_span()`)- Remove ALL inline processing during trace ingestion- Delete in-memory EdgeStore, worker threads, and buffering code- Feature-gated for enterprise builds with non-enterprise stubsSQL Query:- CTE extracts client/server from span_kind ('2'=SERVER, '3'=CLIENT)- Aggregates by (client, server) with COUNT, percentiles, error rate- Filters: span_kind IN ('2','3') AND peer_service IS NOT NULLConfiguration:- O2_SERVICE_GRAPH_ENABLED (default: false)- O2_SERVICE_GRAPH_PROCESSING_INTERVAL_SECS (default: 3600) - Daemon run frequency- O2_SERVICE_GRAPH_QUERY_TIME_RANGE_MINUTES (default: 60) - Query window sizeTechnical Details:- Stream name: `_o2_service_graph` (internal stream)- Edge timestamp: Uses query end time for API compatibility- Span kinds: '2' (SERVER), '3' (CLIENT) as VARCHAR in parquet- Percentiles: approx_median(), approx_percentile_cont(0.95), approx_percentile_cont(0.99)e23e8c5 to97c701aCompareada8085 intomainUh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Architecture
Replaces in-memory EdgeStore with daemon that periodically queries trace streams using DataFusion SQL.
Flow:
_o2_service_graphinternal stream_o2_service_graphfor last 60 minutesZero impact on trace ingestion - completely decoupled.
SQL Query
Configuration
O2_SERVICE_GRAPH_ENABLED=true- Enable featureO2_SERVICE_GRAPH_PROCESSING_INTERVAL_SECS=3600- How often daemon runs (default: 1 hour)O2_SERVICE_GRAPH_QUERY_TIME_RANGE_MINUTES=60- Query window size (default: 60 min)Changes
Added:
processor.rs(202 lines) - SQL daemonaggregator.rs(155 lines) - Stream writerModified:
api.rs- Query from_o2_service_graphstream, returnavailableStreamscompactor.rs- Integrate service_graph_processor jobtraces/mod.rs- Remove ALL inline processingDeleted:
job/service_graph.rs- Standalone daemon (now in compactor)Enterprise dependency:
Technical Details