Starlink Collector Services Optimizations

Starlink Telemetry Service OverView

1. Starlink Telemetry API Update

(ref. from starlink docs):
The telemetry API maintains a read index (per {client credential, account}) and a write index (advanced when new data is reported). If your client reads data slower than it's written, you'll keep receiving older data. To stay current, your client must poll faster than new data is generated.

The API is designed for continuous polling (hotloop). You can fine-tune response size using lingerMs and batchSize.

Problem:
If we ingest data slower than Starlink produces it, delays will occur. To catch up, we may use a hotloop (frequent polling of the Starlink telemetry API without immediate ingestion), but aggressive polling can lead to API rate limit issues, especially as the number of service accounts grows

2. Heap Memory Usage Issue

Problem:
Telemetry data for all service accounts under a client is currently collected first, then processed in bulk. This causes excessive heap memory consumption, increasing the risk of out-of-memory errors and impacting service stability.

Proposed Solution:

Shift to a per-account processing model: fetch and process data for one account at a time.
This minimizes memory usage by reducing the in-memory data footprint.
Ensures more consistent and scalable performance as the number of accounts grows.

3. Data Ingestion Performance (ClickHouse)

Problem:

Single ClickHouse connection is used for data ingestion. Using batch inserts in parallel or attempting to insert all data in a single batch leads to bottlenecks, ingestion delays, and even ClickHouse client errors such as "connection busy" during insert operations, especially when the data size is large.

Proposed Solution:

Implement a connection pool of ClickHouse clients to allow multiple concurrent insertions.
Use batch insertion strategies combined with concurrent ingestion threads.
This improves throughput, reduces ingestion lag, and balances system resource usage.

ETL Service: Overview and Rate Limitation Handling

1. Service Overview

The ETL service is responsible for fetching and processing the following data:

Service Lines details
Router Configuration, Address
Service Lines Usage Metrics

Execution Periodicity: Every 30 minutes
Source: Data is fetched via API calls to the Starlink system.

2. Current Limitation

❗ API Rate Limiting

The Starlink API enforces a rate limit of 250 requests per minute per source IP.
All services, including the ETL service, are deployed in the same cloud-based Kubernetes cluster and hence share a single source IP.
This causes a bottleneck, especially as the ETL service makes multiple API calls (for terminal configs, router details, service lines, and usage).
As the system evolves to include configuration updates (e.g., from the Kognitive Starlink dashboard), the risk increases that config operations may fail due to the rate limit being exhausted by the ETL service.

3. Proposed Solutions

A. Network-Level Solution: Assign IP Pool to Kubernetes Cluster

To mitigate the shared IP rate limitation:

assign IP Pools in the Kubernetes cluster.
Isolates services by IP to avoid one service impacting another due to rate limits if possible.

B. Service-Level Improvements

1. Optimize API Usage Frequency

Review current API call patterns.
Identify data that changes less frequently, such as router configuration or address data.
reduce the frequency of these API calls (e.g., refresh them every few hours or on config update instead of every 30 minutes).

2. Introduce ETL service-Side Rate Limiting

Add internal rate limiting logic in the ETL service.
Ensure the service respects global API rate caps and reserves capacity for other services (especially for time-sensitive config operations).

3. Future Scalability: Distributed ETL Service

Once IP pool and rate limits are managed:
- Refactor the ETL service to run in a distributed mode, where work is partitioned and executed in parallel across pods or workers.
- This allows horizontal scaling without breaching rate limits.