Skip to main content

Clickhouse

Scraper

Mission Control integrates with Clickhouse to scrape configuration data from databases and cloud storage.

Use cases:

  • Import configuration items from Clickhouse tables and queries
  • Query data from AWS S3, Google Cloud Storage, and Azure Blob Storage
  • Process CSV, JSON, Parquet, and Iceberg file formats
  • Track changes in analytical data and business configurations
  • Monitor data pipeline outputs and transformation results

Data Sources

Clickhouse's powerful SQL engine can query data from multiple sources using table functions:

SourceDescriptionUse Cases
Clickhouse Database
Direct connection to Clickhouse tables and materialized viewsApplication configs, CDC tables, analytics data
AWS S3
Query objects from S3 buckets using the s3() functionData lake exports, CloudTrail logs, cost reports
Google Cloud Storage
Query objects from GCS buckets using the gcs() functionBigQuery exports, Dataflow outputs, backup configs
Azure Blob Storage
Query blobs from Azure containers using azureBlobStorage()Data Factory exports, Azure logs, Power BI datasets
HTTP/HTTPS URLs
Query files from any HTTP endpoint using the url() functionPublic datasets, REST API exports, remote configs
HDFSQuery files from Hadoop clusters using the hdfs() functionHadoop ecosystem data, Spark outputs
MySQL
Query remote MySQL databases using the mysql() functionLegacy system configs, CRM data
PostgreSQL
Query remote PostgreSQL databases using the postgresql() functionApplication databases, metadata stores

Scraper

Direct Clickhouse Connection

Query configuration data directly from Clickhouse tables.

clickhouse-scraper.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: clickhouse-configs
spec:
schedule: "@every 1h"
clickhouse:
- clickhouseURL: clickhouse://user:password@clickhouse:9000/configs
query: |
SELECT
id,
name,
config_data,
updated_at
FROM application_configs
WHERE active = true
items:
- type: Application::Config
id: .id
name: .name
AWS S3 with Parquet Files

Query Parquet files stored in S3 using Clickhouse's S3 table function.

clickhouse-s3-parquet.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: s3-parquet-configs
spec:
schedule: "@every 6h"
clickhouse:
- awsS3:
bucket: config-exports
path: daily/
region: us-east-1
accessKey:
valueFrom:
secretKeyRef:
name: aws-credentials
key: access-key
secretKey:
valueFrom:
secretKeyRef:
name: aws-credentials
key: secret-key
query: |
SELECT *
FROM s3(
'https://config-exports.s3.amazonaws.com/daily/*.parquet',
'Parquet'
)
items:
- type: Export::Config
id: .record_id
name: .name
Google Cloud Storage with JSON

Query JSON files from Google Cloud Storage.

clickhouse-gcs-json.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: gcs-json-configs
spec:
schedule: "@every 4h"
clickhouse:
- query: |
SELECT *
FROM gcs(
'https://storage.googleapis.com/my-bucket/configs/*.json',
'JSONEachRow'
)
items:
- type: GCS::Config
id: .id
name: .name
Azure Blob Storage with CSV

Query CSV files from Azure Blob Storage using named collections.

clickhouse-azure-csv.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: azure-csv-configs
spec:
schedule: "@every 2h"
clickhouse:
- azureBlobStorage:
account: mystorageaccount
container: configs
path: exports/
collection: AzureConfigs
query: |
SELECT *
FROM azureBlobStorage(AzureConfigs, 'exports/*.csv', 'CSV')
items:
- type: Azure::Export
id: .config_id
name: .config_name
Iceberg Tables

Query Iceberg tables for versioned configuration data.

clickhouse-iceberg.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: iceberg-configs
spec:
schedule: "@every 12h"
clickhouse:
- awsS3:
bucket: data-lake
path: iceberg/configs/
region: us-west-2
connection: connection://aws/data-lake
query: |
SELECT *
FROM iceberg('s3://data-lake/iceberg/configs/')
WHERE snapshot_id = (
SELECT max(snapshot_id)
FROM iceberg_snapshots('s3://data-lake/iceberg/configs/')
)
items:
- type: DataLake::Config
id: .id
name: .name
Access Logs Analysis

Import access logs from cloud storage for compliance tracking.

clickhouse-access-logs.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: access-logs
spec:
schedule: "@every 1h"
clickhouse:
- awsS3:
bucket: access-logs
path: "{{.Date.Format \"2006/01/02\"}}/"
connection: connection://aws/logs
query: |
SELECT
request_id,
user_id,
action,
resource,
timestamp,
ip_address
FROM s3(
'https://access-logs.s3.amazonaws.com/{{.Date.Format "2006/01/02"}}/*.json.gz',
'JSONEachRow'
)
WHERE timestamp > now() - INTERVAL 1 HOUR
items:
- type: AccessLog::Entry
id: .request_id
name: "{{.action}} by {{.user_id}}"
Change Data Capture

Track configuration changes by querying CDC tables.

clickhouse-cdc.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: config-changes
spec:
schedule: "@every 15m"
full: true
clickhouse:
- clickhouseURL: clickhouse://user:password@clickhouse:9000/cdc
query: |
SELECT
id,
table_name,
operation,
before_data,
after_data,
changed_at
FROM config_changes
WHERE changed_at > now() - INTERVAL 15 MINUTE
ORDER BY changed_at DESC
items:
- type: CDC::Change
id: .id
name: "{{.table_name}}.{{.operation}}"
Remote MySQL Database

Query configuration data from a remote MySQL database using Clickhouse's MySQL table function.

clickhouse-mysql.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: mysql-configs
spec:
schedule: "@every 2h"
clickhouse:
- clickhouseURL: clickhouse://user:password@clickhouse:9000/default
query: |
SELECT
id,
name,
settings,
updated_at
FROM mysql(
'mysql-host:3306',
'app_database',
'configurations',
'readonly_user',
'password'
)
WHERE active = 1
items:
- type: MySQL::Config
id: .id
name: .name
Remote PostgreSQL Database

Query configuration data from a remote PostgreSQL database.

clickhouse-postgresql.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: postgres-configs
spec:
schedule: "@every 2h"
clickhouse:
- clickhouseURL: clickhouse://user:password@clickhouse:9000/default
query: |
SELECT
id,
name,
config_json,
modified_at
FROM postgresql(
'postgres-host:5432',
'app_db',
'feature_flags',
'readonly',
'password'
)
items:
- type: Postgres::FeatureFlag
id: .id
name: .name
HTTP/URL Data Source

Query JSON or CSV files from any HTTP endpoint.

clickhouse-url.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: url-configs
spec:
schedule: "@every 1h"
clickhouse:
- clickhouseURL: clickhouse://user:password@clickhouse:9000/default
query: |
SELECT *
FROM url(
'https://api.example.com/exports/configs.json',
'JSONEachRow'
)
items:
- type: API::Config
id: .id
name: .name
AWS Cost and Usage Reports

Import AWS Cost and Usage Reports from S3 for cost tracking.

clickhouse-aws-cur.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: aws-cost-reports
spec:
schedule: "@daily"
clickhouse:
- awsS3:
bucket: aws-billing-reports
path: cur/
region: us-east-1
connection: connection://aws/billing
query: |
SELECT
line_item_resource_id as id,
line_item_product_code as service,
line_item_usage_type as usage_type,
sum(line_item_unblended_cost) as cost,
billing_period_start_date
FROM s3(
'https://aws-billing-reports.s3.amazonaws.com/cur/**/*.parquet',
'Parquet'
)
WHERE billing_period_start_date >= today() - 30
GROUP BY id, service, usage_type, billing_period_start_date
items:
- type: AWS::CostReport
id: .id
name: "{{.service}}/{{.usage_type}}"
CloudTrail Logs

Import CloudTrail logs for security and compliance auditing.

clickhouse-cloudtrail.yaml
apiVersion: configs.flanksource.com/v1
kind: ScrapeConfig
metadata:
name: cloudtrail-events
spec:
schedule: "@every 30m"
clickhouse:
- awsS3:
bucket: cloudtrail-logs
path: "AWSLogs/{{.AccountID}}/CloudTrail/"
region: us-east-1
connection: connection://aws/security
query: |
SELECT
eventID as id,
eventName,
eventSource,
userIdentity,
sourceIPAddress,
eventTime
FROM s3(
'https://cloudtrail-logs.s3.amazonaws.com/AWSLogs/*/CloudTrail/**/*.json.gz',
'JSONEachRow'
)
WHERE eventTime > now() - INTERVAL 1 HOUR
items:
- type: CloudTrail::Event
id: .id
name: "{{.eventSource}}/{{.eventName}}"

Supported File Formats

FormatExtensionDescriptionBest For
CSV
.csv, .tsvComma/tab-separated valuesSimple exports, spreadsheet data
JSON
.json, .ndjsonJSON and newline-delimited JSONAPI exports, nested structures
Parquet.parquetColumnar storage formatLarge datasets, analytics
Iceberg-Apache Iceberg table formatVersioned data lakes, time travel
ORC.orcOptimized Row Columnar formatHive/Hadoop ecosystem
Avro.avroApache Avro formatSchema evolution, Kafka data
XML
.xmlExtensible Markup LanguageLegacy systems, SOAP exports
Protobuf.pbProtocol BuffersgRPC services, compact binary

Deployment

Mission Control can deploy a Clickhouse instance for you:

# values.yaml
config-db:
clickhouse:
enabled: true

Or connect to an existing Clickhouse server using the clickhouseURL parameter.

Next Steps