Integrating dbt and ClickHouse
Index
The dbt-clickhouse Plugin
dbt (data build tool) enables analytics engineers to transform data in their warehouses by simply writing select statements. dbt handles materializing these select statements into objects in the database in the form of tables and views - performing the T of Extract Load and Transform (ELT). Users can create a model defined by a SELECT statement.
Within dbt, these models can be cross-referenced and layered to allow the construction of higher-level concepts. The boilerplate SQL required to connect models is automatically generated. Furthermore, dbt identifies dependencies between models and ensures they are created in the appropriate order using a directed acyclic graph (DAG).
Dbt is compatible with ClickHouse through a ClickHouse-supported plugin. We describe the process for connecting ClickHouse with a simple example based on a publicly available IMDB dataset. We additionally highlight some of the limitations of the current connector.
Supported features
- Table materialization
- View materialization
- Incremental materialization
- Microbatch incremental materialization
- Materialized View materializations (uses the
TO
form of MATERIALIZED VIEW, experimental) - Seeds
- Sources
- Docs generate
- Tests
- Snapshots
- Most dbt-utils macros (now included in dbt-core)
- Ephemeral materialization
- Distributed table materialization (experimental)
- Distributed incremental materialization (experimental)
- Contracts
Concepts
dbt introduces the concept of a model. This is defined as a SQL statement, potentially joining many tables. A model can be "materialized" in a number of ways. A materialization represents a build strategy for the model's select query. The code behind a materialization is boilerplate SQL that wraps your SELECT query in a statement in order to create a new or update an existing relation.
dbt provides 4 types of materialization:
- view (default): The model is built as a view in the database.
- table: The model is built as a table in the database.
- ephemeral: The model is not directly built in the database but is instead pulled into dependent models as common table expressions.
- incremental: The model is initially materialized as a table, and in subsequent runs, dbt inserts new rows and updates changed rows in the table.
Additional syntax and clauses define how these models should be updated if their underlying data changes. dbt generally recommends starting with the view materialization until performance becomes a concern. The table materialization provides a query time performance improvement by capturing the results of the model's query as a table at the expense of increased storage. The incremental approach builds on this further to allow subsequent updates to the underlying data to be captured in the target table.
The current plugin for ClickHouse supports also support materialized view, dictionary, distributed table and distributed incremental materializations. The plugin also supports dbt snapshots and seeds.
Details about supported materializations
Type | Supported? | Details |
---|---|---|
view materialization | YES | Creates a view. |
table materialization | YES | Creates a table. See below for the list of supported engines. |
incremental materialization | YES | Creates a table if it doesn't exist, and then writes only updates to it. |
ephemeral materialized | YES | Creates a ephemeral/CTE materialization. This does model is internal to dbt and does not create any database objects |
The following are experimental features in Clickhouse:
Type | Supported? | Details |
---|---|---|
Materialized View materialization | YES, Experimental | Creates a materialized view. |
Distributed table materialization | YES, Experimental | Creates a distributed table. |
Distributed incremental materialization | YES, Experimental | Incremental model based on the same idea as distributed table. Note that not all strategies are supported, visit this for more info. |
Dictionary materialization | YES, Experimental | Creates a dictionary. |
Setup of dbt and the ClickHouse plugin
Install dbt-core and dbt-clickhouse
Provide dbt with the connection details for our ClickHouse instance.
Configure clickhouse
profile in ~/.dbt/profiles.yml file and provide user, password, schema host properties. The full list of connection configuration options is available in the Features and configurations page:
Create a dbt project
Inside project_name
dir, update your dbt_project.yml
file to specify a profile name to connect to the ClickHouse server.
Test connection
Execute dbt debug
with the CLI tool to confirm whether dbt is able to connect to ClickHouse. Confirm the response includes Connection test: [OK connection ok]
indicating a successful connection.
We assume the use of the dbt CLI for the following examples. This plugin is still not available for usage inside dbt Cloud, but we expect to get it available soon. Please reach out to support to get more info on this.
dbt offers a number of options for CLI installation. Follow the instructions described here. At this stage install dbt-core only. We recommend the use of pip
to install both dbt and dbt-clickhouse.
Go to the guides page to learn more about how to use dbt with ClickHouse.
Troubleshooting Connections
If you encounter issues connecting to ClickHouse from dbt, make sure the following criteria are met:
- The engine must be one of the supported engines.
- You must have adequate permissions to access the database.
- If you're not using the default table engine for the database, you must specify a table engine in your model configuration.
Limitations
The current ClickHouse plugin for dbt has several limitations users should be aware of:
- The plugin currently materializes models as tables using an
INSERT TO SELECT
. This effectively means data duplication. Very large datasets (PB) can result in extremely long run times, making some models unviable. Aim to minimize the number of rows returned by any query, utilizing GROUP BY where possible. Prefer models which summarize data over those which simply perform a transform whilst maintaining row counts of the source. - To use Distributed tables to represent a model, users must create the underlying replicated tables on each node manually. The Distributed table can, in turn, be created on top of these. The plugin does not manage cluster creation.
- When dbt creates a relation (table/view) in a database, it usually creates it as:
{{ database }}.{{ schema }}.{{ table/view id }}
. ClickHouse has no notion of schemas. The plugin therefore uses{{schema}}.{{ table/view id }}
, whereschema
is the ClickHouse database. - Ephemeral models/CTEs don't work if placed before the
INSERT INTO
in a ClickHouse insert statement, see https://github.com/ClickHouse/ClickHouse/issues/30323. This should not affect most models, but care should be taken where an ephemeral model is placed in model definitions and other SQL statements.
Further Information
The previous guides only touch the surface of dbt functionality. Users are recommended to read the excellent dbt documentation.
Additional configuration for the plugin is described here.
Fivetran
The dbt-clickhouse
connector is also available for use in Fivetran transformations, allowing seamless integration and transformation capabilities directly within the Fivetran platform using dbt
.