βΆdbt vs traditional ETL (like Informatica), what's the difference?
dbt is an ELT framework (Extract Load Transform), data lands in your warehouse first, then dbt transforms it using SQL. Traditional ETL transforms before loading. dbt's advantage: leverage the warehouse's compute (Snowflake, BigQuery), version-control transformations via Git, test data quality, document lineage, and iterate quickly. Drawback: requires centralized data warehouse (not cheap for small datasets). Use dbt if you have Snowflake/BigQuery/Redshift + >1GB data. Use traditional ETL if moving raw files between on-prem systems.
βΆWhat's a 'model' in dbt and how does it differ from a view or table?
A dbt model = a SQL file that defines a transformation. When you run dbt, each model becomes either a view or table in your warehouse (you choose via materialization config). View = query runs every time (cheaper compute, slower), table = pre-computed and stored (expensive compute, fast queries). Incremental model = hybrid (update only new rows since last run). Use tables for frequently-queried fact tables, views for intermediate staging layers. `SELECT * FROM {{ ref('my_model') }}` pulls from upstream models.
βΆHow do I test data quality in dbt and prevent bad data from reaching downstream dashboards?
dbt includes built-in tests (unique, not_null, relationships, accepted_values) in schema.yml and custom SQL tests. Example: `- name: user_id\n tests:\n - unique\n - not_null`. Run `dbt test` to validate. Fail fast: block downstream models if tests fail (dbt's default). Advanced: write custom tests (e.g., 'conversion rate between 0 and 1'). Test at source level (raw data) and model level (transformed). Monitor test coverage: aim for >80% of columns tested.
βΆSnapshots and incremental models, when do I use each?
Incremental = transforms only NEW rows since last run (fast, cheap). Use for fact tables growing daily (events, transactions). Snapshot = captures a point-in-time view of a slowly-changing dimension (SCD). Example: snapshot user profile each day, track when 'plan' column changed from free to paid. Incremental fails if you need historical versions of a row; snapshot is designed for that. Rule of thumb: incremental for immutable events, snapshots for mutable dimensions.
βΆHow do I set up CI/CD for dbt in production (dbt Cloud vs self-hosted)?
dbt Cloud = managed solution, runs jobs on dbt's infrastructure, built-in scheduler, PR deployment previews (run jobs against staging branch). Cost ~$100-300/mo. Self-hosted (Airflow/Dagster) = you manage Python orchestrator, invoke dbt CLI in tasks, more control but more ops. Modern best practice: use dbt Cloud for small teams (<$10M ARR), self-hosted for large companies managing complex DAGs. dbt Cloud integration: commit to branch, dbt automatically runs `dbt test` + `dbt build` on preview branch before merging to main.
βΆWhat's the dbt semantic layer and how does it change analytics?
Semantic layer = centralized metric definitions (one source of truth for KPIs). Instead of each BI tool defining 'revenue' differently, dbt semantic layer defines it once (SELECT SUM(amount) WHERE status='paid'), and Looker/Tableau/Metabase pull from it. Benefit: consistency, fast iterations (change metric definition once, all dashboards update). Drawback: new workflow, requires dbt Cloud (not Core), learning curve. Use if: multiple BI tools or teams defining metrics differently. Skip if: small team, single BI tool.
βΆHow do I debug dbt when transformations produce wrong results?
Use `dbt debug` to check Warehouse connection. Use `dbt run --select my_model` to run one model. Check `target/compiled/` folder to see the actual SQL generated (Jinja templates expanded). Check `target/run_results.json` for execution times and test failures. Common bugs: missing WHERE clause (forgot to filter source data), wrong join logic (many-to-many join creating duplicates), stale dependencies (old ref() to wrong model). Always run `dbt test` after model changes. Use `dbt freshness` to check source data staleness.