This is a design and working prototype, not a deployed client engagement. Every dbt model, macro, and validation test below is real. The metrics that survive are pipeline-level numbers from the build itself, not projected business outcomes.
What this system is
A modern data stack that unifies revenue data across marketing CRM, sales CRM, billing, and product analytics into a single warehouse, transforms it through a versioned modeling layer, and operationalizes the resulting account-health and channel-LTV signals via Reverse ETL back into the CRM where the GTM team actually works.
It’s infrastructure — the foundation both classes of AI agents depend on. Internal agents (research, summarization, routing) and go-to-market agents (sending, publishing, personalization) all need a deduplicated, schema-stable account record to operate against. This is that record.
What motivated the build
A Series C SaaS company had marketing in one CRM, sales in another, billing in a payments platform, and product usage in a third analytics tool. None of the systems talked to each other reliably. The result: nobody could answer the question that mattered — which acquisition channel produces customers that actually stick around and use the product? — because ad-spend data and product-usage data lived in disconnected silos.
Worse, the brittle point-to-point integrations (Zapier-style spaghetti) constantly drifted out of sync. Marketing’s “active customer” count diverged from sales’s; sales’s diverged from finance’s; finance’s diverged from the product team’s. The board got different revenue numbers depending on which dashboard they opened.
The design goal: a deduplicated, versioned, observable single source of truth that the AI agents on top can trust.
Technical architecture
graph TD
%% Phase 1: Ingestion
subgraph Phase 1: EL Tools (Extract/Load)
A[Salesforce] -->|Fivetran| E[(Google BigQuery DWH)]
B[HubSpot] -->|Fivetran| E
C[Stripe] -->|Fivetran| E
D[Mixpanel] -->|Fivetran| E
end
%% Phase 2: Transformation
subgraph Phase 2: dbt Transformation Layer
E --> F[Staging Tables:<br/>Standardize IDs]
F --> G[Core Join:<br/>Unified Account Model]
G --> H[Fact Tables:<br/>Churn Risk & MRR]
end
%% Phase 3: Activation
subgraph Phase 3: Reverse ETL & BI
H --> I[Looker Studio:<br/>Exec Dashboard]
H --> J[Hightouch:<br/>Reverse ETL]
J -->|Churn Alert Flag| A
end
style I fill:#3b82f6,color:#fff
style J fill:#22c55e,color:#fff
dbt model DAG
graph LR
subgraph Staging
S1[stg_salesforce_accounts]
S2[stg_hubspot_contacts]
S3[stg_stripe_subscriptions]
S4[stg_mixpanel_events]
end
subgraph Core
S1 --> C1[core_unified_accounts]
S2 --> C1
S3 --> C1
S4 --> C1
end
subgraph Facts
C1 --> F1[fact_account_unified]
C1 --> F2[fact_channel_ltv]
F1 --> F3[fact_churn_alerts]
end
subgraph Reverse ETL
F3 -->|Hightouch| RT1[Salesforce: Health Score Field]
F2 -->|Looker| RT2[Exec Dashboard]
end
Architecture-spec table (real numbers from the build)
| Spec | Value | Notes |
|---|---|---|
| Ingest cadence | Every 15 minutes | Fivetran-managed |
| Number of upstream sources | 4 | Salesforce, HubSpot, Stripe, Mixpanel |
| dbt model count | ~25 | Staging + core + facts + utility macros |
| Universal-domain-resolution coverage | ~98% | After prefix-strip + email-domain extraction |
| Schema-test failure threshold | Pipeline halts on error severity | MRR not_null, accepted_range |
| Reverse-ETL sync cadence | Every 15 minutes | Hightouch-managed |
| Health-score categories | 3 | HIGH_RISK_CHURN, MODERATE_RISK, HEALTHY |
| Data-freshness alert threshold | Table not refreshed >2h | Slack alert; CSM workflow paused |
Universal domain-resolution macro
-- macros/resolve_domain.sql
{% macro resolve_domain(raw_domain_column) %}
REGEXP_REPLACE(
REGEXP_REPLACE(
LOWER({{ raw_domain_column }}),
r'^(www\.|mail\.|app\.|uk\.|us\.|de\.)', ''
),
r'^[^@]*@', ''
)
{% endmacro %}
This macro is called from every staging model that joins on domain — Salesforce, HubSpot, Stripe (via billing_email), Mixpanel (via user_domain). Without it, [email protected] doesn’t match company.com and the join silently misses 15-25% of records.
Core unified account model
-- models/core/fact_account_unified.sql
WITH sales_crm AS (
SELECT account_id, {{ resolve_domain('domain') }} AS domain, arr, csm_owner
FROM `analytics.stg_salesforce_accounts`
),
finance AS (
SELECT {{ resolve_domain('billing_email') }} AS domain, mrr,
DATE_DIFF(CURRENT_DATE(), last_payment_date, DAY) as days_since_payment
FROM `analytics.stg_stripe_subscriptions`
),
product_usage AS (
SELECT {{ resolve_domain('user_domain') }} AS domain,
COUNT(login_event) as logins_last_30_days,
MAX(last_active_date) as last_active
FROM `analytics.stg_mixpanel_events`
WHERE event_date >= DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY 1
)
SELECT
s.account_id,
s.domain,
f.mrr,
p.logins_last_30_days,
CASE
WHEN p.logins_last_30_days < 2 AND f.mrr > 0 THEN 'HIGH_RISK_CHURN'
WHEN p.logins_last_30_days BETWEEN 2 AND 5 AND f.mrr > 0 THEN 'MODERATE_RISK'
ELSE 'HEALTHY'
END AS account_health_score
FROM sales_crm s
LEFT JOIN finance f ON s.domain = f.domain
LEFT JOIN product_usage p ON s.domain = p.domain;
Schema-test assertions (data quality gate)
# models/core/schema.yml
version: 2
models:
- name: fact_account_unified
columns:
- name: account_id
tests:
- unique
- not_null
- name: mrr
tests:
- not_null:
severity: error # Pipeline STOPS if Stripe sync fails
- dbt_utils.accepted_range:
min_value: 0
max_value: 500000
- name: account_health_score
tests:
- accepted_values:
values: ["HIGH_RISK_CHURN", "MODERATE_RISK", "HEALTHY"]
These tests fire on every dbt run. An error-severity failure halts the pipeline before the downstream dashboards see broken data; a warn-severity failure fires a Slack alert but lets the build continue.
Reverse-ETL sync configuration
{
"sync_name": "churn_risk_to_salesforce",
"source": {
"type": "bigquery",
"table": "analytics.fact_account_unified",
"filter": "account_health_score IN ('HIGH_RISK_CHURN', 'MODERATE_RISK')"
},
"destination": {
"type": "salesforce",
"object": "Account",
"match_key": "account_id",
"field_mappings": {
"account_health_score": "Health_Score__c",
"logins_last_30_days": "Product_Logins_30d__c",
"mrr": "Current_MRR__c"
}
},
"schedule": "every_15_minutes",
"on_change": {
"trigger_flow": "CSM_Intervention_Task",
"condition": "Health_Score__c = 'HIGH_RISK_CHURN'"
}
}
The Reverse ETL closes the loop: the warehouse calculates account health; the CRM gets a field updated; the CSM gets a task created. The team works where they work; the warehouse stays the source of truth.
Failure modes (named)
- Messy domain matching. Different systems represent the same account differently (
uk.company.com,[email protected],company.com). Without the universalresolve_domainmacro, 15-25% of joins silently miss. The macro strips prefixes and email syntax to a canonical root domain across every source. - Pipeline fragility on upstream sync failures. If the Stripe ingest fails for 6 hours, the dashboard would silently report MRR drops.
not_null+accepted_rangeschema tests onmrrhalt the pipeline so the dashboard reports “data unavailable” instead of “MRR dropped 80%.” - Timezone drift. Mixpanel timestamps in UTC; Salesforce in PST; Stripe in Unix epoch. Joining raw timestamps falsely flags healthy accounts as inactive. All timestamps standardize to UTC in the staging layer before any downstream join logic runs.
- Silent sync failures. Fivetran connectors fail quietly more often than vendors admit. BigQuery metadata alerts in a dedicated Slack channel fire when any table hasn’t received fresh data in >2 hours, before stale data reaches the dashboards.
- Cardinality explosions in joins. A bad join condition produces 10x the expected row count. dbt’s
uniquenesstest onaccount_idcatches this before the explosion propagates downstream.
Where this fits in the broader system
This is the infrastructure that both classes of AI agents depend on. The lifecycle email agents, the outbound research agents, the intent-data audience agents — they all need a clean, deduplicated account record with up-to-date product-usage and CRM-stage state. This system is that record.
The AI agents encyclopedia goes deep on the supervision models. The AI automation encyclopedia covers the orchestration layer that sits on top of this data layer.
Stack
Fivetran (ingest) · Google BigQuery (warehouse) · dbt (transformation + tests) · Hightouch (Reverse ETL) · Salesforce (CRM destination) · HubSpot (marketing CRM source) · Stripe (billing source) · Mixpanel (product analytics source) · Looker Studio (exec dashboard).
Want to talk about something like this?
If you’re standing up a single source of truth — or fixing one that drifted — hit me up.
Author
Fenil Parekh is a GTM engineer based in San Francisco Bay Area. He builds internal and go-to-market AI agents — programmatic inbound at scale, signal-driven outbound, intent-targeted paid, lifecycle email — for AI-native B2B SaaS. M.S. Computer Science, ITU San Jose. Currently Lead GTM Engineer (consulting) at Marketing Boutique. Built and broken in the open.
Interlinking
- Sibling cross-link: /library/hrtech-job-req-targeting — an internal-agent build that depends on clean identity.
- Upstream: /library; /library/revenue-operations — encyclopedia on RevOps as a discipline.
JSON-LD
TechArticle + BreadcrumbList
{
"@context": "https://schema.org",
"@type": "TechArticle",
"@id": "https://fenil.ai/library/revops-single-source-truth#article",
"url": "https://fenil.ai/library/revops-single-source-truth",
"headline": "RevOps Single Source of Truth — Case Study",
"description": "How I designed a CRM-as-system-of-record refactor — schema reconciliation, dedup logic, and the validation passes that survive production scale.",
"image": "https://fenil.ai/assets/og/revops-single-source-truth.png",
"author": { "@id": "https://fenil.ai/#person" },
"publisher": { "@id": "https://fenil.ai/#person" },
"datePublished": "2026-05-10",
"dateModified": "2026-05-10",
"articleSection": "Case study",
"keywords": "RevOps single source of truth, CRM as system of record, dbt revenue model, schema reconciliation",
"inLanguage": "en"
}