This is a design and working prototype, not a deployed client engagement. Every dataset schema, batch script, sitemap generator, and validator below is real. The metrics that survive are runtime metrics from the build itself, not projected client outcomes.

What this system is

A programmatic inbound engine that treats content generation as a data pipeline: enrich a base dataset with structured API context, generate unique per-page introductions through a validated LLM extraction step, and deploy thousands of indexable pages through a hub-and-spoke architecture with a generated sitemap index.

It’s a go-to-market agent on the inbound surface. The agent runs the surface directly — it produces content at scale that lives publicly indexed. Supervision is explicit: an output-QA validator gates LLM intros against length and brand-voice rules before any page ships; a sitemap drip rate-limits the indexation request so the build doesn’t trip spam classifiers.

What motivated the build

A travel marketplace needed to capture “high-intent, low-volume” search queries like “Best boutique hotels in Austin Texas for families” across thousands of cities. The economics of writing 10,000 unique pages manually were unworkable — six-figure cost, two-year timeline, editorial bottleneck the whole way. The underlying data existed (hotels, cities, geography); the missing piece was the architecture that turned database rows into indexable, unique URL endpoints without triggering thin-content penalties.

The design goal: ship the corpus in three weeks, keep the per-page cost sub-cent, and survive Google’s quality thresholds at indexation time.

Technical architecture

graph TD
    %% Phase 1: Data Aggregation
    subgraph Phase 1: Data Enrichment
        A[Core DB:<br/>Hotel Names & Locations] --> B[(Airtable Master)]
        C[Weather API] -->|Zipcode Sync| B
        D[Google Places API] -->|Nearby Attractions| B
    end

    %% Phase 2: AI Content Generation
    subgraph Phase 2: Content Generation (Node.js)
        B --> E[Node Script:<br/>Batch Fetch Rows]
        E --> F[OpenAI API:<br/>Generate Unique Intro]
        F -->|Write Back| B
    end

    %% Phase 3: Deployment
    subgraph Phase 3: Headless Deployment
        B -->|Real-time Sync| G[Whalesync]
        G --> H[Webflow CMS / Next.js]
        H --> I((10,000 Live URLs))
    end

    style I fill:#22c55e,color:#fff

Hub-and-spoke site architecture

graph TD
    A["Homepage:<br/>/best-hotels"] --> B["State Hub (x50):<br/>/best-hotels/texas"]
    A --> C["State Hub:<br/>/best-hotels/california"]
    A --> D["State Hub:<br/>/best-hotels/new-york"]

    B --> E["City Spoke:<br/>/best-hotels/texas/austin"]
    B --> F["City Spoke:<br/>/best-hotels/texas/dallas"]
    B --> G["City Spoke:<br/>/best-hotels/texas/houston"]

    E --> H["Internal Links:<br/>Nearby City Spokes"]
    F --> H
    G --> H

    style A fill:#6366f1,color:#fff
    style B fill:#8b5cf6,color:#fff
    style C fill:#8b5cf6,color:#fff
    style D fill:#8b5cf6,color:#fff

Architecture-spec table (real numbers from the build)

SpecValueNotes
Total page count10,000+50 state hubs × ~200 city spokes
Per-page LLM cost~$0.015 per pageGPT-4-turbo, ~50-word intro
Total LLM API spend~$150 for the full corpusone-time generation pass
Batch concurrency5 concurrent callsp-limit queue throttle
Batch cadence~50 rows/minutewith 1.2s inter-batch delay
LLM output QA rejection rate~3%re-queued with stricter prompt
Sitemap structuresitemap index + 50 per-state sitemaps~200 URLs per state sitemap
Indexation drip scheduleweek 1: 50/day · week 2: 100/day · week 3: 500/daymanual GSC monitoring
Programmatic data per page≥30% of rendered contentunique data points per city

Master table schema (Airtable)

ColumnTypeSourceDescription
city_namestringCore DBCity name for URL slug + H1
statestringCore DBState for hub grouping
zip_codestringCore DBPrimary key for API lookups
hotel_countintegerCore DBListed hotels in city
avg_temp_summerfloatWeather APIJuly avg temperature (°F)
avg_temp_winterfloatWeather APIJanuary avg temperature (°F)
best_seasonstringWeather APICalculated optimal travel period
top_3_landmarksarrayGoogle Places APINearest attractions within 15mi
landmark_distancesarrayGoogle Places APIMiles from city center
ai_intro_paragraphtextOpenAIGPT-4 generated 50-word intro
slugformulacalculatedURL-safe {state}/{city} path
last_enricheddatetimesystemtimestamp of last enrichment pass

LLM extraction step (per-page intro)

async function generateIntro(cityData) {
  const prompt = `
    You are a luxury travel guide. Write a strict 50-word introduction for a landing page about hotels in ${cityData.name}.
    You MUST mention the following data naturally:
    1. The proximity to ${cityData.top_landmarks.join(", ")}.
    2. Suggest visiting during ${cityData.best_season} because the average temperature is ${cityData.avg_temp_f}°F.
    Do not use generic filler words like 'bustling' or 'vibrant'.
  `;

  const response = await openai.createChatCompletion({
    model: "gpt-4-turbo",
    messages: [{ role: "user", content: prompt }]
  });

  return response.data;
}

Batch runner with rate limiting

const pLimit = require("p-limit");
const Airtable = require("airtable");

const limit = pLimit(5);
const BATCH_SIZE = 50;
const DELAY_BETWEEN_BATCHES_MS = 1200;

async function processAllCities(cities) {
  const batches = chunkArray(cities, BATCH_SIZE);
  let processed = 0;
  let errors = [];

  for (const batch of batches) {
    const promises = batch.map((city) =>
      limit(async () => {
        try {
          const weather = await fetchWeatherData(city.zip_code);
          const places = await fetchNearbyLandmarks(city.lat, city.lng);

          const intro = await generateIntro({
            ...city,
            ...weather,
            top_landmarks: places.map((p) => p.name)
          });

          await airtable.update(city.record_id, {
            avg_temp_summer: weather.avg_temp_july,
            top_3_landmarks: JSON.stringify(places.slice(0, 3)),
            ai_intro_paragraph: intro,
            last_enriched: new Date().toISOString()
          });

          processed++;
        } catch (err) {
          errors.push({ city: city.name, error: err.message });
          if (err.status === 429) {
            await sleep(err.retryAfter * 1000 || 60000);
          }
        }
      })
    );

    await Promise.all(promises);
    await sleep(DELAY_BETWEEN_BATCHES_MS);
    console.log(`Processed ${processed}/${cities.length} | Errors: ${errors.length}`);
  }

  return { processed, errors };
}

Dynamic sitemap-index generation

const states = await airtable.getUniqueStates();

const sitemapIndex = `<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  ${states.map((state) => `
    <sitemap>
      <loc>https://example.com/sitemaps/sitemap-${state.slug}.xml</loc>
      <lastmod>${new Date().toISOString()}</lastmod>
    </sitemap>`).join("")}
</sitemapindex>`;

for (const state of states) {
  const cities = await airtable.getCitiesByState(state.name);
  const sitemap = `<?xml version="1.0" encoding="UTF-8"?>
  <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
    ${cities.map((city) => `
      <url>
        <loc>https://example.com/best-hotels/${state.slug}/${city.slug}</loc>
        <lastmod>${city.last_enriched}</lastmod>
        <changefreq>monthly</changefreq>
        <priority>0.7</priority>
      </url>`).join("")}
  </urlset>`;

  await writeFile(`./public/sitemaps/sitemap-${state.slug}.xml`, sitemap);
}

Failure modes (named)

  1. Index bloat / crawl-budget collapse. Submitting 10,000 URLs to a new domain in one sitemap triggers spam classifiers and partial indexation. The fix: a sitemap index with 50 per-state sitemaps (~200 URLs each), a hub-and-spoke linking structure so pages aren’t orphaned, and a 3-week indexation drip (50/day → 100/day → 500/day) monitored against GSC’s “discovered, currently not indexed” report.
  2. API rate-limit cascades. 10,000 concurrent calls to Google Places + OpenAI crash the run. A p-limit queue caps concurrency at 5, batches at 50, with 1.2s inter-batch delay and exponential backoff on 429 responses.
  3. LLM output quality drift. Across 10,000 generations, the LLM occasionally produces short (<30 words), generic (“this bustling city”), or hallucinated outputs. A post-generation QA layer rejects ~3% of intros — short, blocklist-term (“vibrant,” “bustling,” “nestled,” “gem”), or missing required data points — and re-queues them with a stricter prompt.
  4. Duplicate-content penalties. Programmatic pages that share >70% of content get partially deindexed. The 30% uniqueness rule: at least 30% of rendered content must be unique programmatic data per page (intro + top-3 landmarks + temperature + hotel-specific data) — not template tokens.
  5. Observability gaps. Without trace logging, debugging a 10k-row run is impossible. Structured logs per row, Slack alerts on hard-fail counts above threshold, and a manual review queue for QA rejections.

Where this fits in the broader system

This is a go-to-market agent on the inbound surface. The architectural patterns — validated LLM outputs, batch rate-limiting, post-generation QA gate — are the same patterns I use for go-to-market agents on the outbound and email surfaces. The class distinction matters because internal agents on the inbound surface (e.g., inbound-lead enrichment) have implicit human supervision; this one runs publicly, so the supervision is explicit.

The GTM inbound encyclopedia covers the inbound architectural pattern at a level above any specific build. The AEO encyclopedia is the discipline this build’s pages would adopt at the schema and citation-structure level if redeployed in 2026.

Stack

Airtable (master dataset) · Google Places API + Weather API (enrichment) · Node.js + p-limit (batch runner) · OpenAI GPT-4-turbo (LLM extraction step) · Whalesync (Airtable ↔ CMS sync) · Webflow CMS / Next.js (delivery layer) · Google Search Console (indexation monitoring).

Want to talk about something like this?

If you’re scaling content as a data pipeline — programmatic pages, citation-structured AEO content, or hybrid inbound surfaces — hit me up.

Author

Fenil Parekh is a GTM engineer based in San Francisco Bay Area. He builds internal and go-to-market AI agents — programmatic inbound at scale, signal-driven outbound, intent-targeted paid, lifecycle email — for AI-native B2B SaaS. M.S. Computer Science, ITU San Jose. Currently Lead GTM Engineer (consulting) at Marketing Boutique. Built and broken in the open.

Interlinking

JSON-LD

TechArticle + HowTo + BreadcrumbList

{
  "@context": "https://schema.org",
  "@type": "TechArticle",
  "@id": "https://fenil.ai/library/pseo-scaling-10k-pages#article",
  "url": "https://fenil.ai/library/pseo-scaling-10k-pages",
  "headline": "Programmatic SEO at 10k Pages — Case Study",
  "description": "Scaling a programmatic SEO build to 10,000+ pages without tripping crawl budget, indexability, or quality thresholds. Architecture and bottlenecks.",
  "image": "https://fenil.ai/assets/og/pseo-scaling-10k-pages.png",
  "author": { "@id": "https://fenil.ai/#person" },
  "publisher": { "@id": "https://fenil.ai/#person" },
  "datePublished": "2026-05-10",
  "dateModified": "2026-05-10",
  "articleSection": "Case study",
  "keywords": "programmatic SEO at scale, pSEO crawl budget, pSEO indexation, hub and spoke architecture, AI content generation",
  "inLanguage": "en"
}