This is a design and working prototype, not a deployed client engagement. Every architectural choice, function node, and validator below is real. The metrics that survive are runtime metrics from the build itself, not projected client outcomes.

What this system is

A signal-driven outbound pipeline that watches public job-requisition feeds for HRTech ICPs, extracts buying-signal context from each posting with an LLM, validates the extraction against a hallucination guard, and either routes high-value enterprise accounts to an SDR queue with a context one-pager or enrolls mid-market accounts into a sequence with personalized merge fields.

It’s an internal agent: the user is a sales-development rep or account executive on the HRTech GTM team. The agent does the research; the human does the conversation. Supervision is implicit — the SDR reviews the brief before reaching out; the AE evaluates whether the surfaced account warrants a personal touch.

What motivated the build

A Series B HRTech platform was burning paid budget targeting “HR Managers” broadly — including companies on hiring freezes that had no near-term need for an applicant-tracking product. The buying signal sat 24-48 hours upstream: the moment a company posts a hard-to-fill job requisition is the moment they are experiencing the recruiting pain the product solves.

The design goal: build a system that detects new job requisitions in near-real-time, extracts the implied operational pain from the description, validates the extraction, and routes the qualified records into the right outbound motion.

Technical architecture

HRTech outbound architecture — four-phase pipeline Phase 1 autonomous scraping: a cron job runs every 6 hours, triggering an Apify actor that scrapes LinkedIn jobs, filtered to posts under 24 hours old containing target keywords. Phase 2 GPT-4 need extraction: an n8n HTTP node calls the OpenAI API to extract a core challenge and three hard skills from each job description. Phase 3 routing: a decision gate checks whether company size exceeds 500. Phase 4 tiered activation: enterprise leads create a Salesforce task for an SDR; mid-market leads enter a Smartlead auto-sequence. Phase 1 · Autonomous Scraping Cron: Every 6 Hours scheduler Apify Actor LinkedIn Jobs scraper Filter Posted < 24h + contains target keywords Phase 2 · GPT-4 Need Extraction n8n HTTP Node workflow orchestrator OpenAI API Node GPT-4o-mini call Extract Core challenge + 3 hard skills Phase 3 · Routing Company Size > 500? Yes · Enterprise No · Mid-Market Phase 4 · Tiered Activation Salesforce Create Task for SDR queue Smartlead Add to Sequence auto-personalized

Hallucination guard (LLM validation)

Hallucination guard — two-gate LLM validation The raw job description enters GPT-4o for analysis. Gate one checks the count of extracted skills: fewer than two skills flags the record as LOW_SIGNAL. Two or more skills passes to gate two, which checks whether an implied challenge exists. If yes, the record is APPROVED and routed to a Smartlead sequence. If no, the record joins the LOW_SIGNAL path. LOW_SIGNAL records bypass Smartlead and are archived to a CRM manual review queue. Raw Job Description input payload GPT-4o Analysis structured JSON output extracted_skills count < 2 skills ≥ 2 skills implied_challenge exists? No Yes Approved Route to Sequence Smartlead enrollment Flag LOW_SIGNAL Bypass Smartlead no outbound activation Archive to CRM manual review queue human-in-the-loop

Architecture-spec table (real numbers from the build)

SpecValueNotes
Scrape cadenceevery 6 hoursApify-rotated residential proxies
Records per scrape pass200-400 jobsfiltered by target role + posted < 24h
Average LLM extraction runtime~1.2s per jobOpenAI GPT-4o, batched
LLM token cost per record~$0.005input + output tokens
Hallucination-guard rejection rate~7%classified LOW_SIGNAL, routed to manual review
Deduplication window30 daysMD5(domain + normalized_title) hash check
Enterprise vs mid-market split~25% / 75%by company size > 500
Pipeline uptime alert threshold<5 jobs per runSlack alert fires; manual resume

Apify input configuration

{
  "queries": [
    "Software Engineer location:Remote",
    "Account Executive B2B SaaS"
  ],
  "max_posts_per_query": 200,
  "published_at": "past-24h",
  "proxy_configuration": { "useApifyProxy": true },
  "filters": {
    "company_size": "50-1000",
    "job_type": "full_time"
  }
}

LLM extraction system prompt

const gpt_prompt = `
You are a technical recruiter analyzer. Read the following job description.
Extract exactly three highly specific technical or domain skills required.
Identify the primary implied challenge of the role (e.g. 'scaling infrastructure' or 'managing a mid-market cycle').

OUTPUT FORMAT MUST BE VALID JSON:
{
  "extracted_skills": "Python, AWS, and Distributed Systems",
  "implied_challenge": "migrating from monolith to microservices",
  "ps_line_in_email": "P.S. Finding remote engineers strong in Python, AWS AND Distributed Systems is notoriously brutal—assuming you need them to lead the microservices migration?"
}
`;

LLM output validator (n8n Function node)

function validateGptOutput(gptResponse) {
  let parsed;
  try {
    parsed = JSON.parse(gptResponse);
  } catch (e) {
    return { status: "REJECTED", reason: "INVALID_JSON", route: "manual_review" };
  }

  const skills = parsed.extracted_skills.split(",").map((s) => s.trim());
  const genericTerms = ["communication", "teamwork", "leadership", "detail-oriented"];
  const realSkills = skills.filter((s) => !genericTerms.includes(s.toLowerCase()));

  if (realSkills.length < 2) {
    return { status: "LOW_SIGNAL", reason: "INSUFFICIENT_SKILLS", route: "archive" };
  }

  if (
    !parsed.implied_challenge ||
    parsed.implied_challenge.length < 10 ||
    parsed.implied_challenge.includes("various")
  ) {
    return { status: "LOW_SIGNAL", reason: "VAGUE_CHALLENGE", route: "archive" };
  }

  return { status: "APPROVED", data: parsed, route: "smartlead_sequence" };
}

Activation push (Smartlead API)

const smartleadPayload = {
  method: "POST",
  url: "https://server.smartlead.ai/api/v1/leads",
  headers: { "Api-Key": "{{$credentials.smartlead_api_key}}" },
  body: {
    email: "{{$json.enriched_email}}",
    first_name: "{{$json.first_name}}",
    last_name: "{{$json.last_name}}",
    company_name: "{{$json.company_name}}",
    custom_fields: {
      ps_line: "{{$json.gpt_output.ps_line_in_email}}",
      extracted_skills: "{{$json.gpt_output.extracted_skills}}",
      job_title_scraped: "{{$json.job_title}}",
      scrape_timestamp: "{{$json.scraped_at}}"
    },
    campaign_id: "{{$json.route === 'enterprise' ? ENTERPRISE_CAMPAIGN_ID : MIDMARKET_CAMPAIGN_ID}}"
  }
};

Sample outbound payload

Subject: Your new Software Engineer req at {{companyName}}

Hi {{firstName}},

Noticed the engineering req that went live on LinkedIn a few hours ago.

Our ATS overlay handles technical vetting for complex roles automatically before they hit your desk, reducing engineering interview hours by 40%. Given the urgency, are you open to seeing how we could plug into your existing Greenhouse setup?

{{customInfo_ps_line}}

Failure modes (named)

  1. LinkedIn rate-limiting. Scraping LinkedIn directly from a server IP gets the IP banned within hours. The system uses Apify’s rotating residential proxies; if the proxy pool degrades, the scrape pass returns fewer results and the observability alert fires.
  2. Hallucination on vague job descriptions. Two-sentence postings cause the LLM to invent skills. The hallucination guard rejects ~7% of extractions as LOW_SIGNAL and routes them to manual review. Without the guard, the false-positive rate would corrupt sequence personalization.
  3. Job-reposting duplication. Companies repost identical listings weekly to keep them fresh. Without dedup, the same VP HR gets the same email repeatedly. The system hashes MD5(domain + normalized_title) and silently drops re-encounters within 30 days.
  4. Stale signal windows. A job req posted >30 days ago is no longer a fresh buying signal. The scrape filter excludes anything older than 24 hours at ingest.
  5. Vendor outage on Apify or OpenAI. Both vendors have had multi-hour outages. The system has retry budgets per node, exponential backoff on rate-limit errors, and a Slack alert when a run returns fewer than 5 records — which is the canary that something upstream is wrong.

Where this fits in the broader system

This is an internal agent on the outbound surface. The architectural patterns — schema-validated outputs, hallucination guards, dedup logic, observability alerts — are the same patterns I use for go-to-market outbound agents. The class distinction matters because go-to-market agents on the same surface need additional supervision (HITL gates, brand-voice validators) before any output ships to the inbox.

The /ai/agents encyclopedia goes deep on the two classes and how each one shows up across the four surfaces.

Stack

Apify (scraping with residential proxy rotation) · n8n (workflow orchestration + Function-node validators) · OpenAI GPT-4o (LLM extraction step) · Smartlead (deliverability-managed sender) · Salesforce (CRM routing for enterprise tier).

Want to talk about something like this?

If you’re building something in the same shape — signal-driven outbound, internal research agents, validator-gated LLM pipelines — hit me up.