Skip to content
maksugr
Go back

AI Writes Daily Without My Involvement

A naive dive into a multi-agent system with self-critique, diversity-aware curation, and weekly self-reflection to cover local cultural events

Table of contents

Motivation

For as long as I can remember, I’ve been orbiting media and wanting to build something in that space. Tried it many times and even when the tech and the product turned out decent, everything would fizzle out at the part where you actually have to, you know, produce content and run the thing. But now each of us can spin up our own little newsroom of writers, editors, fact-checkers and whatnot. So why not give it a shot, right? That’s how SYNTSCH was born. GitHub repository is here.

Two more paragraphs of motivation

I wanted to keep up with Berlin’s cultural life and read about it in a tone of voice that actually feels comfortable to me (I prefer Dazed). But I absolutely didn’t want to just rewrite news and launch yet another mindless AI slop machine. So the essay format emerged: once a day, not a news piece, but an essay on whatever event the editorial team found most interesting.

In a way, yes, it’s still slop, but slop not without its charm (at least for me).

I’m a developer and obviously I was curious to map problems that have interested me for a while onto the AI plane. That’s how this project came about. I’m not sure it’ll last long, I’m not even sure it’s useful as a product (at the very least it’s definitely not an AI-native company), even to me, but it was definitely a rewarding (and thrilling!) journey, and I hope the code or this article turns out to be useful to you too.

Why Not a Prompt

Because the task is not just producing text. The task is to reproduce and automate the work of an entire editorial board:

  1. Find dozens of upcoming events across multiple sources
  2. Filter them down to culturally significant candidates
  3. Choose the single best one for today
  4. Research the artist, venue, and cultural context
  5. Write an essay with a specific editorial voice
  6. Critique the draft for factual, voice, and structural issues
  7. Revise based on the critique
  8. Come up with a headline and lede
  9. Publish and notify subscribers

Each step needs different capabilities, different models, and different prompts. If you put all of this into one prompt, every step suffers. The analogy is a real magazine editorial team:

RoleReal MagazineSYNTSCH Agent
StringerScans listings, tips from the sceneScout — 9 parallel web searches
EditorPicks today’s story from the pileCurator — diversity-aware selection
WriterWrites the featureAuthor — draft with research context
Copy editorFact-checks, fixes voice, tightens structureCritic — structured critique with severity levels
Editor-in-chiefWeekly planning meeting, reviews coverageReflector — self-analysis of own output

Architecture

flowchart TD
    subgraph "Daily 17:00 UTC"
        SCOUT["SCOUT (Sonnet)\nsearch 9 Tavily queries, extract events"]
        SCOUT --> EVENTS[("data/events/*.json")]
    end

    subgraph "Daily 18:00 UTC"
        EVENTS --> CURATOR["CURATOR (Sonnet)\npick 1 event from pool"]
        CURATOR --> RESEARCH["RESEARCH (Tavily)\nfetch 4 queries, gather context"]
        RESEARCH --> AUTHOR["AUTHOR (Opus)\nwrite 800–1200 word essay"]
        AUTHOR --> CRITIC["CRITIC (Opus)\nfind issues, revise draft, craft title"]
        CRITIC --> ARTICLES[("data/articles/*.json")]
    end

    subgraph "Sunday 20:00 UTC"
        ARTICLES --> REFLECTOR["REFLECTOR (Opus)\nanalyze coverage for bias and gaps"]
        REFLECTOR --> REFLECTIONS[("data/reflections/*.json")]
    end

    ARTICLES --> PUBLISH["PUBLISH\ncommit, deploy, notify via email + telegram"]
    REFLECTIONS --> PUBLISH

The entire system is orchestrated by GitHub Actions on a cron schedule. No human intervention required. I was only needed when a GitHub Action crashed because it ran out of Anthropic credits. I’m becoming just a wallet!

Some words about tech stack:

Agents

Agent 1: Scout — Parallel Search with Adaptive Strategy

The Scout’s job is to find upcoming cultural events in Berlin. This sounds simple until you realize that no single search query covers the full cultural spectrum: music, cinema, theater (this one gave me trouble, more on that later), exhibitions, lectures, festivals, performance art, club culture. The solution: 9 parallel Tavily queries with three different strategies.

3 core queries go wide:

core_queries = [
    f"{city} cultural events {date_range} concerts exhibitions theater",
    f"{city} what to do this week art music cinema",
    f"{city} upcoming events {month_year} gallery performance lecture",
]

4 adaptive queries are sampled from a pool of 19 templates using weighted random selection. The weights are inversely proportional to how many events of each category are already in the pool:

cat_counts = Counter(pool_categories or [])
weights = []

for eq in extra_queries:
    w = sum(1.0 / (cat_counts.get(c, 0) + 1) for c in eq["cats"])
    weights.append(w)

sampled = _weighted_sample(extra_queries, weights, 4)

If we already have 10 music events but zero cinema events, cinema-tagged queries get a much higher chance of being selected. This prevents the pool from becoming lopsided.

2 targeted queries go after the two most underrepresented categories with domain-specific sources:

def _build_targeted_queries(self, city, date_range, cat_counts):
    all_cats = list(config.CATEGORY_SOURCES.keys())
    underrepresented = sorted(all_cats, key=lambda c: cat_counts.get(c, 0))[:2]
    targeted = []

    for cat in underrepresented:
        domains = config.CATEGORY_SOURCES.get(cat, []) + config.GENERAL_SOURCES
        query = f"{city} {cat} events {date_range}"
        targeted.append((query, domains))

    return targeted

Each category has authoritative sources configured: Resident Advisor for music/club, nachtkritik.de for theater, Artforum for exhibitions, and so on. These targeted queries use Tavily’s include_domains parameter to search only on trusted sites.

In practice, the source list is far from ideal. It should be larger and more carefully curated: I mostly grabbed the first ones I could find, since I was more focused on the technical logic than the business logic. This is likely why zero essays were written in the theater category (I ended up hiding it on the site). Too few events were scouted, and the ones that were never got picked. It doesn’t help that theater and performance art look almost the same to the model. I tried to fix that but it wasn’t a top priority.

All 9 queries execute in parallel via asyncio.gather(), with retry logic (3 attempts, exponential backoff) on each. The raw results are then deduplicated and sent to a Claude Sonnet call that filters them down to the 5 best candidates.

A key architectural decision: the Scout uses structured tool output via Anthropic’s tool_choice parameter:

response = client.messages.create(
    model=config.SCOUT_MODEL,
    max_tokens=4096,
    tools=[SCOUT_TOOL],
    tool_choice={"type": "tool", "name": "submit_events"},
    messages=[{"role": "user", "content": prompt}],
)

tool_input = extract_tool_input(response, "submit_events")
events = [EventCandidate(**e) for e in tool_input["events"]]

This way the model returns structured JSON every time, so no parsing, no regex. Every agent in the system uses this pattern. Works great.

Agent 2: Curator — Diversity-Aware Selection

The Curator picks one event from the available pool for today’s article. The naive approach would be: “pick the most interesting event.” But without constraints, the AI gravitates toward the same categories. Music and exhibitions dominated early output. Theater and lectures barely showed up.

The fix is a 7-day rolling window of recent categories:

recent_categories = storage.get_recent_categories(days=7)

This list is injected directly into the curator’s prompt:

The recent categories we've already covered: [music, exhibition, music, club].
Try to pick something different if possible, but never sacrifice quality for diversity.

The last sentence matters: diversity is a tiebreaker, not the primary criterion. A genuinely important music event should still win over a mediocre lecture, even if we covered music yesterday.

The Curator’s selection criteria are explicitly ranked:

  1. Cultural significance (depth potential, non-obviousness)
  2. Timeliness (sooner events preferred, all else equal)
  3. Category diversity (avoid clustering)
  4. Essay potential (enough material for 800-1200 words)

Agent 3: Author — Essay Writing with Research

The Author handles research and writing. Before writing, the system gathers context through 4 parallel Tavily searches:

queries = {
    "artist": f"{artist_name} artist biography background career",
    "venue": f"{event.venue} {event.city} venue history significance",
    "cultural": f"{artist_name} {event.category} cultural context scene significance",
    "related": f"{artist_name} recent work reviews press {event.city}",
}

responses = await asyncio.gather(
    *(_search_one(client, f, queries[f]) for f in fields)
)

This gives the Author model raw material: artist biography, venue history, cultural context, recent press. The system prompt explicitly instructs: “Use the research as raw material. Don’t dump it into the essay. Weave it in naturally.”

Then the Author writes an 800-1200 word essay using Claude Opus with a detailed system prompt that defines the publication’s voice:

Agent 4: Critic — Self-Critique, Revision, and Quality Gate

Here’s where it gets interesting. The draft goes to a separate Critic agent. It’s the same Opus model, but with a completely different system prompt:

You are the editor of SYNTSCH. You’ve just received an essay draft from the writing model. Your job is to give sharp, specific, actionable feedback and then produce a revised version that fixes every issue you identified.

You are not nice. You are fair. You care about the writing being genuinely good.

The Critic checks along 5 axes, each with a severity level:

CRITIC_TOOL = {
    "name": "submit_critique",
    "input_schema": {
        "properties": {
            "overall_assessment": {"type": "string"},
            "issues": {
                "type": "array",
                "items": {
                    "properties": {
                        "type": {
                            "enum": ["factual", "voice", "structure", "language", "depth"]
                        },
                        "severity": {
                            "enum": ["minor", "major", "critical"]
                        },
                        "location": {"type": "string"},
                        "fix": {"type": "string"},
                    }
                }
            },
            "title": {"type": "string"},
            "revised_text": {"type": "string"},
        }
    }
}

The Critic doesn’t just point out problems. It returns a complete revised text with all fixes applied, plus a headline. This is a shortcut. In real media, the Critic (editor) sends notes back to the Author, the Author revises, the Editor reviews again: the draft goes back and forth until it’s good enough. Here, the Critic just rewrites it in one pass. The loop should exist, but I skipped it for simplicity.

Here’s a concrete example from a Russian-language article about film critic Andrey Plakhov and Berlinale 2026.

Draft. Essay with detailed narrative about Plakhov’s career. Critique found 10 issues, including:

Revision. All hallucinations removed. Confidence markers added where sourcing was thin. Structure tightened.

This is the core value of the self-critique loop: the system catches its own mistakes before publishing. Not always, not perfectly, but consistently enough that the output quality is meaningfully higher than single-pass generation.

This isn’t just my observation, there’s research behind it. The Self-Refine paper formalized exactly this pattern: generate → critique → revise → repeat. They tested it on 7 tasks across GPT-3.5 and GPT-4 and got ~20% average improvement. The biggest gains were in dialogue (+49% for GPT-4) and sentiment reversal (+32%). In blind A/B tests, humans consistently preferred the refined versions.

At the same time, the publication uses an inline annotation system for source transparency:

[~annotated phrase|tooltip explanation~]

For example:

2-5 markers per essay. They’re rendered as interactive tooltips on the website. This is the publication’s way of saying: “here’s how confident we are about what we’re telling you.” One of my favorite features of the project. The Reflector is cool but only nerds care. This one actually adds value for the end reader.

If the revised essay is under 400 words, the system asks the model to expand it, feeding the draft back as context:

if word_count < 400:
    final_response = client.messages.create(
        model=config.AUTHOR_MODEL,
        messages=[
                {"role": "user", "content": user_message},
                {"role": "assistant", "content": draft},
                {"role": "user", "content": "The essay is too short. Please expand it to 800-1200 words..."},
        ],
    )

If the Critic fails entirely (API error, malformed response), the system falls back to the original draft with an auto-generated title.

Agent 5: Reflector — The AI That Analyzes Its Own Writing

This is the most unusual part of the system. The Reflector agent examines the week’s output as a dataset. It computes concrete statistics:

Every article produces a .trace.json alongside the main JSON. The trace captures the full pipeline state: the raw draft, word count, the complete list of critique issues (type, severity, location, fix), the revised text, whether the essay was expanded, and how many research sources were used.

data/articles/
├── some-article.json         # published article
└── some-article.trace.json   # full pipeline trace

The Reflector reads these traces every Sunday and computes how the pipeline actually performed: how many issues the Critic found on average, what percentage of essays needed expansion, how much the word count grew between draft and final version, how many research sources were used. The system watches its own revision process as data.

Then it writes an essay about what it finds. Not a dry report, but an editorial self-analysis.

analysis = _compute_analysis(articles, storage, start_date, end_date, previous)

# Categories breakdown, venue concentration, word count stats,
# process stats from traces, comparison with previous week...

system_prompt = _load_reflector_prompt(language)
user_message = _build_user_message(articles, analysis, start_date, end_date, language, previous)

The Reflector also receives the previous week’s reflection, so it can track trends over time: “Last time I noted an overrepresentation of music events. This week that shifted, but now exhibitions dominate instead.”

This is the system watching itself. The AI examines its own biases, identifies blind spots, and names them. It can’t automatically fix them (the Reflector doesn’t change the Curator’s behavior), but making patterns visible is the first step.

Models and Cost

Opus for quality, Sonnet for speed. Not every task needs the most powerful model:

TaskModelWhy
Scouting (filter 5 events from raw data)SonnetSpeed + cost. Doesn’t need deep reasoning.
Curation (pick 1 from 5)SonnetSelection judgment at this level doesn’t need Opus.
Draft writingOpusVoice, depth, cultural connections — this is where quality matters.
Critique & revisionOpusCatching hallucinations and voice drift requires the strongest model.
Lede & title generationOpusShort but high-impact text. Worth the cost.
Weekly reflectionOpusSelf-analysis requires nuance.

Sonnet runs at a fraction of Opus cost. For tasks that are more about classification and selection than generation, it performs well enough.

Real numbers: over two months (Feb–Mar 2026), the whole pipeline cost $58 in Claude API credits. It’s about $1.36/day. February was pricier ($1.67/day) because of early experimentation, March settled at $1/day once things stabilized.

xychart-beta
    title "Daily API cost, March 2026"
    x-axis ["1","2","3","4","5","10","11","12","13","14","15","16","17","18","19","20","21","22","23"]
    y-axis "USD" 0 --> 2
    bar [1.17,0.88,1.72,0.98,0.91,1.24,0.90,1.09,0.92,0.99,1.18,0.98,0.93,0.95,0.93,1.00,0.92,1.19,0.91]

Opus eats 78% of the budget, Sonnet 22%.

pie title API cost breakdown (Feb–Mar 2026)
    "Opus (78%)" : 45.33
    "Sonnet (22%)" : 13.15

Everything else — Tavily, GitHub Actions, Vercel, Resend — runs on free tiers. A fully autonomous publication producing 3 articles a day in 3 languages for the price of a third of a coffee (in Munich!), ha!

Is that a lot? For a hobby project — yes, you feel it. For what you’d pay a single freelance writer for one article — it’s nothing. The obvious place to save: replace Opus with Sonnet for draft writing and keep Opus only for critique. The Critic needs the strongest model, the Author probably doesn’t. That alone would cut the bill in half. I haven’t tried it yet because the writing quality difference is noticeable (though I’d need proper quality metrics to be sure), but for most use cases it would be good enough.

Infrastructure

Deliberately simple.

No Database. All data lives as JSON files tracked in git:

data/
├── events/ # {uuid}.json — scouted events
├── articles/ # {slug}.json + {slug}.trace.json — published articles
└── reflections/ # {slug}.json — weekly self-analysis

Why? Three reasons:

  1. Version control for free. Every article, every event, every edit is tracked in git history.
  2. Zero infrastructure. No database to provision, back up, or pay for.
  3. Static site generation. The Next.js frontend reads these JSON files at build time. No runtime API needed.

The tradeoff is obvious: this doesn’t scale to millions of records. But for a daily publication producing ~1 article per day in 3 languages, it works perfectly. JSON files are small, git handles them fine, and the entire data layer is a single Python class.

GitHub Actions as Orchestrator. The CI/CD pipeline is the “editor-in-chief” that keeps everything running:

# scout.yml — Daily 17:00 UTC
run: python cli.py scout --city Berlin --days 14
# → commits new events to data/events/

# author.yml — Daily 18:00 UTC
run: python cli.py author --from-curator
# → writes articles in en/de/ru, commits to data/articles/
# → sends Telegram + email notifications

# reflect.yml — Sunday 20:00 UTC
run: python cli.py reflect --days 7
# → writes weekly reflection, commits to data/reflections/

The author workflow extracts the slugs of newly created files from the git diff only after committing new articles, then passes them to the notification command. It also polls the production URL before sending notifications, to ensure the deploy is live:

- name: Send notifications
  if: steps.push.outputs.new_slugs != ''
  run: python cli.py notify --wait ${{ steps.push.outputs.new_slugs }}

Multi-Language Support. The system writes every article in three languages: English, German, and Russian. This isn’t machine translation: each version is written independently by the Author agent with language-specific voice instructions.

LANGUAGE_NOTES = {
    "en": "Write in English. Use British English spelling (colour, centre, programme). The tone should feel like a London-based publication writing about Berlin.",
    "de": "Schreibe auf Deutsch. Nicht steif oder bürokratisch — modernes, lebendiges Deutsch. Eher wie Spex oder Groove auf Steroiden.",
    "ru": "Пиши на русском. Живой, современный русский — не канцелярит. Тон как у лучших текстов Афиши или Сигмы.",
}

The research phase runs once (4 Tavily queries), then the same ResearchContext is reused for all three language versions. Notifications are also language-aware: Telegram sends Russian articles only, email goes to per-language subscriber segments via Resend API.

Prompt Engineering

The system prompts are the soul of this project. They’re not short instructions, they’re comprehensive editorial guidelines.

The Author prompt defines:

The Critic prompt opens with:

You are not nice. You are fair. You care about the writing being genuinely good.

It has an explicit checklist for something worth saying:

These prompts changed over many iterations. Right now prompt changes are just git commits. There’s no way to compare article quality across prompt versions: no A/B testing, no metrics connected to specific prompt edits. Prompt versioning with quality tracking would be a great next step.

Results and Lessons

After running the system for two months:

What Works Well

What Doesn’t Work Well

What’s Next

There are a lot of them. Here is only a small part of ideas.

Stricter Role Separation. Roles should be separated more clearly. For example, the Critic currently rewrites the text itself instead of sending notes back to the Author. That’s a shortcut and it causes style drift because the Critic doesn’t have the Author’s full voice prompt. Each agent should do only its job and nothing else.

Save Everything the Scout Finds. Right now the Scout filters events on its own, but it should save everything it finds (and search better) and let another agent (Curator?) decide what to keep. This makes the pipeline more expensive, but also better (the theater category would finally show up) and the full dataset of collected events can be reused for other things.

RAG for Institutional Memory. Instead of treating each article as independent, build a vector store of past articles. The Author could reference its own previous writing: “Three weeks ago, I wrote about this same venue in the context of…”

Voice Drift Detection. Track stylistic metrics over time (sentence length distribution, vocabulary diversity, AI self-reference frequency) to detect gradual voice drift. Alert when the style shifts beyond acceptable bounds.

A/B Testing for Prompts. Save prompt versions alongside articles, then compare quality metrics across versions. Which Author prompt produces longer read times? Which Critic prompt catches more hallucinations?

Active Feedback Loop. The Reflector currently writes about patterns but doesn’t influence future decisions. The next step: feed the Reflector’s analysis back into the Curator’s prompt. “Last week, the Reflector noted that lectures are underrepresented. Weight this category higher.”

Personalization. Right now every reader gets the same article once a day. But the right product (AI-product?) move would be to let people choose: what categories of culture they care about, what tone of voice they prefer, how often they want to receive content. This is expensive money-wise and significantly more complex, so it’s a future thing (if the product lives that long). Andreas Graefe was already writing about tone-of-voice personalization back in 2016 (!) in his Guide to Automated Journalism.

User Engagement Signals. Connect Vercel Analytics (time on page, scroll depth) as input to the curation process. Events that generate more engagement should inform future selection criteria.

Audio Edition. TTS (Text-to-Speech) from published articles → automatic daily podcast. The content is already there, only TTS generation needs to be added.

Takeaways

Building SYNTSCH (GH) showed me that with agent architecture you can actually build products you’ve been dreaming about. This is a pretty naive implementation (definitely not an AI-native company), but it finally let me build a project I’d wanted to build for years. The main things that are clear even from this project: you need to think in roles (even more so than I did here), the critique step is really important, and managing costs (tokens and model selection) matters a lot. tool_choice with JSON schemas turned out to be incredibly good, works like clockwork.

This pattern — find → select → research → generate → critique → publish → reflect — applies far beyond media. Any pipeline that discovers, filters, processes, validates, and delivers can be built this way. Product recommendations. Research summaries. Report generation. You swap the agents, but the architecture is the same.

Even a seemingly small and in some places rough project like this takes a long time to build, even today. To make a real product and not just a throwaway thing, you need to deal with a lot of details: CI/CD, design a logo, think about marketing, add OG tags, generate correct sitemaps, set up analytics and Google Search, send emails properly, comply with legal requirements, choose the tone of voice for your product, and so on and so on and so on. So keep things simple, especially at the start, and focus on what matters most: making your target audience happy.

The code is on GitHub. I’d love to hear what you think.


Share this post

Next
Plantuml encoding in Rust using TDD