2026-02-06-1536.md
Session: 2026-02-06 15:36:49 UTC
- Session Key: agent:main:main
- Session ID: d0bb288a-3ba0-44d2-951b-9981dbeea84a
- Source: telegram
Conversation Summary
assistant: Yep — that’s exactly the intended behavior, with one important nuance:
What the lock means (and what it doesn’t)
- The lock is per-run, not per-issue.
- It answers: “Is another cron worker instance currently running?”
- It does not represent “there’s an issue in progress” (that’s tracked separately in
state.json+ the Project Status field).
So:
If a run is already active
- New cron tick comes in
- It tries
lock.mjs acquire - If the lock exists and is still “fresh” (TTL), the new run bails immediately (doesn’t look at the project at all).
When the active run finishes
- It releases the lock at the end
- Next cron tick will acquire the lock and then:
- query Ready-for-work items
- pick up any new ones that appeared while it was running
How we know an individual issue is “already being handled”
That’s the job of the watcher state + GitHub Project status:
- As soon as the worker starts an issue, it runs:
mark-in-progress.mjs <issueNodeId> --project-item <projectItemId>- updates
state.json(inProgress) - moves the Project item to In progress
- updates
Then get-unprocessed-ready.mjs will naturally skip it on subsequent runs (until TTL expires), which prevents duplicate work even if something crashes mid-issue.
Telegram “needs clarification” messages
Agreed — without that, the agent would just stall forever. The updated rules are now:
- Message you when requirements are unclear (and mark Blocked).
- Message you only when a PR actually merges during that same run (unchanged).
- Otherwise stay quiet.
Status: changes are already applied
I’ve already:
- added
automation/github-project-watcher/lock.mjs - updated the “Automatic Coding Agent” cron payload to use the lock + mark in progress + mark blocked + mark processed with
--project-item(so Project moves to Done on PR creation, per your preference)
Next question: do you want me to also do something about the current stuck state.inProgress entry (I_kwDORJcijc7ooW25)? Options:
- leave it and let TTL handle it
- resolve what it is (map to repo/issue) and decide what to do
- clear it from
state.jsonso it can be picked up again if it’s still “Ready for work” on the Project user: System: [2026-02-06 14:39:07 UTC] Cron: Run completed. Output shows no unprocessed "Ready for work" issues:
- readyIssues: []
- unprocessedReadyIssues: []
State file shows no in-progress or blocked issues.
No further action needed right now. Would you like me to re-run this on a schedule, or check a different project/board?
Yeah for the existing item you can clear it from state.json. I still have another question: when the job runs what happens if it sees multiple items in the GitHub project that are ready for work? Would it process those items sequentially in the same cron worker instance?
[message_id: 6c4f174c-2b86-4835-a640-bce65f96e660]
assistant: Cleared. I removed I_kwDORJcijc7ooW25 from automation/github-project-watcher/state.json (it had been sitting in processedIssueNodeIds; inProgress was already empty).
On your question: yes—if a given run sees multiple “Ready for work” items, it will process them sequentially in the same cron worker instance, in a deterministic order (sorted by repo then issue number, per get-unprocessed-ready.mjs).
A couple practical notes given the 10-minute cron timeout:
- It will start iterating through the whole list, but in practice it may only finish 1 (or a few) before it hits the timeout, especially if cloning/tests/CI polling are involved.
- With the lock in place, another cron tick won’t run concurrently, so any remaining ready items will wait until the next scheduled run after the current one finishes and releases the lock.
If you want, we can add an explicit “time budget” rule to the payload (e.g., “only start a new issue if ≥2 minutes remain”) to avoid half-starting an issue near timeout. user: So what we're saying is let's say I have five items that are ready for work and it only finishes one in the time minute window. It was in the process of working on the second item when it hits a timeout. What happens next for that second item and the other three items? [message_id: 6352c31a-6c09-4360-b39e-6c078dcb6c60] assistant: Assume 5 items are “Ready for work”, and the worker finishes #1, starts #2, then the cron run times out.
Item #2 (the one it was working on when it timed out)
With the updated logic, the first thing it does when it starts an item is:
mark-in-progress.mjs <issueNodeId> --project-item <projectItemId>
So when the job dies mid-#2:
- GitHub Project: that Project item will already have been moved to In progress.
- Local state.json: that
issueNodeIdwill be present instate.inProgresswith astartedAt.
On the next cron run:
- the job only queries items that are Status = “Ready for work”, so #2 (now “In progress”) won’t even show up in
readyIssues. - even if it somehow did,
get-unprocessed-ready.mjswould still skip it while it’s “fresh” instate.inProgress(until TTL expires; your watch.json says 12 hours).
Net: #2 will be skipped for a while (until you move it back to Ready for work manually, or until the TTL expires and you have logic to re-queue/handle stale inProgress).
The other 3 items (#3–#5)
Those were never touched, so they remain:
- GitHub Project: still Ready for work
- Local state: not in
processedIssueNodeIds, not ininProgress, not inblocked
What happens next depends on whether the timed-out run had enough time to mark #2 in progress (usually yes, because it’s step (a) at the top). Either way, on the next run:
- the worker will pick up the remaining Ready-for-work items (#3–#5), sequentially, until it runs out of time again.
One gotcha (important)
Because we’re using a run lock, if the job times out hard and doesn’t execute the “release lock” step, the lock file can be left behind. That’s why the lock has a TTL (default 20 minutes): the next run will treat it as stale, break it, and continue.
If you want the “timed out on #2” case to auto-recover cleanly, the usual improvement is:
- add a “stale inProgress requeue” step (e.g., if
startedAtolder than 30–60 minutes, mark it blocked or move it back to Ready), or - add a time-budget rule so it won’t start #2 unless there’s enough time left to at least create the PR.
Which behavior do you prefer for stale inProgress: auto-retry, auto-block, or leave for manual? user: System: [2026-02-06 14:44:22 UTC] Cron: Run completed.
Summary:
- Acquired run lock successfully.
- Queried for unprocessed "Ready for work" issues on the project board.
- No unprocessed ready issues found (unprocessedReadyIssues: []).
- Released run lock.
No action required. If you want me to run again on a schedule or force a re-check, tell me and I'll start another run.
I don't want to have to do anything manually other than providing clarifications. That's kind of the whole point of setting up this automatic coding agent process. What do you think will be the most robust way to handle this scenario? [message_id: f5531d42-e730-498b-a4f9-d0e27839dd7c] assistant: If you want zero manual intervention except clarifications, the most robust pattern is to treat each issue like a leased job with automatic retries + an eventual “needs human” escape hatch.
Right now we have the beginnings of that (mark inProgress early), but we’re missing the “what if the worker dies mid-job?” recovery loop.
Here’s what I think is the most robust approach, in order of impact:
1) Make inProgress a short lease + auto-reclaim
Instead of inProgressTtlMinutes = 720 (12h), use something like 30–60 minutes for the automation lease.
- When a run starts issue #2, it marks it
inProgress. - If the run dies, the next run waits until the lease expires, then automatically reclaims the item and retries.
This eliminates long “stuck in progress” states without you doing anything.
Even better: “lease heartbeat”
If the agent is still alive and working, it should periodically “renew” the lease (update startedAt or heartbeatAt) every couple minutes. Then:
- alive worker keeps the lease fresh
- dead worker stops heartbeating → lease expires → next run reclaims
That’s how most production queue workers avoid duplication and avoid getting stuck.
2) Teach the cron job to also look at stale inProgress and recover them
Today the cron only queries Ready for work items. For true hands-off automation, the job should do:
- Query Ready for work items (new work)
- Also check state.inProgress for stale leases (or query Project items in “In progress” too)
- If stale:
- either retry automatically (set Project back to Ready-for-work and re-run), or
- just continue work on it if it can detect an existing branch/PR
My preference: retry automatically unless you’ve built reliable “resume from partial state” logic.
3) Add attempt counting + escalation to “Blocked + clarification”
To avoid infinite retry loops (e.g., broken tests, missing secrets, flaky dependency), track attempts per issue, e.g.:
attempts: 1, 2, 3…lastErrorSummarylastTriedAt
Policy:
- Retry automatically up to N times (usually 2–3)
- If still failing, mark Blocked and message you with specific questions or the failure summary
This keeps it autonomous but prevents it from burning cycles forever.
4) Time-budget rule: never start a new issue near timeout
This prevents the “starts #2 and dies immediately” scenario.
Example policy:
- If there’s < ~2 minutes left, don’t start another issue
- Just release lock and exit cleanly
- Next run continues with remaining ready items
This dramatically reduces partial work without requiring complex resume logic.
5) Consider “1 issue per run” as the simplest reliable mode
With a 10-minute timeout, processing 5 issues in one run is optimistic.
A very robust strategy is:
- Each cron tick processes at most 1 issue (or 1 PR creation attempt)
- Marks it processed/done once PR is created
- Next tick picks up the next issue
This makes timeouts almost irrelevant and naturally smooths load.
What I recommend for your goals (hands-off)
Implement: (A) short lease + auto-reclaim + attempts + time budget + 1-issue-per-run.
That combo is boring, simple, and extremely reliable.
Concrete behavior in your “5 items” scenario
- Run processes item #1 fully → done
- Starts item #2
- if it times out mid-way:
- item #2 remains “In progress” briefly
- next run either:
- reclaims it after lease expiry and retries, or
- continues it if it can detect a PR/branch
- if it times out mid-way:
- Items #3–#5 stay Ready and get handled on subsequent runs
No manual cleanup. Only messages you get are:
- clarification needed, or
- “I tried 3 times and keep failing; here’s why—what should I do?”
If you’re aligned, I’d implement next:
- Reduce
inProgressTtlMinutesinwatch.jsonto something like 45 - Extend the state format to track
attemptsper issue + timestamps - Update the cron payload instructions to:
- process max 1 issue per run
- before starting: ensure enough time remains
- if stale inProgress exists: reclaim/retry it first
Want me to go ahead with that plan, or do you prefer “process as many as possible per run” but with a time budget? user: im not sure... some of these feel way more complicated than it needs to be... my end goal is simply that - i'll create work in my github project and i want the coding agent to pick them up and work on them, ask me for clarifications when needed [message_id: fa58fcac-5ebc-4394-a25c-54f44c361e53] assistant: You’re right to push back. You don’t need a “distributed job system.” You just need two simple guarantees:
- No duplicate workers stepping on each other (lock)
- No item can get stuck forever (automatic recovery from timeouts/crashes)
We already added (1). The simplest version of (2) that stays low-complexity is:
The simple, robust rule set (what I’d do)
A) Process one issue per cron run
This is the biggest simplifier.
- Each run: pick the first Ready-for-work item → do the work → stop.
- Next run (5 minutes later): pick the next one.
Why it helps:
- Almost eliminates the “timed out halfway through issue #2” scenario.
- Makes behavior predictable.
- Keeps the “queue” naturally flowing without needing fancy resuming.
B) When starting an issue, mark it In progress
We already instructed this:
mark-in-progress ... --project-item ...
So if the run dies, at least it’s clearly “owned” and won’t be picked up as Ready by a parallel run.
C) If an issue has been In progress “too long”, automatically re-queue it
No manual intervention.
Simplest interpretation:
- If something has been inProgress for > X minutes (pick something like 60–90), then on a future run we:
- set it back to Ready for work (or directly pick it up again), and
- try again.
- If it keeps failing repeatedly, then mark Blocked + ask you.
This can be implemented with minimal extra logic:
- We already compute
staleInProgressinget-unprocessed-ready.mjs. - We just need the cron instructions to say: if
staleInProgressexists, “reclaim” it (unmark/mark-blocked/whatever policy) and retry.
D) Telegram messages only for:
- “I need clarification” (blocked)
- “I merged it” (optional, as you already had)
That matches your end goal.
Answering your earlier scenario with this “simple mode”
5 items ready, it finishes 1 and times out on 2
If we do one issue per run, this basically can’t happen (because it wouldn’t start #2 in that same run). The remaining 4 items just wait for subsequent runs.
If a single issue itself times out mid-work:
- It’s already marked In progress
- Next runs won’t ignore it forever, because after ~60–90 minutes it will be auto-requeued / retried
- After N retries it becomes Blocked + asks you
No manual cleanup.
One question so I implement the recovery behavior correctly
When an issue is stale In progress (timed out/crashed), do you want the agent to:
- Retry automatically (preferred): move it back to Ready-for-work (or just re-run it)
- Mark Blocked + message you immediately (“I got stuck, what now?”)
If you say “retry automatically,” I’ll implement a very simple “retry up to 2 times, then block + ask” policy. That keeps it hands-off without getting into complicated systems. user: System: [2026-02-06 14:49:40 UTC] Cron: Run started — lock acquired.
Result:
- Found 0 unprocessed "Ready for work" issues (unprocessedReadyIssues is empty).
- No actions required.
Lock released. Nothing else to report.
retry automatically. but just to clarify, in this new setup, if something would take longer than 5 mins to complete, would it get stuck forever essentially? bc even if it gets requeued it will time out again next time? [message_id: 72fd8890-00a4-4a84-b0b4-631faec8b1c5] assistant: No — it wouldn’t get stuck forever, but it could get into a dumb retry loop if the per-run time budget is always shorter than the work.
There are three different “takes longer than 5 mins” cases:
1) It takes longer than 5 mins, but less than the job timeout (10 mins today)
Then it’s fine. The cron fires every 5 mins, but the run can last up to timeoutSeconds: 600 (10 mins). With the lock, overlapping ticks just bail.
So “>5 mins” by itself isn’t a problem.
2) It often takes longer than 10 mins to do real work (clone + changes + tests + PR)
Then yes, with timeoutSeconds=600, you can see:
- start work
- time out
- auto-retry
- time out again …until it either gets lucky (fast run) or hits a retry limit and blocks.
It still won’t be “stuck forever” if we implement “retry N times then Blocked + ask you”, but it could churn and waste cycles.
3) The work is inherently slow because of CI / auto-merge waiting
This one is actually already handled: we don’t need to wait for merge to make progress. You can mark Done when PR is created, and stop polling merge entirely (or poll briefly). CI can run asynchronously.
The simplest fix (and what I recommend)
Increase the cron job timeout.
If you want this to behave like a real worker, give it room to breathe:
- set
timeoutSecondsto 1800 (30 min) or 2400 (40 min)
Then combine with:
- process 1 issue per run
- time-budget rule (don’t start if < ~5 min left)
This keeps logic simple and makes “retry loop” much less likely.
If you keep the schedule at every 5 minutes, that’s fine: the lock will cause most ticks to bail while the long run is active.
And add one safety valve
Even with longer timeouts, keep:
- “retry automatically up to 2–3 times”
- then mark Blocked + message you with what it needs (or what failed)
That guarantees “no manual work except clarifications,” but also “no infinite spinning.”
If you want, I can implement this as:
- update the cron job:
timeoutSeconds: 1800and instruct “max 1 issue per run” - adjust watch.json
inProgressTtlMinutesto something like 90 (so a dead run gets reclaimed) - add a very small “attempt counter” in state so we block after repeated timeouts/failures
Do you want 30 minutes or 45 minutes for the job timeout? user: sounds good, timeoutSeconds: 1800 is good, and we can set inProgressTtlMinutes to 30 too [message_id: 3f6dfe74-b1bf-4f91-9cdf-682965357843] assistant: Done.
Changes applied:
- Cron job timeout increased
- “Automatic Coding Agent” now has
timeoutSeconds: 1800(30 minutes).
- In-progress TTL reduced
automation/github-project-watcher/watch.jsonnow has:inProgressTtlMinutes: 30
- Automatic stale inProgress recovery added
- Added:
automation/github-project-watcher/reclaim-stale-inprogress.mjs- On each run (after acquiring the lock), it reclaims anything stale in
state.jsoninProgressand re-queues it (and clears the local inProgress entry).
- On each run (after acquiring the lock), it reclaims anything stale in
- State now remembers projectItemId for inProgress
- Updated
mark-in-progress.mjsso when called with--project-item <id>, it stores thatprojectItemIdalongsidestartedAt. This makes re-queueing possible.
- “Process at most one issue per run”
- Updated the cron payload instructions to take only the first
unprocessedReadyIssuesitem each run.
One nuance to be aware of (important): for auto-retry to work cleanly, the reclaim script needs the projectItemId recorded in state.inProgress. From this point forward it will be (because we record it when we mark in progress). Older inProgress entries without a stored projectItemId would still be cleared locally, but might need the Project item moved back to “Ready for work” to be picked up again—this should now be handled going forward.
If you want, I can trigger a manual cron run now to sanity-check that:
- lock acquire/release works,
- reclaim script runs cleanly,
- and it exits normally when there are no ready items. user: System: [2026-02-06 14:54:55 UTC] Cron: Run started, acquired lock, checked for unprocessed "Ready for work" issues, found none, and released the lock.
Details:
- Lock acquire: succeeded (lock at /tmp/openclaw-automatic-coding-agent.lock).
- get-unprocessed-ready.mjs output: { readyIssues: [], unprocessedReadyIssues: [], state: { inProgress: [], blocked: [], staleInProgress: [], staleBlocked: [] } }
- Lock release: succeeded.
No further action required this run. If you want me to re-run now or change the watcher behavior (e.g., relax criteria, run on a different project board), tell me what to do next.
sure go for it
[message_id: 1d3cf15d-9b23-4762-9c20-f1685cbc2615]
assistant: The scheduler wouldn’t trigger it immediately just now — cron run returned:
ran: falsereason: "not-due"
That means this cron implementation only allows manual runs when the job is due (some setups enforce that).
Two easy options:
- Wait until the next 5‑minute tick (it’ll run on schedule automatically), or
- I can temporarily tweak the schedule (e.g., set it to run “every 1 minute”), let it fire once, then set it back to 5 minutes.
Which do you prefer?