Debugging stuck Script Activities | Cleon

Something went sideways. A Script Activity in an overnight Automation was supposed to write 12,000 rows to a log DE and only wrote 3,400. Or it ran for 31 minutes and the Automation step shows Failed with no useful error message. Or — the worst kind — the step shows Completed but nothing downstream got the data it expected. The diagnostic flow is the same every time: read the log DE the script was supposed to write, find the last row it managed to emit, and read backward from there.

This page is the six-query playbook. If you instrumented the script per the Style Guide — RunId + Step + Ts at every meaningful step, errors caught into de_log_ssjs_errors — every query below resolves in seconds. If you didn't, this is also the page that convinces you to start.

The diagnostic flow

[ Activity scheduled ]
        ↓ runtime begins
[ Platform.Load + RUN_ID generated ]
        ↓ init row written to de_log_ssjs_runs
[ Step 1 ... Step N ]
        ↓ each step writes its own row
[ Completion / Timeout / Caught error ]

Each arrow is a place the script can die without leaving an error in the Activity log. The queries below check each level by reading the rows the script did write, then inferring what's missing.

Step 1 — Did the script even start?

If the script never wrote its init row, the failure is before any of your code ran. Platform.Load failed, a SpiderMonkey 1.7 SyntaxError took the whole block down, or the Activity didn't fire at all.

-- Replace the date range with the Activity's expected run window
SELECT
  COUNT(*)                AS InitRows,
  MIN(Ts)                 AS FirstWriteAt,
  COUNT(DISTINCT RunId)   AS DistinctRuns
FROM de_log_ssjs_runs
WHERE Step = 'init'
  AND Ts BETWEEN '2026-05-13 02:00' AND '2026-05-13 03:00';

Three numbers, one question:

InitRows = 0: the script body never executed. Check the Activity's Run History in Automation Studio for an error message (Platform is not defined, SyntaxError, etc.). If the Run History also shows nothing, the Automation may not have triggered — verify the schedule and the upstream step's status.
InitRows = 1 but no other rows for that RunId: the script reached line 1 but died before the first meaningful step. Usually a missing dependency (a Code Resource that didn't load, a DE name that doesn't exist).
DistinctRuns > 1 in a window that should have only one: the Activity retried. Read the Run History to see why.

Step 2 — Where did it die?

If the init row is there, find the last step the script managed to write. That's the boundary — the bug is in whatever runs after it.

-- Replace 'a1b2c3d4-...' with the RUN_ID of the failed run
SELECT
  Step,
  Ts,
  Message
FROM de_log_ssjs_runs
WHERE RunId = 'a1b2c3d4-e5f6-...';
-- Order alphabetical-by-Step or chronological-by-Ts depending
-- on how you named steps. The Cleon convention prefixes each step
-- with a 2-digit order (01-init, 02-lookup, 03-enrich, ...) so
-- alphabetical and chronological agree.

The last Step value in the result is where the script died. The next bit of code in the script — between that step and the next log() call — is where the bug lives.

A common shape: the script logs 03-enrich-start but never 03-enrich-done. The bug is inside the enrichment block — usually a LookupRows cap, a WSProxy auth error, or an HTTP timeout on an external API. Steps 3-5 below check each.

Step 3 — Did it catch an error?

If the script followed the Style Guide rule about logging inside catch blocks, the error is sitting in de_log_ssjs_errors with the same RunId.

SELECT
  Step,
  Ts,
  Msg
FROM de_log_ssjs_errors
WHERE RunId = 'a1b2c3d4-e5f6-...';

What the message tells you:

Token has expired / Unauthorized — WSProxy auth timeout. The fix is in WSProxy — re-instantiate the prox every 15 minutes inside long-running loops. See gotchas — #8.
The requested operation could not be performed — usually a column mismatch in UpsertData / InsertData. The destination DE schema changed, your column list didn't.
Object expected / 'Platform' is not defined — Platform.Load is missing or the script tried to use Platform.* before the load call. See gotchas — #2.
No rows at all for this RunId: either the script has bare catch {} blocks swallowing errors silently (see gotchas — #7), or the failure mode is not an exception — it's a silent truncation (step 5 below).

Step 4 — Did it hit the 30-minute timeout?

Compute the duration between the first and last log rows for the RunId. If the last row's Ts is more than 28 minutes after the init row's Ts, the script almost certainly hit the 30-min hard timeout (see gotchas — #3).

SELECT
  RunId,
  MIN(Ts)                                AS StartedAt,
  MAX(Ts)                                AS LastWriteAt,
  DATEDIFF(minute, MIN(Ts), MAX(Ts))     AS DurationMinutes
FROM de_log_ssjs_runs
WHERE RunId = 'a1b2c3d4-e5f6-...'
GROUP BY RunId;

A DurationMinutes of 29 or 30 with no completion row is the timeout signature. The fix is structural, not in the failing step: either split the work across parallel Script Activities (see gotchas — #10), pivot to a SQL Query Activity for the bulk read, or add pagination upper bounds inside the loop (see gotchas — #9).

Step 5 — Did `LookupRows` silently truncate?

The LookupRows cap at 2500 doesn't throw an error. It returns the first 2500 rows and your script processes only those. To detect the failure after the fact, look for rows in de_log_ssjs_runs that recorded a lookup step with a row count of exactly 2500 — that's the cap signature.

SELECT
  RunId,
  Step,
  Message,
  Ts
FROM de_log_ssjs_runs
WHERE Step LIKE '%lookup%'
  AND Message LIKE '%count=2500%'
  AND Ts BETWEEN '2026-05-13 02:00' AND '2026-05-13 03:00';

This assumes the script logged the row count with each LookupRows call (e.g. log("lookup-subs", "count=" + rows.length)). If it didn't, this query returns nothing — and the only way to confirm the silent truncation is to re-run the script with a corrected log line, or pivot the read to a SQL Query Activity and compare row counts.

See gotchas — #5 for the pattern that prevents this from happening twice.

Step 6 — Write the postmortem

Once the bug is found, write the diagnostic into a de_log_ssjs_postmortems so the next time something looks off, you have historical baselines and a paper trail.

INSERT INTO de_log_ssjs_postmortems
SELECT
  GETDATE()                                AS DiagnosedAt,
  'SA_NightlyEnrichment'                   AS ActivityName,
  'a1b2c3d4-e5f6-...'                      AS RunId,
  (SELECT MIN(Ts) FROM de_log_ssjs_runs WHERE RunId = 'a1b2c3d4-e5f6-...')                 AS StartedAt,
  (SELECT MAX(Ts) FROM de_log_ssjs_runs WHERE RunId = 'a1b2c3d4-e5f6-...')                 AS LastWriteAt,
  (SELECT TOP 1 Step FROM de_log_ssjs_runs WHERE RunId = 'a1b2c3d4-e5f6-...' ORDER BY Ts DESC) AS LastStep,
  (SELECT COUNT(*) FROM de_log_ssjs_errors WHERE RunId = 'a1b2c3d4-e5f6-...')              AS ErrorCount,
  'LookupRows cap hit at 03-enrich; pivoted to SQL Query Activity'                         AS RootCause;

Run this after each diagnostic. Six months from now when a similar Activity acts up, you have the history to recognize the pattern instead of starting from zero.

Common causes ranked by frequency

| Cause | How to spot | Fix in | |---|---|---| | LookupRows returned 2500 silently | Step 5 finds a lookup step with count=2500 | gotchas — #5; pivot to SQL Query Activity | | 30-min hard timeout | Step 4's DurationMinutes is 29-30 with no completion row | gotchas — #3; split across Script Activities | | WSProxy auth expired mid-loop | Step 3 shows Token has expired | WSProxy; re-instantiate every 15 min | | HTTP.Get / Post hit 60s timeout | Step 3 shows a request-timeout message | gotchas — #9; paginate + cap iterations | | Bare catch {} swallowed an error | Step 3 returns no rows but step 2 stopped abruptly | gotchas — #7; log inside every catch | | Platform.Load missing | Step 1 returns zero init rows; Activity Run History shows Platform is not defined | gotchas — #2; line 1 every block | | SpiderMonkey 1.7 SyntaxError | Step 1 returns zero init rows; Activity Run History shows the parse error | gotchas — #1; no modern JS | | UpsertData against wrong PK | Step 3 might show nothing; row counts in downstream DEs are wrong | gotchas — #4; verify destination PK | | Unbounded while loop | Step 4's DurationMinutes near timeout, last step is inside a paginated callout | gotchas — #9; add && i < N | | Automation didn't trigger at all | Step 1 returns zero rows for the entire window | Check upstream step status + Automation schedule |

Basics — two contexts (CloudPage vs Script Activity) and how output differs
Platform.Function — the UpsertData / LookupRows / GUID calls the diagnostic queries depend on
WSProxy — auth token expiry pattern (the #1 cause of long-loop failures)
MC SSJS gotchas — the failure shapes referenced throughout this playbook
Style Guide — the instrumentation discipline (RunId + Step + Ts + log-in-catch) that makes these queries possible in the first place