TL;DR:  Being able to see the stack trace and investigate the values in those stack frames from a failed node process makes it much easier to understand the error messages and what has gone wrong.  mdb on SmartOS makes this possible, arguably, even easy.

I run several small Triton Smart Data Center installations, both for fun and as part of my job.  In both cases, I'm running the open source product, in a minimal configuration which might have more in common with a lab or development environment than a large datacenter.  I'm not running any HA.  I have backups and I know how to recover from them.  I don't have fully independent power.  Bringing the system back up when all of the HA elements were down at the same time isn't a challenge I wanted to engage in.  The non HA nature of my configurations will become relevant shortly.

Upgrades to Triton Smart Data Center are smooth.  Amazingly smooth.  I've been doing essentially monthly upgrades of the components for nearly 2 years.  In those two years, I've had two upgrades not go perfectly smoothly.  This is the story of the second one.

During a standard monthly update, while doing an sdcadm update --all the process timed out in one of the processing stages.  This has happened before because my system was slow, but up until now, simply re-running the update command has cleaned up the environment.  Which is what I did, which is also where the fun began.

Triton has most of its state stored in a single service called manatee.  Thus, this is a somewhat sensitive service/zone.  The last zone to upgrade after the failed first run was manatee, and with great concern, it did not start correctly after the update.  That moment is when you know you are going to be having a longer night than you had planned.

More than one Triton service uses waferlock. In my manatee zone waferlock was not starting.  The logs were... correct, but not helpful.  Sadly, they have rotated out before I have written this blog, so I can not correctly reproduce them here.

The logs indicated that waferlock was crashing here in the process of attempting to start.  The logs did not, however contain a full listing of what was in inst.metadata, only that DATACENTER was undefined and that a core had been dumped.  I spent several hours chasing around red herrings, attempting to figure out what part of my configuration had gone so very wrong.  On the second attempt to figure this out, after I took a break for sleep, I decided it was time to attempt to figure this out from the core file.

I basically did a paint by numbers trace of the javascript stack, following along with the examples, because I did not know what I was doing when I started. I still have the core, so what follows is my reproduction of the process.

root@6a4702d7-cd1a-4648-876a-3f95b81c36e7 (us-elns-1:manatee0) /]# mdb core
Loading modules: [ libumem.so.1 libc.so.1 ld.so.1 ]
> ::stack
_ZN2v84base2OS5AbortEv+9(93a2008, ada186f5, 80430c0, 80432c8, 80432cc, 0)
_ZN2v88internal2IC9TypeErrorENS0_15MessageTemplate8TemplateENS0_6HandleINS0_6ObjectEEES6_+0x4d(804321c, 8043238, 2f, 80432cc, 80432c8, 80432cc)
_ZN2v88internal6LoadIC4LoadENS0_6HandleINS0_6ObjectEEENS2_INS0_4NameEEE+0x1a2(804321c, 8043238, 80432cc, 80432c8, 8043238, 80432cc)
_ZN2v88internal11LoadIC_MissEiPPNS0_6ObjectEPNS0_7IsolateE+0x17e(4, 80432cc, 93a2008, aa10a07c, aa10a001, 80432a0)
0xaa10a07c(81459af5, e, 9594c681, 9fa08099, 9fa08099, 9fa08099)
0x9f7d5337(81459911, 9fa08099, 6, ada185bd, 14, 8043338)
0xaa11a143(ada17b49, 0, 81459911, 9fa08099, 9fa08099, 81459911)
0xaa17f1ef(4, ada17b49, 9fa08099, ada185bd, 9fa08099, a3c424f1)
0x9f7324a1(9fa08099, ada185bd, ada17b49, 2, a3c42699, 14)
0xaa11a143(ada185bd, ada17b49, a3c42699, ada17bd5, ada17b49, ad9fdd1d)
0x9f7d4d13(ada17985, ada16195, ad9fe6fd, 9fa08089, 9fa6c7c9, 8)
0xaa11ef22(ada17995, 9fa6c7c9, ad9fdd1d, 9fa608bd, ada17995, ada17995)
0xaedbfc99(9fa6c7c9, 8, ad9fdd71, 14, 8043454, 9f7cc8db)
0xaa11a143(ada17985, ada16195, ad9fe6fd, 9fa08089, 9fa08099, ad9fdd71)
0x9f7cc8db(ada16b8d, ada16195, ad9fe6fd, 9fa08089, 9fa08099, ada0ba35)
0x9f7cc243(ada16195, 0, 0, aa11edc1, 10, 9fa24719)
0xaa11ef22(ada17655, ada16195, ada16789, 9fa608bd, ada17655, ada1689d)
0x9f735516(ada16195, 9fa6082d, 4, ada173a9, 6, 85889251)
0xaede2768(ada16195, 9fa08231, ada166c5, 9fa08099, 85889251, a3c4d821)
0xaa172733(9fa551ed, ada16195, a3c4d821, 8588b641, 8586db55, 8043580)
0x9f7cb300(ada16195, ada1620d, 9fa08099, 8588b641, 858890c5, 9fa08211)
0x9f71cf74(ada16195, ada1620d, 8588b641, 9fa08099, 8588f749, ada16fad)
0xaedc164f(9fa6c735, 858100f5, aa11a921, 10, 0, 80435f8)
0xaa11a9e1(0, 0, 2, 0, 93fcdf8, 93a2008)
0xaa11999f(aedc1320, 858100f5, 9fa6c735, 0, 0, 93a2008)
_ZN2v88internal9Execution4CallEPNS0_7IsolateENS0_6HandleINS0_6ObjectEEES6_iPS6_b+0xff(80436ac, 93a2008, 93fcdf8, 93fccd8, 0, 0)
_ZN2v88Function4CallENS_5LocalINS_7ContextEEENS1_INS_5ValueEEEiPS5_+0xd7(804372c, 93fcdf8, 93e402c, 93fccd8, 0, 0)
_ZN2v88Function4CallENS_5LocalINS_5ValueEEEiPS3_+0x3c(804378c, 93fcdf8, 93fccd8, 0, 0, 93a2008)
_ZN4node9AsyncWrap12MakeCallbackEN2v85LocalINS1_8FunctionEEEiPNS2_INS1_5ValueEEE+0x22e(8043818, 9410a48, 93e4018, 3, 8043824, 93fcfd8)
_ZN4node10StreamBase8EmitDataEiN2v85LocalINS1_6ObjectEEES4_+0xec(9410a70, 24d, 93e400c, 0, 0, 9460848)
_ZN4node10StreamWrap10OnReadImplEiPK8uv_buf_t14uv_handle_typePv+0x125(24d, 8043950, 0, 9410a48)
_ZN4node10StreamWrap6OnReadEP11uv_stream_siPK8uv_buf_t+0x76(9410aa8, 24d, 8043950, 8566f5c, 10000, 0)
uv__read+0x1cd(9410aa8, 0, 4, fe8c81b9)
uv__stream_io+0x2c1(9366e40, 9410aec, 1, 8538914, 94517d8, 9366e6c)
uv__io_poll+0x20d(9366e40, c8, 4, 9fa08255, 9366f3c, 9366e50)
uv_run+0x127(9366e40, 1, fe956720, 8335a3e)
_ZN4node5StartEiPPc+0x59d(2, 8047d44, 4, 400, 9355ff4, 8047cf8)
main+0x42(fe958737, fe9d2548, 8047d38, 8314543, 3, 8047d44)
_start+0x83(3, 8047e34, 8047e7d, 8047e7d, 0, 8047e9e)
>

Yup, that's a stack.  That's not very helpful.  Let's load up the v8 tooling.

> ::load v8
mdb_v8 version: 1.4.1 (release, from 0cd139c)
V8 version: 4.5.103.53
Autoconfigured V8 support from target
C++ symbol demangling enabled
> ::jsstack
native: v8::base::OS::Abort+9
native: v8::internal::IC::TypeError+0x4d
native: v8::internal::LoadIC::Load+0x1a2
native: v8::internal::LoadIC_Miss+0x17e
        (1 internal frame elided)
js:     <anonymous> (as <anon>)
        (1 internal frame elided)
js:     InnerArrayForEach
js:     forEach
        (1 internal frame elided)
js:     <anonymous> (as <anon>)
        (1 internal frame elided)
js:     <anonymous> (as <anon>)
        (1 internal frame elided)
js:     parseResponse
js:     done
        (1 internal frame elided)
js:     g
js:     emitNone
js:     emit
js:     endReadableNT
js:     nextTickCallbackWith2Args
js:     _tickCallback
        (1 internal frame elided)
        (1 internal frame elided)
native: v8::internal::Execution::Call+0xff
native: v8::Function::Call+0xd7
native: v8::Function::Call+0x3c
native: node::AsyncWrap::MakeCallback+0x22e
native: node::StreamBase::EmitData+0xec
native: node::StreamWrap::OnReadImpl+0x125
native: node::StreamWrap::OnRead+0x76
native: uv__read+0x1cd
native: uv__stream_io+0x2c1
native: uv__io_poll+0x20d
native: uv_run+0x127
native: node::Start+0x59d
native: main+0x42
native: _start+0x83
>

That's better?  Not really helpful yet.  Let's try the -vn0 modifier to see if that helps.

> ::jsstack -vn0
native: v8::base::OS::Abort+9
native: v8::internal::IC::TypeError+0x4d
native: v8::internal::LoadIC::Load+0x1a2
native: v8::internal::LoadIC_Miss+0x17e
        (1 internal frame elided)
js:     <anonymous> (as <anon>)
          file: /opt/smartdc/waferlock/lib/sapi.js
          posn: line 184
          this: 9fa08099 (Oddball: "undefined")
          arg1: 81459911 (<unknown subclass>: String)
        (1 internal frame elided)
js:     InnerArrayForEach
          file: native array.js
          posn: position 16657
          this: 9fa08099 (Oddball: "undefined")
          arg1: ada185bd (JSFunction)
          arg2: 9fa08099 (Oddball: "undefined")
          arg3: ada17b49 (JSArray)
          arg4: 4 (SMI: value = 2)
js:     forEach
          file: native array.js
          posn: position 17214
          this: ada185bd (JSFunction)
          arg1: 9fa08099 (Oddball: "undefined")
        (1 internal frame elided)
js:     <anonymous> (as <anon>)
          file: /opt/smartdc/waferlock/lib/sapi.js
          posn: line 151
          this: 9fa6c7c9 (JSGlobalProxy)
          arg1: 9fa08089 (Oddball: "null")
          arg2: ad9fe6fd (JSObject: ClientRequest)
          arg3: ada16195 (JSObject: IncomingMessage)
          arg4: ada17985 (JSArray)
        (1 internal frame elided)
js:     <anonymous> (as <anon>)
          file: /opt/smartdc/waferlock/node_modules/mooremachine/lib/fsm.js
          posn: line 199
          this: 9fa6c7c9 (JSGlobalProxy)
        (1 internal frame elided)
js:     parseResponse
          file: /opt/smartdc/waferlock/node_modules/restify-clients/lib/JsonClient.js
          posn: line 60
          this: 9fa08099 (Oddball: "undefined")
          arg1: 9fa08089 (Oddball: "null")
          arg2: ad9fe6fd (JSObject: ClientRequest)
          arg3: ada16195 (JSObject: IncomingMessage)
          arg4: ada16b8d (SeqOneByteString)
js:     done
          file: /opt/smartdc/waferlock/node_modules/restify-clients/lib/StringClient.js
          posn: line 194
          this: ada16195 (JSObject: IncomingMessage)
        (1 internal frame elided)
js:     g
>> More [<space>, <cr>, q, n, c, a] ?

Okay, I can now clearly orient myself in the stack frame.  Based on the source code, I'd like to see what was in the 'added' array at time of failure.

> ada17b49::jsprint
[
    "b4d65123-85b8-4406-b065-0d4a3719fd8f",
    "b9094c09-353c-4a49-b3cc-a16940697c00",
]

Okay, I had an immediate and strong reaction to that of 'what?'.  Remember how I said earlier that I was not running an HA environment?  Well that was an indication that there was more than one moray being listed, and in my environment, there's only one moray.

Additional investigation eventually led me to noting that the moray0tmp that gets created during the upgrade process hadn't been cleared out of SAPI, probably due to the crash during the first half of the update.  The moray0tmp that's created during the update process does not contain all of the metadata a full moray does, and does not have the DATACENTER element, which is where waferlock was crashing.  

I then manually cleaned up SAPI's record of reality, removing moray0tmp's record, and waferlock started immediately.  Problem resolved.  Board shows green again.

In the process of writing this entry, I stumbled over this wonderful block comment explaining that this does occasionally happen.  It appears there is some cleanup code for this case, which would have cleaned up the leftover records at the next update of moray.

So, next time a system helpfully drops you a core file, just open it up and start looking.