The first sign of a problem was when the app crashed while I was testing a new feature I put in. The error in VS indicated a XAML issue which was odd, since I wasn’t working with XAML at that point. I restarted the app and no error occurred. “Meh, what ever” was my response. The second sign of a problem was when another developer on the team, a week later told me about the issue. My response – shrugged it off and told him to restart. The third sign was when the testers logged a bug about it – finally someone was making this a real issue. You may hear that a developer & tester are better than two developers – this is a great example of why: two different perspectives of what really is an issue does really help.
The app we were building was a Windows Store App for Windows 8.1 using XAML & C#. It was being built using the Universal App Framework. It was also big for an app with over 40 screens! The XAML for our custom controls & styles is over 2 000 lines of XAML! The bug could be anywhere, but it had four traits:
So how did we fix this? This is a story of the dumb things I did followed by the smart thing I did.
Step one, we know it is XAML & it is on every page – so what is common on all pages? First is our styles.xaml where we store all the common colours & styles for the app. Second is a custom control we built, called PageLayout, which gives us a common layout and pieces of functionality (basically it is similar to the Frame control). If you are interested the XAML for that control is here. The key thing which suggested the XAML control was the cause was it has FOUR content presenters on it and the error suggested a content presenter was at fault. I reviewed the XAML & so did another dev – no problems found.
One thing happened, when we tweaked the XAML for PageLayout the line numbers in the error changed. So we knew the problem was there – but what was it?!
We were running the app outside of Visual Studio one day and it crashed. You know what is awesome in Windows? It will grab a memory dump for scenarios like this – so we got a memory dump of the error! This is awesome because we can use Visual Studio to work with those and there is a fantastic blog post on exactly how to work with it. You know what two hours of doing that told me… nothing. Come on, what is all that text below if we had the answer now. Keep up reader. All I could get was the following error message: Unhandled exception at 0x7582B152 (combase.dll) in triagedump.dmp: 0xC000027B: An application-internal exception has occurred (parameters: 0x055C31F8, 0x00000004).
I also got stuck since I didn’t have the Debugging Tools for Windows and over a gig of download wasn’t going to happen at work, but that may have been a blessing in disguise. Through this though, I did find this great Wintellect Now video on how to debug on a OS level.
So after spending hours of repeating the process of making a small change and then clicking through various screens over & over again (and thus making sure carpel tunnel syndrome was coming) – I wondered, can’t I automate the clicking. Actually I don’t need clicks, I just need navigations to a different page. So I created a new universal app, with a single page in it. I then copied PageLayout and everything it needed over – totalling 300 lines of XAML! I added a single button which automatically navigates to the mainpage over and over again. You can see this code base here.
This app is called Heather Speed Crash! and boy did it deliver, it could crash every time! It was still intermittent but it meant I could get it happen in < 60s with one click rather than > 20 clicks & 5 minutes. This helped a lot in getting work done. It also meant that the code base was less than 500 lines of code + XAML compared to the actual app of thousands of lines of code.
Now to iterate over Heather and clean it down and still get the errors which led to this version, which got us down to less than half the code + XAML & it also showed the exact issue was in the PageLayout. Not something which PageLayout used. This is one of the biggest learning's from this, intermittent issues aren’t some pain to resolve, the trick is just to make the scenario happen faster.
Somewhere between starting to understand the problem & diving into Windows Debugging, I realised that I was struggling with the problem and I decided to post the issue to StackOverflow and then, a few days later, to MSDN forums. This is always a valuable exercise for me as it forces me to put down in a systematic way and often just the act of collecting, ordering and writing down the information shows me what the problem is. This time, writing didn’t help and after a few days I had no progress with it. On the MSDN forums, it was suggested that I try identify which of the four content presenters was the problem. Now that I had Heather to help, I could do this. The process I went through was:
These two content presenters were not used on ever page, so their binding to content was set to null by default. This was the only difference between these two & the other two. So I gave each a default piece of content and the problem stopped. Oddly, if I gave just one a default piece of content and left the other presenter null, it would also be resolved.
The issue turned out to be a layout bug somewhere deep in XAML which is triggered only when two content presenters both exist on a page with null content. When XAML tries to work out the sizes of the controls, having two nulled content presenters causes a crash.
The work around to resolve this was to have both content presenters visibility set to collapsed by default & only make them visible if they have content.