It's our data, we need to keep it in-house.

Using HTTP/2 Session IDs for Analytics

How to replace Google Analytics with an in-house solution

by Joe Honton

In this episode Ivana builds an in-house website log analyzer from scratch using Node.js.

Ivana thought about the remarks made by Antoní during last week's Tech Tuesday, where Capt. Cayman spoke about the importance of logging, in The Critical Importance of Data Logging. She wanted to clarify what he meant.

Ivana walked over to Antoní's desk. "So, you were saying that the RWSERVE logging tools don't go far enough. What's far enough for you?"

"Well, to start with, I'll just say that the RWSERVE logging facility is very comprehensive," Antoní replied. "It has the capability to record everything we need, maybe even more than we need. So I'm not finding fault there.

"And it's method of categorizing log entries by type, rather than by severity is a novel approach. I like that too.

"But what I want to see each morning is more than just a health check. More than just a list of all the bad actors who have tried to break in over the past 24 hours. More than just a list of the glitches in our JavaScript that we have to fix today.

"I want to see the paths that people follow as they navigate from page to page. This isn't a difficult problem. And I know it can be done without injecting third party analytic scripts into each page. There's just no reason why we need to be giving control over our valuable data to an outside company just because the service is free. It's our data. And especially in this new age of security awareness, we need to keep it in-house."

"I see," Ivana said, "so what would your advice be. Is it something that someone like me could do?"

"Absolutely," Antoní answered assertively. "Go for it. And if you get stuck, come back to me and I'll help."

Ivana left it at that for now, and went back to the API extension she was building for Rocket Science. But the impulse was still there, and just wouldn't fade.

Later that week, Ivana approached Clarissa, "So I've been thinking about a 20% skunk-works project. Is it really something I can work on after the weekly deploy each Thursday night?"

"Of course," Clarissa answered, "I meant what I said. If you've got a project in mind that would help make you more productive, absolutely. If you're thinking about something that would benefit the whole company, even better."

It's your project. Have fun!

Ivana relaxed a bit, "Cool! I've been thinking about logging, and what Antoní was suggesting. You know, about threads. It seems to me that with HTTP/2 there's a built-in way to watch a visitor's thread, as they navigate through the website, simply by using the session identifiers. With HTTP/2, sessions are persistent, so it's a snap to monitor a visitor's navigation through the site. All I'd need to do is ..."

"OK, OK," Clarissa interrupted with a hand-gesture for her to stop. "I don't need any details. It's your project. Have fun!"

Ivana was relieved that she got the go-ahead so easily. Then, a few moments later, she fell into a panic. "What have I done? I was just thinking it should be possible, and now I'm on the hook."

She immediately went back to her desk and started prototyping a quick and dirty solution. If that didn't pan out, then she'd still have time to walk-back on her commitment.

Six hours later, it was getting dark outside. "It's quiet in here," she thought. "Am I the only one left?"

Ivana poked her head into Antoní's office, "Oh, you're here. I thought everyone had gone for the day."

"Right," he replied, "I should get out of here. It's way past quitting time. Can I walk you to the door, so we can lock-up for the night?"

Ivana almost declined. She was so close to getting her first real results, and didn't want to let it go. But she wasn't sure if Antoní was asking a question or making a statement, so she gave in. "OK, I suppose the code will still be there tomorrow."

"It always is," he said with a smile.

When Ivana returned the next day, she was musing about the analytics problem and how to proceed. But fate drew her away from any further work on it that day. It was a week before she was able to resume.

Friday arrived and she was able to pick up where she left off.

When she originally built the initial prototype, all of her thinking had been focused on the idea that session identifiers were going to be the key to unlocking visitor threads. And without too much fuss she had a workable prototype that proved the idea. But now, a week later, she saw how things could be taken to the next level.

Since sessions were persistent for a limited time, there were gaps in the threads whenever the visitor lingered too long on any page. So she fell back to using the visitor's IP address to reconnect the separate sessions. And that proved useful too, allowing her to highlight those lingering moments in the report. How long were the gaps? How many separate visits occurred over a given time period? Did the visitor return where they left off?

With just an afternoon's work, Ivana had the codebase refactored to handle both session identifiers and IP threads. She was eager to show off what she had come up with, so she unplugged her laptop and walked over to Antoní's office.

I've got a working prototype.

She hesitated, "Uh, ... got a minute?"

"What's up?" Antoní seemed genuinely interested.

"So I've got a working prototype for that thing we were talking about. You know, HTTP/2 session identifier analytics."

"Right. How did it turn out?"

Ivana began a core dump of all the problems she had to overcome. "Well, first I had to aggregate the log messages for each request response cycle, because RWSERVE logs one message for the incoming request, and a separate message for the outgoing response. And if we have staging and information configured for the server, then each cycle will have those two messages as well.

"Then I had to parse the log entries into key/value pairs, which wasn't too hard because RWSERVE uses a formal syntax for logging."

She walked over to the whiteboard and scratched out —

request  SID=2344; RR=0; ME=GET; PA=/index.html;
staging SID=2344; RR=0; RA=; UACN=Safari
response SID=2344; RR=0; ST=200; CT=text/html; CL=1467

"Consider these three log messages. They each have a session ID of 2344 and a request/response ID of 0. Clearly they belong together, so I aggregate them into a single data structure. Then I parse the logged items into a map of keys, which in this case would be ME, PA, RF, RA, UACN, ST, CT and CL.

"At first glance it seems a bit obscure, but actually it's easy to read. The codes are abbreviations that are spelled out in the server configuration file: method, path, referer, remote-address, user-agent-common-name, status, content-type and content-length.

"The mapping of HTTP header names to abbreviations is up to us. Also it's not limited to any predefined list of header names, so if we want to use a new response header, say x-tangled-web-services we could map that to TWS and log it too.

"There were other things to work out as well, like stripping off the ANSI color control characters that RWSERVE adds for foreground and background coloring.

"And prettying the text of paths and query-strings by decoding URIs back into Unicode.

"And recognizing internal and external referer values."

Marveling at her enthusiasm, Antoní interjected, "But, again ... how did it turn out?"

"I think I'm getting there." Ivana said. "I've got session ID threads, IP address tracing, stratification and tabulation, ingress and egress."

"Ingress and egress?" Antoní puzzled.

"Where the visitor was referred from, and what page the visitor navigated to," she explained.

"Oh, how about just saying 'coming from' and 'going to'."

"Sure. That's probably better." Her glee was apparent. It was obvious that Antoní approved of the work she had put into it. "So I've been thinking about marketing campaigns and other ways to glean information from the logs. I'd like to keep working on it before sharing it with the team."

"Hmm, be careful not to succumb to the urge to add more features," he advised. "Get it out there, and let others give you feedback on what they need. Adding new features can be endless. It sounds like you've got the kernel in good shape. Now get the bugs worked out before going any further."

"Bugs?" she thought. She had been careful when coding things to handle all of the obvious edge cases. But for now, she kept her thoughts to herself.

As Ivana left she was wavering on whether or not to share what she had so far. "Thanks for the feedback," she mustered with respect, "I'll take one more look at the code before committing it."

Ivana was ambivalent about her results. On the one hand, she had proven the usefulness of HTTP/2 session identifiers in capturing the navigation patterns of her website's visitors. But on the other hand, there was so much more potential to be tapped. She wanted to take her 20% project beyond its roots. She had a headful of ideas about what to do next.

Anyway, Antoní was probably right. It's hard to let things go when you've invested so much creative energy into it. But without actual usage, it's nothing more than lines of code.

See what Ivana has been up to with the Read Write Serve Analytics CLI.

No minifig characters were harmed in the production of this Tangled Web Services episode.

Using HTTP/2 Session IDs for Analytics — How to replace Google Analytics with an in-house solution

🔗 🔎