Google Analytics runs on over 56% of all websites. It's the backbone of ad-tech across the web. Unfortunately, for site owners like me who just want to learn how people are using their website—while respecting their privacy—there simply aren't any alternatives that meet all my requirements. So in two days, after a couple dead-ends, I built my own using React, AWS Lambda, and a spreadsheet. This is how.
I wanted to set this up quickly, so I decided on the most easily accessible tools for my current go-to stack. You could swap out any of these layers for your preferred tech. If you're expecting higher usage, you'll almost certainly want a more robust data store.
Here's how it works:
Working backwards, I start by defining what data I care about tracking. This is actually one of the biggest ancillary benefits of rolling your own analytics. I get to perfectly fit my analytics to my application. How much of GA do you really understand and use anyways?
My initial use case is a common one. I'm building featherbubble.com, an interactive children's story. The site has a landing page and a waitlist conversion page. I'll soon add a pilot episode people can read through to preview. So, a site with a few pages and a conversion.
Track sessions, not people, mostly events. All I want to know is referrer and some session metadata like language, timezone, and device. And then I want to log what happened using events and summarize that in the session table.
I create a new Google Sheet (sheet.new) with two sheets: Sessions
and Events
. Put the column headers in the first row. A logged session looks like this:
Again, you can customize these fields to your heart's content. Instead of shoehorning all your data into the complicated event mappings of Category/Action/Label/Value fields (or Segment's analytics.js properties
field), you can simply create a new column with what you want to track.
I initially tried to use Google's Sheets API. This is a dead-end. Even with an API key, while you can read data from a spreadsheet, "anonymous edits aren't currently supported." The API needs an auth token.
Fortunately, there's an even easier way (without needing to walk through the complexity of Oauth configuration): Google App Scripts. From your spreadsheet menu, Tools > Script Editor
, you can publish a script that will run operations on your spreadsheet.
I got the idea from here, and modified the script to:
{
// Session headers...
sessionId: '1234',
device: 'iPad',
...,
Events: [{
// Event headers...
sessionId: '1234',
event: '/',
...,
}]
}
Here's the full gist. Copy that into the script editor, make your changes and save it. To publish, in the menu go to Publish > Deploy to web app...
.
A few important gotchas about publishing:
Anyone, even anonymous
New
Setup
function. From the menu, Run function > Setup
You only ever need to do this once.Now you've got an API endpoint you can use to store analytics data in your spreadsheet.
Getting the data in is only half the battle. I want to make my analytics easily consumable. For me, that means a daily summary, in my inbox. So, because my spreadsheet-fu is not very good, instead of figuring out pivot tables, I decided to write a script that aggregates the session data each day, adds it to a 3rd "Analytics" sheet, and emails me a report.
Instead of running a daily cron job, whenever a new session is saved, I check to see if the previous day's sessions have been aggregated. If not, I total them up and email myself a report.
The dailyTotals
function is included in the gist above.
With the daily summaries in a spreadsheet, I can easily turn that into a chart and view by week, month, etc.
Serverless lambda functions are perfect for an analytics API. Logging doesn't require a response. This makes the cold start issue with lambda functions negligible. Of course, there's no reason you couldn't use a regular server endpoint, especially if you've already got an API. But for a static site, the free and easy setup with Netlify along with the potential to scale effortlessly makes it an obvious choice.
Using Netlify's free tier, access to AWS Lambda functions are easy to setup. I won't go into the details, since there are plenty of resources out there already. Here are the important parts:
example.com/.netlify/functions/my-function
, and viola, your analytics API calls are to your own domain.Here's my lambda function code:
// necessary for async/await
import 'regenerator-runtime/runtime'
// fetch implementation for node
import fetch from 'node-fetch'
// Store your Sheets url in your environment vars
const { GOOGLE_SHEETS_URL } = process.env
// custom helper functions
import genId from './helpers/genId'
import getDevice from './helpers/getDevice'
import isBot from './helpers/isBot'
export const handler = async event => {
const data = JSON.parse((event || {}).body)
// exit if data is not formatted correctly
if (!data || !data.events) return throw 'Invalid data'
// skip if looks like a bot
if (isBot(data.userAgent)) return
let waitlist
let pages = 0
// sort the events chronologically and summarize
const events = data.events.sort((a, b) => {
// check if conversion happened
if (~a.event.indexOf('joined waitlist')) waitlist = true
// count the PAGES
if (a.label === 'PAGE') pages += 1
if (a.timestamp > b.timestamp) return 1
else if (a.timestamp < b.timestamp) return -1
return 0
})
// values to post
const Session = {
// session: get id based off first event timestamp
sessionId: genId(events[0].timestamp),
startedAt: events[0].timestamp,
ref: data.ref,
// parse userAgent for device
device: getDevice(data.userAgent),
userAgent: data.userAgent,
language: data.language,
timezone: data.timezone,
latency: data.latency,
pageLoad: data.pageLoad,
// get difference between first and last
length: events[events.length - 1].timestamp - events[0].timestamp,
waitlist: waitlist || null,
pages: pages,
}
// events
Session.Events = events.map((e, i) => ({
sessionId: Session.sessionId,
timestamp: e.timestamp,
event: e.event,
label: e.label || null,
// get time to next event to measure length, assumes next event is user action
length: i < events.length - 1 ? events[i + 1].timestamp - e.timestamp : 0,
}))
// Post to Google Sheets endpoint
try {
const response = await fetch(GOOGLE_SHEETS_URL, {
body: JSON.stringify(Session),
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
})
return {
statusCode: 200,
body: JSON.stringify({ response }),
}
} catch (error) {
console.log(error) // output to netlify function log
return {
statusCode: 500,
body: JSON.stringify({ msg: error.message }),
}
}
}
Now that I've got a serverless endpoint, I just need to log a visitor's session and events from the client. There are a couple major differences with how I'm doing things vs. a standard analytics implementation.
I make a single call at the end of a session, posting the entire session log as JSON to my lambda function. How do I know when a session ends? Unfortunately, there's no magic bullet for universal coverage across all browsers and all cases. Instead, I'm listening for several exit events and gracefully degrading based on what's happening in the client's browser. While I haven't done a side-by-side test with a standard implmentation, I'd estimate I'm covering around 95% of visitor sessions.
I'm sending the data with navigator.sendBeacon
if it's available, which posts the session data in the background without waiting for a response. As a fallback, I have several levels of degradation that get called based on browser.
Here's the simplified version:
// listen for all the exit events
window.addEventListener('pagehide', endSession)
window.addEventListener('beforeunload', endSession)
window.addEventListener('unload', endSession)
// for iOS when the focus leaves the tab
if (iOS) window.addEventListener('blur', endSession)
let skip
// call this function on exit
const endSession = () => {
// skip if the function has already been called
if (skip) return
skip = true
// I also add an "end session" event to the data here
const data = SESSION_DATA
const url = FUNCTION_URL
const { vendor } = window.navigator
// https://bugs.webkit.org/show_bug.cgi?id=188329
// Safari bug is fixed but not yet released. When that happens, will need to check safari version also
if (window.navigator.sendBeacon && !~vendor.indexOf('Apple')) {
// try to send the beacon
const beacon = window.navigator.sendBeacon(url, data)
if (beacon) return
// if it failed to queue, (some adblockers will block all beacons), then try the other way
}
// Instead, send an async request
// Except for iOS :(
const async = !iOS
const request = new XMLHttpRequest()
request.open('POST', url, async) // 'false' makes the request synchronous
request.setRequestHeader('Content-Type', 'application/json')
request.send(data)
// Synchronous request cause a slight delay in UX as the browser waits for the response
// I've found it more performant to do an async call and use the following hack to keep the loop open while waiting
// Chrome doesn't care about waiting
if (!async || ~vendor.indexOf('Google')) return
// Latency calculated from navigator.performance
const latency = data.latency || 0
const t = Date.now() + Math.max(300, latency + 200)
while (Date.now() < t) {
// postpone the JS loop for 300ms so that the request can complete
// a hack necessary for Firefox and Safari refresh / back button
}
}
The benefit to this strategy is a lightweight solution, both for the client and the API—which is subject to usage rates. You could certainly implement a more standard approach to log each interaction as it happens, but it will likely cost much more at any kind of scale.
With React, it's easy to make a few reusable components for all my analytics. I use react-router, and render my main Analytics component with every route: <Route component={Analytics} />
.
This component has 3 main functions:
Then, I log the initial PAGE event (more about event logging in the next section), save some session and page performance metadata in the component's state, and then register my endSession event listeners.
if (location.pathname !== prevProps.location.pathname) {
event(location.pathname, { label: 'PAGE' })
}
The event logging component is a tad more complicated, as the event()
function needs to be accessible anywhere in the code I want to log an event from.
To accomplish this, I'm using React Context as a "global" store for both my array of events and the push function.
I create an Events component for the context provider, that stores the events in local state, and passes into Context.Provider
a function that adds a new event to the array.
// function to log an event
event(name, properties = {}) {
const event = {
event: name,
timestamp: new Date().getTime(),
...properties,
}
events.push(event)
this.setState({ events })
// log to console in DEVELOPMENT
if (dev) {
const { label, ...rest } = properties
console.log(label + ': ' + name, JSON.stringify(rest))
}
}
I can then create a Context.Consumer
anywhere in my code to access the context. I try and avoid render functions as a pattern, so I turn my context consumer into a helpful higher-order component:
// Turn Context.Consumer into HOC
const Event = Component => {
const setContext = props => (
<SetContext.Consumer>
{context => <Component {...props} event={context} />}
</SetContext.Consumer>
)
return setContext
}
// Elsewhere... easy way to access event function as a prop
import Event from '..'
const MyComponent = ({ event }) => {...}
export default Event(MyComponent)
You could probably do this even easier now with Hooks.
The final step is to grab from context the array of events in the endSession()
function and post it as part of the session data.
Because I'm only calling 1 request per session, my lambda costs will never balloon. With this setup on AWS Lambda, reddit.com's monthly traffic (1.5b sessions) would cost about $300 in serverless fees!
This post hit the front page of Hacker News in 2019. Here's what people had to say: https://news.ycombinator.com/item?id=19388489
Thanks for reading! I'm a product engineer, writing about how to be a human in a computer world. This site is my digital garden. Explore, enjoy. My mailbox is always open.