Roll Your Own Analytics

How to build a free, privacy-focused alternative to Google Analytics

By P.C. Maffey

Published 3/13/2019

11 pages

Google Analytics runs on over 56% of all websites. It's the backbone of ad-tech across the web. Unfortunately, for site owners like me who just want to learn how people are using their website—while respecting their privacy—there simply aren't any alternatives that meet all my requirements. So in two days, after a couple dead-ends, I built my own using React, AWS Lambda, and a spreadsheet. This is how.

My Requirements

No 3rd-party tracking -When you visit my site, no one else is listening in. That means, no calls to any 3rd-party domains from the client. No tracking snippets or pixels. No 3rd-party cookies. (Also, don't use Google Fonts!)
Anonymous -Don't store any personal data, unless given with permission (eg. email for newsletter). Don't store IP addresses or superfluous session data that could be used for fingerprinting. This way, my site is GDPR compliant by default. And while not a hard requirement, I don't use cookies to track visitors across multiple sessions.
Avoid ad-blockers -My goal with analytics is to learn how people use my site so I can improve it and serve them better. I'm not using ad-tech so there's no point in getting blocked by 25% of visitors with an ad-blocker. That means doing 1st-party analytics, without using a 3rd-party tracking snippet—even self-hosted!
No bloat -What's the minimum number of network calls required to log a full session? Turns out, the answer is 1. Unlike every tracking snippet.
Free -I spin up a lot of sites for various projects. Having a free tier when getting started is critical. So everything in this stack needs to be free up to a certain level—beyond that, I'm happy to upgrade and pay.
Easily replicable -I want to be able to copy and paste a few things and set up analytics for a new site's analytics.
No server -I use (and love!) Netlify's free static site hosting. It doesn't provide access to server logs though—which rules out log-based analytics, like Go Access (which would satisfy most of the other requirements). Fortunately, Netlify does provide free proxied access to AWS Lambda functions…

The Stack

I wanted to set this up quickly, so I decided on the most easily accessible tools for my current go-to stack. You could swap out any of these layers for your preferred tech. If you're expecting higher usage, you'll almost certainly want a more robust data store.

Here's how it works:

Log a session on the client -React, state + context API

When the session ends...

Call a lambda function -AWS Lambda via Netlify
Store the data in Google Sheets -Google Sheets, Google App Scripts

Define the Data Model

Working backwards, I start by defining what data I care about tracking. This is actually one of the biggest ancillary benefits of rolling your own analytics. I get to perfectly fit my analytics to my application. How much of GA do you really understand and use anyways?

My initial use case is a common one. I'm building featherbubble.com, an interactive children's story. The site has a landing page and a waitlist conversion page. I'll soon add a pilot episode people can read through to preview. So, a site with a few pages and a conversion.

Track sessions, not people, mostly events. All I want to know is referrer and some session metadata like language, timezone, and device. And then I want to log what happened using events and summarize that in the session table.

I create a new Google Sheet (sheet.new) with two sheets: Sessions and Events. Put the column headers in the first row. A logged session looks like this:

Sessions

Sessions Sheet Headers

Here's some more details about each field.

Events

Events Sheet Headers

And details about the event fields.

Again, you can customize these fields to your heart's content. Instead of shoehorning all your data into the complicated event mappings of Category/Action/Label/Value fields (or Segment's analytics.js properties field), you can simply create a new column with what you want to track.

Save the Data

I initially tried to use Google's Sheets API. This is a dead-end. Even with an API key, while you can read data from a spreadsheet, "anonymous edits aren't currently supported." The API needs an auth token.

Fortunately, there's an even easier way (without needing to walk through the complexity of Oauth configuration): Google App Scripts. From your spreadsheet menu, Tools > Script Editor, you can publish a script that will run operations on your spreadsheet.

I got the idea from here, and modified the script to:

Handle POST requests (instead of GET) to receive a JSON object of Session data, with an array of Events:

  {
    // Session headers...
    sessionId: '1234',
    device: 'iPad',
    ...,
    Events: [{
      // Event headers...
      sessionId: '1234',
      event: '/',
      ...,
    }]
  }

Parse and write that data to the Sessions and Events sheets. The script will match whatever column headers you use for each sheet to the keys in your JSON blob
Aggregate daily totals to a 3rd sheet called 'Analytics' (more on that below)
Send me an email if it errors

Here's the full gist. Copy that into the script editor, make your changes and save it. To publish, in the menu go to Publish > Deploy to web app....

A few important gotchas about publishing:

You must execute the app as yourself (ie. an authenticated google account)
Give access to Anyone, even anonymous
Save the web app URL and treat it like an API KEY. Don't publish to Git.
When you make changes to the script, you will need to redeploy it and update the project version to New
You need to manually run the Setup function. From the menu, Run function > Setup You only ever need to do this once.

Now you've got an API endpoint you can use to store analytics data in your spreadsheet.

Bonus: Aggregate Daily Totals

Getting the data in is only half the battle. I want to make my analytics easily consumable. For me, that means a daily summary, in my inbox. So, because my spreadsheet-fu is not very good, instead of figuring out pivot tables, I decided to write a script that aggregates the session data each day, adds it to a 3rd "Analytics" sheet, and emails me a report.

Analytics

Analytics Sheet Headers

Here's what I'm aggregating.

Instead of running a daily cron job, whenever a new session is saved, I check to see if the previous day's sessions have been aggregated. If not, I total them up and email myself a report.

The dailyTotals function is included in the gist above.

With the daily summaries in a spreadsheet, I can easily turn that into a chart and view by week, month, etc.

The Lambda Function

Serverless lambda functions are perfect for an analytics API. Logging doesn't require a response. This makes the cold start issue with lambda functions negligible. Of course, there's no reason you couldn't use a regular server endpoint, especially if you've already got an API. But for a static site, the free and easy setup with Netlify along with the potential to scale effortlessly makes it an obvious choice.

Using Netlify's free tier, access to AWS Lambda functions are easy to setup. I won't go into the details, since there are plenty of resources out there already. Here are the important parts:

Setup your functions -Check out Netlify's create-react-app-lambda package or this guide for how to get started with Lamda functions in a React app. I'm using the netlify-lambda package to build my functions with webpack.
Get a proxied endpoint -Once setup, you'll be able to post data to example.com/.netlify/functions/my-function, and viola, your analytics API calls are to your own domain.
Parse and format data -When the lambda function receives the raw session data from the client, I have it do most of the work to format the data and do things like: generate a sessionId and add it to each event, count the number of PAGE events, check for a conversion, etc.
Post it -Finally, I post the formatted Session data as JSON to the Google Sheets / App Script url, which I'm storing as an environment variable in Netlify.

Here's my lambda function code:

// necessary for async/await
import 'regenerator-runtime/runtime'
// fetch implementation for node
import fetch from 'node-fetch'

// Store your Sheets url in your environment vars
const { GOOGLE_SHEETS_URL } = process.env

// custom helper functions
import genId from './helpers/genId'
import getDevice from './helpers/getDevice'
import isBot from './helpers/isBot'

export const handler = async event => {
  const data = JSON.parse((event || {}).body)

  // exit if data is not formatted correctly
  if (!data || !data.events) return throw 'Invalid data'
  // skip if looks like a bot
  if (isBot(data.userAgent)) return

  let waitlist
  let pages = 0

  // sort the events chronologically and summarize
  const events = data.events.sort((a, b) => {
    // check if conversion happened
    if (~a.event.indexOf('joined waitlist')) waitlist = true
    // count the PAGES
    if (a.label === 'PAGE') pages += 1
    if (a.timestamp > b.timestamp) return 1
    else if (a.timestamp < b.timestamp) return -1
    return 0
  })

  // values to post
  const Session = {
    // session: get id based off first event timestamp
    sessionId: genId(events[0].timestamp),
    startedAt: events[0].timestamp,
    ref: data.ref,
    // parse userAgent for device
    device: getDevice(data.userAgent),
    userAgent: data.userAgent,
    language: data.language,
    timezone: data.timezone,
    latency: data.latency,
    pageLoad: data.pageLoad,
    // get difference between first and last
    length: events[events.length - 1].timestamp - events[0].timestamp,
    waitlist: waitlist || null,
    pages: pages,
  }

  // events
  Session.Events = events.map((e, i) => ({
    sessionId: Session.sessionId,
    timestamp: e.timestamp,
    event: e.event,
    label: e.label || null,
    // get time to next event to measure length, assumes next event is user action
    length: i < events.length - 1 ? events[i + 1].timestamp - e.timestamp : 0,
  }))

  // Post to Google Sheets endpoint
  try {
    const response = await fetch(GOOGLE_SHEETS_URL, {
      body: JSON.stringify(Session),
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
      },
    })

    return {
      statusCode: 200,
      body: JSON.stringify({ response }),
    }
  } catch (error) {
    console.log(error) // output to netlify function log 
    return {
      statusCode: 500,
      body: JSON.stringify({ msg: error.message }),
    }
  }
}

One Call (to rule them all)

Now that I've got a serverless endpoint, I just need to log a visitor's session and events from the client. There are a couple major differences with how I'm doing things vs. a standard analytics implementation.

I make a single call at the end of a session, posting the entire session log as JSON to my lambda function. How do I know when a session ends? Unfortunately, there's no magic bullet for universal coverage across all browsers and all cases. Instead, I'm listening for several exit events and gracefully degrading based on what's happening in the client's browser. While I haven't done a side-by-side test with a standard implmentation, I'd estimate I'm covering around 95% of visitor sessions.

I'm sending the data with navigator.sendBeacon if it's available, which posts the session data in the background without waiting for a response. As a fallback, I have several levels of degradation that get called based on browser.

Here's the simplified version:

// listen for all the exit events
window.addEventListener('pagehide', endSession)
window.addEventListener('beforeunload', endSession)
window.addEventListener('unload', endSession)
// for iOS when the focus leaves the tab
if (iOS) window.addEventListener('blur', endSession)

let skip
// call this function on exit
const endSession = () => {
  // skip if the function has already been called
  if (skip) return
  skip = true

  // I also add an "end session" event to the data here
  const data = SESSION_DATA
  const url = FUNCTION_URL

  const { vendor } = window.navigator

  // https://bugs.webkit.org/show_bug.cgi?id=188329
  // Safari bug is fixed but not yet released. When that happens, will need to check safari version also
  if (window.navigator.sendBeacon && !~vendor.indexOf('Apple')) {
    // try to send the beacon
    const beacon = window.navigator.sendBeacon(url, data)
    if (beacon) return
    // if it failed to queue, (some adblockers will block all beacons), then try the other way
  }

  // Instead, send an async request
  // Except for iOS :(
  const async = !iOS
  const request = new XMLHttpRequest()
  request.open('POST', url, async) // 'false' makes the request synchronous
  request.setRequestHeader('Content-Type', 'application/json')
  request.send(data)

  // Synchronous request cause a slight delay in UX as the browser waits for the response
  // I've found it more performant to do an async call and use the following hack to keep the loop open while waiting

  // Chrome doesn't care about waiting
  if (!async || ~vendor.indexOf('Google')) return

  // Latency calculated from navigator.performance
  const latency = data.latency || 0
  const t = Date.now() + Math.max(300, latency + 200)
  while (Date.now() < t) {
    // postpone the JS loop for 300ms so that the request can complete
    // a hack necessary for Firefox and Safari refresh / back button
  }
}

The benefit to this strategy is a lightweight solution, both for the client and the API—which is subject to usage rates. You could certainly implement a more standard approach to log each interaction as it happens, but it will likely cost much more at any kind of scale.

Log Session Metadata

With React, it's easy to make a few reusable components for all my analytics. I use react-router, and render my main Analytics component with every route: <Route component={Analytics} />.

This component has 3 main functions:

Start session -Rather than calling this onMount, I'm registering an on "load" event listener that starts the session once the document has fully loaded. This way, if a visitor bounces before the page finishes loading, I ignore it.

Then, I log the initial PAGE event (more about event logging in the next section), save some session and page performance metadata in the component's state, and then register my endSession event listeners.

End session -As described in the previous section, when the session ends, I grab the session data from the component store and post it to my lambda function's url.
Listen to route changes -When the path changes, I log a page view as an event.

if (location.pathname !== prevProps.location.pathname) {
  event(location.pathname, { label: 'PAGE' })
}

Log Events

The event logging component is a tad more complicated, as the event() function needs to be accessible anywhere in the code I want to log an event from.

To accomplish this, I'm using React Context as a "global" store for both my array of events and the push function.

I create an Events component for the context provider, that stores the events in local state, and passes into Context.Provider a function that adds a new event to the array.

// function to log an event
event(name, properties = {}) {
  const event = {
    event: name,
    timestamp: new Date().getTime(),
    ...properties,
  }

  events.push(event)
  this.setState({ events })

  // log to console in DEVELOPMENT
  if (dev) {
    const { label, ...rest } = properties
    console.log(label + ': ' + name, JSON.stringify(rest))
  }
}

I can then create a Context.Consumer anywhere in my code to access the context. I try and avoid render functions as a pattern, so I turn my context consumer into a helpful higher-order component:

// Turn Context.Consumer into HOC
const Event = Component => {
  const setContext = props => (
    <SetContext.Consumer>
      {context => <Component {...props} event={context} />}
    </SetContext.Consumer>
  )
  return setContext
}

// Elsewhere... easy way to access event function as a prop 
import Event from '..'
const MyComponent = ({ event }) => {...}
export default Event(MyComponent)

You could probably do this even easier now with Hooks.

The final step is to grab from context the array of events in the endSession() function and post it as part of the session data.

A Final Note on Scalability

Because I'm only calling 1 request per session, my lambda costs will never balloon. With this setup on AWS Lambda, reddit.com's monthly traffic (1.5b sessions) would cost about $300 in serverless fees!

https://pcmaffey.com/roll-your-own-analytics

This post hit the front page of Hacker News in 2019. Here's what people had to say: https://news.ycombinator.com/item?id=19388489

Thanks for reading! I'm a product engineer, writing about how to be a human in a computer world. This site is my digital garden. Explore, enjoy. My mailbox is always open.

Email: [email protected]

Follow: rss.xml

P.C. Maffey

Return to my garden