The Infovore's Dilemma

Observations and Musings

Accessing and Analyzing My Personal Data

As a final project for Yale Law Tech I set out to discover how much of my digital data footprint I could access, and if I could find anything interesting in the noise. I looked at the two primary web services I use - Twitter & Facebook - as well as all of my phone Geolocation data.

Twitter

Late last year Twitter built the functionality that allow user’s to download their archive. When you do, you get a both a set of JSON files and a CSV. The CSV is really easy to read into R, and also very easy to use with excel. In R, with a little time stamp conversions, I was able to plot a set of visuals that take the almost two thousand tweets and try to bring sense to the noise.

The fall after I started using Twitter, I tweeted way too much. Thankfully, that activity has diminished over time. What this doesn’t show though is how much I use Twitter as a source of news and information. I would really like Twitter to offer the ability to access data on all my session times. This would be a fascinating set of information that would allow me to discover both my reading and procrastination habits and potentially give me clear data backing up intuitions I have about my productivity (or lack thereof).

I clearly need to get a little more sleep! And 4pm Monday and Thursday are clearly not my most productive hours of the week.

You can probably guess that I go to Yale, have a fascination for TED talks, and generally like ‘awesome’ things.

Facebook

Facebook allows you to download an expanded archive of your data. However, while it is a good amount of data, it is tiny in comparison to the true size of my digital footprint on Facebook. When you download the archive, you get a set of html files that you can load locally, basically giving you a little static Facebook with personal information. The most interesting piece of information they make accessible is a subset of your account activity that includes session timestamps for the past few months. They provided me with sessions back to February 10th, 2013. It takes a little work to strip out all the html and get just the timestamps. It can be done though and the code is below. When we do, like Twitter, we can try and visualize some of my habits.

The Facebook heatmap is made with less data than Twitter, but offers a useful and different perspective. The Twitter data is my public actions, where as the Facebook session timestamps is my private viewing habits. Sunday is clearly a terribly unproductive day. Monday evening around 10pm I also start to loose concentration. Good habits to know! It would be much better if I could actually map this over time, with much more data so that I can get a more accurate representation of my habits.

Open Paths

I downloaded Open Paths after watching Jer Thorp’s TED Talk Making Data More Human over a year ago. The app has been running in the background tracking my location since March 8th, 2012. With over a year of data now, I thought I would finally turn the matrix of Lat Lon coordinates into a map.

If you put the Open Paths data against some of the Facebook data, you could quickly figure out that the northeast corridor trips are to visit a special someone in New York. With the Tweet’s that say Yale, it’s not hard to figure out how I spend quite a bit of my time - commuting back and forth.

Final Notes

Twitter definitely wins my vote for most accessible personal data. However, I am glad that Facebook gives me session times and I would really like those for Twitter. In the future, I want to figure out how to get timestamps for all my emails since that is a private and constant method of communication that I think will highlight lots of my personal habits. I also think it would be fascinating to map out my banking transactions over time since all of my credit transactions are recorded and easily exportable from Bank of America. On that note, I leave you with a link to Stephen Wolfram’s post on personal analytics that inspired a lot of this.

Here is all the R code I wrote to generate the visuals

The US Economy in 15 Graphs

… and a Dilbert cartoon:

Appendix

Obesity in the USA

Notes:

  • The motivation for this project was
    1. To learn D3.
    2. To learn how to create a D3 visualization in an Octopress blog post.
    3. To improve upon the current CDC graphic for visualizing this data.
  • The metric being used is the % of people per state that are classified as obese (BMI 30.0+).
  • The CDC has a set of powerpoint slides on their website that loops through each year and color codes states by obesity level. I felt that D3 would offer a chance to improve on the graphic by making it both more interactive, and also using a linear color gradient vs 5% threshold color jumps.
  • All the data was collected using this CDC tool which provides the percent of obese individuals per state since 1995. I downloaded individual CSV files for each year and then combined them into one CSV for 1995-2010. I then used this tool to convert the CSV into JSON-properties that I could access using D3.
  • If you are interested in how to create D3 graphics in an Octopress blog post I have posted the raw code for this post in a gist here. It isn’t the cleanest, but it works and if you have feedback please drop me a comment!