Discovering Cyclical Patterns of Terms on Twitter

October 31, 2011

better

Our assignment this week for my R class asked us to take a look at word counts for twitter terms. Presumably, if you were to analyzing tweets to discovering trending topics you would want to look at terms that weren’t just high volume, but those that were displaying an increasing curve of growth. You could do this by looking at the slope of the path between two points of time on the curve and see if the slope is increasing by at least the pre-determined rate of what is accepted as “trending.”

One problem that arises through this process is that we run into terms that are considered cyclical. These cyclical terms are those that appear to increase and decrease on a predictable basis. Purely for example, each morning it might be reasonable to assume that people wake up and tweet “good morning” – which gives a big kick to the amount of tweets containing either or both of these terms. But because this happens everyday we’re not really interested in the meaning behind these – they’re just common greetings.

We’d like to eliminate any and all terms like this from what we would consider “trending” and do so in a semi- scientific and reliable way (or at least better than purely guessing). And that is what we are doing with this assignment. Working with a time series of about 1569 terms over 464 hours, we’re taking a look at these.

The first thing we did when going about this was to just pick a term from the list at random (or one we’re interested in) and see how cyclical it looks. I chose the term ‘better’, plotted it’s word count over time and also laid a lowess smoothed line over the top to get a sense of how the data looks… i.e. does it look cyclical? *Note that I’m including some formatting code for the plot itself as well (which is above):

par(bg = "gray25")
par(fg = "gray60")
par(col.lab = "gray60")
par(col.main = "gray60")
plot_colors <- c("#FF000060","#d200ff60","#FFC60060","#FFFFFF10","#0000FF40","#FFFFFF40")
#
idx = which(names(word.counts)=="better")
smoothed <- lowess(word.counts[,idx], f=0.05)
#
plot(word.counts[,idx], pch=16, col=plot_colors[6], col.axis=plot_colors[6], bty="n", xlab="hour", cex.axis=".8", ylab="instances", main="Word Count for 'better'")
lines(smoothed, col=plot_colors[3])

Hmmm, interesting. It dips and rises - but it looks a little too erratic to be called cyclical, in my opinion. To really check all these terms out we could plot each and every one (all 1569) and look at them and determine which ones look cyclical or not - but that's probably gonna take a ton of time. But we can extract this kind of information from the data by using the auto-correlation function (acf). I'm assuming that this function looks at the slope at certain intervals and is able to quantify the changes taking place.

acfed.word.counts <- apply(word.counts,2,acf)

Then we use a function to extract the readings and return some kind of numerical ranking as to how cyclical it is (Jake wrote the following function so I'm not going to pretend to walk you through it):

cyclic.measure <- function(x, weight=1.0) {
  # Pull out the acf component and drop the first value (it's always 1)
  data <- x$acf[2:length(x$acf)]

  # Restrict ourselves to only spikes above 0.2 or below -0.1
  data <- data[data < -0.1 | data > 0.2]

  # Sign returns -1 or 1, so this next step just identifies positive and negative values
  signs <- sign(data)

  # Diffs takes the difference of signs, i.e. when there's a jump from positive to negative
  diffs <- diff(signs)

  # Find out where those breaks are
  idx <- which(diffs != 0)

  # There might not be any, in which case just return 0 (this isn't cyclic)
  if (length(idx) == 0) {
    return(0)
  }

  # Tell R there are implicit breaks at the beginning and end of the data
  idx <- c(0, idx, length(data))

  # Take the differences to get the lengths of sequences
  lengths <- diff(idx)

  # Create a label for each piece of data based on which section it's in
  seqs <- rep(1:length(lengths), lengths)

  # Find the maximum value within each positive / negative sequence
  maxes <- tapply(data, seqs, max)

  # Return a weighted sum of the average peak size and the number of times
  # we flipped across the positive / negative line (weight = 0.9 - 1.0 seems
  # to work well
  weight*mean(abs(maxes)) + (1-weight)*(length(lengths) / length(data))
}

The code to set it in action using sapply (or lapply):

cyc.measure.word.counts <- sapply(acfed.word.counts,cyclic.measure)

When I take a look at the resultant list, these are the top ten most cyclical terms:

  1. twittascope
  2. daily
  3. today
  4. morning
  5. love
  6. eu
  7. know
  8. please
  9. tonight
  10. live

I plotted a couple of these just to check them out, and you can indeed see a nice cyclical graph - like a sine wave. The only thing that sticks out is that 'morning' appears to my eye to be more cyclical than 'today' - despite the rankings. But numbers don't lie, do they (?)

twittascope
daily
today
morning

Conversely, if we look at a term with a really poor cyclical score - in this case 'attack' - we can see that there isn't really anything resembling a pattern here:

attack

x

Hi there. I'm a design & code creative living, working and studying in sunny Brooklyn, NY. I'm currently exploring data representation within the context of the networked urban environment as well as the DIY health and biohacking movements.

Keywords: design, user experience, interaction, visual communication, Processing, data visualization, Android, HTML5, css, Javascript, WebgL, branding, rapid prototyping, Python

CV - download cv (pdf - 180kb)

Contact

Education

2010.09 — 2012.05 (expected)

Master of Professional Studies
Interactive Telecommunication Program (ITP) Tisch School of the Arts, New York University

2010.09 — 2004.05

BA Visual Communications with minor in Art History
The George Washington University
Graduated Cum Laude
National Society of Collegiate Scholars
Spring 2003 semester at Sydney University, AU

Work Experience

2011.06 — 2011.09

UX Design Intern, Microsoft Bing, Bellevue, WA

Worked with design, editorial, dev and program management teams to scope, design and develop prototypes for soon-to-be-released Bing.com feature. The internship culminated in two presentations of the feature prototypes to senior leadership at Microsoft as well as the Bing design team.

2007.02 — 2010.08

Graphic & Interaction Designer, Empax, Inc., New York, NY

Worked with design, editorial, dev and program management teams to scope, design and develop prototypes for soon-to-be-released Bing.com feature. The internship culminated in two presentations of the feature prototypes to senior leadership at Microsoft as well as the Bing design team.

2006.12 — present

Freelance Graphic & Interaction Design Consultant, New York, NY

Worked as a sole proprietor with various clients from retail, music, film, nonprofit, real estate and technology industries to create and improve existing brand and user experiences across many platforms and media.

2004.04 — 2006.01

Graphic Designer, The George Washington University Communication & Creative Services, Washington, DC

Worked with project management and external production vendors to deliver a range of print and interactive material related to university publications and communications initiatives. responsibilities included design and implementation of print collateral, posters, animation, environmental signage, web publication and press checks.

Selected Publications

2011.07

Freakonomics (Web),
“What Would it Be Like to Climb 26 Years of Federal Spending?”

2011.04

Flowingdata (Web),
“Physically climb over budget data with Kinect”, by Nathan Yau

2011.02

Logo Lounge 6 (Book),
by Catharine Fishel and Bill Gardner, Rockport Publishers - Gedenk Logo

2010.12

“A Bartender That Pours The Perfect Shot, Every Shot”, by Matt Buchanan

2009.11

Basic Logos (Book),
by Index Book - The 2007 Gotham Awards Logo

2008.10

Print Magazine,
“Dialogue: Martin Kace”, by Steven Heller - The Alliance for Climate Protection Website

Selected Exhibitions

2010.12

ITP Winter show 2010, NYC

2011.04

Data Viz Challenge Party, hosted by Eyebeam and Google, NYC

2011.05

ITP Spring Show 2011, NYC

Other Experience

2006.01 — 2006.12

English Teacher, NOVA Japan, Kure-shi, Hiroshima-ken, Japan
Taught and mentored students of all ages and abilities in small to medium-sized classes to improve proficiency in english linguistics and conversation.