Saturday, June 29, 2013

Trawling Social Media Part 2: Flickr

My adventures in Instagram came to a sudden and tragic halt once I encountered the bugginess of Instagram's media/search function. Other people have been documenting a decline in the number of responses they get to this API call for a few months now (see, for example, here) and at this point I get nothing by Error Code 400 in response to every request. Instagram's locations/search function does work correctly, but only returns 33 Colorado Springs-geocoded images over an eight-day period encompassing the evacuation of more than 20,000 people from that city. Disappointing!

Bereft of data, I've been investigating Flickr today. I found this tutorial helpful in getting started. However, for roughly the same period of time I'm seeing only 96 images being shot within a 5km radius of the center of Colorado Springs. I've posted the bulk of the code here, in case anyone else wants to give it a try.

A sample of the images drawn from my Flickr data pull
If anyone has had different experiences with this, I'd be thrilled to hear it, but I'm willing to tentatively say that if you're going back more than a few months and trying to extract data, you should expect to encounter a lot of bad behavior from APIs. This makes a lot of sense given the focus of these companies, but the takeaway: caveat emptor.

Wednesday, June 26, 2013

Trawling Social Media

Lately, I've been digging into using the APIs of various social media platforms as tools to help explore the spread of information, sentiment, and so forth, specifically focusing on Twitter and Instagram. It's been interesting and sometimes challenging, as I'm going over old tricks (PHP) and learning new ones (authorization and so forth). There have been a number of resources I've found helpful, and I thought I'd post something about my experiences here as a guide to others who are also just getting started with using social media APIs for research.

OAuth

In order to access the APIs, it's necessary to authenticate with the API itself. Both Twitter and Instagram use the OAuth protocol to provide users with access to their data. This involves having an account, registering an application, and generating and providing access codes in the appropriate places. I found the following to be very helpful in understanding and interacting with OAuth:

  • 140 Dev Twitter OAuth Programming - a tutorial on using OAuth in the context of Twitter applications. You have to sign up as a member to get access to the text, but I highly recommend it.
  • tmhOAuth - An OAuth library used in the 140 Dev tutorial, which with minimal modification can be used to access Instagram data as well (specifically, by changing 'api.twitter.com' on line 40 to 'api.instagram.com').
The App Itself

I've been using PHP to do my data extraction, but other sites discuss using javascript, etc, to do something similar. If you're comfortable with the command line, using PHP is incredibly simple and I highly recommend it. I've included a simple script which, if you edit it to include your target username, will pull the last 100 tweets from a user. To use it, you can drop the tmhOAuth.php and cacert.pem files into a directory with a copy of your application tokens, put a simple PHP script in that same directory, and type

commandline> php myScript.php > outputfile.txt

and voila! You're done. Until you run into 429 Errors, aka rate limiting.

Rate Limiting

Rate limiting will probably make you want to tear out your hair. Twitter and Instagram have different limits on how many times you can query the API within however much time. Twitter even breaks rate limiting down by the kind of query you send - for example, under the current REST 1.1 API policies, a user can submit only 15 "GET lists" queries relative to 180 "GET statuses/user_timeline" queries within a 15 minute period. So be aware of this, and factor it into your applications.

Anyway, hope this is helpful to someone!

Sample PHP files

Wednesday, April 17, 2013

AAG 2013

I had the opportunity to attend and present some of my work at AAG 2013 this past weekend. I had a great time and met a bunch of great researchers. Thanks, everyone!

An animated wordcloud of Colorado wildfire-related tweets, normalized in size relative to the search term "fire"

My presentation was about simulating the wildfires that swept through Colorado last summer. I thought I'd post my slides here to give a sense of what I've been up to these past few weeks. You can also check out the interactive mapping of tweets I show in the presentation here!

Wednesday, March 27, 2013

The Deluge: Twitter and Hurricane Sandy

Information is often valuable, but it's crucial in crisis situations. In a broad and basic attempt to move toward harvesting social media to determine a population's mood, my group looked at the use of language on Twitter as Hurricane Sandy bore down on the east coast of America during late 2012. What I show here is a basic visualization of tweets by time, location, and valence (here calculated by a hilariously rough positivity/negativity measure).

The tweets were scraped by team member Jacek Radzikowski (who is also available on Twitter here) focusing on the terms "Sandy", "hurricane", and "frankenstorm". The scraped tweets consist of much more than a 140 character string and a timestamp: if the user has included a description of their location in their profile or enabled twitter to use their phone's GPS, the tweet can contain some very specific locational information. For reasons that will become apparent below, we wanted to try to gain access to the user-provided location information rather than relying exclusively on GPS-derived geotagged tweets.

In this post, I distinguish between what I call "geotagged" tweets (tweets with associated coordinates) and "geocoded" tweets (tweets with locational information that was run through the Yahoo PlaceFinder to produce a set of coordinates).

We collected 1060915 tweets in all, of which 692376 have some geographic designator and 10237 have GPS-derived coordinates. For those of us who understand better in scientific notation, that's
  • 1.1x10^6 total
  • 6.9x10^5 with some info (~65%)
  • 1.0x10^4 with coordinates (~1%)
(...so our motivation for investigating the geocoder is pretty obvious, right?)

To derive some basic measure of how positively people were talking about their impending doom, we took a hand-coded set of words (AFINN, available here). For each tweet, the text is stripped of punctuation, converted to lower case, and broken into individual words. The valence is set equal to the sum of the valence values of each component word found in the AFINN wordset, normalized by the number of valence-having words in the tweet. In the following images, the tweets are colored by this normalized quantity, with more positive tweets in green and more negative tweets in red.

A composite image of Sandy-related geotagged tweets in the New York area
A composite 50% sample of Sandy-related geocoded tweets in the New York area
 
A composite image of Sandy-related geotagged tweets in the USA
A composite 25% sample of Sandy-related geocoded tweets in the USA

Finally, the following video displays the valence and location of tweets as they were updated over the course of the storm's approach and impact.


 This work is still very much in its infancy, but we found it very interesting and hope to do more with it in the future. Stay tuned!

Thursday, March 7, 2013

Sustainable Agriculture through Modeling

An acequia-irrigated field in Santa Fe, NM
 
A map of land use around Taos, NM
Irrigating agricultural land has been a major concern throughout history. When Spanish explorers began to settle in what would become New Mexico, they brought with them a system of communal irrigation management that they had learned in their turn from the Moors. This system, consisting of both the physical network of ditches and the social structure associated with their maintenance and utilization, persists to this day under the name "acequias". As water management issues emerge as an increasingly serious topic in the US Southwest, how sustainable are traditional, acequia-dependent forms of agriculture?

To investigate, Andrew Crooks and I developed a spatially explicit agent-based model of the area around the town of Taos, New Mexico. The model was coded in Java utilizing the MASON simulation toolkit and its GIS add-on, GeoMASON. The focus of the model was the farmer agents, who made choices about whether to participate in traditional agriculture (maintaining the acequia ditches and harvesting crops) or to sell their land (resulting in it permanently transitioning to residential use). Farming agents were also influenced by both the physical environment and social factors, including the selling habits of their neighbors and their own personal valuation of the traditional lifestyle. We track the overall conversion of land from farmland to non-agricultural land.

The simulation's interface

A paper documenting our findings was published in Computers, Environment and Urban Systems and is available here, and a video with a bit more detail about the setup and execution of the simulation is included below. A very attractive locally-produced '90s video with more information about acequias is available here!


A special thanks to Michael Cox and John Paul Gonzales for making this project possible!

Full Reference:
Wise, S. and Crooks, A. T. (2012), Agent Based Modelling and GIS for Community Resource Management: Acequia-based Agriculture, Computers, Environment and Urban Systems. Doi http://dx.doi.org/10.1016/j.compenvurbsys.2012.08.004