hey ppl,

I have a twitter sub-project I'm working on and wondering if anyone is interesting in collaborating. Here's a brief description:

1) collect data from twitter's public stream
2) extract basic meta-data from the JSON dump from 1)
2) categorize tweets according to language (not necessarily the same as location)
3) categorize tweets according to content (this is the hardest part, and most interesting to me)

I need help with 1) and 2). I have a ruby script (go it online) that is collecting data from a bunch of people's tweets. The data I'm collecting right now isn't much, just over 1MB/minute. The script is based on the excellent Ruby twitter library (tweetstream) but I'd like the script to be made much more robust.

The output of 1) is a json dump that I also need help with (this can be in any language that has good json parsers). The idea is to extract date/time info, the sender of the tweet, whether it's a retweet, what links are in the tweet, which other twitter username is in the tweet, what hashtags are referenced etc etc. Just very basic meta-data.

Presently I have some code that can do 3). But the code gets confused sometimes since each tweet is really small and sometimes includes multiple languages in it. For instance, I'd like to be able to identify a Kenyan (regardless of where they're physically located) solely by their tweets. It's a lot harder than it appears, but I think it's doable.

Ultimately I'd like to be able to automatically cluster tweets according to content, but that is a long-term project that I may not be able to achieve. But it's worth a try I think.

This will all be open-source, so there is no direct monetary compensation. I'm doing it to learn concepts that I think will be extremely useful. I'd love to answer any other questions you have about this.

Interested? Holla.

saidi

PS Don't sell yourself short and assume you can't do it. You're probably more skilled than you allow yourself to believe, or at the very least you can quickly learn where that is necessary.

PPS As you may have guessed, this isn't really about twitter. Twitter just happens to be a great source of huge amounts of freely available data. And it also happens to contain a wide spectrum of people/topics/places, making the info contained within extremely valuable.