
Hi Jacob, I'm not sure I fully understood your problem, so just toss my comment to the trash-bin if I misunderstood ;-) As I understand it you are trying to find patterns within a large dataset, you want to figure out whats "good" and whats "crap" in thousands of text messages. This is almost the definition of machine-learning algorithms so maybe you should look into those. From your description it sounds as if you need to use some "unsupervised learning" algorithm for classification - and since you most likely have a ton of unique words it might be relevant to perform a PCA analysis to reduce the dimentionality first - you most likely will have to play around with a few algorithms to find the one working best for your dataset... Also you should maybe also look into the Swiftriver part of Ushahidi (http://swift.ushahidi.com/) since that project is aiming to do almost exactly the same i.e. finding "signals" in large twitter streams (and other streams). In particular the SiLCC part of it (http://swift.ushahidi.com/extend/silcc/ ). Generally they let the python natural language toolkit (http://www.nltk.org/ ) do the actual "work". From looking at the code the other day I learned that they used a supervised learning algorithm (naive bayesian classifier) instead of an unsupervised - and I'm not sure why(?) hope this helps. one final question: why did you chose to do it as a desktop application ? regards webMike Jacob Ayienda wrote:
I usually wouldnt post this without uploading it first as most probably know me but I'd like to give everyone a heads up about the project and if any one has a similar project mayb you can tell me where im going wrong.
The application is a native desktop app primarily designed to determine 3 things:
- Trending topics in nairobi...what nairobians are tweeting - Most active nairobi users - Most mentioned / retweeted users from nairobi
Given imitations of the twitter API of going back 7 days and taking into account that some users have a data plan the user determines how far back to go i.e 24hrs - 7 days only and the algorithm used is vertical cheking 1 tweet at a time as they are downloaded.
Initially I had designed the aplication to check trendz of up to 24 hrs but that became a big problem for this reasons:
- The number of tweets minus the usual spammers average around 5000 tweets per day. On weekends esp. sunday is less than 2000. - Out of the 5000, unique tweets are less than 3000 if you remove retweets and crappy spam. _ Of the remaining 3000 tweets, majority are crappy updates similar to facebook updates...seriously dont make any sense to regex algorithm thus eliminating them leaving roughly 1500 unique tweets...also with links.
Looking at the trending topics 24hrs based on my algorithm the trending topics that make most sense out of the top 50 during testing have been:
1. #nowplaying and nowplaying -- also trending globally almost indef. 2. Corruption -- referenced mainly news articles 3. Constitution -- referenced mainly news articles 4. Raila -- referenced mainly news articles 5. ManU also arsenal esp. during matches 6. Tweeple -- Kenyans must really love this word . . some of this include silly twitter games like #FF, #FT ... blah blah blah . 7. Arunga -- no longer trending though. referenced mainly news articles and retweets
On the most active users the top list includes mainly spammers and its very difficult to filter them out...still working on it.
On the most mentioned / retweeted users, the only thing that seems to work correctly but I dont want to publish users lest I be accused of trying to push traffic to them.
I'd appreciate if someone build a web frontend esp. now with a general idea...my web programming sucks big time so don't expect any help from me.
I'll upload the appication as soon it is stable and fixed some of the key issues. Before that happens a word of advice to twitter users in Nairobi:
- Let the crappy fb updates stay in facebook or just send an sms it makes more sense. - To the spammers...you suck - To the rest...it is important to note that the more unique tweets been sent the more easier it is to come up with sensible trending topics... so tweet everything...what you're watching, listening to, where you hang out, websites and links...
Website Link http://sites.google.com/site/usbdiskintegritychecker/