
I usually wouldnt post this without uploading it first as most probably know me but I'd like to give everyone a heads up about the project and if any one has a similar project mayb you can tell me where im going wrong. The application is a native desktop app primarily designed to determine 3 things: - Trending topics in nairobi...what nairobians are tweeting - Most active nairobi users - Most mentioned / retweeted users from nairobi Given imitations of the twitter API of going back 7 days and taking into account that some users have a data plan the user determines how far back to go i.e 24hrs - 7 days only and the algorithm used is vertical cheking 1 tweet at a time as they are downloaded. Initially I had designed the aplication to check trendz of up to 24 hrs but that became a big problem for this reasons: - The number of tweets minus the usual spammers average around 5000 tweets per day. On weekends esp. sunday is less than 2000. - Out of the 5000, unique tweets are less than 3000 if you remove retweets and crappy spam. _ Of the remaining 3000 tweets, majority are crappy updates similar to facebook updates...seriously dont make any sense to regex algorithm thus eliminating them leaving roughly 1500 unique tweets...also with links. Looking at the trending topics 24hrs based on my algorithm the trending topics that make most sense out of the top 50 during testing have been: 1. #nowplaying and nowplaying -- also trending globally almost indef. 2. Corruption -- referenced mainly news articles 3. Constitution -- referenced mainly news articles 4. Raila -- referenced mainly news articles 5. ManU also arsenal esp. during matches 6. Tweeple -- Kenyans must really love this word . . some of this include silly twitter games like #FF, #FT ... blah blah blah . 7. Arunga -- no longer trending though. referenced mainly news articles and retweets On the most active users the top list includes mainly spammers and its very difficult to filter them out...still working on it. On the most mentioned / retweeted users, the only thing that seems to work correctly but I dont want to publish users lest I be accused of trying to push traffic to them. I'd appreciate if someone build a web frontend esp. now with a general idea...my web programming sucks big time so don't expect any help from me. I'll upload the appication as soon it is stable and fixed some of the key issues. Before that happens a word of advice to twitter users in Nairobi: - Let the crappy fb updates stay in facebook or just send an sms it makes more sense. - To the spammers...you suck - To the rest...it is important to note that the more unique tweets been sent the more easier it is to come up with sensible trending topics... so tweet everything...what you're watching, listening to, where you hang out, websites and links... Website Link http://sites.google.com/site/usbdiskintegritychecker/

Hi Jacob, I'm not sure I fully understood your problem, so just toss my comment to the trash-bin if I misunderstood ;-) As I understand it you are trying to find patterns within a large dataset, you want to figure out whats "good" and whats "crap" in thousands of text messages. This is almost the definition of machine-learning algorithms so maybe you should look into those. From your description it sounds as if you need to use some "unsupervised learning" algorithm for classification - and since you most likely have a ton of unique words it might be relevant to perform a PCA analysis to reduce the dimentionality first - you most likely will have to play around with a few algorithms to find the one working best for your dataset... Also you should maybe also look into the Swiftriver part of Ushahidi (http://swift.ushahidi.com/) since that project is aiming to do almost exactly the same i.e. finding "signals" in large twitter streams (and other streams). In particular the SiLCC part of it (http://swift.ushahidi.com/extend/silcc/ ). Generally they let the python natural language toolkit (http://www.nltk.org/ ) do the actual "work". From looking at the code the other day I learned that they used a supervised learning algorithm (naive bayesian classifier) instead of an unsupervised - and I'm not sure why(?) hope this helps. one final question: why did you chose to do it as a desktop application ? regards webMike Jacob Ayienda wrote:
I usually wouldnt post this without uploading it first as most probably know me but I'd like to give everyone a heads up about the project and if any one has a similar project mayb you can tell me where im going wrong.
The application is a native desktop app primarily designed to determine 3 things:
- Trending topics in nairobi...what nairobians are tweeting - Most active nairobi users - Most mentioned / retweeted users from nairobi
Given imitations of the twitter API of going back 7 days and taking into account that some users have a data plan the user determines how far back to go i.e 24hrs - 7 days only and the algorithm used is vertical cheking 1 tweet at a time as they are downloaded.
Initially I had designed the aplication to check trendz of up to 24 hrs but that became a big problem for this reasons:
- The number of tweets minus the usual spammers average around 5000 tweets per day. On weekends esp. sunday is less than 2000. - Out of the 5000, unique tweets are less than 3000 if you remove retweets and crappy spam. _ Of the remaining 3000 tweets, majority are crappy updates similar to facebook updates...seriously dont make any sense to regex algorithm thus eliminating them leaving roughly 1500 unique tweets...also with links.
Looking at the trending topics 24hrs based on my algorithm the trending topics that make most sense out of the top 50 during testing have been:
1. #nowplaying and nowplaying -- also trending globally almost indef. 2. Corruption -- referenced mainly news articles 3. Constitution -- referenced mainly news articles 4. Raila -- referenced mainly news articles 5. ManU also arsenal esp. during matches 6. Tweeple -- Kenyans must really love this word . . some of this include silly twitter games like #FF, #FT ... blah blah blah . 7. Arunga -- no longer trending though. referenced mainly news articles and retweets
On the most active users the top list includes mainly spammers and its very difficult to filter them out...still working on it.
On the most mentioned / retweeted users, the only thing that seems to work correctly but I dont want to publish users lest I be accused of trying to push traffic to them.
I'd appreciate if someone build a web frontend esp. now with a general idea...my web programming sucks big time so don't expect any help from me.
I'll upload the appication as soon it is stable and fixed some of the key issues. Before that happens a word of advice to twitter users in Nairobi:
- Let the crappy fb updates stay in facebook or just send an sms it makes more sense. - To the spammers...you suck - To the rest...it is important to note that the more unique tweets been sent the more easier it is to come up with sensible trending topics... so tweet everything...what you're watching, listening to, where you hang out, websites and links...
Website Link http://sites.google.com/site/usbdiskintegritychecker/

@pedersen thanks for the info particularly on SiLCC. It appears to be really advanced. It seems I have to scrap the entire application based on these new information. Using the SiLCC API seems like a more sensible option right now. Ill go through and see what next? You asked why a desktop application? Apparently I designed the application for a MSC student researching the impact of social networks in Kenya and it only displayed the number and frequency of tweets sent mainly from nairobi. Afterwards I decided to throw in a regular expression parser and continued adding more features.

Jacob Ayienda wrote:
It seems I have to scrap the entire application based on these new information.
Would be a shame if you did :-( There are many different ways to achieve what you are trying to do, machine learning is just one of them. The way i read your first mail, you had something that sort of worked already ? Anyway take it as an (excellent) opportunity to learn something new - personally I'm working hard on improving my skills in machine learning, and in that regards I can highly recommend Stanford's course by Andrew Ng - available on youtube: http://www.youtube.com/results?search_query=stanford+andrew+machine+learning
Apparently I designed the application for a MSC student researching the impact of social networks in Kenya
Interesting - have the research been completed ? and do you know if there is a report available somewhere ? .. webmike

Interesting - have the research been completed ? and do you know if there is a report available somewhere ?
I doubt it. Most research work in Kenya never gets publicly published. I haven't talked to the guy for some time so I really don't know what happened. As for my project I've been checking out SiLCC and its quite impressive particularly the part of "Predictive Tagging" which is exactly what i'm trying to do albeit crudely. and it also makes more sense now to build the application as a web app to makeit easier for many people to access it. I'm checking out PHP at the moment to see if it will be possible to rewrite the whole thing for the web.

Hi Jacob "a web app to makeit easier for many people to access it" - XML or JSON would be perfect to fetch data Wanted to develop the same but to find out who is going on safari in Kenya (like the last popular web project, forgot the name "robme" ??, tracking people not at home usin foursquare and twitter) On Sun, Mar 21, 2010 at 11:53 AM, Jacob Ayienda <jacobayienda@gmail.com> wrote:
Interesting - have the research been completed ? and do you know if there is a report available somewhere ?
I doubt it. Most research work in Kenya never gets publicly published. I haven't talked to the guy for some time so I really don't know what happened.
As for my project I've been checking out SiLCC and its quite impressive particularly the part of "Predictive Tagging" which is exactly what i'm trying to do albeit crudely. and it also makes more sense now to build the application as a web app to makeit easier for many people to access it.
I'm checking out PHP at the moment to see if it will be possible to rewrite the whole thing for the web. _______________________________________________ Skunkworks mailing list Skunkworks@lists.my.co.ke http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------ Skunkworks Server donations spreadsheet http://spreadsheets.google.com/ccc?key=0AopdHkqSqKL-dHlQVTMxU1VBdU1BSWJxdy1f... ------------ Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke

I'm not really helping but lolest! Kenyans love the word tweeple? That's a shocker. And yes, spammers suck. On Sun, Mar 21, 2010 at 4:19 PM, TheBigBoss <thebigboss@peperuka.com> wrote:
Hi Jacob
"a web app to makeit easier for many people to access it" - XML or JSON would be perfect to fetch data
Wanted to develop the same but to find out who is going on safari in Kenya (like the last popular web project, forgot the name "robme" ??, tracking people not at home usin foursquare and twitter)
On Sun, Mar 21, 2010 at 11:53 AM, Jacob Ayienda <jacobayienda@gmail.com> wrote:
Interesting - have the research been completed ? and do you know if there is a report available somewhere ?
I doubt it. Most research work in Kenya never gets publicly published. I haven't talked to the guy for some time so I really don't know what happened.
As for my project I've been checking out SiLCC and its quite impressive particularly the part of "Predictive Tagging" which is exactly what i'm trying to do albeit crudely. and it also makes more sense now to build the application as a web app to makeit easier for many people to access it.
I'm checking out PHP at the moment to see if it will be possible to rewrite the whole thing for the web. _______________________________________________ Skunkworks mailing list Skunkworks@lists.my.co.ke http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------ Skunkworks Server donations spreadsheet
http://spreadsheets.google.com/ccc?key=0AopdHkqSqKL-dHlQVTMxU1VBdU1BSWJxdy1f...
------------ Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke
_______________________________________________ Skunkworks mailing list Skunkworks@lists.my.co.ke http://lists.my.co.ke/cgi-bin/mailman/listinfo/skunkworks ------------ Skunkworks Server donations spreadsheet
http://spreadsheets.google.com/ccc?key=0AopdHkqSqKL-dHlQVTMxU1VBdU1BSWJxdy1f... ------------ Skunkworks Rules http://my.co.ke/phpbb/viewtopic.php?f=24&t=94 ------------ Other services @ http://my.co.ke
-- Shiro Njagi The idea is to write it so that people hear it and it slides through the brain and goes straight to the heart. ~~Maya Angelou~~
participants (4)
-
Jacob Ayienda
-
Michael Pedersen
-
Shiro Njagi
-
TheBigBoss