While sed/awk/grep-ing the cablegate files, I stumbled upon a cable that mentioned Kofi Annan asking Robert Mugabe to step down in exchange for a handsome retirement plan during the Millennium summit. Being an ignorant bloke, I could hardly recall what the Millennium summit was about, had no clue if Mugabe was still in office, and if Kofi Annan has made a comment on this! Without the right background and context I could not appreciate the data to the full extent. Below is the #MozNewsLab final project idea pitch in the lights of the three speakets this week: Chris Heilmann, John Resig and Jesse James Garrett
“What is this thing for? What does it do? How is it supposed to fit into people’s lives?”, @Jesse James Garrett:
Journalists get amazing amount of digital data everyday which are in the form of numbers in tables. With some spreadsheet skills or help from newsroom programmers, they produce incredible revelations of the reality that hides behind those numbers. However, when the data comes in the form of unstructured text files written in natural language – there isn’t much algorithmic help available, other than full text searches with a list of guess words. Using cutting edge information retrieval technique, Reveal would aim to build a framework that automatically annotates names, places, locations, dates etc. in the unstructured text files.
“Adopting Open Source, Open standards“, @Chris Heilmann:
Being baptized by St. IGNUcius, the idea of Free as in Freedom runs through the core of the technology stack of Reveal. Standard LAMP stack for server side, UI powered by HTML5, CSS3 and jQuery plugins and a number of open source libraries for doing the information extraction – long post describing the information retrieval technology coming soon. (Mind map above).
Using the detected names, locations, dates etc., Reveal will try to aggregate additional information in the form of images, maps, news articles, videos, wikipedia pages, visualizations etc. via open API-s and use them as navigational elements to browse the data. Juxtaposed to the document under scrutiny, these will provide the right context to gauge the sensitivity of the information.
“User to Contributor”, @John Resig:
Additionally, by showing a relative score of “How much does the world know?”, calculated on the basis of the aggregated information published before the documents surfaced, we can excite the newsreaders to share the information across their own social network. Add some game mechanics by quantifying that “sharing”, and we bust the filter bubble of ignorant blokes and turn them into responsible citizens who’ll raise voices against wrong doings of totalitarian regimes, evil corporations or other bad asses. This will lead to creation of more content and will act as a feedback loop to the background and context aggregation step before.
Now, a similar project by the uber journalist-programmer Jonathan Stray of the AP has won this year’s Knight Mozilla news challenge. His approach, Overview, solely focuses on clustering documents based on cosine similarity of their tf-idf scores. Using sexy visualization, it pulls out key terms specific to the corpus under study. The night when the results of Knight Mozilla challenge was announced – in an euphoric outburst I sent him an embarrassingly long late night email ranting the above. Obviously, I never heard back but he will be releasing his code soon and I am super excited to fork it for visualizations in Reveal.