Data loss protection for source code

Scopes of Data loss in SDLC
In a post Wikileaks age the software engineering companies should probably start sniffing their development artifacts to protect the customer’s interest. From requirement analysis document to the source code and beyond, different the software artifacts contain information that the clients will consider sensitive. The traditional development process has multiple points for potential data loss – external testing agencies, other software vendors, consulting agencies etc. Most software companies have security experts and/or business analysts redacting sensitive information from documents written in natural language. Source code is a bit different though.

A lot companies do have people looking into the source code for trademark infringements, copyright statements that do not adhere to established patterns, checking if previous copyright/credits are maintained, when applicable. Blackduck or, Coverity are nice tools to help you with that.

Ambitious goal

I am trying to do a study on data loss protection in source code – sensitive information or and quasi-identifiers that might have seeped into the code in the form of comments, variable names etc. The ambitious goal is detection of such leaks and automatically sanitize (probably replace all is enough) such source code and retain code comprehensibility at the same time.

To formulate a convincing case study with motivating examples I need to mine considerable code base and requirement specifications. But no software company would actually give you access to such artifacts. Moreover (academic) people who would evaluate the study are also expected to be lacking such facilities for reproducibility. So we turn towards Free/Open source softwares., Github, Bitbucket, Google code – huge archives of robust softwares written by sharpest minds all over the globe. However there are two significant issues with using FOSS for such a study.

Sensitive information in FOSS code?

Firstly, what can be confidential in open source code? Majority of FOSS projects develop and thrive outside the corporate firewalls with out the need for hiding anything. So we might be looking for the needle in the wrong haystack. However, being able to define WHAT sensitive information is we can probably get around with it.

There are commercial products like Identity Finder that detect information like Social Security Numbers (SSNs), Credit/Debit Card Information (CCNs), Bank Account Information, any Custom Pattern or Sensitive Data in documents. Some more regex foo or should be good enough for detecting all such stuff …

for i in `cat sensitive_terms_list.txt`;do
for j in `ls $SRC_DIR`; do cat $SRC_DIR$j | grep -EHn --color=always $i ; done

Documentation in FOSS

Secondly, the ‘release early, release often’ bits of FOSS make a structured software development model somewhat redundant. Who would want to write requirements docs, design docs when you just want to scratch the itch? The nearest in terms of design or, specification documentation would be projects which have adopted the Agile model (or, Scrum, say) of development. In other words, a model that mandates extensive requirements documentation be drawn up in the form of user stories and their ilk. being a trivial example.

Still Looking
What are some of the famous Free/Open Source projects that have considerable documentation closely resembling a traditional development model (or models accepted in closed source development)? I plan to build a catalog of such software projects so that it can serve as a reference for similar work that involve traceability in source code and requirements.

Possible places to look into: (WIP)
* Repositories mentioned above

Would sincerely appreciate if you leave your thoughts, comments, poison fangs in the comments section … 🙂

Hacking the newsroom

[This is part 2 of the final pitch, which talks about the newsroom and business perspective. Part 1, detailing the newsreader perspective is here.]

Before anything else, there must be a 90 seconds theatrical promo:

Stop laughing at my amateurish video editing! This is my first ever … even Bergman, Godard, Fellini started somewhere to be great! Jokes apart here’s what REVEAL actually is all about:

Lets consider a hypothetical newsroom which uses REVEAL. A journalist gets hold a huge collection of classified documents that contains potentially sensitive information. Instead of painstakingly reading each line and jumping back to google to search relevant information – she uploads them to REVEAL and hits the pantry for her coffee. Reveal goes to work and automatically parses out names of pepole, places, organizations etc. Using the names it detected, REVEAL affixes thumbnail images with the mappings of the named entities with the documents. The journalist now sits back, sips the coffee and flips through the images looking for someone/something/some place that’s interesting and jumps directly to the document when she finds her target.

But that’s not all. In order to make the life much easier for the journalist – REVEAL uses the names and keywords from the document, to aggregates semantically related contents from the net – images, video, news, blog, wiki articles using open apis. Making the background context readily available, it allows the journalist focus solely on her analysis of the story.

What follows is an over the top ambitious plan for making lots of money – I mean the business plan.

Unearthing named entities involves doing tonnes of computationally intensive text analysis and for any sizable dataset we need a cloud based solution. While REVEAL will always be Free and Open Source Software, the business proposition is offering it as a service. Be a startup or a news corp, whoever deploys REVEAL at their site – they can offer it as a service to other news agencies/ organizations based on pay by usage model. Different packages can be offered based on when they want to share the information dug out from their documents.

Nothing like REVEAL exists today. The cohesive bond of unknown information on well known personalities and organizations, original content (the documents), expert opinion(journalist’s view), user generated content(comments) and  aggregated content – will make REVEAL a dream product for generating ad-revenues. Features for lead generation is inbuilt into the system and the karma points based reader appreciation along with the 360 degree view of the world will ensure persistent traffic.

Now get me to Berlin Hackathon!
(398 words)

Most common names detected in Wikileaks cablegate files

Link to an incomplete implementation

Reveal – How much does the world know?

Have you ever had a shit-tonne of documents dumped into your inbox with an impossible deadline demanding to suck out the hidden juicy bits? Or may be it has been a joyful experience of discovering the dump of an MILF’s emails, diplomatic cables, or code dumps of an evil corporation’s website? At moments like those, you might have uttered, “fcuk! … Omne Ignotum Pro Magnifico!”. Wouldn’t it be nice if the needles just magically popped out of the haystack? Meet Reveal (clickable prototype) – a software framework that aspires to achieve that and may be a bit more.

While sed/awk/grep-ing the cablegate files, I stumbled upon a cable that mentioned Kofi Annan asking Robert Mugabe to step down in exchange for a handsome retirement plan during the Millennium summit. Being an ignorant bloke, I could hardly recall what the Millennium summit was about, had no clue if Mugabe was still in office, and if Kofi Annan has made a comment on this! Without the right background and context I could not appreciate the data to the full extent. Below is the #MozNewsLab final project idea pitch in the lights of the three speakets  this week: Chris Heilmann, John Resig and Jesse James Garrett

“What is this thing for? What does it do? How is it supposed to fit into people’s lives?”, @Jesse James Garrett:

Journalists get amazing amount of digital data everyday which are in the form of numbers in tables. With some spreadsheet skills or help from newsroom programmers, they produce incredible revelations of the reality that hides behind those numbers. However, when the data comes in the form of unstructured text files written in natural language – there isn’t much algorithmic help available, other than full text searches with a list of guess words. Using cutting edge information retrieval technique,  Reveal would aim to build a framework that automatically annotates names, places, locations, dates etc. in the unstructured text files.

“Adopting Open Source, Open standards“, @Chris Heilmann:
Being baptized by St. IGNUcius, the idea of Free as in Freedom runs through the core of the technology stack of Reveal. Standard LAMP stack for server side, UI powered by HTML5, CSS3 and jQuery plugins and a number of open source libraries for doing the information extraction – long post describing the information retrieval technology coming soon. (Mind map above).

Using the detected names, locations, dates etc., Reveal will try to aggregate additional information in the form of images, maps, news articles, videos, wikipedia pages, visualizations etc. via open API-s and use them as navigational elements to browse the data. Juxtaposed to the document under scrutiny, these will provide the right context to gauge the sensitivity of the information.

“User to Contributor”, @John Resig:
Additionally, by showing a relative score of “How much does the world know?”, calculated on the basis of the aggregated information published before the documents surfaced, we can excite the newsreaders to share the information across their own social network. Add some game mechanics by quantifying that “sharing”, and we bust the filter bubble of ignorant blokes and turn them into responsible citizens who’ll raise voices against wrong doings of totalitarian regimes, evil corporations or other bad asses. This will lead to creation of more content and will act as a feedback loop to the background and context aggregation step before.

Now, a similar project by the uber journalist-programmer Jonathan Stray of the AP has won this year’s Knight Mozilla news challenge. His approach, Overview, solely focuses on clustering documents based on cosine similarity of their tf-idf scores. Using sexy visualization, it pulls out key terms specific to the corpus under study. The night when the results of Knight Mozilla challenge was announced – in an euphoric outburst I sent him an embarrassingly long late night email ranting the above. Obviously, I never heard back but he will be releasing his code soon and I am super excited to fork it for visualizations in Reveal.

That is my final software idea pitch inspired by Chris Heilmann, John Resig and Jesse James Garrett #MozNewsLab Week 2:

Tweetsabers of News Revolution across the globe #MozNewsLab

After Amanda Cox’s lecture I was too pepped up to do some quick and sexy data visualization. In my daily life, I rely on R or GNUPlot for doing all my plots, simply because of their scripting interface. I have played a bit with Google chart and visualization api and they are absolutely brilliant. I’m planning to get my hands dirty with matplotlib (yep, yep … Python!).

So mid way into the re-listening the lecture, I was overpowered by this feeling of doing some global data visualization. One of the best global data visualization tool that have left an impression on my mind, is the WebGL Globe – an open platform from Google Data Arts Team. I grabbed the example code from here and with a simple Python script collected Twitter activities during the first week of #MozNewsLab into a .json file. Some simple changes to the sample javascript code and there you have colored light sabers shooting out from the globe.

The main problem was the twitter api limit – a meager 125 queries per hour. So I had to rely on the geo-location data that I had scraped earlier for mapping #MozNewsLab participants in the world map. That allowed me to narrow down the geo location queries for those who were not a participant of #MozNewsLab. In case the geo-location info was not available on their twitter profile, those homeless tweet counts were assigned to our dear Lab co-lead Phillip Smith.

Mozilla News Lab Schedule – Google Calendar

The Knight Foundation and Mozilla have joined forces to help the media adapt to the evolving technology landscape. After an open idea challenge, 60 hackers and journalists were selected for a month long Learning Lab. I somehow managed to sneak into this elite club with my two cents here and here.

The Learning Lab is going to be a series of webinars from some of the most respected names in technology and journalism. Here is a list of Twitter handles of the organizers and speakers. Below is a google calendar showing the webinar timings (PST).

Wanna see how wacky it will get? Check out the video from the most colorful moderator ever – Jacob Caggiano.

Can’t wait till Monday …

Subclipse 1.6.x and Eclipse 3.5 on Ubuntu 9.10 – Karmic Koala

A quick post to get subclipse working on Eclipse 3.5 on Karmic Kola. We are doing a group project for our CS-480 Database Systems with 7 team members. Obvious choice was google code as Min and Ali had used it in previous semesters – familiarity is a strong motivation not to select your CASE tools (read github for SCM) when your grade depends on meeting the deadline. 😀

Min mailed us a comprehensive document on getting subclipse working with – but I stumbled across getting it working in Ubuntu – the well known problem of gnome-keyring and JavaHL ate up the last 2 hours.

Once you’ve installed subclipse 1.6.x in Eclipse and would like to jump off to checkout code for your favorite open source project – you’ll hit hard by messages similar to these:

Failed to load JavaHL Library.
These are the errors that were encountered:
no libsvnjavahl-1 in java.library.path
no svnjavahl-1 in java.library.path
no svnjavahl in java.library.path
java.library.path = /usr/lib/jvm/java-6-sun-

Nice as people are in FOSS world, it’ll also point you to the documentation JavaHL wiki to fix this. While it gives you the basic steps required – but distro-specific details were missing. So here are the quick steps.

1. Install libsvn-java using Synpatic or from the commmand line sudo apt-get install libsvn-java

Edit the eclipse.ini file in your eclipse directory to add


to tell Eclipse where to look for all Java SVN bindings

3. You also need to tell gnome-keyring to shut the f* up and let subclipse work. For this you keep your password-store as blank. Edit the svn config file located in the .subversion directory of your home directory by adding

### Set password stores used by Subversion. They should be
### delimited by spaces or commas. The order of values determines
### the order in which password stores are used.
### Valid password stores:
### gnome-keyring (Unix-like systems)
### kwallet (Unix-like systems)
### keychain (Mac OS X)
### windows-cryptoapi (Windows)
password-stores =

I also disabled gnome-keyring using the gconf-editor (navigate to /apps/gnome-keyring/daemon-components uncheck SSH and PKCS11) – but not really required I guess.

(Re)Start Eclipse. You should now be able to checkout your project.

Ummm, how much of the school’s project did I complete this morning? None really :P.
Next step is getting Apache Derby working.

Symbol table for the C-Compiler

In this assignment we learnt how a symbol table is implemented in a complier. Each new symbol is pushed into the hashtable (implemented as a array of pointers of elements)
of the Symbol table stack. We sum up the ascii values of the characters of the symbol and take the modulo with the MAXHASHSIZE to determine the array index for a particular entry. The problem with this implementation seems to be the constant size of the hashtable we are allocating on entering each new scope. If a scope has lesser number of variables compared to one that has close to MAXHASHSIZE, we’ll unnecessarily waste memory; Check the size of the memory using gdb by

print sizeof(symbolStackTop->symbolTablePtr->hashTable)

and it comes to 128. We defined MAXHASHSIZE as 32, and size of each integer is 4 bytes in my machine (AMD Athlon(tm) Processor L110). So whatever be the number of elements in the block – we always allocate 128 bytes.

Now this brings us to the question – symbolStackTop->symbolTablePtr just assumes different address depending on the order of and number of times enterscope() and leavescope() gets called. So do we free() the stack top (to avoid memory leak), each time we leaveScope()? Then we’ll loose the entires for that particular scope and would not be able to use it later on (it appears we’ll be using it in some other module of the project). If we don’t free(), where do we store the reference for the entry?

Another issue was the data structure Element has a data member key in it. But why do we need it? Our hashing function selects a block in the array, and on collision we resort to chaining. So does it really need a key as a data member?

Lets see what people have to say about this …

CS473 – Writing a compiler for a simple language

This semester one of neatest course I have taken is CS 473: Compiler Design. Professor V. N. Venkatakrishnan is teaching the course. We will be writing a complier for a language called C-, so lots of coding and we really get to understand how the abstract concepts are implemented. It will be a series of six homework projects.

Last semester I planed to note down stuff while doing CS 450 Introduction to Networking – another programming intensive course. And due to many other things – that did not materialize. But this time, we have to submit a short essay on what we learnt in each project which gets graded. Well quite a reason to blog. Here is the first essay.

Implementation of Scanner for C– with (f)lex

What I learnt:
The basic structure of a lex program.

Still not feeling confident:
Using lex api results gives optimized results. How to make efficient usage of it in rules and user submitted routine
Apart form a few documentation, there is hardly any that’s contrite but makes you feel confident

References used:
While none were fully read, peeped into following

Further reading planned from:
[1] lex & yacc, Second Edition – ByDoug Brown, John Levine, Tony MasonPublisher:O’Reilly Media

Just a few lines of bash scripting to do the test cases at once (assuming the test directory is in pwd).

rm -f testresult;
for i in `ls test/*.c`;do
echo $i>> testresult; cat -n $i >> testresult; echo “———–“>> testresult; ./cmlexer $i >> testresult;