Twitter Hospital Compare

While working on Coursera’s Introduction to Data Science course project, a few folks on the discussion forum started exploring the possibilities of performing some twitter data analysis for healthcare. There were a number of thought provoking discussions on what insights about healthcare can be mined from twitter, and I was reminded of a data set I had seen earlier.

Last Fall there was another Coursera course, Computing for Data Analysis, by Roger Peng, that I was auditing. One of its assignments required doing some statistical analysis on Medicare Aided hospitals. These hospitals have an alarming national re-admittance rate(19%), with nearly 2 million Medicare beneficiaries getting readmitted in within 30 days of release each year costing $17.5 billion. It is not completely understood how to reduce the readmission rates as even highly ranked hospitals of the country have not been able to bring their rates down.

Research Questions:
I agree they are all very rudimentary, but my understanding about this country’s complicated medical system is very limited. I know how to code, and take little steps at a time. Or so I thought.

Data description:
Composite Topics
While the Survey file contains survey data on 4606 hospitals across the country, after cleaning up missing values, “insufficient data”, “errors in data collection” the number of hospitals was down to 3573.
That settles the structured data. Lets talk about unstructured data. Consider a tweet from a user @TonyHernandez whose nephew recently had successful brain surgery at @Florida Hospital. Yes this one.
Normally, I’d use Python to do these kind of matching but since the course had evangelized sqlite3 for such join-s, I went that route. A minor point to note here is that case insensitive string matches in sqlite3 for the text data type need an additional “collate nocase” qualification while creating the table.
Next you want to see how many matches between the two datasets do you actually get.
Moreover, apart from the twitter handle rest of the data in the list was outdated. I needed an updated count of followers, friends, listed, retweets and favorites for these handles. A quick Twython did the trick.
Props to TwitterGoggles for such a nice tweet harvesting script written in Python 3.3. It allows you to run the script as jobs with list of handles and offers a very nice schema for storing tweets, hashtags, and all relevant logs.
Before I managed to submit the assignment, on two runs of TwitterGoggles I collected 21651 tweets from and to these hospitals, 10863 hashtags, 18447 mentions, and 8780 retweets from Medicare Aided Hospitals on Twitter.
Analysis: All this while, I was running with the hope that all would some how come together to form a story at the last moment. What made things even more difficult was the survey data was all in Likert Scale – and I could not think up some hardcore data science analysis for the merged data. However, my peers were extraordinarily generous to give me 20 points with the following insightful comments with the first comment nailing it.
peer 1 → The idea is promising, but the submission is clearly incomplete. Your objective is not clear: “finding patterns” is too vague as an objective. One could try to infer your objectives from the results, but you just build the dataset an don’t show nor explain how you intended to use it, not to mention any result. Although you mentioned time constraints maybe you should have considered a smaller project.
peer 2 → Very promising work, but it requires further development. It’s a pity that no analysis was made.
While there is a lot to be done I thought a quick tableau visualization of the data might be useful. Click here for an interactive version.

Among the various data sets available from HCAHPS, this one contains feedback about the hospitals obtained by surveying actual patients. I thought it would be interesting to study how patients and hospitals interact on twitter. 

Why do some hospitals that have more followers, more favorited tweets, or more retweets? Is it because of the quality of the care measure they provide? Is the number of twitter followers of a hospital effected by how the nurses and doctors communicate with their patients? Do patients feel good (sentiment analysis) when hospitals provide clean, quiet environment and cater immediate help on request? Would proper discharge information help get hospitals more twitter love?

The Survey of PatientsHospital Experiences HCAHPS.csv (here on referred to as the “survey”), contains the following fields:

Nurse Communication
Doctor Communication
Responsiveness of Hospital Staff
Pain Management
Communication About Medicines
Discharge Information
Individual Items
Cleanliness of Hospital Environment
Quietness of Hospital Environment
Global Items
Overall Rating of Hospital
Willingness to Recommend Hospital

7hr brain surgery, huge success! Big props to Dr Eric Trumble at #FloridaHospital #disney pavilion, a true ROCK STAR! Thanks 4 prayers!
— Tony Tightropes (@TonyHernandez) June 7, 2013

This tells how a particular patient feels about the care measure he(his nephew) received at the hospital. The sentiment of the tweet text, the hashtags, the retweet count, favorites count are simple yet powerful signals we can aggregate to get an idea about how the hospital is performing.

Next I got a list of hospitals that were on Twitter … thanks to the lovely folks who hand curated it. It was nicely html-ed making it easy to scrape  into a Google Doc with one line of ImportXML(“”, “//tr”). Unfortunately, the number of hospitals on twitter according to this list (779) is significantly less when compared to the total number of hospitals. But it is still a lot of human work to match the 3573 x 779 hospital names.

So we lose out 92% of the survey data and less than 8% of the hospitals we have data for were on twitter when this list was made. These 246 hospitals are definitely more proactive than the rest of the hospitals, so I already have a biased dataset. Shaks!

While the twitter api gives direct count of the friends, followers and listed, for other attributes I had to collect all the tweets that were made by these hospitals. Additionally, it is important to get the tweets that mention these hospitals on twitter. 

Collecting such historic data means using the Twitter Search API and not the live Streaming API. The search API is not only more stringent as far as the rate limits are concerned, but it is thrift in terms of how many tweets it returns. Its meant to be relevant and current instead of being exhaustive.

peer 4 → I thought the project was well put together and organized. I was impressed with the use of github, amazon AWS, and google docs to share everything amongst the group. The project seems helpful to gather data from multiple sources that then can hopefully be used later to help figure out why the readmission rates are so high.

peer 6 → As a strength, this solution is well-documented and interesting. As a weakness, I would like to have seen a couple of visualizations.

It appears that hospitals on the east coast are far more active on Twitter when compared to the those on the West Coast. The data is here as a csv and the google doc spreadsheet.

Leave a Reply

Your email address will not be published. Required fields are marked *