Guido Romeo: “Check your data before starting to work”

Map: Safe Schools - Wired ItaliaMap: the seismic safety of Italian schools

When asked about his biggest mistake Wired Italy journalist Guido Romeo shares the story behind his first big project. It taught him about the importance of a roadmap, and the time-consuming task of cleaning up data. Why you should check your data before you set a deadline.

“Our biggest mistake was on our first big data story: scuole sicure, an investigation on the seismic safety of more than 75.000 school buildings in Italy. We started off thinking we’re get all the data from the Ministry of education and they would also provide the geocoding as they’d done it for their website. We were very naive and embarked in a much to large project without estimating the actual effort.”

Dirty data

“The data was very dirty, it took a full three weeks to clean it up to a workable version and we had to geocode everything by ourselves as it turned out the Ministry was not at all in favor of what we were doing. Geocoding 75.000 points was a big issue as Google free services allow you to code 2500 points per day per machine… All of a sudden we realized it’d take us weeks to do it all even using all the newsroom computers. As I said, we were very naive and embarked in a large project without a due diligence and a clear roadmap and timetable.”

Cleaning up

“Data cleaning was done over the summer and with the help of two interns (interns are gold!!!). For the geocoding we had a hacker friend write a little program me cheating Google’s servers into doing all the work at once. In the end this part actually worked pretty well.”

Roadmap

“Check your data before starting to work and setting deadlines. If you have big datasets try to run smaller samples to see if the process works. Set out a detailed roadmap. To out it simply: be paranoid when it comes to big datasets, but push on.”

Reading Tip

Reading Tip

“I believe the next two books are a great combination of what you need to understand: The first one is a fundamental classic, Precision Journalism from Philip Meyer. The other one is Functional Art from Alberto Cairo. Also looking at GEN’s shortlist for the Data Journalism Awards is great inspiration. The quality of the work submit there goes from great to simply excellent. (BTW: we were shortlisted last year for our hospital story:)).”


fail faster - open source wayAbout the Data Horror Stories
Failing fast seems to be a good option if you want to learn something quickly. But what about learning from mistakes others have made? In the Data Horror Story series data journalists share a mistake for us to learn from. Share your own data horror story here, or read some more horror stories from others.
Filed under Uncategorized