Crowdsourcing scientific skills: Kaggle and data modeling

kaggle-logo-transparent-300Kaggle is not exactly a newcomer but it is an excellent example of how the web 2.0 can boost science and help solve scientific problems.

Kaggle harvests the power of crowdsourcing to solve problems in need of data modeling. Predictive models are everywhere, they help predict various phenomena, from customer behaviors to bird migration.  However there is no general rule for designing such models, and they often end up being optimized by trial and error. So it seems the field is well suited for the massive amounts of work-hours crowdsourcing can provide.

Kaggle asks participants to develop predictive models to help resolve problems that have been submitted by companies ( GE, Allstate, Merck, Ford…) and other organisations (universities, governmental organisations…). Tens to hundreds of different models can then be compared, and the best is chosen as the winner.

Turning work into a game is a common startegy to motivate participation, however it is interesting to see that Kaggle pushes the sport analogy quite far. The terms “player”, “competition” and “winner” are often used.  And a winner there is, with the creator of the most optimized models usually rewarded with hundreds if not millions of dollars.

Founded in 2010, the company has successfully raised millions of dollars and major companies are coming onboard with their own data to be modeled, convinced by a series of successful projects. A great example how to put brilliant minds (with some free time on their hands) to collaborative working!

Publish the unpublishable with ResearchGate

ResearchGate recently announced that they now encourage researchers to share data through their platform. They hope to get more unpublished information out in the open to fuel scientific discussion. Such information include:

  • Datasets and raw data
  • Negative results
  • Figures and media files
  • Unpublished articles


This new service comes in addition to a set of other services that already allow researchers to share data and unpublished information. ResearchGate, with their 2+ million users, will probably quickly become one of the main platform to publish such information. By steadily releasing new services, ResearchGate seems to be taking the lead as a social platform for scientific exchange.  However published data should be easily searchable, citable and not prisoner of proprietary formats. This is not the case as of today, and I would be curious to learn more about efforts currently underway to address these issues.