It is no secret that the entire IT industry is embracing open source. Nowhere is this shift more pronounced than in the emerging profession of data science. For the longest time data scientists (even though many were not called that) were relying on highly functional, well integrated, proprietary products. Most of these products have a proprietary language, a set of development tools for that language and an execution environment to run the analytics developed using the tools. All of these things are wrapped together in a very well integrated, very functional and often expensive product. The cost of acquisition puts these tools out of reach for many people. The proprietary nature dissuades people from learning these technologies as they are perceived as not being widely adopted. Hence, the drive to open source tools with their ubiquity and low barrier to use.
Open source technologies also have their own unique challenges. As a budding data scientist you maybe looking at picking an open source language like Python, R, Scala, Julia, etc. You also have to pick the packages you need from the catalogs of tens of thousands of packages all created by different authors with different levels of quality, currency and support. Then you need to pick your tools. Will it be an IDE like RStudio or PyCharm or an IPython, Jupyter, Zeppelin, Beaker or some other notebook technology? Then you need to figure out if you are going to use some of the big data technologies like Hadoop and Spark as your run-time engine. Now you need to take all of this and integrate it all together and keep it running as software levels change and stomp on each other. It is not easy.
We created Data Scientist Workbench to make using open source tools for data science easy - as easy as integrated proprietary product suites - maybe even easier.
We took the liberty to put together a starter kit using IPython notebooks with support for Python, R and Scala with all sorts of popular libraries. We added Spark and connectivity to Hadoop clusters and relational databases. We also integrated RStudio IDE because it is very popular with the R data scientists and we added OpenRefine for those who like to eyeball, clean and prepare their data through a spreadsheet-like interface instead of writing code. We then decided to host it for you on the cloud and maintain everything at the latest levels so you can concentrate on data science instead of system administration. However, we do realize that many of you can't put your data and code on the cloud so we are architecting DSWB in such a way that you can deploy it inside your company's firewall. We also heard loud and clear that from time to time you want to be able to use these tools on your laptop with or without an internet connection. As a result, you can bring components of DSWB on to your laptop and go back to the cloud only when you want to.
This is just a start and we are building the work bench for you to satisfy your needs. How do we know what your wants and needs are? We don't unless you tell us. That is why we have integrated feature request mechanism right inside of the DSWB. Just click on the "Resources/Feedback Forum" menu to submit your ideas or to vote up ideas submitted by others. Yes, we are very democratic! We also try to be very transparent. We will share our progress and implementation status with you every step of the way right there in the forum. For many ideas you don't have to wait very long. We don't have releases and we push new features out almost every week.
What is the road map for the Data Scientist Workbench? We set the direction toward assembling an integrated, highly functional tool set for data scientists based on readily available open source technologies. If you like this direction, we would love for you to join us on this journey. We are looking to you to help us choose the steps we take along the way. Point your browser to the feedback forum and let the ideas flow.