数据科学处理从数据源中识别、表示和提取有意义的信息,用于执行某些业务逻辑。数据科学家使用机器学习、统计、概率、线性和逻辑回归等来制作一些有意义的数据。根据业务逻辑寻找模式和相似组合并破解最佳路径是分析的最大工作。
R、 Python、SQL、SAS、Tableau、MATLAB 等是最有用的数据科学工具,R 和Python是最常用的工具。但是,对于任何新手来说,在 R 和Python两者中选择更好或最合适的一个仍然会让人感到困惑。让我们尝试将差异可视化。
概述 :
R | Python |
---|---|
R is a programming language and free software environment for statistical computing and graphics, supported by the R Foundation for Statistical Computing. It was designed by Ross Ihaka and Robert Gentleman and first released in August, 1993. It is widely used among statisticians and data miners for developing statistical software and data analysis. | Python is an Interpreted high-level programming language for general purpose programming. It was created by Guido Van Rossum and was first released in 1991. Python has a very clean and simple code syntax. It emphasizes code readability and thus debugging is also far more simpler and easier in Python. |
数据科学专业:
R | Python |
---|---|
R packages cover advanced techniques which very useful for statistical work. The CRAN text view provides you with many useful R packages. R packages cover everything from Psychometrics to Genetics to Finance. On the other hand, Python, with the help of libraries like SciPy and packages like statsmodels, covers only the most common techniques. | R and Python are equally good for finding outliers in a data set, but for developing a web service to enable other people to upload datasets and find outliers, Python is better. People have built modules to create websites, interact with a variety of databases, and manage users in Python. In general, to create a tool or service that uses data analysis, Python is a better choice. |
功能:
R | Python |
---|---|
R has inbuilt functionalities for data analysis. R was built by eminent statisticians with statistics and data analysis in mind, so many tools that have been externally added to Python through packages are built in R by default. | Python is a general purpose programming language. So most of the data analysis functionalities are not inbuilt and are available through packages like Numpy and Pandas, which are available in PyPi(Python Package Index). |
主要应用领域:
R | Python |
---|---|
Data visualization is a key aspect of analysis, as visual data is best understood. R packages like ggplot2, ggvis, lattice, etc. make data visualization easier in R. Python is catching up with packages like Bokeh, Matplotlib, etc. but is still far behind in this regard. | Python is better for deep learning. Packages like Lasagne, Caffe, Keras, Mxnet, OpenNN, Tensor flow, etc. allows development of deep neural networks far more simple in Python. Although some of these, like tensor flow, are being ported to R(packages like deepnet, H2O, etc.) but it is still better in Python. |
包的可用性:
R | Python |
---|---|
R has hundreds of packages and ways to accomplish needful data science tasks. Although it allows to have desired perfection in completing the task, it makes it difficult for inexperienced developers to achieve certain goals. | Python relies on a few main packages, viz., Scikit learn and Pandas are the packages for machine learning data analysis respectively. It makes easier to accomplish required tasks but consequently it becomes difficult to achieve specialization. |
最终,根据需要选择最合适的语言是数据科学家本身的工作。对于统计背景,R 可能是更好的选择。但是对于 CS 背景甚至初学者来说, Python是最合适的选择。但是,最好对两者都有充分的了解,因为两者有时在数据科学职业中都可能有用。