大数据:是大型组织和企业获得的巨大的、庞大的或海量的数据、信息或相关统计数据。由于难以手动计算大数据,因此创建和准备了许多软件和数据存储。它用于发现模式和趋势,并做出与人类行为和交互技术相关的决策。
大数据的应用和使用:
- facebook 和 twitter 等社交网站。
- 航空和铁路等交通运输。
- 医疗保健和教育系统。
- 农业方面。
Apache Hadoop:它是一个建立在机器集群上的开源软件框架。它用于非常大的数据集(即大数据)的分布式存储和分布式处理。它是使用 MapReduce 编程模型完成的。用Java实现的开发友好型工具支持大数据应用程序。它可以轻松处理商品服务器集群上的海量数据。它可以挖掘任何形式的数据,即结构化、非结构化或半结构化数据。它是高度可扩展的。
它由 3 个组件组成:
- HDFS :可靠的存储系统,其中存储了世界上一半的数据。
- MapReduce :层由分布式处理器组成。
- Yarn :层由资源管理器组成。
下表列出了大数据和 Apache Hadoop 之间的差异:
No. | Big Data | Apache Hadoop |
---|---|---|
1 | Big Data is group of technologies. It is a collection of huge data which is multiplying continuously. | Apache Hadoop is a open source java based framework which involves some of the big data principles. |
2 | It is a collection of assets which is quite complex, complicated and ambiguous. | It achieves a set of goals and objectives for dealing with the collection of assets. |
3 | It is a complicated problem i.e. huge amount of raw data. | It is a solution being processing machine of those data. |
4 | Big Data is harder to access. | It allows the data to be accessed and process faster. |
5 | It is hard to store the huge amount of data as it consists all form of data. i.e. structured, unstructured and semi-structured. | It implements Hadoop Distributed File System (HDFS) which allows the storage of different variety of data. |
6 | It defines the data set size. | It is where the data set stored and processed. |