Cloud Dataproc is a managed Spark and Hadoop service that allows you to use open-source data resources for batch processing, querying, downloading, and machine learning. Automation with Cloud Dataproc lets you rapidly build clusters, spins up instances of the cluster’s compute engine, handle them efficiently, and save money by shutting off clusters that aren’t necessary. You can focus on jobs and your data, with less time and cost spent on administration.
What is Apache Spark
ApacheSpark is an open source software project providing a high-performance analytics platform for batch processing and data streaming. Spark can be up to 100 times faster than comparable Hadoop work, as it leverages computation in memory. Spark also provides a few abstractions to handle data, like what’s called RDDs, or Resilient Distributed Datasets and DataFrames.
What is Hadoop and HDFS
In 2006, with Hadoop, centralized processing of big data has become realistic. The concept behind Hadoop was to build a server cluster, and leveraged distributed computing. The Hadoop Distributed File System (HDFS) has stored data in the cluster on servers, and MapReduce has supported distributed data analysis. A whole Hadoop-related software ecosystem has grown around Hadoop including Hive, Pig and Spark.