You need to develop a pipeline for processing data. The pipeline must meet the following requirements:
Scale up and down resources for cost reduction
Use an in-memory data processing engine to speed up ETL and machine learning operations.
Use streaming capabilities
Provide the ability to code in SQL, Python, Scala, and R Integrate workspace collaboration with Git
What should you use?
A. HDInsight Spark Cluster
B. Azure Stream Analytics
C. HDInsight Hadoop Cluster
D. Azure SQL Data Warehouse
E. HDInsight Kafka Cluster
F. HDInsight Storm Cluster
Correct Answer: A
Explanation/Reference:
Aparch Spark is an open-source, parallel-processing framework that supports in-memory processing to boost the performance of big-data analysis applications.
HDInsight is a managed Hadoop service. Use it deploy and manage Hadoop clusters in Azure. For batch processing, you can use Spark, Hive, Hive LLAP, MapReduce.
Languages: R, Python, Java, Scala, SQL
You can create an HDInsight Spark cluster using an Azure Resource Manager template. The template can be found in GitHub.
References: https://docs.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/batch-processing