Interview Questions For Spark
• Spark has an interactive language shell as it has an independent Scala (the language in which Spark is written) interpreter
• Spark consists of RDD’s (Resilient Distributed Datasets), which can be cached across computing nodes in a cluster.
• Spark supports multiple analytic tools that are used for interactive query analysis , real-time analysis and graph processing
• Parallelized Collections : The existing RDD’s running parallel with one another
• Hadoop datasets: perform function on each file record in HDFS or other storage system
• Actions
The driver also delivers the RDD graphs to Master, where the standalone cluster manager runs.
hive> set spark.home=/location/to/sparkHome;
hive> set hive.execution.engine=spark;
Hive on Spark supports Spark on yarn mode by default.
• Spark Streaming for processing live data streams
• GraphX for generating and computing graphs
• MLlib (Machine Learning Algorithms)
• SparkR to promote R Programming in Spark engine.
• Local File system
• S3
• Loading data from a variety of structured sources
• Querying data using SQL statements, both inside a Spark program and from external tools that connect to Spark SQL through standard database connectors (JDBC/ODBC). For instance, using business intelligence tools like Tableau
• Providing rich integration between SQL and regular Python/Java/Scala code, including the ability to join RDDs and SQL tables, expose custom functions in SQL, and more
• Unlike Hadoop, Spark provides in-built libraries to perform multiple tasks form the same core like batch processing, Steaming, Machine learning, Interactive SQL queries. However, Hadoop only supports batch processing.
• Hadoop is highly disk-dependent whereas Spark promotes caching and in-memory data storage
• Spark is capable of performing computations multiple times on the same dataset. This is called iterative computation while there is no iterative computing implemented by Hadoop.
• Standalone: a basic manager to set up a cluster
• Apache Mesos: generalized/commonly-used cluster manager, also runs Hadoop MapReduce and other applications
• Yarn: responsible for resource management in Hadoop
• By parallelizing a collection in your Driver program. This makes use of SparkContext’s ‘parallelize’ method
val Data = Array(2,4,6,8,10)
val distData = sc.parallelize(Data)
val distData = sc.parallelize(Data)
• By loading an external dataset from external storage like HDFS, HBase, shared file system
No comments:
Post a Comment
Write a comment . .