Top Hadoop Interview Questions
• Managing traffic on streets
• Streaming processing
• Content Management and Archiving Emails
• Processing Rat Brain Neuronal Signals using a Hadoop Computing Cluster
• Fraud detection and Prevention
• Advertisements Targeting Platforms are using Hadoop to capture and analyze click stream, transaction, video and social media data
• Managing content, posts, images and videos on social media platforms
• Analyzing customer data in real-time for improving business performance
• Public sector fields such as intelligence, defense, cyber security and scientific research
• Financial agencies are using Big Data Hadoop to reduce risk, analyze fraud patterns, identify rogue traders, more precisely target their marketing campaigns based on customer segmentation, and improve customer satisfaction
• Getting access to unstructured data like output from medical devices, doctor’s notes, lab results, imaging reports, medical correspondence, clinical data, and financial data
On the contrary, in Relational database computing system, you can query data in real-time, but it is not efficient to store data in tables, records and columns when the data is huge.
Hadoop also provides a scheme to build a Column Database with Hadoop HBase, for runtime queries on rows.
1. Standalone Mode: Default mode of Hadoop, it uses local file stystem for input and output operations. This mode is mainly used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for mapred-site.xml, core-site.xml, hdfs-site.xml files. Much faster when compared to other modes.
2. Pseudo-Distributed Mode (Single Node Cluster): In this case, you need configuration for all the three files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
3. Fully Distributed Mode (Multiple Cluster Node): This is the production phase of Hadoop (what Hadoop is known for) where data is used and distributed across several nodes on a Hadoop cluster. Separate nodes are allotted as Master and Slave.
Block 1: yy tytry
Block 2: xx yuyy
Now, considering the map, it will read first block from ii till ll, but does not know how to process the second block at the same time. Here comes Split into play, which will form a logical group of Block1 and Block 2 as a single block. It then forms key-value pair using inputformat and records reader and sends map for further processing With inputsplit, if you have limited resources, you can increase the split size to limit the number of maps. For instance, if there are 10 blocks of 640MB (64MB each) and there are limited resources, you can assign ‘split size’ as 128MB. This will form a logical group of 128MB, with only 5 maps executing at a time.
However, if the ‘split size’ property is set to false, whole file will form one inputsplit and is processed by single map, consuming more time when the file is bigger.
Benefits of using distributed cache are:
• It distributes simple, read only text/data files and/or complex types like jars, archives and others. These archives are then un-archived at the slave node.
• Distributed cache tracks the modification timestamps of cache files, which notifies that the files should not be modified until a job is executing currentluy.
fsimage file- It keeps track of the latest checkpoint of the namespace.
edits file-It is a log of changes that have been made to the namespace since checkpoint.
Checkpoint NameNode has the same directory structure as NameNode, and creates checkpoints for namespace at regular intervals by downloading the fsimage and edits file and margining them within the local directory. The new image after merging is then uploaded to NameNode.
There is a similar node like Checkpoint, commonly known as Secondary Node, but it does not support the ‘upload to NameNode’ functionality.
Backup Node provides similar functionality as Checkpoint, enforcing synchronization with NameNode. It maintains an up-to-date in-memory copy of file system namespace and doesn’t require getting hold of changes after regular intervals. The backup node needs to save the current state in-memory to an image file to create a new checkpoint.
• Text Input Format: Default input format in Hadoop.
• Key Value Input Format: used for plain text files where the files are broken into lines
• Sequence File Input Format: used for reading files in sequence
The NameNode manages the replication of data blocksfrom one DataNode to other. In this process, the replication data transfers directly between DataNode such that the data never passes the NameNode.
1. setup(): this method is used for configuring various parameters like input data size, distributed cache.
public void setup (context)
2. reduce(): heart of the reducer always called once per key with the associated reduced task
public void reduce(Key, Value, context)
3. cleanup(): this method is called to clean temporary files, only once at the end of the task
public void cleanup (context)
1. Uncompressed key/value records.
2. Record compressed key/value records – only ‘values’ are compressed here.
3. Block compressed key/value records – both keys and values are collected in ‘blocks’ separately and compressed. The size of the ‘block’ is configurable.
• It is a process that runs on a separate node, not on a DataNode often
• Job Tracker communicates with the NameNode to identify data location
• Finds the best Task Tracker Nodes to execute tasks on given nodes
• Monitors individual Task Trackers and submits the overall job back to the client.
• It tracks the execution of MapReduce workloads local to the slave node.
Row1: Welcome to
Row2: Datanalytix Blog
It will be read as “Welcome to Intellipaat” using RecordReader.
It creates a duplicate task on another disk. The same input can be processed multiple times in parallel. When most tasks in a job comes to completion, the speculative execution mechanism schedules duplicate copies of remaining tasks (which are slower) across the nodes that are free currently. When these tasks finish, it is intimated to the JobTracker. If other copies are executing speculatively, Hadoop notifies the TaskTrackers to quit those tasks and reject their output.
Speculative execution is by default true in Hadoop. To disable, set mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution
JobConf options to false.
To delete the directory before running the job, you can use shell:
Hadoop fs –rmr /path/to/your/output/
Or via the Java API: FileSystem.getlocal(conf).delete(outputDir, true);
1. Run: “ps –ef | grep –I ResourceManager”
and look for log directory in the displayed result. Find out the job-id from the displayed list and check if there is any error message associated with that job.
2. On the basis of RM logs, identify the worker node that was involved in execution of the task.
3. Now, login to that node and run – “ps –ef | grep –iNodeManager”
4. Examine the Node Manager log. The majority of errors come from user level logs for each map-reduce job.
You can also modify the replication factor on a per-file basis using the Hadoop FS Shell:
[training@localhost ~]$ hadoopfs –setrep –w 3 /my/file
Conversely, you can also change the replication factor of all the files under a directory.
[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir
[training@localhost ~]$ hadoopfs –setrep –w 3 -R /my/dir
conf.set(“mapreduce.map.output.compress”, true)
conf.set(“mapreduce.output.fileoutputformat.compress”, false)
You can write your query for the data you want to import from Hive to HDFS. The output you receive will be stored in part files in the specified HDFS path.
No comments:
Post a Comment
Write a comment . .