Friday, August 2, 2013

Interview questions for Freshers and Experience

What is a SequenceFile in Hadoop?

A. ASequenceFilecontains a binaryencoding ofan arbitrary numberof homogeneous writable objects.
B. ASequenceFilecontains a binary encoding of an arbitrary number of heterogeneous writable objects.
C. ASequenceFilecontains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
D. ASequenceFilecontains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be sametype.

Is there a map input format in Hadoop?

A.  Yes, but only in Hadoop 0.22+.
B.  Yes, there is a special format for map files.
C.  No, but sequence file input format can read map files.
D.  Both 2 and 3 are correct answers.

What happens if mapper output does not match reducer input in Hadoop?

A.  Hadoop API will convert the data to the type that is needed by the reducer.
B.  Data input/output inconsistency cannot occur. A preliminary validation check is executed prior to the full execution of the job to ensure there is consistency.
C.  The java compiler will report an error during compilation but the job will complete with exceptions.
D.  A real-time exception will be thrown and map-reduce job will fail.

Can you provide multiple input paths to a map-reduce jobs Hadoop?

A.  Yes, but only in Hadoop 0.22+.
B.  No, Hadoop always operates on one input directory.
C.  Yes, developers can add any number of input paths.
D.  Yes, but the limit is currently capped at 10 input paths.

Can a custom type for data Map-Reduce processing be implemented in Hadoop?

A.  No, Hadoop does not provide techniques for custom datatypes.
B.  Yes, but only for mappers.
C.  Yes, custom data types can be implemented as long as they implement writable interface.
D.  Yes, but only for reducers.

The Hadoop API uses basic Java types such as LongWritable, Text, IntWritable. They have almost the same features as default java classes. What are these writable data types optimized for?

A.  Writable data types are specifically optimized for network transmissions
B.  Writable data types are specifically optimized for file system storage
C.  Writable data types are specifically optimized for map-reduce processing
D.  Writable data types are specifically optimized for data retrieval

 What is writable in  Hadoop?

A.  Writable is a java interface that needs to be implemented for streaming data to remote servers.
B.  Writable is a java interface that needs to be implemented for HDFS writes.
C.  Writable is a java interface that needs to be implemented for MapReduce processing.
D.  None of these answers are correct.

What is the best performance one can expect from a Hadoop cluster?

A.  The best performance expectation one can have is measured in seconds. This is because Hadoop can only be used for batch processing
B.  The best performance expectation one can have is measured in milliseconds. This is because Hadoop executes in parallel across so many machines
C.  The best performance expectation one can have is measured in minutes. This is because Hadoop can only be used for batch processing
D.  It depends on on the design of the map-reduce program, how many machines in the cluster, and the amount of data being retrieved

What is distributed cache in Hadoop?

A.  The distributed cache is special component on namenode that will cache frequently used data for faster client response. It is used during reduce step.
B.  The distributed cache is special component on datanode that will cache frequently used data for faster client response. It is used during map step.
C.  The distributed cache is a component that caches java objects.
D.  The distributed cache is a component that allows developers to deploy jars for Map-Reduce processing.

 Can you run Map - Reduce jobs directly on Avro data in Hadoop?

A.  Yes, Avro was specifically designed for data processing via Map-Reduce
B.  Yes, but additional extensive coding is required
C.  No, Avro was specifically designed for data storage only
D.  Avro specifies metadata that allows easier data access. This data cannot be used as part of map-reduce execution, rather input specification only.


Wednesday, July 31, 2013

Java Interview questions for Hadoop developer

Java interview questions for Hadoop developer

Q1. Explain difference of Class Variable and Instance Variable and how are they declared in Java

Ans: Class Variable is a variable which is declared with static modifier. Instance variable is a variable in a class without static modifier.
The main difference between the class variable and Instance variable is, that first time, when class is loaded in to memory, then only memory is allocated for all class variables. That means, class variables do not depend on the Objets of that classes. What ever number of objects are there, only one copy is created at the time of class loding.

Q2. Explain Encapsulation,Inheritance and Polymorphism

Ans: Encapsulation is a process of binding or wrapping the data and the codes that operates on the data into a single entity. This keeps the data safe from outside interface and misuse. One way to think about encapsulation is as a protective wrapper that prevents code and data from being arbitrarily accessed by other code defined outside the wrapper.
Inheritance is the process by which one object acquires the properties of another object.
The meaning of Polymorphism is something like one name many forms. Polymorphism enables one entity to be used as as general category for different types of actions. The specific action is determined by the exact nature of the situation. The concept of polymorphism can be explained as “one interface, multiple methods”.

Q3. Explain garbage collection?

Ans: Garbage collection is one of the most important feature of Java.
Garbage collection is also called automatic memory management as JVM automatically removes the unused variables/objects (value is null) from the memory. User program cann’t directly free the object from memory, instead it is the job of the garbage collector to automatically free the objects that are no longer referenced by a program. Every class inherits finalize() method from java.lang.Object, the finalize() method is called by garbage collector when it determines no more references to the object exists. In Java, it is good idea to explicitly assign null into a variable when no more in us

Q4. What is similarities/difference between an Abstract class and Interface?

Ans: Differences- Interfaces provide a form of multiple inheritance. A class can extend only one other class.
- Interfaces are limited to public methods and constants with no implementation. Abstract classes can have a partial implementation, protected parts, static methods, etc.
- A Class may implement several interfaces. But in case of abstract class, a class may extend only one abstract class.
- Interfaces are slow as it requires extra indirection to find corresponding method in in the actual class. Abstract classes are fast.
- Neither Abstract classes or Interface can be instantiated

Q5. What are different ways to make your class multithreaded in Java

Ans: There are two ways to create new kinds of threads:
- Define a new class that extends the Thread class
- Define a new class that implements the Runnable interface, and pass an object of that class to a Thread’s constructor.

Q6. What do you understand by Synchronization? How do synchronize a method call in Java? How do you synchonize a block of code in java ?

Ans: Synchronization is a process of controlling the access of shared resources by the multiple threads in such a manner that only one thread can access one resource at a time. In non synchronized multithreaded application, it is possible for one thread to modify a shared object while another thread is in the process of using or updating the object’s value. Synchronization prevents such type of data corruption.
- Synchronizing a method: Put keyword synchronized as part of the method declaration
- Synchronizing a block of code inside a method: Put block of code in synchronized (this) { Some Code }

Q7. What is transient variable?

Ans: Transient variable can’t be serialize. For example if a variable is declared as transient in a Serializable class and the class is written to an ObjectStream, the value of the variable can’t be written to the stream instead when the class is retrieved from the ObjectStreamthe value of the variable becomes null.

Q8. What is Properties class in Java. Which class does it extends?

Ans:The Properties class represents a persistent set of properties. The Properties can be saved to a stream or loaded from a stream. Each key and its corresponding value in the property list is a string

Q9. Explain the concept of shallow copy vs deep copy in Java

Ans: In case of shallow copy, the cloned object also refers to the same object to which the original object refers as only the object references gets copied and not the referred objects themselves.
In case deep copy, a clone of the class and all all objects referred by that class is made.

Q10. How can you make a shallow copy of an object in Java

Ans: Use clone() method inherited by Object class

Q11. How would you make a copy of an entire Java object (deep copy) with its state?

 Ans: Have this class implement Cloneable interface and call its method clone().

Thursday, July 25, 2013

More questions

1. Does Hadoop require SSH?

2. I am seeing connection refused in the logs. How do I troubleshoot this?

3. Why I do see broken images in jobdetails.jsp page?

4. How do I change final output file name with the desired name rather than in partitions like part-00000, part-00001?

5. When writing a New InputFormat, what is the format for the array of string returned by InputSplit\#getLocations()?

6. Does the name-node stay in safe mode till all under-replicated files are fully replicated?

Wednesday, July 24, 2013

A discussion on Hadoop

1. Consider you are uploading a file of 300 MB into HDFS, a data of 200 MB was successfully uploaded another client simultaneously wanted to read the uploaded Data(uploading is still continuing). What incurs at this situation
                 a) Arises an exception
                 b) Data will be displayed successfully
                 c) Uploading is interrupted
                 d) The uploaded 200 MB will be displayed

2. Why should you stop all the Task Trackers while decommissioning the nodes in Hadoop cluster
                 a) To overcome the situation of Speculative execution
                 b) To avoid external interference on the new nodes
                 c) In order to make the new nodes identify for Namenode
                 d) JobTacker recieves heartbeats from new nodes only when it is restarted
3. When do your Namenode enters the safe mode
                  a) When 80% of its metadata was filled
                  b) When least replication factor was reached
                  c) Both
                  d) When edit log was full

Tuesday, July 23, 2013

Hadoop FAQs

1.You are given a directory SampleDir of files containing the following  first.txt, _second.txt,.third.txt, #fourth.txt. If you provide SampleDir to the MR job,how many files are processed?

2. You have an external jar file of size 1.3MB that has the required dependencies to run your MR job.What steps do you take to copy the jar file to the task tracker

3.When a job is run,your  properties file are copied to distributed cache in order for your map jobs to access.How do u access the property file
4. If you have m mappers and n reducers in a given job,shuffle and sort algorithm will result in  how many copy and write operations
5. You have 100 Map tasks running out of which ,99 have completed and one task is running slow.The system replicates the slower running task on a different machine and output is collected from the first completed maptask.Rest of the map tasks are killed.What is this phenomenon

Monday, July 22, 2013

Hadoop in ETL process

Traditional ETL architectures can no longer provide the scalability required by the business at an affordable cost. That's why many organizations are turning to Hadoop. But, Hadoop alone is not a data integration solution. Performing even the simplest ETL tasks require mastering disparate tools and writing hundreds of lines of code. Hadoop ETL Solution provides a smarter approach, turning your Hadoop environment into a complete data integration solution!

Everything you need for Hadoop ETL. No coding, No Tuning, No Kidding!

  • Connect to any data source or target
  • Exploit mainframe data
  • Develop MapReduce ETL jobs without coding
  • Jump-start your Hadoop productivity with use case accelerators to help you build common ETL tasks
  • Build, re-use, and check impact analysis with enhanced metadata capabilities
  • Optimize performance and efficiency of each individual node
  • Never tune again

Apache Hadoop has two main subprojects:

  • MapReduce - The framework that understands and assigns work to the nodes in a cluster.
  • HDFS - A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes
Hadoop is supplemented by an ecosystem of Apache projects, such as PigHive andZookeeper, that extend the value of Hadoop and improves its usability.

So what’s the big deal?

Hadoop changes the economics and the dynamics of large scale computing. Its impact can be boiled down to four salient characteristics. 

About Hadoop®

Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.
Hadoop enables a computing solution that is:
  • Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
  • Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
  • Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
  • Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

Think Hadoop is right for you?

Eighty percent of the world’s data is unstructured, and most businesses don’t even attempt to use this data to their advantage. Imagine if you could afford to keep all the data generated by your business? Imagine if you had a way to analyze that data?