Friday, August 2, 2013

Interview questions for Freshers and Experience

What is a SequenceFile in Hadoop?

A. ASequenceFilecontains a binaryencoding ofan arbitrary numberof homogeneous writable objects.
B. ASequenceFilecontains a binary encoding of an arbitrary number of heterogeneous writable objects.
C. ASequenceFilecontains a binary encoding of an arbitrary number of WritableComparable objects, in sorted order.
D. ASequenceFilecontains a binary encoding of an arbitrary number key-value pairs. Each key must be the same type. Each value must be sametype.

Is there a map input format in Hadoop?

A. Yes, but only in Hadoop 0.22+.
B. Yes, there is a special format for map files.
C. No, but sequence file input format can read map files.
D. Both 2 and 3 are correct answers.

What happens if mapper output does not match reducer input in Hadoop?

A. Hadoop API will convert the data to the type that is needed by the reducer.
B. Data input/output inconsistency cannot occur. A preliminary validation check is executed prior to the full execution of the job to ensure there is consistency.
C. The java compiler will report an error during compilation but the job will complete with exceptions.
D. A real-time exception will be thrown and map-reduce job will fail.

Can you provide multiple input paths to a map-reduce jobs Hadoop?

A. Yes, but only in Hadoop 0.22+.
B. No, Hadoop always operates on one input directory.
C. Yes, developers can add any number of input paths.
D. Yes, but the limit is currently capped at 10 input paths.

Can a custom type for data Map-Reduce processing be implemented in Hadoop?

A. No, Hadoop does not provide techniques for custom datatypes.
B. Yes, but only for mappers.
C. Yes, custom data types can be implemented as long as they implement writable interface.
D. Yes, but only for reducers.

The Hadoop API uses basic Java types such as LongWritable, Text, IntWritable. They have almost the same features as default java classes. What are these writable data types optimized for?

A. Writable data types are specifically optimized for network transmissions
B. Writable data types are specifically optimized for file system storage
C. Writable data types are specifically optimized for map-reduce processing
D. Writable data types are specifically optimized for data retrieval

What is writable in Hadoop?

A. Writable is a java interface that needs to be implemented for streaming data to remote servers.
B. Writable is a java interface that needs to be implemented for HDFS writes.
C. Writable is a java interface that needs to be implemented for MapReduce processing.
D. None of these answers are correct.

What is the best performance one can expect from a Hadoop cluster?

A. The best performance expectation one can have is measured in seconds. This is because Hadoop can only be used for batch processing
B. The best performance expectation one can have is measured in milliseconds. This is because Hadoop executes in parallel across so many machines
C. The best performance expectation one can have is measured in minutes. This is because Hadoop can only be used for batch processing
D. It depends on on the design of the map-reduce program, how many machines in the cluster, and the amount of data being retrieved

What is distributed cache in Hadoop?

A. The distributed cache is special component on namenode that will cache frequently used data for faster client response. It is used during reduce step.
B. The distributed cache is special component on datanode that will cache frequently used data for faster client response. It is used during map step.
C. The distributed cache is a component that caches java objects.
D. The distributed cache is a component that allows developers to deploy jars for Map-Reduce processing.

Can you run Map - Reduce jobs directly on Avro data in Hadoop?

A. Yes, Avro was specifically designed for data processing via Map-Reduce
B. Yes, but additional extensive coding is required
C. No, Avro was specifically designed for data storage only
D. Avro specifies metadata that allows easier data access. This data cannot be used as part of map-reduce execution, rather input specification only.

Wednesday, July 31, 2013

Java Interview questions for Hadoop developer

Java interview questions for Hadoop developer

Q1. Explain difference of Class Variable and Instance Variable and how are they declared in Java

Ans: Class Variable is a variable which is declared with static modifier. Instance variable is a variable in a class without static modifier.

The main difference between the class variable and Instance variable is, that first time, when class is loaded in to memory, then only memory is allocated for all class variables. That means, class variables do not depend on the Objets of that classes. What ever number of objects are there, only one copy is created at the time of class loding.

Q2. Explain Encapsulation,Inheritance and Polymorphism

Ans: Encapsulation is a process of binding or wrapping the data and the codes that operates on the data into a single entity. This keeps the data safe from outside interface and misuse. One way to think about encapsulation is as a protective wrapper that prevents code and data from being arbitrarily accessed by other code defined outside the wrapper.

Inheritance is the process by which one object acquires the properties of another object.

The meaning of Polymorphism is something like one name many forms. Polymorphism enables one entity to be used as as general category for different types of actions. The specific action is determined by the exact nature of the situation. The concept of polymorphism can be explained as “one interface, multiple methods”.

Q3. Explain garbage collection?

Ans: Garbage collection is one of the most important feature of Java.

Garbage collection is also called automatic memory management as JVM automatically removes the unused variables/objects (value is null) from the memory. User program cann’t directly free the object from memory, instead it is the job of the garbage collector to automatically free the objects that are no longer referenced by a program. Every class inherits finalize() method from java.lang.Object, the finalize() method is called by garbage collector when it determines no more references to the object exists. In Java, it is good idea to explicitly assign null into a variable when no more in us

Q4. What is similarities/difference between an Abstract class and Interface?

Ans: Differences- Interfaces provide a form of multiple inheritance. A class can extend only one other class.

- Interfaces are limited to public methods and constants with no implementation. Abstract classes can have a partial implementation, protected parts, static methods, etc.

- A Class may implement several interfaces. But in case of abstract class, a class may extend only one abstract class.

- Interfaces are slow as it requires extra indirection to find corresponding method in in the actual class. Abstract classes are fast.

Similarities

- Neither Abstract classes or Interface can be instantiated

Q5. What are different ways to make your class multithreaded in Java

Ans: There are two ways to create new kinds of threads:

- Define a new class that extends the Thread class

- Define a new class that implements the Runnable interface, and pass an object of that class to a Thread’s constructor.

Q6. What do you understand by Synchronization? How do synchronize a method call in Java? How do you synchonize a block of code in java ?

Ans: Synchronization is a process of controlling the access of shared resources by the multiple threads in such a manner that only one thread can access one resource at a time. In non synchronized multithreaded application, it is possible for one thread to modify a shared object while another thread is in the process of using or updating the object’s value. Synchronization prevents such type of data corruption.

- Synchronizing a method: Put keyword synchronized as part of the method declaration

- Synchronizing a block of code inside a method: Put block of code in synchronized (this) { Some Code }

Q7. What is transient variable?

Ans: Transient variable can’t be serialize. For example if a variable is declared as transient in a Serializable class and the class is written to an ObjectStream, the value of the variable can’t be written to the stream instead when the class is retrieved from the ObjectStreamthe value of the variable becomes null.

Q8. What is Properties class in Java. Which class does it extends?

Ans:The Properties class represents a persistent set of properties. The Properties can be saved to a stream or loaded from a stream. Each key and its corresponding value in the property list is a string

Q9. Explain the concept of shallow copy vs deep copy in Java

Ans: In case of shallow copy, the cloned object also refers to the same object to which the original object refers as only the object references gets copied and not the referred objects themselves.

In case deep copy, a clone of the class and all all objects referred by that class is made.

Q10. How can you make a shallow copy of an object in Java

Ans: Use clone() method inherited by Object class

Q11. How would you make a copy of an entire Java object (deep copy) with its state?

Ans: Have this class implement Cloneable interface and call its method clone().

Thursday, July 25, 2013

Wednesday, July 24, 2013

A discussion on Hadoop

1. Consider you are uploading a file of 300 MB into HDFS, a data of 200 MB was successfully uploaded another client simultaneously wanted to read the uploaded Data(uploading is still continuing). What incurs at this situation
a) Arises an exception
b) Data will be displayed successfully
c) Uploading is interrupted
d) The uploaded 200 MB will be displayed

2. Why should you stop all the Task Trackers while decommissioning the nodes in Hadoop cluster
a) To overcome the situation of Speculative execution
b) To avoid external interference on the new nodes
c) In order to make the new nodes identify for Namenode
d) JobTacker recieves heartbeats from new nodes only when it is restarted
3. When do your Namenode enters the safe mode
a) When 80% of its metadata was filled
b) When least replication factor was reached
c) Both
d) When edit log was full

Tuesday, July 23, 2013

Hadoop FAQs

1.You are given a directory SampleDir of files containing the following first.txt, _second.txt,.third.txt, #fourth.txt. If you provide SampleDir to the MR job,how many files are processed?

2. You have an external jar file of size 1.3MB that has the required dependencies to run your MR job.What steps do you take to copy the jar file to the task tracker

3.When a job is run,your properties file are copied to distributed cache in order for your map jobs to access.How do u access the property file

4. If you have m mappers and n reducers in a given job,shuffle and sort algorithm will result in how many copy and write operations

5. You have 100 Map tasks running out of which ,99 have completed and one task is running slow.The system replicates the slower running task on a different machine and output is collected from the first completed maptask.Rest of the map tasks are killed.What is this phenomenon

Monday, July 22, 2013

Hadoop in ETL process

Traditional ETL architectures can no longer provide the scalability required by the business at an affordable cost. That's why many organizations are turning to Hadoop. But, Hadoop alone is not a data integration solution. Performing even the simplest ETL tasks require mastering disparate tools and writing hundreds of lines of code. Hadoop ETL Solution provides a smarter approach, turning your Hadoop environment into a complete data integration solution!

Everything you need for Hadoop ETL. No coding, No Tuning, No Kidding!

Connect to any data source or target
Exploit mainframe data
Develop MapReduce ETL jobs without coding
Jump-start your Hadoop productivity with use case accelerators to help you build common ETL tasks
Build, re-use, and check impact analysis with enhanced metadata capabilities
Optimize performance and efficiency of each individual node
Never tune again

Apache Hadoop has two main subprojects:

MapReduce - The framework that understands and assigns work to the nodes in a cluster.
HDFS - A file system that spans all the nodes in a Hadoop cluster for data storage. It links together the file systems on many local nodes to make them into one big file system. HDFS assumes nodes will fail, so it achieves reliability by replicating data across multiple nodes

Hadoop is supplemented by an ecosystem of Apache projects, such as Pig, Hive andZookeeper, that extend the value of Hadoop and improves its usability.

So what’s the big deal?

Hadoop changes the economics and the dynamics of large scale computing. Its impact can be boiled down to four salient characteristics.

About Hadoop®

Apache™ Hadoop® is an open source software project that enables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a single server to thousands of machines, with a very high degree of fault tolerance. Rather than relying on high-end hardware, the resiliency of these clusters comes from the software’s ability to detect and handle failures at the application layer.

Hadoop enables a computing solution that is:

Scalable– New nodes can be added as needed, and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.
Cost effective– Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease in the cost per terabyte of storage, which in turn makes it affordable to model all your data.
Flexible– Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.
Fault tolerant– When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

Think Hadoop is right for you?

Eighty percent of the world’s data is unstructured, and most businesses don’t even attempt to use this data to their advantage. Imagine if you could afford to keep all the data generated by your business? Imagine if you had a way to analyze that data?

Sunday, July 21, 2013

What is big data?

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data in the world today has been created in the last two years alone. This data comes from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. This data is big data.

Big data in action

What types of business problems can a big data platform help you address? There are multiple uses for big data in every industry – from analyzing larger volumes of data than was previously possible to drive more precise answers, to analyzing data in motion to capture opportunities that were previously lost. A big data platform will enable your organization to tackle complex problems that previously could not be solved.

Big data spans three dimensions: Volume, Velocity, Variety.

Volume: Enterprises are awash with ever-growing data of all types, easily amassing terabytes even petabytes of information.

Turn 12 terabytes of Tweets created each day into improved product sentiment analysis
Convert 350 billion annual meter readings to better predict power consumption

Velocity: Sometimes 2 minutes is too late. For time-sensitive processes such as catching fraud, big data must be used as it streams into your enterprise in order to maximize its value.

Scrutinize 5 million trade events created each day to identify potential fraud
Analyze 500 million daily call detail records in real-time to predict customer churn faster

Variety: Big data is any type of data - structured and unstructured data such as text, sensor data, audio, video, click streams, log files and more. New insights are found when analyzing these data types together.

Monitor 100’s of live video feeds from surveillance cameras to target points of interest
Exploit the 80% data growth in images, video and documents to improve customer satisfaction

Big data = Big Return on Investment (ROI)

While there is a lot of buzz about big data in the market, it isn’t hype. Plenty of customers are seeing tangible ROI using IBM solutions to address their big data challenges:

Healthcare: 20% decrease in patient mortality by analyzing streaming patient data
Telco: 92% decrease in processing time by analyzing networking and call data
Utilities: 99% improved accuracy in placing power generation resources by analyzing 2.8 petabytes of untapped data

The 5 game changing big data use cases

What is a use case?

A use case helps you solve a specific business challenge by using patterns or examples of technology solutions. Your use case, customized for your unique issue, provides answers to your business problem.

While much of the big data activity in the market up to now has been experimenting and learning about big data technologies, IBM has been focused on also helping organizations understand what problems big data can address.

We’ve identified the top 5 high value use cases that can be your first step into big data:

1

Big Data Exploration
Find, visualize, understand all big data to improve decision making. Big data exploration addresses the challenge that every large organization faces: information is stored in many different systems and silos and people need access to that data to do their day-to-day work and make important decisions.

2
Enhanced 360º View of the Customer
Extend existing customer views by incorporating additional internal and external information sources. Gain a full understanding of customers—what makes them tick, why they buy, how they prefer to shop, why they switch, what they’ll buy next, and what factors lead them to recommend a company to others.

3
Security/Intelligence Extension
Lower risk, detect fraud and monitor cyber security in real time. Augment and enhance cyber security and intelligence analysis platforms with big data technologies to process and analyze new types (e.g. social media, emails, sensors, Telco) and sources of under-leveraged data to significantly improve intelligence, security and law enforcement insight

4
Operations Analysis
Analyze a variety of machine and operational data for improved business results. The abundance and growth of machine data, which can include anything from IT machines to sensors and meters and GPS devices requires complex analysis and correlation across different types of data sets. By using big data for operations analysis, organizations can gain real-time visibility into operations, customer experience, transactions and behavior.

5
Data Warehouse Augmentation
Integrate big data and data warehouse capabilities to increase operational efficiency. Optimize your data warehouse to enable new types of analysis. Use big data technologies to set up a staging area or landing zone for your new data before determining what data should be moved to the data warehouse. Offload infrequently accessed or aged data from warehouse and application databases using information integration software and tools.

An introduction to Apache Hadoop

Hadoop is designed to run on commodity hardware and can scale up or down without system interruption. It consists of three main functions: storage, processing and resource management.

Processing – MapReduce
Computation in Hadoop is based on the MapReduce paradigm that distributes tasks across a cluster of coordinated “nodes.” It was designed to run on commodity hardware and to scale up or down without system interruption.
Storage – HDFS
Storage is accomplished with the Hadoop Distributed File System (HDFS) – a reliable and distributed file system that allows large volumes of data to be stored and rapidly accessed across large clusters of commodity servers.
Resource Management – YARN (New in Hadoop 2.0)
YARN performs the resource management function in Hadoop 2.0 and extends MapReduce capabilities by supporting non-MapReduce workloads associated with other programming models. The YARN based architecture of Hadoop 2 is the most significant change introduced to the Hadoop project.

Tuesday, July 16, 2013

Apache Hadoop mirrors at the hub...

Apache Hadoop mirrors

  http://code.google.com/p/autosetup1/downloads/

            detail?name=hadoop-0.20.2.tar.gz&can=2&q=
  http://www.apache.org/dyn/closer.cgi
  http://hadoop.apache.org/releases.html

JDK6
http://www.oracle.com/technetwork/java/javasebusiness/downloads/

     java-archive-downloads-javase6-419409.html#jdk-6u45-oth-JPR

JDK7 
for 32bit Download linux-x86
for 63bit Download linux-x64

  http://www.oracle.com/technetwork/java/javase/downloads/

        jdk7-downloads-1880260.html

ECOSYSTEM DOWNLOADING MIRRORS

 Hive
  http://hive.apache.org/releases.html
  http://www.apache.org/dyn/closer.cgi/hive/
 pig
  http://archive.apache.org/dist/hadoop/pig/stable/
 
 Hbase
  http://www.apache.org/dyn/closer.cgi/hbase/
  http://sourceforge.net/projects/hbasemanagergui/
 Sqoop
  http://www.apache.org/dyn/closer.cgi/sqoop/1.4.3
 Flume
  http://flume.apache.org/download.html
  http://flume.apache.org/
 Chukwa
  http://incubator.apache.org/chukwa/

When to use Hbase and when for MapReduce..?

Very often I do get a query on when to use HBase and when MapReduce. HBase provides an SQL like interface with Phoenix and MapReduce provides a similar SQL interface with Hive. Both can be used to get insights from the data.

I would like the analogy of HBase/MapReduce to an plane/train. A train can carry a lot of material at a slow pace, while a plane relatively can carry less material at a faster pace. Depending on the amount of the material to be transferred from one location to another and the urgency, either a plane or a train can me used to move material.

Similarly HBase (or in fact any database) provides relatively low latency (response time) at the cost of low throughput (data transferred/processed), while MapReduce provides high latency at the cost of high throughput. So, depending on the NFR (Non Functional Requirements) of the application either HBase or MapReduce can be picked.

E-commerce or any customer facing application requires a quick response time to the end user and and also only a few records related to the customer have to be picked, so HBase would fit the bill. But, for all the back end/batch processing MapReduce can be used.

Hbase Use Cases

HBase at Pinterest

Pinterest is completely deployed on Amazon EC2. Pinterest uses a follow model where users follow other users. This requires a following feed for every user that gets updated everytime a followee creates or updates a pin. This is a classic social media application problem. For Pinterest, this amounts to 100s of millions of pins per month that gets fanned out as billions of writes per day.
So the ‘Following Feed‘ is implemented using Hbase. Some specifics:

They chose a wide schema where each user’s following feed is a single row in HBase. This exploits the sorting order within columns for ordering (each user wants to see the latest in his feed) and results in atomic transactions per user.
To optimize writes, they increased per region memstore size. 512M memstore leads to 40M HFile instead of the small 8M file created by default memstore This leads to less frequent compactions.
They take care of the potential for infinite columns by trimming the feed during compactions: there really is not much point having an infinite feed anyway.
They also had to do GC tuning (who doesn’t) opting for more frequent but smaller pauses.

Another very interesting fact. They maintain a mean time to recovery (MTTR) of less than 2 minutes. This is a great accomplishment since HBase favors consistency over availability. They achieve this via reducing various timeout settings (socket, connect, stale node, etc.) and the number of retries. They also avoid the single point of failure by using 2 clusters. To avoid NameNode failure, they keep a copy on EBS.

HBase at Groupon

Groupon has two distinct use caes. Deliver deals to users via email (a batch process) and provide a relevant user experience on the website. They have increasingly tuned their deals to be more accurate and relevant to individual users (personalization).
They started out with running Hadoop MapReduce (MR) jobs for email deal delivery and used MySQL for their online application – but ideally wanted the same system for both.
They now run their Relevance and Personalization system on HBase. In order to cater to the very different workload characteristics of the two systems(email, online), they run 2 HBase clusters that are replicated so they have the same content but are tuned and accessed differently.
Groupon also uses a very wide schema – One colmn-family for ‘user history and profile’ and the other for email history.
A 10 node cluster runs HBase (apart from the 100 node Hadoop cluster). Each node has 96GB RAM, 2

HBase at Longtail Video

This company provides JW Player, an online video player used by over 2 million websites. They have lots of data which is processed by their online analytics tool. They too are completely deployed on AWS and as such use HBase and EMR from Amazon. They read data from and write data to S3.
They had the following requirements:

fast queries across data sets
support for date-range queries
store huge amounts of aggregated data
flexibility in dimensions used for rollup tables

HBase fit the bill. They use multiple clusters to partition their read and write intensive workloads similar to Groupon. They are a full-fledged python shop so use Happybase and have Thrift running on all the nodes of the HBase cluster.