Mapreduce: programming framework with which you can write code to prcoess large datasets in a distributed system like HDFS.
HDFS: shared-nothing principle
Any coordination between nodes is done at the software level, using a conventional network
File blocks are replicated on multi-machines.
Has to implement two callback functions: mapper, reducer
Mapper: called once for every input record, generate any number of key-value pairs
Reducer: Take the key-value pairs procuded by the mappers, collect all the values belonging to the same key, and calls the reducer with an iterator over that collection of values.
A mapreduce program is called a job, A real problem maybe resolved by several jobs => workflow (Pig, Hive, Cascaing, Crunch and FlumeJava)
If the reducer has associativeity and interchangability, the program can set the combine function to reduce the overhead.
MapReduce has no conecpt of indexs. The index means the column in database. The raw data in MapReduce can not join because of the lack conect of it.
The reducers connect to each mappers and download the files of sorted key-value pairs for their partition => shuffle (a confusing term)
Section 3-1 in original paper
Master check every worker periodically
If worker failure
If unfinished map/reduce task: reexecute it
If completed map work, executed it again since the immediate data on local disk.
If completed reduce work, no need to execute since the result is in global file system (ex: Google File System)
Save master's snapshot periodically
If master failure, pick another one node to be a new master.
It's not a easy problem.
Multiple worers run the same task to generate the same output
Rely on atomic commits (Yen3: Need more study)
If the master has the record for the map task, ignore it.
If not, record it with intermidiate filenames.
Reduce task - Rely on atomic rename for the reduce task.
The same reduce task generate the same output filename.
The guarantee is provided by filesystem.
Section 3-4 in origianl paper
Network bandwidth is a scarce resource
Google File System stores several copies in different machine (typically 3 copies)
When the schedule allocate a map task, it takes the location information
Attempts to schedule a map on the machine which has a replica
If failing that, attempt to schedule on the same network swtich as the machine containing the data.
Most input data is read locality and does not consum network bandwitdh.
Section 3-6 in origianl paper
Straggler: a machine that takes an unusually time to compelte one task.
Reason: resource starvation.
Resolve: Run backup task for the last unfinished tasks (in-progres tasks) to reuce the total time.
Result: In distribued sort, it takes longer 44% time to finish the job when the mechanism is disable.
Refinement - Partition Function
Section 4-1 in original paper
In general case, the default hash function can do data-partition fairly but it's not suitable for every situation. For example, if the key is url , the hash function could be redesigned for the attributed like hash(hostname(urlkey))
Refinement - Ordering Guarantees
Section 4-2 in original paper
If the key in a partition are sorted, Easy to generate the sorted output => support random access lookup and so on.
Refinement - Combiner Function
Section 4-3 in original paper
Combiner function: partial reduce in local machine. After the action, pass the result to reuce tasker to limit the network traffic, cpu loading and etc ...
Combiner function is typically the same as the reduce function.
Refinement - Input and Output Types
Section 4-4 in original paper
Support different input/output types.
Support read data from different source (e.g. database). Need to implement Reader class
Refinement - Skipping Bad Records
Section 4-6 in original paper
Some certain records cause map/reduce function crashs deterministically.
If the bug is in maper/reducer, fix it.
If the bus is in third-party librarry and the the resulce can accept/torlence some data lose (e.g. staistical analysis), ignore it.
Record the data record number before the data is executed by mapper/reducer. If the task crashs, send the record information to master for avoiding to process it again.
Refinement - Local Execution
Section 4-7 in original paper
It's hard to debug in distributed system. Provide a local-execution mode for debug (on a single machine.)
Refinement - Counters
Section 4-9 in original paper
Provide global counter to each job for staistical analysis.
Yen3: Need more study.
Reduce-side joins: the actual join logic in the reducers -> secondary sort
Map-side joins: If you can make certain assumptions about your input data, it is possible make joins faster by using map-side joins.
How to deal multiple data source ? use reduce-side joins and map-side joins.
Network bandwidth is a scarce source.
The one benefit of MapReduce is that the sequential behaviour is the same as the parallel behaviour. It's easy to debug.
Appendix: Google File System (GFS)
Yen3: Need to study in the near future.
A distributed file system. based on share-nothing principle. It does not require special hardware.
A central node (Name Node) keeps trace of which file blocks are stored on which machine. Each machine has a HDFS daemon that allows nodes to acces files store on the machine.
Conceptually, HDFS create a big file system to use all spaces on the disk of all machines running the daemon.
Failure tolerate: for machine and disk failures, file blocks are replicated on multiple machines.
Compare to RAID
Same: use redundant data to avoid data loss.
Different: Not required special hardware (RAID maybe). Just need a conventional datacenter network.
Good system papers -- details from apps all the way to network.