Skip to main content
- Mapreduce: programming framework with which you can write code to prcoess large datasets in a distributed system like HDFS.
- HDFS: shared-nothing principle
- Any coordination between nodes is done at the software level, using a conventional network
- File blocks are replicated on multi-machines.
- Has to implement two callback functions: mapper, reducer
- Mapper: called once for every input record, generate any number of key-value pairs
- Reducer: Take the key-value pairs procuded by the mappers, collect all the values belonging to the same key, and calls the reducer with an iterator over that collection of values.
- A mapreduce program is called a job, A real problem maybe resolved by several jobs => workflow (Pig, Hive, Cascaing, Crunch and FlumeJava)
- If the reducer has associativeity and interchangability, the program can set the combine function to reduce the overhead.
- MapReduce has no conecpt of indexs. The index means the column in database. The raw data in MapReduce can not join because of the lack conect of it.
- The reducers connect to each mappers and download the files of sorted key-value pairs for their partition => shuffle (a confusing term)
- Section 3-1 in original paper
- Master check every worker periodically
- If worker failure
- If unfinished map/reduce task: reexecute it
- If completed map work, executed it again since the immediate data on local disk.
- If completed reduce work, no need to execute since the result is in global file system (ex: Google File System)
- Master Failure
- Save master's snapshot periodically
- If master failure, pick another one node to be a new master.
- It's not a easy problem.
- Multiple worers run the same task to generate the same output
- Rely on atomic commits (Yen3: Need more study)
- Map task
- If the master has the record for the map task, ignore it.
- If not, record it with intermidiate filenames.
- Reduce task - Rely on atomic rename for the reduce task.
- The same reduce task generate the same output filename.
- The guarantee is provided by filesystem.
- Section 3-4 in origianl paper
- Network bandwidth is a scarce resource
- Google File System stores several copies in different machine (typically 3 copies)
- When the schedule allocate a map task, it takes the location information
- Attempts to schedule a map on the machine which has a replica
- If failing that, attempt to schedule on the same network swtich as the machine containing the data.
- Most input data is read locality and does not consum network bandwitdh.
- Section 3-6 in origianl paper
- Straggler: a machine that takes an unusually time to compelte one task.
- Reason: resource starvation.
- Resolve: Run backup task for the last unfinished tasks (in-progres tasks) to reuce the total time.
- Result: In distribued sort, it takes longer 44% time to finish the job when the mechanism is disable.
Refinement - Partition Function
- Section 4-1 in original paper
- In general case, the default hash function can do data-partition fairly but it's not suitable for every situation. For example, if the key is url , the hash function could be redesigned for the attributed like
Refinement - Ordering Guarantees
- Section 4-2 in original paper
- If the key in a partition are sorted, Easy to generate the sorted output => support random access lookup and so on.
Refinement - Combiner Function
- Section 4-3 in original paper
- Combiner function: partial reduce in local machine. After the action, pass the result to reuce tasker to limit the network traffic, cpu loading and etc ...
- Combiner function is typically the same as the reduce function.
Refinement - Input and Output Types
- Section 4-4 in original paper
- Support different input/output types.
- Support read data from different source (e.g. database). Need to implement Reader class
Refinement - Skipping Bad Records
- Section 4-6 in original paper
- Some certain records cause map/reduce function crashs deterministically.
- If the bug is in maper/reducer, fix it.
- If the bus is in third-party librarry and the the resulce can accept/torlence some data lose (e.g. staistical analysis), ignore it.
- Record the data record number before the data is executed by mapper/reducer. If the task crashs, send the record information to master for avoiding to process it again.
Refinement - Local Execution
- Section 4-7 in original paper
- It's hard to debug in distributed system. Provide a local-execution mode for debug (on a single machine.)
Refinement - Counters
- Section 4-9 in original paper
- Provide global counter to each job for staistical analysis.
- Yen3: Need more study.
- Reduce-side joins: the actual join logic in the reducers -> secondary sort
- Map-side joins: If you can make certain assumptions about your input data, it is possible make joins faster by using map-side joins.
- How to deal multiple data source ? use reduce-side joins and map-side joins.
- Network bandwidth is a scarce source.
- The one benefit of MapReduce is that the sequential behaviour is the same as the parallel behaviour. It's easy to debug.
Appendix: Google File System (GFS)
- Yen3: Need to study in the near future.
- A distributed file system. based on share-nothing principle. It does not require special hardware.
- A central node (Name Node) keeps trace of which file blocks are stored on which machine. Each machine has a HDFS daemon that allows nodes to acces files store on the machine.
- Conceptually, HDFS create a big file system to use all spaces on the disk of all machines running the daemon.
- Failure tolerate: for machine and disk failures, file blocks are replicated on multiple machines.
- Compare to RAID
- Same: use redundant data to avoid data loss.
- Different: Not required special hardware (RAID maybe). Just need a conventional datacenter network.
- Good system papers -- details from apps all the way to network.