Reading paper/book

Overview

  • Mapreduce: programming framework with which you can write code to prcoess large datasets in a distributed system like HDFS.
  • HDFS: shared-nothing principle
    • Any coordination between nodes is done at the software level, using a conventional network
    • File blocks are replicated on multi-machines.
  • Has to implement two callback functions: mapper, reducer
  • Mapper: called once for every input record, generate any number of key-value pairs
  • Reducer: Take the key-value pairs procuded by the mappers, collect all the values belonging to the same key, and calls the reducer with an iterator over that collection of values.
  • A mapreduce program is called a job, A real problem maybe resolved by several jobs => workflow (Pig, Hive, Cascaing, Crunch and FlumeJava)
  • If the reducer has associativeity and interchangability, the program can set the combine function to reduce the overhead.
  • MapReduce has no conecpt of indexs. The index means the column in database. The raw data in MapReduce can not join because of the lack conect of it.
  • The reducers connect to each mappers and download the files of sorted key-value pairs for their partition => shuffle (a confusing term)

MR_01_Overview

Fault Tolerence

  • Section 3-1 in original paper
  • Worker
    • Master check every worker periodically
    • If worker failure
      • If unfinished map/reduce task: reexecute it
      • If completed map work, executed it again since the immediate data on local disk.
      • If completed reduce work, no need to execute since the result is in global file system (ex: Google File System)
  • Master Failure
    • Save master's snapshot periodically
    • If master failure, pick another one node to be a new master.
    • It's not a easy problem.
  • Multiple worers run the same task to generate the same output
    • Rely on atomic commits (Yen3: Need more study)
    • Map task
      • If the master has the record for the map task, ignore it.
      • If not, record it with intermidiate filenames.
    • Reduce task - Rely on atomic rename for the reduce task.
      • The same reduce task generate the same output filename.
      • The guarantee is provided by filesystem.

Locality

  • Section 3-4 in origianl paper
  • Network bandwidth is a scarce resource
  • Google File System stores several copies in different machine (typically 3 copies)
  • When the schedule allocate a map task, it takes the location information
    • Attempts to schedule a map on the machine which has a replica
    • If failing that, attempt to schedule on the same network swtich as the machine containing the data.
    • Most input data is read locality and does not consum network bandwitdh.

Backup Tasks

  • Section 3-6 in origianl paper
  • Straggler: a machine that takes an unusually time to compelte one task.
    • Reason: resource starvation.
    • Resolve: Run backup task for the last unfinished tasks (in-progres tasks) to reuce the total time.
    • Result: In distribued sort, it takes longer 44% time to finish the job when the mechanism is disable.

Refinement - Partition Function

  • Section 4-1 in original paper
  • In general case, the default hash function can do data-partition fairly but it's not suitable for every situation. For example, if the key is url , the hash function could be redesigned for the attributed like hash(hostname(urlkey))

Refinement - Ordering Guarantees

  • Section 4-2 in original paper
  • If the key in a partition are sorted, Easy to generate the sorted output => support random access lookup and so on.

Refinement - Combiner Function

  • Section 4-3 in original paper
  • Combiner function: partial reduce in local machine. After the action, pass the result to reuce tasker to limit the network traffic, cpu loading and etc ...
  • Combiner function is typically the same as the reduce function.

MP_02_combiner_function

Refinement - Input and Output Types

  • Section 4-4 in original paper
  • Support different input/output types.
  • Support read data from different source (e.g. database). Need to implement Reader class

Refinement - Skipping Bad Records

  • Section 4-6 in original paper
  • Some certain records cause map/reduce function crashs deterministically.
    • If the bug is in maper/reducer, fix it.
    • If the bus is in third-party librarry and the the resulce can accept/torlence some data lose (e.g. staistical analysis), ignore it.
    • Record the data record number before the data is executed by mapper/reducer. If the task crashs, send the record information to master for avoiding to process it again.

Refinement - Local Execution

  • Section 4-7 in original paper
  • It's hard to debug in distributed system. Provide a local-execution mode for debug (on a single machine.)

Refinement - Counters

  • Section 4-9 in original paper
  • Provide global counter to each job for staistical analysis.

Reduce-side/Map-side joins

  • Yen3: Need more study.
  • Reduce-side joins: the actual join logic in the reducers -> secondary sort
  • Map-side joins: If you can make certain assumptions about your input data, it is possible make joins faster by using map-side joins.
    • How to deal multiple data source ? use reduce-side joins and map-side joins.

Conclusion

  • Network bandwidth is a scarce source.
  • The one benefit of MapReduce is that the sequential behaviour is the same as the parallel behaviour. It's easy to debug.

Appendix: Google File System (GFS)

  • Yen3: Need to study in the near future.
  • A distributed file system. based on share-nothing principle. It does not require special hardware.
  • A central node (Name Node) keeps trace of which file blocks are stored on which machine. Each machine has a HDFS daemon that allows nodes to acces files store on the machine.
  • Conceptually, HDFS create a big file system to use all spaces on the disk of all machines running the daemon.
  • Failure tolerate: for machine and disk failures, file blocks are replicated on multiple machines.
    • Compare to RAID
      • Same: use redundant data to avoid data loss.
      • Different: Not required special hardware (RAID maybe). Just need a conventional datacenter network.
  • Good system papers -- details from apps all the way to network.