mapreduce二次排序_ mapreduce二次排序原理 - 全文

關(guān)于什么是二次排序

在mapreduce操作時(shí)，shuffle階段會(huì)多次根據(jù)key值排序。但是在shuffle分組后，相同key值的values序列的順序是不確定的（如下圖）。如果想要此時(shí)value值也是排序好的，這種需求就是二次排序。

默認(rèn)情況下，Map輸出的結(jié)果會(huì)對(duì)Key進(jìn)行默認(rèn)的排序，但是有時(shí)候需要對(duì)Key排序的同時(shí)還需要對(duì)Value進(jìn)行排序，這時(shí)候就要用到二次排序了。

?mapreduce二次排序分析

我們把二次排序分為以下幾個(gè)階段

Map起始階段

在Map階段，使用job.setInputF ormatClass（）定義的InputFormat，將輸入的數(shù)據(jù)集分割成小數(shù)據(jù)塊split，同時(shí)InputFormat提供一個(gè)RecordReader的實(shí)現(xiàn)。在這里我們使用的是TextInputFormat，它提供的RecordReader會(huì)將文本的行號(hào)作為Key，這一行的文本作為Value。這就是自定 Mapper的輸入是《LongWritable，Text》的原因。然后調(diào)用自定義Mapper的map方法，將一個(gè)個(gè)《LongWritable，Text》鍵值對(duì)輸入給Mapper的map方法

Map最后階段

在Map階段的最后，會(huì)先調(diào)用job.setPartitionerClass（）對(duì)這個(gè)Mapper的輸出結(jié)果進(jìn)行分區(qū)，每個(gè)分區(qū)映射到一個(gè)Reducer。每個(gè)分區(qū)內(nèi)又調(diào)用job.setSortComparatorClass（）設(shè)置的Key比較函數(shù)類(lèi)排序。可以看到，這本身就是一個(gè)二次排序。如果沒(méi)有通過(guò)job.setSortComparatorClass（）設(shè)置 Key比較函數(shù)類(lèi)，則使用Key實(shí)現(xiàn)的compareTo（）方法

Reduce階段

在Reduce階段，reduce（）方法接受所有映射到這個(gè)Reduce的map輸出后，也會(huì)調(diào)用job.setSortComparatorClass（）方法設(shè)置的Key比較函數(shù)類(lèi)，對(duì)所有數(shù)據(jù)進(jìn)行排序。然后開(kāi)始構(gòu)造一個(gè)Key對(duì)應(yīng)的Value迭代器。這時(shí)就要用到分組，使用 job.setGroupingComparatorClass（）方法設(shè)置分組函數(shù)類(lèi)。只要這個(gè)比較器比較的兩個(gè)Key相同，它們就屬于同一組，它們的 Value放在一個(gè)Value迭代器，而這個(gè)迭代器的Key使用屬于同一個(gè)組的所有Key的第一個(gè)Key。最后就是進(jìn)入Reducer的 reduce（）方法，reduce（）方法的輸入是所有的Key和它的Value迭代器，同樣注意輸入與輸出的類(lèi)型必須與自定義的Reducer中聲明的一致。

mapreduce二次排序_ mapreduce二次排序原理、

#e#
? ? ? ? ? ? 接下來(lái)我們通過(guò)示例，可以很直觀的了解二次排序的原理

輸入文件 sort.txt 內(nèi)容為

40 20

40 10

40 30

40 5

30 30

30 20

30 10

30 40

50 20

50 50

50 10

50 60

輸出文件的內(nèi)容（從小到大排序）如下

30 10

30 20

30 30

30 40

--------

40 5

40 10

40 20

40 30

--------

50 10

50 20

50 50

50 60

從輸出的結(jié)果可以看出Key實(shí)現(xiàn)了從小到大的排序，同時(shí)相同Key的Value也實(shí)現(xiàn)了從小到大的排序，這就是二次排序的結(jié)果

mapreduce二次排序的具體流程

在本例中要比較兩次。先按照第一字段排序，然后再對(duì)第一字段相同的按照第二字段排序。根據(jù)這一點(diǎn)，我們可以構(gòu)造一個(gè)復(fù)合類(lèi)IntPair ，它有兩個(gè)字段，先利用分區(qū)對(duì)第一字段排序，再利用分區(qū)內(nèi)的比較對(duì)第二字段排序。二次排序的流程分為以下幾步。

1、自定義 key

所有自定義的key應(yīng)該實(shí)現(xiàn)接口WritableComparable，因?yàn)樗强尚蛄谢牟⑶铱杀容^的。WritableComparable 的內(nèi)部方法如下所示

// 反序列化，從流中的二進(jìn)制轉(zhuǎn)換成IntPairpublic void readFields（DataInput in） throws IOException

// 序列化，將IntPair轉(zhuǎn)化成使用流傳送的二進(jìn)制public void write（DataOutput out）

// key的比較public int compareTo（IntPair o）

// 默認(rèn)的分區(qū)類(lèi) HashPartitioner，使用此方法public int hashCode（）

// 默認(rèn)實(shí)現(xiàn)public boolean equals（Object right）

2、自定義分區(qū)

自定義分區(qū)函數(shù)類(lèi)FirstPartitioner，是key的第一次比較，完成對(duì)所有key的排序。

public static class FirstPartitioner extends Partitioner《 IntPair，IntWritable》

在job中使用setPartitionerClasss（）方法設(shè)置Partitioner

job.setPartitionerClasss（FirstPartitioner.Class）;

3、Key的比較類(lèi)

這是Key的第二次比較，對(duì)所有的Key進(jìn)行排序，即同時(shí)完成IntPair中的first和second排序。該類(lèi)是一個(gè)比較器，可以通過(guò)兩種方式實(shí)現(xiàn)。

1）繼承WritableComparator。

public static class KeyComparator extends WritableComparator

必須有一個(gè)構(gòu)造函數(shù)，并且重載以下方法。

public int compare（WritableComparable w1， WritableComparable w2）

2）實(shí)現(xiàn)接口 RawComparator。

上面兩種實(shí)現(xiàn)方式，在Job中，可以通過(guò)setSortComparatorClass（）方法來(lái)設(shè)置Key的比較類(lèi)。

job.setSortComparatorClass（KeyComparator.Class）;

注意：如果沒(méi)有使用自定義的SortComparator類(lèi)，則默認(rèn)使用Key中compareTo（）方法對(duì)Key排序。

4、定義分組類(lèi)函數(shù)

在Reduce階段，構(gòu)造一個(gè)與 Key 相對(duì)應(yīng)的 Value 迭代器的時(shí)候，只要first相同就屬于同一個(gè)組，放在一個(gè)Value迭代器。定義這個(gè)比較器，可以有兩種方式。

1）繼承 WritableComparator。

public static class GroupingComparator extends WritableComparator

必須有一個(gè)構(gòu)造函數(shù)，并且重載以下方法。

public int compare（WritableComparable w1， WritableComparable w2）

2）實(shí)現(xiàn)接口 RawComparator。

上面兩種實(shí)現(xiàn)方式，在 Job 中，可以通過(guò) setGroupingComparatorClass（）方法來(lái)設(shè)置分組類(lèi)。

job.setGroupingComparatorClass（GroupingComparator.Class）;

另外注意的是，如果reduce的輸入與輸出不是同一種類(lèi)型，則 Combiner和Reducer 不能共用 Reducer 類(lèi)，因?yàn)?Combiner 的輸出是 reduce 的輸入。除非重新定義一個(gè)Combiner。

3、代碼實(shí)現(xiàn)

Hadoop的example包中自帶了一個(gè)MapReduce的二次排序算法，下面對(duì) example包中的二次排序進(jìn)行改進(jìn)

package com.buaa;

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import org.apache.hadoop.io.WritableComparable;

/**

* @ProjectName SecondarySort

* @PackageName com.buaa

* @ClassName IntPair

* @Description 將示例數(shù)據(jù)中的key/value封裝成一個(gè)整體作為Key，同時(shí)實(shí)現(xiàn) WritableComparable接口并重寫(xiě)其方法

* @Author 劉吉超

* @Date 2016-06-07 22:31:53

public class IntPair implements WritableComparable《IntPair》{

private int first;

private int second;

public IntPair（）{

}

public IntPair（int left， int right）{

set（left， right）;

}

public void set（int left， int right）{

first = left;

second = right;

}

@Override

public void readFields（DataInput in） throws IOException{

first = in.readInt（）;

second = in.readInt（）;

}

@Override

public void write（DataOutput out） throws IOException{

out.writeInt（first）;

out.writeInt（second）;

}

@Override

public int compareTo（IntPair o）

{

if （first ！= o.first）{

return first 《 o.first ？ -1 ： 1;

}else if （second ！= o.second）{

return second 《 o.second ？ -1 ： 1;

}else{

return 0;

}

@Override

public int hashCode（）{

return first * 157 + second;

}

@Override

public boolean equals（Object right）{

if （right == null）

return false;

if （this == right）

return true;

if （right instanceof IntPair）{

IntPair r = （IntPair） right;

return r.first == first && r.second == second;

}else{

return false;

}

public int getFirst（）{

return first;

}

public int getSecond（）{

return second;

}

package com.buaa;

import java.io.IOException;import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.LongWritable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.WritableComparable;import org.apache.hadoop.io.WritableComparator;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Partitioner;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

/**

* @ProjectName SecondarySort

* @PackageName com.buaa

* @ClassName SecondarySort

* @Description TODO

* @Author 劉吉超

* @Date 2016-06-07 22:40:37*/

@SuppressWarnings（“deprecation”）public class SecondarySort {

public static class Map extends Mapper《LongWritable， Text， IntPair， IntWritable》 {

public void map（LongWritable key， Text value， Context context） throws IOException， InterruptedException {

String line = value.toString（）;

StringTokenizer tokenizer = new StringTokenizer（line）;

int left = 0;

int right = 0;

if （tokenizer.hasMoreTokens（）） {

left = Integer.parseInt（tokenizer.nextToken（））;

if （tokenizer.hasMoreTokens（））

right = Integer.parseInt（tokenizer.nextToken（））;

context.write（new IntPair（left， right）， new IntWritable（right））;

}

* 自定義分區(qū)函數(shù)類(lèi)FirstPartitioner，根據(jù) IntPair中的first實(shí)現(xiàn)分區(qū)

public static class FirstPartitioner extends Partitioner《IntPair， IntWritable》{

@Override

public int getPartition（IntPair key， IntWritable value，int numPartitions）{

return Math.abs（key.getFirst（） * 127） % numPartitions;

}

* 自定義GroupingComparator類(lèi)，實(shí)現(xiàn)分區(qū)內(nèi)的數(shù)據(jù)分組

@SuppressWarnings（“rawtypes”）

public static class GroupingComparator extends WritableComparator{

protected GroupingComparator（）{

super（IntPair.class， true）;

}

@Override

public int compare（WritableComparable w1， WritableComparable w2）{

IntPair ip1 = （IntPair） w1;

IntPair ip2 = （IntPair） w2;

int l = ip1.getFirst（）;

int r = ip2.getFirst（）;

return l == r ？ 0 ：（l 《 r ？ -1 ： 1）;

}

public static class Reduce extends Reducer《IntPair， IntWritable， Text， IntWritable》 {

public void reduce（IntPair key， Iterable《IntWritable》 values， Context context） throws IOException， InterruptedException {

for （IntWritable val ： values） {

context.write（new Text（Integer.toString（key.getFirst（）））， val）;

}

public static void main（String［］ args） throws IOException， InterruptedException， ClassNotFoundException {

// 讀取配置文件

Configuration conf = new Configuration（）;

// 判斷路徑是否存在，如果存在，則刪除

Path mypath = new Path（args［1］）;

FileSystem hdfs = mypath.getFileSystem（conf）;

if （hdfs.isDirectory（mypath）） {

hdfs.delete（mypath， true）;

}

Job job = new Job（conf， “secondarysort”）;

// 設(shè)置主類(lèi)

job.setJarByClass（SecondarySort.class）;

// 輸入路徑

FileInputFormat.setInputPaths（job， new Path（args［0］））;

// 輸出路徑

FileOutputFormat.setOutputPath（job， new Path（args［1］））;

// Mapper

job.setMapperClass（Map.class）;

// Reducer

job.setReducerClass（Reduce.class）;

// 分區(qū)函數(shù)

job.setPartitionerClass（FirstPartitioner.class）;

// 本示例并沒(méi)有自定義SortComparator，而是使用IntPair中compareTo方法進(jìn)行排序 job.setSortComparatorClass（）;

// 分組函數(shù)

job.setGroupingComparatorClass（GroupingComparator.class）;

// map輸出key類(lèi)型

job.setMapOutputKeyClass（IntPair.class）;

// map輸出value類(lèi)型

job.setMapOutputValueClass（IntWritable.class）;

// reduce輸出key類(lèi)型

job.setOutputKeyClass（Text.class）;

// reduce輸出value類(lèi)型

job.setOutputValueClass（IntWritable.class）;

// 輸入格式

job.setInputFormatClass（TextInputFormat.class）;

// 輸出格式

job.setOutputFormatClass（TextOutputFormat.class）;

System.exit（job.waitForCompletion（true）？ 0 ： 1）;

}

mapreduce二次排序_ mapreduce二次排序原理

閱讀全文

上一頁(yè)1 2全文

MapReduce(6251) MapReduce(6251)
二次排序(1304) 二次排序(1304)

評(píng)論

相關(guān)推薦

7954

已全部加載完成

搜索歷史

mapreduce二次排序_ mapreduce二次排序原理 - 全文

關(guān)于什么是二次排序

?mapreduce二次排序分析

#e# ? ? ? ? ? ? 接下來(lái)我們通過(guò)示例，可以很直觀的了解二次排序的原理

mapreduce二次排序的具體流程

評(píng)論

#e#
? ? ? ? ? ? 接下來(lái)我們通過(guò)示例，可以很直觀的了解二次排序的原理