hadoop倒排索引实验报告.docx-资源下载

hadoop倒排索引实验报告.docx

1、hadoop倒排索引实验报告大数据技术概论实验报告作业三姓名：郭利强专业：工程管理专业学号： 2015E8009064028目录1. 实验要求 32. 环境说明 42.1 系统硬件 42.2 系统软件 42.3 集群配置 43. 实验设计 43.1 第一部分设计 43.2 第二部分设计 64. 程序代码 114.1 第一部分代码 114.2 第二部分代码 175. 实验输入和结果 21实验输入输出结果见压缩包中对应目录 211. 实验要求第一部分：采用辅助排序的设计方法，对于输入的N个IP网络流量文件，计算得到文件中的各个源IP地址连接的不同目的IP地址个数，即对各个源IP地址连接

2、的目的IP地址去重并计数举例如下：第二部分：输入N个文件，生成带详细信息的倒排索引举例如下，有4个输入文件： d1.txt: cat dog cat fox d2.txt: cat bear cat cat fox d3.txt: fox wolf dog d4.txt: wolf hen rabbit cat sheep要求建立如下格式的倒排索引： cat 3: 4: (d1.txt,2,4),(d2.txt,3,5),(d4.txt,1,5) 单词出现该单词的文件个数:总文件个数： (出现该单词的文件名，单词在该文件中的出现次数，该文件的总单词数),2. 环境说明2.1 系统硬件处理器

3、：Intel Core i3-2350M CPU2.3GHz4内存：2GB磁盘：60GB2.2 系统软件操作系统：Ubuntu 14.04 LTS 操作系统类型：32位Java版本：1.7.0_85Eclipse版本：3.8Hadoop插件：hadoop-eclipse-plugin-2.6.0.jarHadoop：2.6.12.3 集群配置集群配置为伪分布模式，节点数量一个3. 实验设计3.1 第一部分设计利用两个Map/Reduce过程，在第一个MR中，读取记录并去除重复记录，第二个MR按照辅助排序设计方法，根据源地址进行分组，统计目的地址数量。第一个MR设计：自定义StringPair源

4、地址，目的地址类型，实现WritableComparable，在map过程读取文件，输出,reduce过程去除重复记录输出。在第二个MR设计：1.在Map过程读取第一个MR的输出，对value值进行拆分，并以拆分得到的源地址和目的地址初始化StringPair对象作为输出键，输出值为1。 public void map(Object key, Text value, Context context) throws IOException, InterruptedException String records = value.toString().split(t); String source

5、ip = records0; String desip=records1; context.write(new StringPair(sourceip,desip),one); 2.定义GroupComparator类，继承WritableComparator类，并重载compare方法，对Map过程输出按照StringPair.first排序，完成按照源地址分组。public static class GroupComparator extends WritableComparator protected GroupComparator() super(StringPair.class, t

6、rue); Override public int compare(WritableComparable w1,WritableComparable w2) StringPair ip1=(StringPair)w1; StringPair ip2=(StringPair)w2; return ip1.getFirst().compareTo(ip2.getFirst(); 3.在Reduce过程统计分组中的所有值，得到源地址连接不同目的地址数量。public void reduce( StringPair key, Iterable values, Context context) thro

7、ws IOException, InterruptedException int sum = 0; for (IntWritable val:values) sum += val.get(); statistic.set(sum); context.write(key.getFirst(), statistic); 3.2 第二部分设计利用两个Map/Reduce过程，第一个MR统计各个文件中的所有单词的出现次数，以及各个文件单词总数，第二个MR根据统计结果处理加工得到单词倒排索引。第一个MR设计：1.在Map过程中，重写map类，利用StringTokenizer类，将map方法中的valu

8、e值中存储的文本，拆分成一个个的单词，并获取文件名，以两种格式进行输出或者。public void map(Object key, Text value, Context context) throws IOException, InterruptedException /获取文件名 FileSplit fileSplit= (FileSplit)context.getInputSplit(); String fileName = fileSplit.getPath().getName(); /获取单词在单个文件中出现次数，及文件单词总数 StringTokenizer itr= new St

9、ringTokenizer(value.toString(); for(; itr.hasMoreTokens(); ) String word =removeNonLetters( itr.nextToken().toLowerCase(); String fileWord = fileName+001+word; if(!word.equals() context.write(new Text(fileWord), new IntWritable(1); context.write(new Text(fileName), new IntWritable(1); 2.在Reduce过程中，统

10、计得到每个文件中每个单词的出现次数，以及每个文件的单词总数，输出。public void reduce(Text key, Iterable values, Context context) throws IOException, InterruptedException int sum = 0; for (IntWritable val : values) sum += val.get(); context.write(key,new IntWritable(sum); 第二个MR设计： 1.Map过程读取第一个MR的输出，对value值进行拆分，重新组合后输出键为固定Text类型值inde

11、x，值为filename+word+count或者filename+count。public void map(Object key, Text value, Context context) throws IOException, InterruptedException String valStr = value.toString(); String records = valStr.split(t); context.write(new Text(index),new Text(records0+001+records1); 2.Reduce过程中定义四个HashMap，Map word

12、infilescount，key为单词+文件名，value为单词在该文件中出现的次数；Map filescount ，key为文件名，value为文件的单词总数；Map wordinfiles， key为单词，value为单词在多少个文件中出现；Map indexes，key为单词，value为倒排索引。读取values值，根据设定分隔符拆分，判断拆分后长度如果为2，则该值为文件名+文件单词总数，将拆分后的文件名及文件单词总数，组成键值对放入Map filescount；拆分后长度如果为3，则该值为文件名+单词+单词在该文件中出现次数，将拆分后的文件名+单词及单词在该文件中出现次数组成键值对放

13、入Map wordinfilescount，同时统计单词在多少个文件中出现，并组成键值对放入Map wordinfiles。遍历Map wordinfilescount，将单词作为键，“单词-出现该单词的文件个数:总文件个数：(出现该单词的文件名，单词在该文件中的出现次数，该文件的总单词数)”作为值，放入Map indexes中。遍历Map indexes获取倒排索引并输出全部索引。 public void reduce( Text key, Iterable values, Context context) throws IOException, InterruptedException /

14、拆分输入，获取单词出现在几个文件中以及在该文件中出现次数，各个文件的单词总数，总文件数 for (Text val : values) String valStr = val.toString(); String records = valStr.split(001); switch(records.length) case 2:filescount.put(records0, Integer.parseInt(records1); break; case 3: wordinfilescount.put(valStr, Integer.parseInt(records2); if(!wordi

15、nfiles.containsKey(records1) wordinfiles.put(records1, 1); else wordinfiles.put(records1, wordinfiles.get(records1)+1); ; break; /处理获取倒排索引 for (Entry entry : wordinfilescount.entrySet() String valStr = entry.getKey(); String records = valStr.split(001); String word = records1; if(!indexes.containsKe

16、y(word) StringBuilder sb = new StringBuilder(); sb.append(word) .append(-) .append(wordinfiles.get(word) .append(:) .append(filescount.size() .append(:) .append() .append( records0) .append(,) .append(entry.getValue() .append(,) .append(filescount.get( records0) .append(); indexes.put(word,sb.toStri

17、ng() ); else StringBuilder sb = new StringBuilder(); sb.append(,() .append( records0) .append(,) .append(entry.getValue() .append(,) .append(filescount.get( records0) .append(); indexes.put(word,indexes.get(word)+sb.toString() ); for (Entry entry : indexes.entrySet() context.write(new Text(entry.get

18、Value()+), NullWritable.get(); 4. 程序代码4.1 第一部分代码1. IpStatistics.java/* * Licensed to the Apache Software Foundation (ASF) under one * or more contributor license agreements. See the NOTICE file * distributed with this work for additional information * regarding copyright ownership. The ASF licenses

19、this file * to you under the Apache License, Version 2.0 (the * License); you may not use this file except in compliance * with the License. You may obtain a copy of the License at * * http:/www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software *

20、 distributed under the License is distributed on an AS IS BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */import java.io.IOException;import java.util.ArrayList;i

21、mport java.util.Collections;import java.util.Comparator;import java.util.HashMap;import java.util.List;import java.util.Map;import java.util.Map.Entry;import java.util.StringTokenizer;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.io.Null

22、Writable;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.WritableComparable;import org.apache.hadoop.io.WritableComparator;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.i

23、nput.FileInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.util.GenericOptionsParser;import org.apache.hadoop.fs.Path;public class IpStatistics /第一个Map/Reduce的map类，用于去重 public static class RemoveMapper extends Mapper public void map(Object key, Text

24、value, Context context) throws IOException, InterruptedException StringTokenizer itr = new StringTokenizer(value.toString(); while (itr.hasMoreTokens() String nextToken =itr.nextToken(); String records = nextToken.split(,); String sourceip = records0.replace(, ); context.write(new StringPair(sourcei

25、p,destinationip),NullWritable.get() ); /第二个Map/Reduce过程map类，用于统计 public static class StatisticsMapper extends Mapper IntWritable one=new IntWritable(1); public void map(Object key, Text value, Context context) throws IOException, InterruptedException String records = value.toString().split(t); Strin

26、g sourceip = records0; String desip=records1; context.write(new StringPair(sourceip,desip),one); /按照源地址分组 public static class GroupComparator extends WritableComparator protected GroupComparator() super(StringPair.class, true); Override public int compare(WritableComparable w1,WritableComparable w2)

27、 StringPair ip1=(StringPair)w1; StringPair ip2=(StringPair)w2; return ip1.getFirst().compareTo(ip2.getFirst(); /第一个Map/Reduce过程reduce过程，去重 public static class RemoveReducer extends Reducer public void reduce( StringPair key, Iterable values, Context context) throws IOException, InterruptedException context.write(new Text(key.toString(), NullWritable.get(); /第二个Map/Reduce过程reduce过程，统计 public static class StatisticsReducer extends Reducer

邮箱/手机：
温馨提示：	快捷下载时，用户名和密码都是您填写的邮箱或者手机号，方便查询和重复下载（系统自动生成）。如填写123，账号就是123，密码也是123。
特别说明：	请自助下载，系统不会自动发送文件的哦；如果您已付费，想二次下载，请登录后访问：我的下载记录
支付方式：
验证码：	换一换

账号：
密码：
验证码：	换一换
当日自动登录忘记密码？