上篇:
上篇中,udtf函数,只有为一行输入,一行输出。udtf是可以一行输入,多行输出的。
简述下需求:
输入开始时间,结束时间,返回每个小时的时长
直接上代码:
package com.venn.udtf;import org.apache.hadoop.hive.ql.exec.UDFArgumentException;import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;import org.apache.hadoop.hive.ql.metadata.HiveException;import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;import java.util.ArrayList;/** * Created by venn on 5/20/2018. * SplitHour : split hour */public class SplitHour extends GenericUDTF { /** * add the column name * @param args * @return * @throws UDFArgumentException */ @Override public StructObjectInspector initialize(ObjectInspector[] args) throws UDFArgumentException { if (args.length != 1) { throw new UDFArgumentLengthException("ExplodeMap takes only one argument"); } if (args[0].getCategory() != ObjectInspector.Category.PRIMITIVE) { throw new UDFArgumentException("ExplodeMap takes string as a parameter"); } ArrayListfieldNames = new ArrayList (); ArrayList fieldOIs = new ArrayList (); fieldNames.add("begintime"); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); fieldNames.add("endtime"); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); fieldNames.add("hour"); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); fieldNames.add("seconds"); fieldOIs.add(PrimitiveObjectInspectorFactory.javaStringObjectInspector); return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs); } /** * process the column * @param objects * @throws HiveException */ public void process(Object[] objects) throws HiveException { String [] input = objects[0].toString().split(","); // 2018-06-06 10:25:35 String beginTime = input[0]; String endTime = input[1]; String[] result = new String[4]; result[0] = beginTime; result[1] = endTime; // begintime int bhour = Integer.parseInt(beginTime.substring(11, 13)); int bmin = Integer.parseInt(beginTime.substring(14, 16)); int bsecond = Integer.parseInt(beginTime.substring(17, 19)); // endtime int ehour = Integer.parseInt(endTime.substring(11, 13)); int emin = Integer.parseInt(endTime.substring(14, 16)); int esecond = Integer.parseInt(endTime.substring(17, 19)); // 1.if begin hour equal end hour, second is : (emin - bmin) * 60 + (esecond - bsecond) if (bhour == ehour) { result[2] = String.valueOf(bhour); result[3] = String.valueOf((emin - bmin) * 60 + (esecond - bsecond)); forward(result); return; } boolean flag = true; //TODO 待优化,先输出第一个循环的时长,再循环后面的就不用判断 while (bhour != ehour) { result[2] = String.valueOf(bhour); if(flag){ flag = false; // 2. if begintime hour != endtime, the first hour, second is : 3600 - bmin * 60 - bsecond result[3] = String.valueOf(3600 - bmin * 60 - bsecond); }else { // 3. next hour is 3600 result[3] = String.valueOf(3600); } bhour += 1; // 输出到hive forward(result); } result[2] = String.valueOf(bhour); // 4. the end hour is : emin * 60 + esecond result[3] = String.valueOf( emin * 60 + esecond); forward(result); } public void close() throws HiveException { }}
udtf 函数介绍参加上篇
使用方式见上篇
样例:
hive> select split_hour( concat(begintime,',',endtime)) from viewlog where log_date=20180401 limit 10;OKbegintime endtime hour seconds2018-04-01 10:26:14 2018-04-01 10:26:21 10 72018-04-01 07:21:47 2018-04-01 07:22:23 7 362018-04-01 15:18:08 2018-04-01 15:18:11 15 32018-04-01 18:05:13 2018-04-01 18:05:28 18 152018-04-01 07:18:34 2018-04-01 07:18:52 7 182018-04-01 23:28:32 2018-04-01 23:29:44 23 722018-04-01 06:34:11 2018-04-01 06:34:17 6 62018-04-01 14:02:40 2018-04-01 14:03:33 14 532018-04-01 17:30:23 2018-04-01 17:30:26 17 32018-04-01 12:15:07 2018-04-01 12:15:11 12 42018-04-01 06:53:40 2018-04-01 07:02:09 6 3802018-04-01 06:53:40 2018-04-01 07:02:09 7 129Time taken: 2.238 seconds, Fetched: 12 row(s)
搞定