安装并运行Elasticsearch

https://www.elastic.co/guide/en/elasticsearch/guide/current/running-elasticsearch.html
安装elasticsearch比较简单,按照上边链接中的描述就行了,关键是如何安装sense.

To install and run Sense:

Run the following command in the Kibana directory to download and install the Sense app:
./bin/kibana plugin –install elastic/sense

Windows: bin\kibana.bat plugin –install elastic/sense.
Note
You can download Sense from https://download.elastic.co/elastic/sense/sense-latest.tar.gz to install it on an offline machine.
Start Kibana.
./bin/kibana

Windows: bin\kibana.bat.
Open Sense your web browser by going to http://localhost:5601/app/sense.

一些博客记录

  1. Elasticsearch 权威指南(中文版)
    http://wiki.jikexueyuan.com/project/elasticsearch-definitive-guide-cn/330_Geo_aggs/60_Geo_aggs.html
  1. 有时候,使用mvn assembly:assembly打包会出现找不到非maven管理的jar中的类, 这个时候再次运行mvn assembly:assembly命令就可以了

Path must not end with / character
这个问题可能是kafka的topic是null,或者无效的.
http://blog.csdn.net/ado1986/article/details/50147693

阻塞队列
http://wiki.jikexueyuan.com/project/java-concurrent/blocking-queues.html
http://wiki.jikexueyuan.com/project/java-concurrency/queue-stack.html

优先级队列:PriorityQueue

程序员你在迷茫什么
http://geek.csdn.net/news/detail/68512

几个程序员的禅师笑话
http://bbs.jointforce.com/forum.php?mod=viewthread&tid=16475&extra=page%3D1

多线程
http://www.cnblogs.com/skywang12345/p/3479024.html
http://www.cnblogs.com/skywang12345/p/3479063.html
http://www.cnblogs.com/skywang12345/p/3479083.html
http://www.cnblogs.com/skywang12345/p/3479202.html
http://www.cnblogs.com/skywang12345/p/3479224.html

在IOS 8 iOS 9 中使用CoreLocation 获取地理位置
http://www.jianshu.com/p/ce8be56845c1

SWIFT中获取当前经伟度
http://www.cnblogs.com/foxting/p/4518379.html

根据经纬度获取地点名称
http://api.map.baidu.com/lbsapi/getpoint/index.html?qq-pf-to=pcqq.group

Java SE7新特性之try-with-resources语句
http://blog.csdn.net/jackiehff/article/details/17765909

kafka文档(3)—- 配置选项翻译
http://blog.csdn.net/beitiandijun/article/details/40582541

Java工程师成神之路~
http://blog.csdn.net/i10630226/article/details/50855118

Ad hoc是什么意思?
http://simon.blog.51cto.com/80/97504/

查询ES中所有index和type的定义
GET _mapping

查询ES中nm_1605 index的定义
GET /nm_1605/_mapping

查询ES中nm_1605 index中news type的定义
GET /nm_1605/_mapping/news

安装Tomcat指定JDK
http://www.cnblogs.com/lioillioil/archive/2011/10/08/2202169.html

Kafka 设计与原理详解(一)
http://mp.weixin.qq.com/s?__biz=MzAwNjQwNzU2NQ==&mid=2650342597&idx=1&sn=ec3bde92ae548a587bf8f7a53a060de7&scene=0#rd

CentOS修改主机名
修改如下三个文件中的主机名
[root@iZ25ps5j2tjZ ~]# vi /etc/sysconfig/network
[root@iZ25ps5j2tjZ ~]# vim /etc/hostname
[root@iZ25ps5j2tjZ ~]# vim /etc/hosts

Linkedin Camus,从Kafka到HDFS的数据传输管道
http://blog.csdn.net/amghost/article/details/44258817

Kafka实战-Flume到Kafka
https://yq.aliyun.com/articles/33933#

高可用Hadoop平台-Flume NG实战图解篇
http://www.cnblogs.com/smartloli/p/4468708.html?spm=5176.blog33933.yqblogcon1.5.lrywB7

Flume与Kafka整合
http://m.blog.csdn.net/article/details?id=51125987

Failed to send producer request with correlation id 2 to broker 0 with
http://10120275.blog.51cto.com/10110275/1764526
windows上的java程序直接连linux服务器的上的kafka报错,这个时候需要修改kafka的两项配置.
advertised.host.name=121.51.250.98 #外网ip
advertised.port=9092

win10怎样把应用程序加入开机启动项
http://jingyan.baidu.com/album/90895e0ff3a41f64ec6b0bc3.html

Log4j hello world example
http://www.mkyong.com/logging/log4j-hello-world-example/

kafka delete topic,marked for deletion
http://blog.csdn.net/wind520/article/details/48710043

ik如何动态添加扩展词
http://www.itzhai.com/ikanalyzer-lucene-demo-performance-test.html#直接通过Analyzer进行分词

IKAnalyzer2012_u6.jar必须和lucene-core-3.6.0.jar一块使用,和其它的lucene版本一块使用会出各种各样的问题。

spark mlib FPGrowth所用到的数据
https://github.com/apache/spark/blob/master/data/mllib/sample_fpgrowth.txt

System memory 259522560 must be at least 4.718592E8. Please use a larger heap size.
http://blog.csdn.net/yizheyouye/article/details/50676022
ide里设置-Xms256m -Xmx1024m

Github Flavored Markdown GFM介绍
http://www.jianshu.com/p/cfPxyr

Hadoop新型数据库Kudu系列文章1:概述
http://blog.datacanvas.io/hadoopxin-xing-shu-ju-ku-kuduxi-lie-wen-zhang-1-gai-shu/

阻塞队列
http://www.cnblogs.com/dolphin0520/p/3932906.html

彻底解决安装win7、win8.1、win10系统出现“我们无法创建新的分区 也找不到现有的分区”的问题
http://www.yishimei.cn/computer/486.html

文档去重算法:SimHash和MinHash
http://blog.csdn.net/heiyeshuwu/article/details/44117473

阶乘、排列、组合、二项与多项式
http://202.113.29.3/nankaisource/mathhands/Elementary%20mathematics/0101/010108/01010801.htm

通俗理解卷积神经网络
http://blog.csdn.net/v_july_v/article/details/51812459

linux find命令
find / -name “user_model“ -print
http://www.cnblogs.com/peida/archive/2012/11/16/2773289.html

java slf4j和log4j的整合


org.slf4j
slf4j-api
1.7.13

org.slf4j
slf4j-log4j12
1.7.13

log4j
log4j
1.2.7

在log4j中不打印spring包中类的日志
log4j.logger.org.springframework=ERROR 设置级别,就不会有了

maven添加第三方jar包进本地maven库
http://blog.csdn.net/yancao952/article/details/49926623

将项目里的jar包配置为maven管理的jar包,加入到本地maven库中。

我选择了第二种方式,这样所有的jar包都由maven管理,干净整洁。

步骤如下:

1、使用maven命令将jar包添加到本地maven库,举例 maple-2.0.jar

mvn install:install-file -DgroupId=com.maple -DartifactId=maple -Dversion=2.0 -Dpackaging=jar -Dfile=maple-2.0.jar

命令解释:

mvn install:install-file install命令

-DgroupId
: groupId

-DartifactId
:artifactId

-Dversion
: jar包版本号

-Dpackaging:
要install的包的类型

-Dfile
:要install的jar包位置

执行命令后,查看本地maven库中,已经生成maven依赖pom等maven依赖文件了。

2、项目pom.xml中添加依赖关系


com.maple
maple
2.0

spark stream的提交命令
/opt/software/spark-1.6.1-bin-hadoop2.6/bin/spark-submit –class com.qctt.main.Main –master local –num-executors 1 –driver-memory 521m –executor-memory 521m –executor-cores 2 –name distribute –queue root –jars “/home/wangjunbo/distribute/distribute-3-jar-with-dependencies.jar” –files /home/wangjunbo/distribute/config /home/wangjunbo/distribute/distribute-3.jar config

JAVA_HOME的正确配置方式
JAVA_HOME=/usr/java/jdk1.7.0_67
PATH=$PATH:$JAVA_HOME/bin
export JAVA_HOME PATH

替换jar包指定的文件
http://blog.csdn.net/giianhui/article/details/10085145

jar uvf allydata_nomiss_analysis-4.jar cn/qctt/spark/Start$1.class

java数据缓存
Core Cache Stores
http://infinispan.org/tutorials/simple/remote/
http://infinispan.org/cache-store-implementations/

JUnit4单元测试入门教程
@AfterClass @BeforeClass @after @before的区别对比
http://www.jianshu.com/p/7088822e21a3

git之临时忽略文件
http://www.netingcn.com/git-temporary-ignore.html

java.ext.dirs的使用举例

#!/bin/sh
java -Djava.ext.dirs=/data/nomiss/distribute/lib cn.allydata.main.Controller /data/nomiss/distributemc/conf/config.properties /data/nomiss/distributemc/conf/log4j.properties /data/nomiss/distributemc/conf /data/nomiss/distributemc/conf/motan_client_zk.xml

jar解压
jar xf xxx.jar

jar压缩
jar cvf mysamlpe.jar *

java设置一段代码执行超时时间
http://blog.csdn.net/a9529lty/article/details/42711029

Spark运行时产生的临时目录的问题,比如local模式
http://blog.csdn.net/kwu_ganymede/article/details/49094881

JAVA8 十大新特性详解
http://www.jb51.net/article/48304.htm

HBase无法创建表
http://blog.csdn.net/lxpbs8851/article/details/8287471
最后的解决方案是重启了HBase

Simhash算法介绍和实现

simhash是google用来处理海量文本去重的算法。 google出品,你懂的。 simhash最牛逼的一点就是将一个文档,最后转换成一个64位的字节,暂且称之为特征字,然后判断重复只需要判断他们的特征字的距离是不是<n,(根据经验这个n一般取值为3),就可以判断两个文档是否相似。

详细介绍参考:
http://yanyiwu.com/work/2014/01/30/simhash-shi-xian-xiang-jie.html
http://www.lanceyan.com/tech/arch/simhash_hamming_distance_similarity.html

代码实现:
http://my.oschina.net/leejun2005/blog/150086

以下是我根据某位大神的代码修改而来的:
IKAnalyer的jar在http://pan.baidu.com/s/1nu33Q5r
luncene包使用maven下载
其它两个依赖的java文件,自己去实现一下,一个是去除html tag的,另一个是读取文件中的内容的.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
package cn.allydata.util;

import java.io.IOException;
import java.io.StringReader;
import java.math.BigInteger;
import java.util.HashMap;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.wltea.analyzer.lucene.IKAnalyzer;

import com.wjb.util.common.WjbFileUtil;
import com.wjb.util.common.WjbHtmlUtil;

/**
* Function: simHash 判断文本相似度,该示例程支持中文<br/>
* date: 2013-8-6 上午1:11:48 <br/>
* @author june
* @version 0.1

<dependency>
<groupId>org.apache.lucene</groupId>
<artifactId>lucene-core</artifactId>
<version>3.6.0</version>
</dependency>

*/
public class WjbSimHash {

private String tokens;

private BigInteger intSimHash;

private String strSimHash;

private int hashbits = 64;

public WjbSimHash(String tokens) throws IOException {
this.tokens = tokens;
this.intSimHash = this.simHash();
}

public WjbSimHash(String tokens, int hashbits) throws IOException {
this.tokens = tokens;
this.hashbits = hashbits;
this.intSimHash = this.simHash();
}

HashMap<String, Integer> wordMap = new HashMap<String, Integer>();

public BigInteger simHash() throws IOException {
// 定义特征向量/数组
int[] v = new int[this.hashbits];
// 英文分词
// StringTokenizer stringTokens = new StringTokenizer(this.tokens);
// while (stringTokens.hasMoreTokens()) {
// String temp = stringTokens.nextToken();
// }
// 1、中文分词,分词器采用 IKAnalyzer3.2.8 ,仅供演示使用,新版 API 已变化。

Analyzer analyzer = new IKAnalyzer(true);
StringReader reader = new StringReader(this.tokens);
TokenStream ts = analyzer.tokenStream("", reader);
CharTermAttribute term=ts.getAttribute(CharTermAttribute.class);

String word = null;
while(ts.incrementToken()){
word = term.toString();
// 注意停用词会被干掉
// System.out.println(word);
// 2、将每一个分词hash为一组固定长度的数列.比如 64bit 的一个整数.
BigInteger t = this.hash(word);
for (int i = 0; i < this.hashbits; i++) {
BigInteger bitmask = new BigInteger("1").shiftLeft(i);
// 3、建立一个长度为64的整数数组(假设要生成64位的数字指纹,也可以是其它数字),
// 对每一个分词hash后的数列进行判断,如果是1000...1,那么数组的第一位和末尾一位加1,
// 中间的62位减一,也就是说,逢1加1,逢0减1.一直到把所有的分词hash数列全部判断完毕.
if (t.and(bitmask).signum() != 0) {
// 这里是计算整个文档的所有特征的向量和
// 这里实际使用中需要 +- 权重,比如词频,而不是简单的 +1/-1,
v[i] += 1;
} else {
v[i] -= 1;
}
}
}

BigInteger fingerprint = new BigInteger("0");
StringBuffer simHashBuffer = new StringBuffer();
for (int i = 0; i < this.hashbits; i++) {
// 4、最后对数组进行判断,大于0的记为1,小于等于0的记为0,得到一个 64bit 的数字指纹/签名.
if (v[i] >= 0) {
fingerprint = fingerprint.add(new BigInteger("1").shiftLeft(i));
simHashBuffer.append("1");
} else {
simHashBuffer.append("0");
}
}
this.strSimHash = simHashBuffer.toString();
return fingerprint;
}

private BigInteger hash(String source) {
if (source == null || source.length() == 0) {
return new BigInteger("0");
} else {
char[] sourceArray = source.toCharArray();
BigInteger x = BigInteger.valueOf(((long) sourceArray[0]) << 7);
BigInteger m = new BigInteger("1000003");
BigInteger mask = new BigInteger("2").pow(this.hashbits).subtract(new BigInteger("1"));
for (char item : sourceArray) {
BigInteger temp = BigInteger.valueOf((long) item);
x = x.multiply(m).xor(temp).and(mask);
}
x = x.xor(new BigInteger(String.valueOf(source.length())));
if (x.equals(new BigInteger("-1"))) {
x = new BigInteger("-2");
}
return x;
}
}

public int hammingDistance(WjbSimHash other) {

BigInteger x = this.intSimHash.xor(other.intSimHash);
int tot = 0;

// 统计x中二进制位数为1的个数
// 我们想想,一个二进制数减去1,那么,从最后那个1(包括那个1)后面的数字全都反了,
// 对吧,然后,n&(n-1)就相当于把后面的数字清0,
// 我们看n能做多少次这样的操作就OK了。

while (x.signum() != 0) {
tot += 1;
x = x.and(x.subtract(new BigInteger("1")));
}
return tot;
}

public int getDistance(String str1, String str2) {
int distance;
if (str1.length() != str2.length()) {
distance = -1;
} else {
distance = 0;
for (int i = 0; i < str1.length(); i++) {
if (str1.charAt(i) != str2.charAt(i)) {
distance++;
}
}
}
return distance;
}


public BigInteger getIntSimHash()
{
return this.intSimHash;
}

public String getStrSimHash()
{
return this.strSimHash;
}


public static void main(String[] args) throws IOException {
String s = WjbFileUtil.fromFile("d:/1.txt");
s = WjbHtmlUtil.delHTMLTag(s);
System.out.println(s);
WjbSimHash hash1 = new WjbSimHash(s, 64);

System.out.println("---------------------------------");
// 删除首句话,并加入两个干扰串
s = WjbFileUtil.fromFile("d:/2.txt",WjbFileUtil.GBK);
s = WjbHtmlUtil.delHTMLTag(s);
System.out.println(s);
WjbSimHash hash2 = new WjbSimHash(s, 64);

// System.out.println("============================");
//
int dis = hash1.getDistance(hash1.strSimHash, hash2.strSimHash);
System.out.println(hash1.hammingDistance(hash2) + " " + dis);
long begin = System.currentTimeMillis();
// for(int i = 0 ; i < 10000000 ; i++)
// {
// hash1.hammingDistance(hash2);
// }
long end = System.currentTimeMillis();
System.out.println(end - begin);
}
}

余弦相似性算法

余弦相似性算法的具体介绍参考:http://www.ruanyifeng.com/blog/2013/03/cosine_similarity.html

下面是我根据上边的介绍进行的java语言的实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138

import java.io.IOException;
import java.io.StringReader;
import java.util.Collections;
import java.util.Comparator;
import java.util.HashMap;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Set;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
import org.wltea.analyzer.lucene.IKAnalyzer;

import com.wjb.util.common.WjbTuple2;

public class CosineTextSimilarity {

public static Map<String, Integer> makeTermFrequency(String text) throws IOException
{
Analyzer analyzer = new IKAnalyzer(true);
StringReader reader = new StringReader(text);
TokenStream ts = analyzer.tokenStream("", reader);
CharTermAttribute term=ts.getAttribute(CharTermAttribute.class);
Map<String,Integer> tf = new HashMap<String, Integer>();
while(ts.incrementToken()){
String t = term.toString();
Integer count = tf.get(t);
if(count == null)
{
tf.put(t, 1);
}else{
tf.put(t, count + 1);
}
}
analyzer.close();
reader.close();
return tf;
}

/**
* 根据key的长度进行过滤,只有key的长度不小于 length 时, 这个key才会保留
* @param map
* @param length
* @return
* @throws IOException
*/
public static Map<String, Integer> filterByKeyLength(Map<String, Integer> map , int length) throws IOException
{
Map<String, Integer> m = new HashMap<String, Integer>();
for(String key : map.keySet())
{
if(key == null || key.trim().length() >= length)
{
m.put(key, map.get(key));
}
}
return m;
}

public static WjbTuple2<int[], int[]> makeVector(Map<String, Integer> first,Map<String, Integer> second){
Set<String> keys = new HashSet<String>();
keys.addAll(first.keySet());
keys.addAll(second.keySet());
int[] vector1 = new int[keys.size()];
int[] vector2 = new int[keys.size()];
int i = 0;
for(String key : keys)
{
Integer count1 = first.get(key);
if(count1 != null)
{
vector1[i] = count1;
}
Integer count2 = second.get(key);
if(count2 != null)
{
vector2[i] = count2;
}
i++;

}
return new WjbTuple2<int[], int[]>(vector1, vector2);
}



public static double cosine(WjbTuple2<int[], int[]> tuple)
{
int[] vector1 = tuple._1;
int[] vector2 = tuple._2;

double sum1 = 0;
double sum21 = 0;
double sum22 = 0;

for (int i = 0; i < vector1.length; i++) {
sum1 += vector1[i] * vector2[i];
sum21 += vector1[i] * vector1[i];
sum22 += vector2[i] * vector2[i];
}

return sum1/(Math.sqrt(sum21 * sum22 ));
}

public static List<Entry> sort(Map unsortMap) {

// Convert Map to List
List<Map.Entry> list = new LinkedList<Map.Entry>(unsortMap.entrySet());

// Sort list with comparator, to compare the Map values
Collections.sort(list, new Comparator<Map.Entry>() {
public int compare(Map.Entry o1,Map.Entry o2) {
String d1 = o1.getValue().toString();
String d2 = o2.getValue().toString();
String k1 = o1.getKey().toString();
String k2 = o2.getKey().toString();
if(o1.getValue() instanceof Integer)
{
Integer nd1 = Integer.parseInt(d1);
Integer nd2 = Integer.parseInt(d2);
if( nd2 - nd1 != 0 )
return nd2 - nd1;
else{
return k2.compareTo(k1);
}
}else
return d2.compareTo(d1);
}
});

return list;
}
}

下面是main方法,进行测试

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;

import com.wjb.util.common.WjbFileUtil;
import com.wjb.util.common.WjbTuple2;

public class Main {
public static void main(String[] args) throws Exception {

String text1 = WjbFileUtil.fromFile("d:/1.txt");
String text2 = WjbFileUtil.fromFile("d:/2.txt" , WjbFileUtil.GBK);

System.out.println(text2);
long begin = System.currentTimeMillis();
Map<String, Integer> map1 = CosineTextSimilarity.makeTermFrequency(text1);
Map<String, Integer> map2 = CosineTextSimilarity.makeTermFrequency(text2);

// map1 = CosineTextSimilarity.filterByKeyLength(map1, 2);
// map2 = CosineTextSimilarity.filterByKeyLength(map2, 2);

List<Entry> list1 = CosineTextSimilarity.sort(map1);
System.out.println(list1);
list1 = list1.subList(0 , list1.size() > 20 ? 20 : list1.size());

List<Entry> list2 = CosineTextSimilarity.sort(map2);
System.out.println(list2);
list2 = list2.subList(0 , list2.size() > 20 ? 20 : list2.size());

map1 = list2Map(list1);
map2 = list2Map(list2);

WjbTuple2<int[], int[]> tuple = CosineTextSimilarity.makeVector(map1, map2);
double cos = CosineTextSimilarity.cosine(tuple);

long end = System.currentTimeMillis();

System.out.println(end - begin);

System.out.println(cos);
}

public static Map<String, Integer> list2Map(List<Entry> list)
{
Map<String, Integer> map = new HashMap<String, Integer>();
for(Entry e : list)
{
map.put(e.getKey().toString(), (Integer)e.getValue());
}
return map;
}
}