修改Solr的源码使SolrCloud支持多分片的索引集合进行Join查询

博客信息

修改Solr的源码使SolrCloud支持多分片的索引集合进行Join查询

发布时间：『 2017-03-21 20:30』博客类别：Solr/ES 阅读(2999) 评论(0)

背景：

关于Solr的Join可以见：https://wiki.apache.org/solr/Join支持两个索引集合进行关联搜索。语法大概为：{!join fromIndex=collectionName from=fromid to=toid}*:*，但Join有很多的限制：

Fields or other properties of the documents being joined "from" are not available for use in processing of the resulting set of "to" documents (ie: you can not return fields in the "from" documents as if they were a multivalued field on the "to" documents)
The Join query produces constant scores for all documents that match -- scores computed by the nested query for the "from" documents are not available to use in scoring the "to" documents
In a DistributedSearch environment, you can not Join across cores on multiple nodes. If however you have a custom sharding approach, you could join across cores on the same node.

第三点说，Solr无法对多个节点，多个core的索引进行关联查询。同时在Solr（本文测试是在Solr5.5.3上）的官方文档上说：

To use the movie_directors collection in Solr join queries with the movies collection, it needs to have a replica
on each of the four nodes. In other words, movie_directors must have one shard and replication factor of four ............
Moreover, if you add a replica to the to collection, then you also need to add a
replica for the from collection.

这段英文大概说：当索引集合B关联查询索引集合A时，如果A有多个分片，而且分布在不同的机器上，那么查询：/A/solr/select?fq={!join fromIndex=B from=bid to=aid}*:* 需要正常搜索的话，需要B满足的条件是只能有一个Shard(分片)，而且在A的所有Shard所在的节点上都有副本。A每增加一个节点，B就需要在对应的节点上增加一个副本。

所以问题来了，如果索引A和索引B都是多个分片的索引集合，如何支持两个索引的关联搜索呢？

经过笔者的验证，两个多分片的索引集合Join，会报错如下：

SolrCloud join: multiple shards not yet supported dmpreportcloud

在apache的jira问题上找到了相关的思路，https://issues.apache.org/jira/browse/SOLR-8297

即将两个索引集合进行定制化分片，一种是只要两个索引集合的分片名称相同，而且都在同一个节点上。一种是只要两个索引集合的分片范围相同，而且都在同一个节点上。解决关键点就是都在同一个节点，而且关联的数据路由到同一个节点上即可。

具体代码修改片段：org/apache/solr/search/join/ScoreJoinQParserPlugin.java

  /**
   * 通过名称来获取当前Node的Shard
   * @param zkController
   * @param fromIndex
   * @param req
   * @return
   */
  private static String findNameMatchShard(ZkController zkController, String fromIndex, SolrQueryRequest req){
	    String fromReplica = null;
	    
	    CoreDescriptor containerDesc = req.getCore().getCoreDescriptor();
	    String nodeName = zkController.getNodeName();
	    
	    //Get collection information of to_shard or shard on which query is applied
	    DocCollection toCollection = zkController.getClusterState().getCollection(containerDesc.getCollectionName());
	    Slice toShardSlice = toCollection.getActiveSlicesMap().get(req.getCore().getCoreDescriptor().getCoreProperty(CoreDescriptor.CORE_SHARD, null));
	    
	    //获取fromIndex的分片数量 如果分片数量=1，则取shard1
	    int size = zkController.getClusterState().getCollection(fromIndex).getActiveSlices().size();
	    
	    if(size == 1){//如果只有一个分片，则直接是shard1
	    	return findLocalReplica(zkController, fromIndex);
	    }else{//如果是多个分片，则匹配名称
		    String toSliceName = toShardSlice.getName();
			for (Slice slice : zkController.getClusterState().getCollection(fromIndex).getActiveSlices()) {
			  if(toSliceName.equals(slice.getName())){
			    return findSliceReplica(fromIndex, nodeName, slice);
			  }
	       }
	    }
	    
	   return fromReplica;
  }

  private static String findSliceReplica(String fromIndex, String nodeName, Slice slice) {
    String fromReplica = null;
    for (Replica replica : slice.getReplicas()) {
      if (replica.getNodeName().equals(nodeName)) {
        fromReplica = replica.getStr(ZkStateReader.CORE_NAME_PROP);
        // found local replica, but is it Active?
        if (replica.getState() != Replica.State.ACTIVE)
          throw new SolrException(SolrException.ErrorCode.BAD_REQUEST,
              "SolrCloud join: "+fromIndex+" has a local replica ("+fromReplica+
                  ") on "+nodeName+", but it is "+replica.getState());
	   
        break;
      }
    }
    return fromReplica;
  }

具体请见修改的代码补丁：http://git.oschina.net/dearbaba/Solr5.5.3-src/commit/35565f20cf09dbe9a5c3f0576d77ec11b73615ef

可以直接下载源码，编译后将里面修改过的两个文件覆盖solr-core-5.5.3.jar中对应的Class覆盖，重新打包。

将代码更新后，需要对索引集合进行定制化分片。

1、定制化分片

案例：比如有3台机器，有两个索引集合dmpreportcloud和paddatacloud。这两个索引集合都需要分三个分片，每个分片2个副本。那么他们的分片命令大概为：

//创建dmpreportcloud collection
curl 'http://localhost:8983/solr/admin/collections?action=CREATE&name=dmpreportcloud&router.name=implicit&shards=shard1&createNodeSet=10.20.112.30:8983_solr';
//增加dmpreportcloud shard
curl 'http://localhost:8983/solr/admin/collections?action=CREATESHARD&shard=shard2&collection=dmpreportcloud&createNodeSet=10.20.112.31:8983_solr';
curl 'http://localhost:8983/solr/admin/collections?action=CREATESHARD&shard=shard3&collection=dmpreportcloud&createNodeSet=10.20.112.32:8983_solr';

//创建paddatacloud collection
curl 'http://localhost:8983/solr/admin/collections?action=CREATE&name=paddatacloud&router.name=implicit&shards=shard1&createNodeSet=10.20.112.30:8983_solr';
//增加dmpreportcloud shard
curl 'http://localhost:8983/solr/admin/collections?action=CREATESHARD&shard=shard2&collection=paddatacloud&createNodeSet=10.20.112.31:8983_solr';
curl 'http://localhost:8983/solr/admin/collections?action=CREATESHARD&shard=shard3&collection=paddatacloud&createNodeSet=10.20.112.32:8983_solr';

创建dmpreportcloud 的副本
curl 'http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=dmpreportcloud&shard=shard1&node=10.20.112.31:8983_solr'
curl 'http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=dmpreportcloud&shard=shard2&node=10.20.112.32:8983_solr'
curl 'http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=dmpreportcloud&shard=shard3&node=10.20.112.30:8983_solr'
创建paddatacloud 的副本
curl 'http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=paddatacloud&shard=shard1&node=10.20.112.31:8983_solr'
curl 'http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=paddatacloud&shard=shard2&node=10.20.112.32:8983_solr'
curl 'http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=paddatacloud&shard=shard3&node=10.20.112.30:8983_solr'

如上图这样，两个索引集合的分片和节点都对应。

2、两个索引集合Join语法

http://10.20.112.30:8983/solr/dmpreportcloud/select?q=*:*&wt=json&indent=true&distribJoinMod=name&fq={!join fromIndex=paddatacloud from=device_id_md5 to=id}*:*

需要加参数distribJoinMod，它的值为name或range，如果为name，则支持路由规则是implicit的索引集合，会比较两个索引集合的分片名称是否在同一个节点上。如果是range则支持路由规则为compositeId的索引集合，会比较两个索引的分片是否Hash范围相同。

注意：路由规则一定要相同，相同的shard关联数据的部分一定要路由到同一台服务器上（也可以说同一个Solr节点上）。

关键字： solrcloud join

上一篇： SolrCloud使用implict路由规则创建Shard（分片）

下一篇：Hadoop Hive安装手册

评论信息