背景:
关于Solr的Join可以见:https://wiki.apache.org/solr/Join支持两个索引集合进行关联搜索。语法大概为:{!join fromIndex=collectionName from=fromid to=toid}*:*,但Join有很多的限制:
Fields or other properties of the documents being joined "from" are not available for use in processing of the resulting set of "to" documents (ie: you can not return fields in the "from" documents as if they were a multivalued field on the "to" documents)
The Join query produces constant scores for all documents that match -- scores computed by the nested query for the "from" documents are not available to use in scoring the "to" documents
In a DistributedSearch environment, you can not Join across cores on multiple nodes. If however you have a custom sharding approach, you could join across cores on the same node.
第三点说,Solr无法对多个节点,多个core的索引进行关联查询。同时在Solr(本文测试是在Solr5.5.3上)的官方文档上说:
To use the movie_directors collection in Solr join queries with the movies collection, it needs to have a replica
on each of the four nodes. In other words, movie_directors must have one shard and replication factor of four ............
Moreover, if you add a replica to the to collection, then you also need to add a
replica for the from collection.
这段英文大概说:当索引集合B关联查询索引集合A时,如果A有多个分片,而且分布在不同的机器上,那么查询:/A/solr/select?fq={!join fromIndex=B from=bid to=aid}*:* 需要正常搜索的话,需要B满足的条件是只能有一个Shard(分片),而且在A的所有Shard所在的节点上都有副本。A每增加一个节点,B就需要在对应的节点上增加一个副本。
所以问题来了,如果索引A和索引B都是多个分片的索引集合,如何支持两个索引的关联搜索呢?
经过笔者的验证,两个多分片的索引集合Join,会报错如下:
SolrCloud join: multiple shards not yet supported dmpreportcloud
在apache的jira问题上找到了相关的思路,https://issues.apache.org/jira/browse/SOLR-8297
即将两个索引集合进行定制化分片,一种是只要两个索引集合的分片名称相同,而且都在同一个节点上。一种是只要两个索引集合的分片范围相同,而且都在同一个节点上。解决关键点就是都在同一个节点,而且关联的数据路由到同一个节点上即可。
具体代码修改片段:org/apache/solr/search/join/ScoreJoinQParserPlugin.java
/** * 通过名称来获取当前Node的Shard * @param zkController * @param fromIndex * @param req * @return */ private static String findNameMatchShard(ZkController zkController, String fromIndex, SolrQueryRequest req){ String fromReplica = null; CoreDescriptor containerDesc = req.getCore().getCoreDescriptor(); String nodeName = zkController.getNodeName(); //Get collection information of to_shard or shard on which query is applied DocCollection toCollection = zkController.getClusterState().getCollection(containerDesc.getCollectionName()); Slice toShardSlice = toCollection.getActiveSlicesMap().get(req.getCore().getCoreDescriptor().getCoreProperty(CoreDescriptor.CORE_SHARD, null)); //获取fromIndex的分片数量 如果分片数量=1,则取shard1 int size = zkController.getClusterState().getCollection(fromIndex).getActiveSlices().size(); if(size == 1){//如果只有一个分片,则直接是shard1 return findLocalReplica(zkController, fromIndex); }else{//如果是多个分片,则匹配名称 String toSliceName = toShardSlice.getName(); for (Slice slice : zkController.getClusterState().getCollection(fromIndex).getActiveSlices()) { if(toSliceName.equals(slice.getName())){ return findSliceReplica(fromIndex, nodeName, slice); } } } return fromReplica; }
private static String findSliceReplica(String fromIndex, String nodeName, Slice slice) { String fromReplica = null; for (Replica replica : slice.getReplicas()) { if (replica.getNodeName().equals(nodeName)) { fromReplica = replica.getStr(ZkStateReader.CORE_NAME_PROP); // found local replica, but is it Active? if (replica.getState() != Replica.State.ACTIVE) throw new SolrException(SolrException.ErrorCode.BAD_REQUEST, "SolrCloud join: "+fromIndex+" has a local replica ("+fromReplica+ ") on "+nodeName+", but it is "+replica.getState()); break; } } return fromReplica; }
具体请见修改的代码补丁:http://git.oschina.net/dearbaba/Solr5.5.3-src/commit/35565f20cf09dbe9a5c3f0576d77ec11b73615ef
可以直接下载源码,编译后将里面修改过的两个文件覆盖solr-core-5.5.3.jar中对应的Class覆盖,重新打包。
将代码更新后,需要对索引集合进行定制化分片。
1、定制化分片
案例:比如有3台机器,有两个索引集合dmpreportcloud和paddatacloud。这两个索引集合都需要分三个分片,每个分片2个副本。那么他们的分片命令大概为:
//创建dmpreportcloud collection curl 'http://localhost:8983/solr/admin/collections?action=CREATE&name=dmpreportcloud&router.name=implicit&shards=shard1&createNodeSet=10.20.112.30:8983_solr'; //增加dmpreportcloud shard curl 'http://localhost:8983/solr/admin/collections?action=CREATESHARD&shard=shard2&collection=dmpreportcloud&createNodeSet=10.20.112.31:8983_solr'; curl 'http://localhost:8983/solr/admin/collections?action=CREATESHARD&shard=shard3&collection=dmpreportcloud&createNodeSet=10.20.112.32:8983_solr'; //创建paddatacloud collection curl 'http://localhost:8983/solr/admin/collections?action=CREATE&name=paddatacloud&router.name=implicit&shards=shard1&createNodeSet=10.20.112.30:8983_solr'; //增加dmpreportcloud shard curl 'http://localhost:8983/solr/admin/collections?action=CREATESHARD&shard=shard2&collection=paddatacloud&createNodeSet=10.20.112.31:8983_solr'; curl 'http://localhost:8983/solr/admin/collections?action=CREATESHARD&shard=shard3&collection=paddatacloud&createNodeSet=10.20.112.32:8983_solr';
创建dmpreportcloud 的副本 curl 'http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=dmpreportcloud&shard=shard1&node=10.20.112.31:8983_solr' curl 'http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=dmpreportcloud&shard=shard2&node=10.20.112.32:8983_solr' curl 'http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=dmpreportcloud&shard=shard3&node=10.20.112.30:8983_solr' 创建paddatacloud 的副本 curl 'http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=paddatacloud&shard=shard1&node=10.20.112.31:8983_solr' curl 'http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=paddatacloud&shard=shard2&node=10.20.112.32:8983_solr' curl 'http://localhost:8983/solr/admin/collections?action=ADDREPLICA&collection=paddatacloud&shard=shard3&node=10.20.112.30:8983_solr'
如上图这样,两个索引集合的分片和节点都对应。
2、两个索引集合Join语法
http://10.20.112.30:8983/solr/dmpreportcloud/select?q=*:*&wt=json&indent=true&distribJoinMod=name&fq={!join fromIndex=paddatacloud from=device_id_md5 to=id}*:*
需要加参数distribJoinMod,它的值为name或range,如果为name,则支持路由规则是implicit的索引集合,会比较两个索引集合的分片名称是否在同一个节点上。如果是range则支持路由规则为compositeId的索引集合,会比较两个索引的分片是否Hash范围相同。
注意:路由规则一定要相同,相同的shard关联数据的部分一定要路由到同一台服务器上(也可以说同一个Solr节点上)。