Monday 10 December 2018


SQL Execution Via-SparkThriftServer throws “Error: java.lang.StackOverflowError


Overview 

This article describes "Error: java.lang.StackOverflowError (state=,code=0)” seen while executing very long SQL in Via-SPARK connection. Issue and resolution were tested on Ambari managed 3 node Hortonworks 2.6.5.0 cluster.

The test query contains >=133 number of predicates in where clause. For such query with default configuration java.lang.StackOverflowError
is seen.



From Spark log file  we see following error messages.



When we use Simba Driver for Spark , the console error seen might be different.

Root Cause


It can be noticed from log messages that default stack size of JVM is allowing to push 132 calls to nodeToExpr I.e. 132 predicates. Hence, when 133rd predicate is added to where clause of query , java.lang.StackOverflowError was seen.

Resolution


We need to increase StackSize by using Xss option of java. To support query containing many more predicates in where clause we can set stack size to large value say 1500m.


We may add following line at end of spark-env.sh file (or increase value if such statement already exist).



After saving spark-env.sh , restart Spark Thrift Server, Spark History Server and refresh Spark client configs. Now same test query will runs successfully.
 


Monday 26 November 2018

Greenplum "ERROR: target lists can have at most 1664 entries"

Overview 

This article describes "ERROR: target list can have at most 1664 enties". Experiment described in article was tested on Greenplum 5.5.

Greenplum has limitation on number of columns that can be mentioned in projection list of select query. There can be maximum 1664 column in projection list of SELECT query. Hence, while executing SELECT query containing >1664 columns will lead to such error.

Consider table having number of columns > 832. We can also hit "ERROR: target list can have at most 1664 enties" error while inserting row/s with any number of columns into such table.










Table Statistics Collection 

Greenplum provides feature of analyzing tables and collecting statistics. These statistics can be used for generating cost optimised query plan. The ANALYZE statement which can be used by end user to trigger table analysis manually. There exist automatic statistics collection functionality in greenplum due to which table is ANALYZed and Statistics are collected whenever rows are inserted in table. The value of gp_autostat_mode configuration parameter decides wether automatic statistics should be triggered or not.

While collecting table statistics, analyzer executes some SELECT statements to extract stored data size. That means SELECTs projecting all table columns are executed. One of the statistics collection SELECT projects each column twice (after applying certain functions).

Following is example of statistics gathering select under discussion for table named tab1 with single column id1. 


Considering Greenplum's limitation of 1664 columns in projection list , it is obvious to hit "ERROR: target list can have at most 1664 enties" while  inserting row/s into table having >832 columns.

References 

  1. https://github.com/greenplum-db/gpdb/issues/1345
  2. https://gpdb.docs.pivotal.io/550/admin_guide/topics/g-automatic-statistics-collection-parameters.html
  3. https://gpdb.docs.pivotal.io/550/ref_guide/sql_commands/ANALYZE.html

Saturday 25 March 2017

Integrating Lucidworks Fusion with SolrCloud

The Lucidworks fusion provides frontend UI and interfaces to implement search applications easily and successfully.
Fusion can be setup in standalone mode or cluster mode.
This article explains how fusion can be installed in cluster mode with Hortonworks HDP components including Ambari infra and zookeeper.

Cluster nodes

Here we will have following 5 nodes and components.


Node 0 : Ambari server
Node 1 : Ambari Agent , Namenode, Datanode, zookeeper, Ambari infra zookeeper client, HDFS client
Node 2 : Ambari Agent , Secondary Namenode, Datanode, zookeeper, Ambari infra,  zookeeper client, HDFS client
Node 3 : Ambari Agent , Datanode, zookeeper, Ambari infra, zookeeper client, HDFS client
Node 4 : Ambari Agent , Datanode, zookeeper client, HDFS client, Lucidworks Fusion
Node 5 : Ambari Agent , Datanode, zookeeper client, HDFS client, Lucidworks  Fusion


This article uses HDP 2.5.0.0 , Ambari 2.4.3 and Fusion 3.0.0 version.  
FQDN of nodes are as follows

Steps

  •  Create 6 node cluster as given above. You may have less nodes also with multiple servers on one node
  • Install Ambari server on node0
    • Yum repository for CentOS 6 can be downloaded from http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.4.0.1/ambari.repo
    • Install Ambari server using yum command 
  • From Ambari UI add following services on respective nodes
    • Add a HDFS service
    • Add a zookeeper service
    • Add a Ambari infra service

  • Make sure all services are running (you may run the service check)
  • Decompress Fusion in /opt directory on Node5 and Node4
  • On each of Node4 and Node5 update /opt/fusion/3.0.0/conf/fusion.properties


      • This is because, in cluster mode we want to use solr and zookeeper from HDP installation instead of fusion’s embedded components
          

         
       
  • Start fusion on n4 and n5 node as