SpeakTech

Monday 10 December 2018

**SQL Execution Via-SparkThriftServer throws “Error: java.lang.StackOverflowError”**

Overview

This article describes "Error: java.lang.StackOverflowError (state=,code=0)” seen while executing very long SQL in Via-SPARK connection. Issue and resolution were tested on Ambari managed 3 node Hortonworks 2.6.5.0 cluster.

The test query contains >=133 number of predicates in where clause. For such query with default configuration java.lang.StackOverflowError
is seen.

From Spark log file we see following error messages.

When we use Simba Driver for Spark , the console error seen might be different.

Root Cause

It can be noticed from log messages that default stack size of JVM is allowing to push 132 calls to nodeToExpr I.e. 132 predicates. Hence, when 133^rd predicate is added to where clause of query , java.lang.StackOverflowError was seen.

Resolution

We need to increase StackSize by using Xss option of java. To support query containing many more predicates in where clause we can set stack size to large value say 1500m.

We may add following line at end of spark-env.sh file (or increase value if such statement already exist).

After saving spark-env.sh , restart Spark Thrift Server, Spark History Server and refresh Spark client configs. Now same test query will runs successfully.

Monday 26 November 2018

Greenplum "ERROR: target lists can have at most 1664 entries"

Overview

This article describes "ERROR: target list can have at most 1664 enties". Experiment described in article was tested on Greenplum 5.5.

Greenplum has limitation on number of columns that can be mentioned in projection list of select query. There can be maximum 1664 column in projection list of SELECT query. Hence, while executing SELECT query containing >1664 columns will lead to such error.

Consider table having number of columns > 832. We can also hit "ERROR: target list can have at most 1664 enties" error while inserting row/s with any number of columns into such table.

Table Statistics Collection

Greenplum provides feature of analyzing tables and collecting statistics. These statistics can be used for generating cost optimised query plan. The ANALYZE statement which can be used by end user to trigger table analysis manually. There exist automatic statistics collection functionality in greenplum due to which table is ANALYZed and Statistics are collected whenever rows are inserted in table. The value of gp_autostat_mode configuration parameter decides wether automatic statistics should be triggered or not.

While collecting table statistics, analyzer executes some SELECT statements to extract stored data size. That means SELECTs projecting all table columns are executed. One of the statistics collection SELECT projects each column twice (after applying certain functions).

Following is example of statistics gathering select under discussion for table named tab1 with single column id1.

Considering Greenplum's limitation of 1664 columns in projection list , it is obvious to hit "ERROR: target list can have at most 1664 enties" while inserting row/s into table having >832 columns.

References

https://github.com/greenplum-db/gpdb/issues/1345
https://gpdb.docs.pivotal.io/550/admin_guide/topics/g-automatic-statistics-collection-parameters.html
https://gpdb.docs.pivotal.io/550/ref_guide/sql_commands/ANALYZE.html

Saturday 25 March 2017

Integrating Lucidworks Fusion with SolrCloud

The Lucidworks fusion provides frontend UI and interfaces to implement search applications easily and successfully.

Fusion can be setup in standalone mode or cluster mode.

This article explains how fusion can be installed in cluster mode with Hortonworks HDP components including Ambari infra and zookeeper.

Cluster nodes

Here we will have following 5 nodes and components.

Node 0 : Ambari server

Node 1 : Ambari Agent , Namenode, Datanode, zookeeper, Ambari infra zookeeper client, HDFS client

Node 2 : Ambari Agent , Secondary Namenode, Datanode, zookeeper, Ambari infra, zookeeper client, HDFS client

Node 3 : Ambari Agent , Datanode, zookeeper, Ambari infra, zookeeper client, HDFS client

Node 4 : Ambari Agent , Datanode, zookeeper client, HDFS client, Lucidworks Fusion

Node 5 : Ambari Agent , Datanode, zookeeper client, HDFS client, Lucidworks Fusion

This article uses HDP 2.5.0.0 , Ambari 2.4.3 and Fusion 3.0.0 version.

FQDN of nodes are as follows

Steps

Create 6 node cluster as given above. You may have less nodes also with multiple servers on one node

Install Ambari server on node0

Yum repository for CentOS 6 can be downloaded from http://public-repo-1.hortonworks.com/ambari/centos6/2.x/updates/2.4.0.1/ambari.repo
Install Ambari server using yum command

From Ambari UI add following services on respective nodes

Add a HDFS service
Add a zookeeper service
Add a Ambari infra service

Make sure all services are running (you may run the service check)

Download Lucidworks Fusion from https://lucidworks.com/download/

Decompress Fusion in /opt directory on Node5 and Node4

On each of Node4 and Node5 update /opt/fusion/3.0.0/conf/fusion.properties

This is because, in cluster mode we want to use solr and zookeeper from HDP installation instead of fusion’s embedded components

Start fusion on n4 and n5 node as