We did a few POCs on Gridgain using Java to fetch data using Hive. All was well until we hit hive tables with huge amount of data (>20mn records, 100+ columns).
Issue cropped up because we coded to reach each cell level data to form a tabular structure in gridgain to be able to do sql queries against gridgain.
We use BinaryObject to stream data into ignite as follows:
Using Java ResultSet with hive data -> ResultSetMetaData to QueryEntity
Each cell data is stored in BinaryObject along with its data type from prior step -> BinaryObjectBuilder.setField(DataType, rs.getLong(columnName));
This step obviously takes very long with the data set size I mentioned above
We do this so we can maintain exactly tabular structure from a hive table with data types so our apps don't change their SQLs.
Questions here are,
We are not maintaining any Java POJOs since there are many tables and we want to keep generic code (BinaryObject)
We will first fetch schema of file -> form ddl as above -> stream data as above.
Only win will be threading up multiple parquet part files to stream into same IgniteDataStreamer object
Is this an approach to go ahead with?
We tried using AvroParquetReader, but it does not support Timestamp/INT96 from parquets which is our requirement.
1065 East Hillsdale Blvd, Suite 220 Foster City, CA 94404
(650) 241-2281 email@example.com firstname.lastname@example.org