Community Edition Troubleshooting

Expand all | Collapse all

Loading ResultSet to IgniteDataStreamer via BinaryObject

  • 1.  Loading ResultSet to IgniteDataStreamer via BinaryObject

     
    Posted 27 days ago

    We did a few POCs on Gridgain using Java to fetch data using Hive. All was well until we hit hive tables with huge amount of data (>20mn records, 100+ columns).

     

    Issue cropped up because we coded to reach each cell level data to form a tabular structure in gridgain to be able to do sql queries against gridgain.

    We use BinaryObject to stream data into ignite as follows:

     

    • Forming the DDL to create table structure in Gridgain

    Using Java ResultSet with hive data -> ResultSetMetaData to QueryEntity

     

    • Once DDL is ready, we iterate through ResultSet and iterate again to get cell level data

    Each cell data is stored in BinaryObject along with its data type from prior step -> BinaryObjectBuilder.setField(DataType, rs.getLong(columnName));

    This step obviously takes very long with the data set size I mentioned above

     

    • Lastly, we add the built BinaryObject back to IgniteDataStreamer per row from ResultSet

     

    We do this so we can maintain exactly tabular structure from a hive table with data types so our apps don't change their SQLs.

     

     

    Questions here are,

    1. Is there a better way to tackle this and still maintain a replica of hive tables without nested iteration?

    We are not maintaining any Java POJOs since there are many tables and we want to keep generic code (BinaryObject)

     

    1. If we try to pull in underlying Parquets instead of hive queries, we will fall into same issue because:

    We will first fetch schema of file -> form ddl as above -> stream data as above.

    Only win will be threading up multiple parquet part files to stream into same IgniteDataStreamer object

    Is this an approach to go ahead with?

     

    1. Any examples to stream in Parquets directly into Gridgain, while maintaining a tabular structure with correct DataTypes?

    We tried using AvroParquetReader, but it does not support Timestamp/INT96 from parquets which is our requirement.

     

     

    We are close to including Gridgain in our strategic suite, once we resolve this issue

    ------------------------------
    Rahul Gaba
    Software developer
    Barclays
    ------------------------------


  • 2.  RE: Loading ResultSet to IgniteDataStreamer via BinaryObject

    Posted 17 days ago
    Hello!

    1. Where do you spend most of the processing time? Can you explain the nested iteration in any more detail? If you have just two fields you can try using IgniteBiTuple instead of BinaryObjectBuilder.

    2. Yes, you can use the same IgniteDataStreamer from multiple threads as soon as it targets the same cache & you flush() it as needed.

    Regards,



    ------------------------------
    Ilya Kasnacheev
    Community Support Specialist
    GridGain
    ------------------------------