Tips & Tricks for Optimizing Oracle Endeca Data Ingestion

More than once I’ve been on a client site to try to deal with a data build that was either taking too long, or was no longer completing successfully. The handraulic analysis to figure out what was causing the issues can take a long time. The rewards however are tremendous. Not simply fixing a build that was failing, but in some cases cutting the time demand in half meant a job could be run overnight rather than scheduled for weekends. In some cases verifying with the business users what attributes are loaded and how they are interacted with can make their lives easier.

Below are a collection of tips I would suggest for anyone trying to speed up a data build in Integrator. I can swear these will work for you, every site is different. They are simply based on some experiences I’ve had so if they help you then great.

– Reduce the number of rows and columns. As a general rule less will always be faster so if you can filter out garbage records or redundant columns do. Depending on your version of OES the indexer will work in batch sizes of up to 150 or 180 MB. My assumption is the more records that can fit into each batch the faster the overall ingestion process will be. Calculating average record size may not be easy, particularly with multiassign attributes and ragged width records. However you can monitor the Rec/S and KB/s which Integrator will report which can at least help you measure when you’re processing more records.

– Defrag the data drives. Although the physical mechanics of how the data store file is managed are not obvious, the capability of the engine to read and write to/from disk contiguously appears to be significant.

– Specify a higher number of threads. Although in Windows the dgraph process may show any number, officially the default will be 2. The standard recommendation is to identify as many threads as you have CPU cores. My suggestion is experiment with different numbers until you find an optimal point. When you create or attach the data store specify the “–vars –threads X” parameter (X= the number you want). Full disclosuse this is a stock recommendation and I’ve not tested whether this impacts data ingestion or OES queries equally.

– Check if a newer version of OES that could be installed instead. On one client site running 2.3 we installed OES 7.4, it was both compatible and the performance improvements were notable. As mentioned above how batch sizes are defined has changed. With early versions of OES it was always set to 150 MB, with later versions it is dynamically defined and can scale up to 180 MB. This should benefit not only small data ingestions but large ones.

– Refine the data types. Strings will almost always consume the most disk space. The larger the footprint of a record the longer it tends to be to read or write it. And for analytical purposes strings are mostly qualitative fields. You need numbers for quantitative analysis and dates for trending.

– Check the 64-bit version of OES was installed. This is obviously dependent on your hardware but 32-bit servers are becoming hard to find and the 32-bit version of the software might have been installed by accident.

– Verify the bottleneck in the ingestion actually is the indexer. You can generally see this in integrator when the console log only shows time ticking and all the components are done except for the Bulk Add/Replace component. There is very little other feedback on what the indexer is doing, but you can monitor the files in the generations folder to see the activity the indexer is producing. If the bottleneck isn’t the indexer then don’t waste time putting out the wrong fire!

– Check Resource Monitor (Windows) and filter to the dgraph and javaw processes to track the amount of cpu/ram/etc.. being consumed. RAM will probably be your highest consistent resource and generally the easiest area to upgrade. In particular note the “Hard Faults/sec”. However if you see steady demand on Disk, CPU or Network instead you may have something else worth upgrading.

– Terminate or disable competing processes and services. Running multiple processes will chip away all the available resources and the more you can make available to your dgraph process the better. On almost every server I looked at I’ve found services that weren’t necessary just sitting idle and locking resources away.

– Use RAID 10 or RAID 0 (best balance for reads/writes) or a SAN. If you can align your ingestion process to read from one drive and write to another you may greatly speed things up and avoiding the heads from spinning back and forth. Optimizing disk is a bigger challenge to apply & more importantly accurately measure, but don’t neglect this since the impact is significant.

– If your data volume is particularly high moving your system paging file to a separate drive could also provide some benefits. Whenever your data store is larger then the available RAM there tends to be quite a bit of disk I/O.

– Run your indexes during off hours. During business hours the server will usually be dealing with user queries and juggling resources. Try to run your large data volume processes when users aren’t around.

– Run build steps in parallel. Sequential processes almost always means there will be latency when RAM, CPU or Disk are idle. Though processes running in parallel will individually take longer to complete, you’re still likely to complete more of them in a shorter time frame. Generally they’ll queue up waiting for resources and you’ll maximize the utilization of RAM, CPU and Disk.

– Review your attribute properties and the defaults. Ingesting data without any data modeling is easy, but if all your fields are Text Searchable your index creation will need to support that. Minimizing the number of searchable attributes can have a very significant reduction on the size of the data store. I’ve seen this in practice translate to half the disk footprint and the indexing time reduced by more than half.

– Review the cardinality of data values. In some cases attributes may be enabled for search, but the actual range of distinct values is so low that a search is almost meaningless. A search should provide an effective record filter, if the results are still going to be many millions maybe that value isn’t useful for searching on at all. Don’t forget you can still always use the Available Refinements to apply those kinds of filtering.

– Look for duplicate or combo fields. More than once I’ve found data sets where the same value was identical for more than one attribute. Particularly in cases when records were being merged from disparate data sources. This is a great way to ensure consistency in the source systems, but if that hasn’t been a problem then duplicating values may be offering zero return value. Same thing goes for fields that repeat the same information. Think of First_Name, Last_Name and Full_Name. There may be some business reason to format them for display purposes, but generally I’d keep the granular values and drop the combo version. You can always concatenate values through a view if you need to, meanwhile you’ve cut in half the memory those 3 fields consumed.

– Avoid updates by sticking to batch inserts. Multi-Assign attributes may require special attention, and the Bulk Add/Replace will certainly replace records. In theory if you can avoid updates you can avoid the overhead of the index having to deal with noncontiguous inserts. Note I haven’t definitively verified this would make a significant difference, so this is just a suggestion.

– Use the Bulk Add/Replace Records component. This one seems obvious but know that the other components will interact through the exposed/slow OES Web Services and the Bulk Add/Replace Records component is definitely faster.

– Verify your RecordSpec is appropriately unique. This is also a question of data integrity, but in some cases I’ve seen RecordSpecs defined on a number of concatenated fields that included metric values. Not only was the field much larger than it needed to be, but it also wasn’t reliably unique. If you aren’t going to be updating records the RecordSpec may be better defined on a smaller field guaranteed to be unique e.g. NEWID().

– Another suggestion is to presort your ingestion data by your record spec and submit as a batch. I tested this and have to admit I did not find any improvement, but with traditional databases presorted records tend to be processed faster. Try it out and leave a comment if you found this did or didn’t help.

– Review your managed attributes and dimensional hierarchies. If they’re not in use or serve a purpose remove them. I’ve seen some that weren’t being used but hadn’t been removed simply because it had been so complex to define and add them in the first place. They weren’t being used, but they were complicating the indexing process.

Hopefully these tips can help you out. Rest assured you can use Oracle Endeca to ingest, index, and search many millions of records with scores of attributes in OEID (I try to draw a line between 200 – 300). But some discretion, planning and review is always advisable.

This entry was posted in Endeca. Bookmark the permalink.

One Response to Tips & Tricks for Optimizing Oracle Endeca Data Ingestion

  1. Pingback: Data Burping : Stephen E. Arnold @ Beyond Search

Comments are closed.