Free Effort Estimation Tool

Time and again I find myself on projects where the effort needed to deliver on the requirements are far removed from the project budget. Tasks are added to the scope without consideration to the real effort they will require. Project overhead related to testing, deploying, clarifying requirements, and simply managing the tasks are all forgotten.

There are tools you can invest in to ensure projects are managed properly, and an experienced Project Manager will ensure no details are missed. A lot of the time however you need something quicker and easier to use. I built this little spreadsheet to deal with those cases and share it with you in hopes you find it equally helpful.


Feature List

The first tab [Features] is your start point. List all your high level user features here. If you have a work task breakdown you can enter the individual tasks instead, the level of detail is only restricted by the information you have on hand and your preference. I tend to record use cases or user stories from the business perspective since at the stage I’m using this tool I’m typically scoping out the project. The columns for Domain, Description, and Dependencies are all optional, adjust them as you like. Often I’ll continue tracking the project tasks in this spreadsheet and will rename those columns to track Assignee & Status.

Effort Estimates

Estimating the effort for each task is a guessing game. If it’s feasible to engage a larger group in a Planning Poker session I’ll do that. It’s not only a great way to get a more complete picture of the effort for some tasks but provides an excellent opportunity to engage the development team into the project. Whatever your process, what is essential is to capture both the Optimistic and the Pessimistic estimates. If the effort (in hours) is an exact number enter the same value in both columns. The calculated Most Likely value in this tool is based on an adjustable percentage which at its default (70%) slightly skews towards the Pessimistic level.

I’ve also included a couple calculated columns which apply the PERT formula ((O + 4M + P)/ 6), and the standard deviation ((P – O) / 6) in the estimate range. The former is a familiar approach which actually expects the Most Likely value to have been manually identified. In the event you do identify your own Most Likely value then the PERT formula is available to factor in those estimate ranges for you. The standard deviation isn’t formatted, however the greater the number the greater the uncertainty in the supplied estimate.


Identify the priority for each task. I find the MoSCoW Methodology works great, and the spreadsheet will breakdown all your effort and costs against those 4 priorities (Must Have, Should Have, Could Have and Would Like). Generally I’ll plan a project around delivering the Must & Should Haves. Everything else is either “time allowing” or just a parking lot for a future project.


Burndown charts are often based on the number of features or (story points) completed in an iteration. I’ve simplified this to calculate velocity based on any actual hours on completed tasks. As the project starts to unfold this can be a very helpful gauge for whether you’ll finish ahead or behind schedule. It can also provide a quick ‘n dirty average when looking at any new scope creep items that might be discussed. Not that latter needs to be taken with a pretty massive grain of salt, however in my experience with each respective client project there generally tends to be some consistency in the form and scale of features and units of work (but quote that average with caution).



Refined Hours

The second tab [Metrics] provides a summary of how all the hours flush out for the features under each priority, with a few automatic refinements. The numbers are all based on the Total Most Likely values.

Rows 11 through 14 factor in the overhead for ongoing Business Analysis, Project Management, Testing & Bug Fixing, and Training & Deployment. These percentages are also adjustable, however I default them to 5%, 20%, 30% and 5%. You can optionally capture that effort as we did with features. I like to use a percentage to keep it relative to the effort.

The impact of staffing is intended to capture the increasing overhead as more people are added to a project. Although more people will help to split the work it will increase the communications and coordination necessary on the project.

Rows 19 & 20 provide a sense of the project duration in days and weeks. Instead of Perfect Engineering Day and 5 day weeks I like to plan around my team having 6 productive hours a day, and with respect to stats, vacation and sick days about 4 days per week. You can adjust all those to suit your comfort level, but I like to make sure I’m producing an estimate for my client that I can rely on with some comfort.




The third tab [Costs] provides a breakdown of the costs for the various project team members. The proportions for testing, business analysis, project management, and deployment are tied to the hours from rows 11 through 14 on the [Metrics] tab. Their division into separate roles is defined in the [Refs] tab.

The actual role titles and their hours are adjustable, I explain how the percentages on effort are split later on. I recognize I’m not capturing the full breadth of different roles and contributions that you might need in your project team. UX designers for example I might here merge with the role of Business Analysts or Developers. However this tool is meant to be quick ‘n easy. You can repurpose any roles as you see fit, or extend the worksheet to introduce new roles if that works better for you.



The fourth and final tab [Ref] presents all the adjustable variables applied in the previous three tabs. When planning a project adjusting these to reflect your situation, team, and client would be necessary.

  • Most Likely Variance – As described before I generally skew the average between Optimistic and Pessimistic closer to the Pessimistic side. This tends to be more accurate.
  • Project Team Size – This value is intended to be adjusted to assess the impact of increasing or decreasing the number of people involved. The larger the team the greater the effort, but the lower the calendar duration.
  • Staffing Increment – This value indicates the boundary points where I’ll apply an incremental impact to effort as the team size increases.
  • Staffing Increment Impact – The impact to overall effort whenever the project team size grows by the defined increment. This is a generalization, however you can adjust it as you see fit.
  • Dev Hours per Day – The number of actual development hours that I think I can reasonably expect on average for each calendar day.
  • Dev Days per Week – The number of days I think I can safely plan for per week. You can adjust this as you like, but if you’re committing to a calendar estimate to the client I think it’s better to err on the side of caution rather than have to defend a missed deadline.
  • Business Analysis – The amount of additional ongoing requirements gathering that will take place during the development.
  • Project Mgmt Cost – The amount of time necessary to manage the project, track time, manage risks, prepare invoices, etc..
  • Testing – The percentage of overhead to cover unit testing, quality assurance and bug fixing.
  • Prod Deployment – The percentage of time related to deploying the solution to the production environment and to the client to use.


This final section needs a bit more explanation since its application is not as obvious. As mentioned earlier the actual role titles can be changed. Each section captures the roles involved in the development effort, plus the overhead for testing, project management, and deployment. Against each role are sample rates per hour that are used in calculating the project costs. The percentage split (column D) reflect how much of the effort will be split between those various roles. Each section starts with 100%, the more senior roles reduce that amount and the blue box reflects whatever remains. In the event the project team is 1 person the developer is assumed a Senior Developer so those splits will be ignored.

So for example if you decided your project team would include a Senior Tester for about 20% of the testing effort, then the remaining 80% would be inherited by the Tester. If testing was looking to be about 200 hours, then the hours & costs for the Senior Tester would be calculated to be 40 hours, and the remaining 160 hours would fall to the regular Tester(s). You can play around with those values to see how this works, its much less complicated than it sounds.

In Sum

A couple of caveats are necessary. I have personally found having this spreadsheet at my disposal to be very useful and reasonably accurate. However there are lots of generalizations being made in the logic and calculations that it must of course be used with discretion. There are lots of alternative options for estimating effort and costs that will provide a much more accurate project plan for you. However more often than not I have still found it necessary to have something quick ‘n dirty on hand. At the very least this tool has allowed me to produce a project effort estimate with minimal effort, and prompted excellent dialog with the client on priorities and project overhead.

Enjoy! If you make awesome improvements please share what you produce with me.

Effort Estimation (Template) – ver 5

p.s. You can unprotect any sheet, the password is blank.

Posted in Professional Services | Tagged , , , , , , , , , , , , ,

What’s New and Exciting with OEID 3.1?


Starting with Oracle Endeca Server the sample data domain is now called Sales History instead of GettingStarted.  The web service numbers have moved from 2 to 3 so any Integrator graphs will need to be updated. There’s a large number of changes to service names, properties and a few EQL changes to be aware of. They’re well documented, but here’s the top 3 changes in OES I would highlight.

There’s now a warm-cache-dd command for Endeca-cmd to preload a dgraph cache which can be very helpful. Especially if you’re preparing for a demo to an executive.

Another big change for OES is the concept of idling data domains that aren’t being queried, and shifting those resources to those that are. Nice efficiency if it works as advertised, but also a nice complement to the warm-cache-dd command.

Newly introduced is the concept of “Collections”, which are containers for attributes and define how OES will index them. I’ll revisit this in a later post, but off hand this seems an odd direction. Grouping attributes into logical groups to support business semantics is one thing, but having to sort attributes into data sets starts to sound a bit like traditional data modeling of entities into columns and tables. I think there are some intriguing possibilities with this, in the meanwhile Base remains a valid collection.


Oracle Endeca Information Discovery Integrator is now called “Oracle Endeca Information Discovery Integrator ETL”. So that’s an important change that makes it much simpler and clearer…?

All data being loaded into OES now must have a “collection key” field specified to direct them to the appropriate container in their data domain. As mentioned you can load them all to the “Base” collection if you want, but you have to specify something.

There is a new component called “Language Detector” you can use to detect what language you’re reading in.

There is improved support for Hadoop and Apache Hive. Haven’t tested it to see how well it works but clearly they aren’t neglecting the ever growing popularity of Hadoop.

They’ve introduced the “Oracle Endeca Web Acquisition Toolkit”, which could be a big win for a lot of clients. This provides a GUI interface to access web content where no existing API exists. I give bonus points for this, adding those unstructured data sources is a high value item for analysis and is only increasing in opportunity and return value.

Separately worth noting there is now available an “IKM SQL to Endeca Server” module. This provides a mechanism for writing to Oracle Endeca Server from Oracle Data Integrator 11g instead of the Endeca CloverETL version of Integrator. Oracle has suggested ODI is their strategic long term tool for all data integration so Integrator developers may want to consider familiarizing themselves with ODI. I don’t know if this module is compatible with the most recent ODI version 12c.


Finally we get to the actual business application. So what’s new here? Well the interface is a bit different, but we’re still conceptually working with Applications.

We now have a Data Source Library that includes the options to select “Oracle BI” & “JDBC”. This appears to provide an alternative source for data for business users to create their own applications against. JDBC connections opens up a number of doors.


Endeca data domains need to be specified through the Control Panel option “Endeca Servers”.

The import process has been improved for Provisioning Services, including adding the ability to merge two separate data sources together. Data Mashups have been around for other BI tools so filling that gap is good to see. Provides a nice capacity for end users to correlate records from different sources for themselves. Provisioning Services will still present some questions around IT governance and each data load is a one-time effort so there’s little support for updates.  Still a very convenient feature for business users.

There’s a number of general bug fixes. Sorting options are improved, special characters are displayed correctly, browser support is addressed, nothing earth shattering but cumulatively positive. Dragging components around different containers has been resolved which is frankly awesome since that bug was driving me nuts. More nuts at least.

Results Table component has some minor improvements, but this is more important than it sounds since it is a critical component for exploring and extracting data of interest. I didn’t have any hierarchical dimensions to test the Pivot Table with but they both have a non-EQL interface and business users should find them both easy to work with.

The “Add Components” menu has changed. Popular bubbly icons and no longer in folders. I think this makes them generally more intuitive.


Guided Nav is now called “Available Refinements” and allows users to switch from a custom listing to a default one. Breadcrumbs are now called “Selected Refinements” and looked pretty much identical.  The IFrame component is also very easy to configure, though not every site will support it.

Metrics Bars is now called “Summarization Bar” (Alerts component is also gone).  As with the table components you no longer have access to EQL queries. You’ll need to select from the available attributes and from the aggregation queries they provide (e.g. min, max, avg, etc.). This is geared directly at making that component easier for business users, however all custom EQL work will need to shift entirely to views.


There’s a Toolbar along the base of the browser window which offers a couple nice features. One of them being the option to select Data Sets. The second is to take a Screen Capture of all the components on the screen (note the little camera icon). They extended that to the “Actions” menu on the components as well so you have the option to “Save Image”.  This should be popular and I’ve had a few clients asking for this very option. The third element on this toolbar is the ability to directly create bookmarks which replaces the Bookmarks component. This toolbar is an elegant improvement.


The inclusion of Heat Maps to improved mapping components is nice. Heat maps as a form of visualization is popular and an effective means for intuitive information display. This screen capture isn’t the best example but you get the idea.


Text enrichment can now be applied within Studio (Application Settings..Data Sets menu) rather than solely as part of the ingestion process. This will help empower the business users to apply their subject matter expertise directly. The Extract Terms feature was pretty cool to see exposed in Studio, though I’ll confess I only got an error trying to run it.  Sentiment scoring is still an ingestion phase features, however Whitelist Text Tagging will let business users manage their own term substitution which should allow them much faster reaction to changing business rules.


The Tag Cloud provides a toggle between cloud & list view. Another nice touch for the people who want to see the data in a more structured sorted order.



OEID 3.1 is delivering a nice focus on the business usage of the tool. The changes are not dramatic, but elegant refinements that make the overall user experience much more comfortable. Going from 3.0 to 3.1 might be a tough sell, but from 2.x to 3.1 is definitely compelling.

The top three features I see are firstly the “Oracle Endeca Web Acquisition Toolkit”. I think that will make a big difference to sites wanting to load unstructured data. There’ll still be some work to do during the ingestion but the IAS process was a bit convoluted and not easy for everyone to adopt.

The second big feature is the removal of EQL from all the components. This version is really aimed at improving the ease of use factor for business users. There is still the capability to write custom EQL in views. Although it now becomes largely the domain of the developer it is comforting to know that flexibility is still there. And it’s not such a bad thing if our more casual users can’t risk a typo or a logic errors when they try their hand at writing queries.

The third big feature I see is in those Data Sets (a.k.a. Collections). They’re similar in function to Views, but I’m not entirely sure just how they might be leveraged.  Stay tuned for more information on this, I think they could prove quite valuable.

Posted in Endeca | Tagged , , , , , , , , , , , , , , | 1 Comment

Installing or Upgrading to OEID 3.1

Oracle Endeca Information Discovery 3.1 has been released. Excited? If you’re working with 3.0 you probably don’t really need to be. So what’s new and improved? Well there’s a fair bit, some is improved, some may be more nuisance, but I’ll dig into those things next posting.  For now how do you install everything?

If you’ve already gone through the installation of 3.0 then you should be familiar with all the steps to configure with WebLogic or Tomcat. If not read my earlier posting on how to install OEID 3.0 for WebLogic, the steps for 3.1 are almost identical.


#1- Uninstall OES version 7.5

The first step is to uninstall Oracle Endeca Server. On my Windows box this was the following command, including the path to where JRE was installed:

C:\Oracle\Middleware\EndecaServer7.5.1_1\oui\bin>setup.exe –deinstall –jreLoc C:\Java\jrockit-jre1.6.0_37-R28.2.5

Next remove the domain from WebLogic. This involves nothing more than stopping WebLogic, removing the entry for your OES domain from the file C:\Oracle\Middleware\domain-registry.xml, and then deleting the folder it referenced.

If you’ve set up a Windows service to automatically start OES make sure you stop and disable it. Forgetting this step may cause you system stability problems.


#2- Install OES version 7.6.0

Run the setup command from whatever folder you extracted the “disk1” files to, again specifying your JRE install path.

setup.exe –jreLoc C:\Java\jrockit-jre1.6.0_37-R28.2.5

Next create the weblogic domain by executing the following file. Note the steps are the same as per 3.0 so just follow the prompts.


Oracle has provided a sample data domain called “Sales History”. You can import this to have a set of data immediately compatible with OEID 3.1 & OES 7.6. I found some issues with accessing this data store because it didn’t appear to have default data sets defined.  More on that later, meantime you may find it easier to update the LoadData.grf from the Getting Started project that came with OEID 3.0. That involved nothing more complicated than creating the data domain through the endeca-cmd utility and updating the graph in Integrator to specify “Base” as the Collection key. You will need to have upgraded Integrator first.


#3- Upgrade Integrator

Run the uninstall for Integrator 3.0 and then the install for Integrator 3.1. The install requires you to specify the path to the Eclipse package file, this is a bit of a nuisance compared to the 3.0 installation. You also need to make sure it’s the IDE for Java Developers Indigo version.

If you don’t have the right eclipse package you will be advised that the RSE installation failed and you should check your internet connection. Very likely your internet connection is fine, so instead check your version of Eclipse. The file that worked for me was “”.


#4- Upgrade Studio

This part gets more complicated, if you forget the fact I went through a half dozen versions of Eclipse before I resolved my “Internet Connection errors”.

If you have content you may want to backup up your existing install. Then just remove the domain from WebLogic in the same way we did the OES domain.

As before run this file to recreate the domain:


“Basic WebLogic Server Domain” is the one you want and will be selected by default. The process is pretty straight forward as you go through each of the steps. When you get to the “Select Optional Configuration” just make sure to select “Administration Server”. This will let you change the listening port so you don’t conflict with OES listing on port 7001 already.

In the domain folder create these additional folders (<studio_domain> is whatever name you picked):


Extract and copy the file “” to the ..\eid\studio\ folder.

Start WebLogic:


Logon to http://localhost:8101/console (replace 8101 with whatever port you specified earlier).  Import the file “endeca-portal-weblogic-3.1.14220.ear”. These steps are the same as they were for installing OEID 3.0.

Once you finish remember to make sure your application is started. And to access it be aware the path is now http://localhost:8101/eid (again that may not be your port number). The default login is still and “Welcome123”.

You won’t be able to do much until you have data to work with, so your first step to run Studio will need to be creating some data sources.


#5 – Upgrade Provisioning Services (optional)

This process is also pretty much the same as it was for OEID 3.0.  Remember to remove the old provisioning service domain folder and its corresponding reference in the file: “C:\Oracle\Middleware\domain-registry.xml”. Also remove the old provisioning service application folder (..\user_projects\applications\<old_prov_service>).

When you get to the “Select Optional Configuration” select “Administration Server”. In my case the port was already set to 8201 and wasn’t defaulting to 7001, but better to check to be sure. Also verify the ws port is 7001 in the file:  “..\user_projects\domains\prov_domain\eidProvisioningConfig\plan.xml”

Note if you’re running an SSL environment those ports would default to 8202 and 7002 respectively.

Once you start the service you should be able to verify it through a browser.


You’ll have to update the JSON entry for Provisioning Service in the Control Panel before you can use the service.

So you should now be up and running and able to start checking out OEID 3.1 for yourself.

Posted in Endeca | Tagged , , , , , , , , , , , ,

Agile BI Does Not Mean Anything Goes

Recently I was in a discussion where the capacity for OEID to rapidly ingest data was highlighted as one of its highest value items. That’s a fair point. In the spirit of “Agile BI” the turnaround time for ingesting a brand new data source is indeed a key differentiator.

The risk is that rapid ingest in the spirit of “Agile BI” runs the same risks as “Agile” projects that cut corners and throw out discipline.  Agile I believe means engaging with clients, embracing change, and building a solution through iterative improvements.

Rushing anything will not produce a quality product. Rapid data ingestion is a wonderful ability, but always ask yourself these questions:

#1- Do I understand the overall vision?

Is the data I’m loading going to be nothing more than a temporary throwaway data store? Or will it be there for the long haul, adding depth to existing records and needing to align to the existing dimensions and records?

There’s no harm in spinning up a data store and throwing it away after some quick exploration. That’s how experimentation and learning work. However making that the standard practice instead of supporting a more holistic enterprise architecture may not deliver the value your client needs. Consider how your data set might fit into the whole, it’s rarely without connections that need to be understood.

#2 – Am I delivering what the client needs?

Oracle Endeca Server will support you loading data without any structure at all. You can baseline a data store and attribute names and types can be reset anytime. Very accommodatingly all the data can default to strings and Studio is wonderfully intuitive for text searching.

This doesn’t mean you’ll answer all the questions your client will have. Ensuring numerical and date/time attributes are in their respective formats may ensure trending and statistical questions can be answered. Taking some time to consider the semantics and even attribute groups may make the data exploration a much richer experience for the client. Before you start madly shoveling data into a heap take pause to confirm it works for your client. The technology doesn’t require it, but there is return value from the data modeling exercise.

#3 – Am I developing like a cowboy?

This starts innocently enough, with a question being answered with a shrug and the suggestion it doesn’t matter. Maybe it doesn’t, but do this too many times and your solution won’t matter either.

There are good reasons for standards and best practices. Ignoring them for the sake of fast data ingestion can translate into such poor reporting quality that users lose confidence and trust in the system. Developers need to stay disciplined so their solution will also be measured by its reliability.

So yes, the technology in OEID enables very rapid ingest of data. Agile BI is very compelling that way.

Just never forget that quality products come from quality effort, and speed is not the only measure of success.

Posted in Endeca | Tagged , , , , , , , , , | 1 Comment

Information Audits are an Opportunity, Not a Police Action

The first goal when auditing information systems is typically to identify policies exist and are enforced around responsible management of data.  Audit controls, change tracking, security, access, integrity constraints, etc., can all be verified or recommended.  These are prudent practices to managing risks and corporate liabilities, and it is essential for an organization to maintain its customers’ trust.  They also make an audit look like an expense with no immediate business advantage.  Like paying for car insurance, it’s a cost we’d really rather not have.  Focusing exclusively on questions of governance and control can fail to recognize there is more opportunity to an information audit that simply implementing new rules.

Some years ago I sat in on a meeting to discuss information sharing.  One manager stated quite definitively there was no way they were going to share a report because of “FOIPOP” laws.  This was a reference to Freedom of Information/Protection of Privacy legislation current in British Columbia at the time.  Their staunch protection of their customer was comforting, however for the scenario being discussed there was in fact no compliance issue.  More surprisingly their vague reference to “FOIPOP” proved sufficient to silence the other party.

Since that time our familiarity with, and adoption of, good practices for managing information have greatly matured.  Yet we still find organizations and users that don’t entirely understand their responsibilities and their rights with regard to their data assets.  We needs to focus on themes of accountability and ownership, and how they apply both between the organization and its client base as well as between staff and their organization’s objectives.


The immediate and most familiar requirement for an Information Audit is to ensure Accountability in the collection and usage of information.  This is ultimately a question of whether the organization is managing its information with respect to their obligations to their customers.  Is the information being gathered and retained for reasonable purposes?  Is it being stored with diligence to security and access policies?  Are they holding themselves accountable to how that information is used so the rights of their customers are maintained?  Do they have any exposed liabilities or risks?  There may be market or government legislation that mandates how information can be collected, stored and used.  There are almost always internal policies and procedures that need to be defined, communicated and enforced.  And there generally needs to be technical implementations for ensuring compliance and governance that IT will often be responsible for.   Ensuring organizational accountability is the first priority to assess.

What shouldn’t be lost however is also an accountability to supporting and enabling the organizations success.  There is an accountability users and departments have to enable each other that excessive zeal around protecting data can obstruct.  Sharing data isn’t a bad thing.  In the spirit of “Systems Thinking” a good audit will also identify where policies and safeguards are unnecessary or could be streamlined to benefit the business.  The cumulative corporate data an organization can collect is an immensely valuable asset that leveraged appropriately can empower the business.  Navigating silos and diplomatically handling internal politics is part and parcel of making sound recommendations in a complete information audit.


The second area of focus is in regards to Ownership.  This captures the responsibilities individuals and departments have around what data they collect and their role as the stewards of that data.  Ensuring they understand their responsibilities for collecting good clean data, correcting errors, and understanding how dependent downstream systems are on data quality.  Aligning against business objectives can help clarify if the right information is being collected and retained.  Ownership includes a look at how other parties are supported, how usage policies are communicated, and who ultimately decides how data will be shared or transformed within the organization.

This sense of ownership should not however lead directly to silos of information.  Part of the mandate of being a data owner is to ensure all users understand both their responsibilities and their rights with the data in a system.  Users are empowered when they understand the reason behind a policy and can access the necessary resources and experts to be able to apply appropriate judgment on how data is shared or managed.  In your own organization, do you find the hurdles to working with your organizations data are reasonable or excessive?  Ensuring corporate data is being shared appropriately and efficiently between separate departments is a critical deliverable of a good information audit.

Data is one of the most valuable assets an organization has.  Business users have serious responsibilities around its collection, storage and usage that need to be very carefully understood and respected.  The risks and liabilities are considerable.  The caveat remains, however, that an information audit should focus not only on mitigating risks but also ensuring data is being effectively leveraged. There is just as much value in recognizing when a constraint is not necessary as there is in confirming one exists.

When you assess an information system you need to go beyond the technical details and consider your data landscape more holistically.  By understanding the business, by reviewing the external and internal policies and procedures, and by engaging with business users at all levels, you will be able to deliver an assessment that will help you manage your risks and ensure you are maximizing your opportunities.

An information audit is not a police action.  This is an opportunity to ensure your organization is leveraging its data assets respectfully and effectively.

Posted in Information, Professional Services | Tagged , , , , , , , , , , , ,

Predictive Analytics with Endeca

The opportunities and challenges to delivering real world Predictive Analytics are exciting.  They’re not trivial efforts, and they require a level of collaboration between Business and IT that can be rare.  However when the stars align the forecasts they produce can be game changers.  And a BI solution that doesn’t change the game for its business is arguably a waste of time.

Oracle Endeca Information Discovery is not really a Predictive Analytics tool.  The Text Mining through Lexalytics provides one powerful data mining model, but that’s the only one.  Plus it’s part of the data ingest and upstream from Studio.  We’ve interfaced with R from Integrator enough to know during the ETL stage just about any external data mining model is effectively available.  Some level of classification and association might be suggested through Studio’s data exploration, but I’d argue this produces business questions and is a far cry from the complex algorithms proper cluster analysis and classification trees produce.  OEID is a first and foremost a tool for Data Discovery.

So what good is OEID when your company wants to add Predictive Analytics?

There are three reasons, starting with “Big Data”.  If you crunch through some of the facts gathered and collected by Marcia Conner ( you can see data volumes are absolutely enormous and continuing to grow.  Data mining structured and massaged data isn’t hard, but there is a growing gold mine of unstructured and social media data that is ripe for analysis.  Collecting all this data together requires robust and capable ETL processes that can handle data sets that are text rich and constantly changing.  OEID provides the Integrator Acquisition Service (IAS), the text enrichment components, and an inherent flexibility in the engine to support multi assign attributes and ragged width records without a long refactoring effort.  Combined these deliver an ideal toolset to bring together data from multiple sources and formats.

The second consideration is the sheer effort involved in Bridging the Gap between business and IT.  Data Scientists are expected to know all the data mining models, but like the rest of IT have no secret insights to automatically understand all the business nuances.  And the data mining tools they use (e.g. R) are not always intuitive interfaces for business users to pick up.  OEID Studio is a visual tool.  The intuitive and friendly interface makes it ideal to search, explore, and even extract data of interest.  Exploring data can be used to feed into data mining models, create training sample sets, and of course to explore the output they generate.  As a tool for communicating and collaborating Studio can be ideal for reducing the gap between IT & business and help ensure relevancy and focus to any data mining efforts.

The final major consideration is the growing demand for tools that deliver Interactive and Intuitive visualizations and simulations.  At heart I absolutely believe the rule that good data coupled with simple visualizations is best, and out of the box OEID offers the essential histograms, line graphs, scatter plots, and of course the ever popular pie chart.  With features like the guided navigation, breadcrumbs, and the search interface OEID can provide a tool to explore your data and any generated data mining models.  For those analysts who trust to the pivot table and ad hoc queries the alerts, results table and metrics components are easily accessible.  For anything more, the framework Studio is built upon (LifeRay) is entirely extensible to customizations, improved visualizations, and even methods that can support simulations and model output comparisons.  OEID is an ideal and flexible tool for business users to interact with the output of data mining models.

Business Users aren’t going to be able to use OEID to generate predictions.  They aren’t going to write EQL queries that perform regression analysis.  They won’t find a magic button to perform algorithmic analysis and produce coefficients, probabilities, margins of errors, and all the other statistical outputs proper data mining models produce.

They will however have a tool that can handle all the varieties and volumes of real time data to feed into the data mining exercise.  They’ll have an architecture that can interface with data mining engines to ingest the results of a model.  They’ll have powerful text mining engine in a data world that is increasingly unstructured text.  And they’ll have an interactive and visual tool that lets them actively participate in preparing, tuning, and applying the fruits of Predictive Analytics.

The only thing missing is the question:

What do you want to predict for your business?

Posted in Endeca | Tagged , , , , , , , , , , , , , , | 2 Comments

Making Metrics Matter with Data Exploration – Part 3 of 3

(This posting follows an earlier posting on working with EQL to use a join instead of a where clause)

The last element to working with metrics I want to cover relates to a side effect of the activity of Data Exploration.  As you browse your data and navigate through different dimensions you can arrive at a point where the records needed to define the metrics are no longer returned.

This is the same behavior we see with the Guided Nav component.  It’s very effective as you explore the various dimensions that records start to be filtered out.  Dimensions aren’t displayed if you can’t explore the records to any further levels.  Allows you to nicely focus in on records of interest.

Using the GettingStarted data domain, browse to “Fiscal Year” and select 2011 & 2010.


The metrics bar provided with this sample will display values, however we know there are some issues.  Firstly the sales growth isn’t reflective of 2010.  Secondly though the number of orders is, there’s no alignment between those order numbers and the calculated sales growth.


The metrics bar we created earlier shows us data more contextually relevant to our current nav state.


If we remove 2011 from our breadcrumbs pane our updated metrics bar unfortunately fails us.  They both do actually, though ours arguably worse than the original version.


So why does this happen?  We have Coalesce calls around the attributes to handle nulls, so shouldn’t we simply see 0 values?  It seems to work for the default metrics bar after all.

The issue for our eql is we’re defining record sets for our source and we’re not getting any records.  Coalesce operates on an attribute value.  It handles a value missing from a record, not a missing record.  If you don’t even have a record in your record set then Coalesce has nothing to execute against.  This doesn’t seem to impact the eql that operates by default against the Base view.

Currently our eql looks as follows, leveraging both the dynamic identification of fiscal years and the performance issue around using a WHERE clause.  The latter change was less important for GettingStarted, but I included it for anyone wanting the reference.


There may be other solutions, however the approach I took to resolving this applies the same approach for resolving the performance issues around filtering the fiscal year.  Taking a page from the world of traditional database development I just made sure I always had a record set of 1 that I could join to.  This way my Coalesce functions will always execute.

You can just use a local DEFINE statement.  In order to reuse the logic multiple times I created a view as follows:


This view will always and only ever provide me with a “RecordCount” of 0.

Applying this to my eql I am creating a Cartesian product by doing a full outer join to the ReturnZeroCount recordset.  This however has minimal impact since its only a single record with a single value.  And where its added its only adding a value of 0 so won’t change an actual count.  Really makes you wonder how ancient civilizations managed without the concept of 0!


Since I’m always guaranteed to have at least 1 record in my record set I also know my metrics will always be calculated.  They may only return a value of 0, but that’s a better message than “No results available”.

The result to displaying my metrics bar is as follows:


There are two final refinements to make.  Because I no longer have a previous year my join from rs1 to rs2 fails.  By adjusting it to a FULL JOIN as well I can make sure I’m not excluding values for the current fiscal year.

I’ll also update the calculation of Sales Growth, since SalesPrevious may be null then I’ll default the value to 1.  This is a bit pointless but you could argue it’s appropriate to identify 100% sales growth when starting from zero sales.


Now my metrics bar displays my total sales for the current year, my total number of orders for the current year, and my sales growth from the previous year.  Values that will correctly reflect your changing navigation state.


Hopefully these techniques will prove of use to you in your business.  Cheers!

Posted in Endeca | Tagged , , , , , , ,