Wednesday, March 28, 2012

Model Information

Is it possible to get hold of the fit parameters out of the prediction models in order to use them without going back to the database.

E.g. With the linear fit model, y = ax + b, is it possible to get hold of a and b?

This way I could use the fit equation and parameters directly within my C# code, rather than making many costly connections back to the database.You can get all of the parameters from the model's content rowset. You can see the rowset by using the generic viewer (there's a viewer drop-down above the mining viewers) or by executing the query SELECT * FROM [My Model].CONTENT.

At www.sqlserverdatamining.com there is an additional plug-in viewer that makes it easier to view the content rowset as well.

You can query the content rowset through code using standard connection and command objects.|||Thanks of replying, but this still doesn't seem to give me the actual parameters of the equation. I can see them in the 'mining legend' window, though I can only get this to come up for the linear fit model.

Also it seems to be giving me strange fit parameters. I created a table of data with x = 1 - 20 and y = 2x. When I performed a linear fit on this, the equation I was given in 'Mining Legend' was Y = 21.000+1.999*(X-10.500). This seems a bit odd, especially when I pasted the data into excel and got the expected answer of a zero intercept and a gradient of 2.|||

Greetings, MagorGirl:
what you are seeing in the Mining Legend is the algebraic form of the linear fit you are looking for.
Because 1.99999*(-10.5) is -21 [up to roundoff], once you expand the parentheses you indeed get Y = 0.000 + 1.999*X
Regression formulas are represented in such form in order to highlight the mean values of the each regressor and of the regression target.
In this case 10.5 is obviously the mean of x and 21 is the mean of y.
So by looking at this format you can take in all the essential stats in one glance.

If you use a regression formula in this format for numeric prediction, the numeric resultsare going to be indentical of course to when you use the equivalent linear fit in the form of a*x+b

|||Thanks Alexei

There's still a problem with the accuracy of this fit though, the intercept does not cancel out fully unless you use quite a lot of rounding. Is there a tendency in the model not to overfit? Does the SQL regression require more data than the excel regression?

I was also looking at the neural networks model and trying to see what equation it produces.

Is there a table anywhere that contains the explanations for the different valuetypes? I've found out that in the linear regression output, valuetype of 3 = intercept and 7 is gradient (Thanks to Jamie for pointing out the code for the improved viewer). I've yet to find a table explaining all the values and the general content of the node_distribution table. Is there any more technical documentation out there, there doesn't seem to be much in books online.

Thanks|||> Is there a tendency in the model not to overfit?

That's exactly right. There is a mechanism of Bayesian priors to reduce the risk of overfitting. The more data you have the less noticeable the influence of the priors is going to be.

I am not sure where the valuetype descriptions are published...|||

I know this is not the level of detail you're looking for but you can at least see the symbolic names for the value types (and other constants used in content) in "\Program Files\Microsoft SQL Server\90\SDK\Include\oledbdm.h" - e.g.:
#define DMMVALUETYPE_COEFFICIENT ( 7 )

We hope to have more technical information available in web updates to Books Online.

|||The GenericContentTreeViewer downloadable from sqlserverdatamining.com decodes all the content node types and value types for you to make it easier to understand. I recommend giving it a try.|||Ok, so I've installed the generic viewer and I can see that some values_types are continuous, some coefficients etc. I can see how the linear fit equation can be built up from this.

I would still like to create an equation from the Neural Network model. I used the same data set as for my linear fit, y = 2x. I told the model to have zero hidden nodes. I basically get two value_types back, which are both continuos. X = 5.5 and Y = 11. Obviously this shows that y = 2x, but how do I construct an equation from these?

Thanks for all the help so far

No comments:

Post a Comment