There are two things wrong with histc:a) histc without normalization uses normalization=%t as default, however in the help normalization=%f is defined as default.b)histc with normalization=%t gives *TOTALLY* *WRONG* results.HOW TO REPRODUCE THE BUG:-------------------------myprob=[1 2 2 3 3 3]a) (wrong default)histc(3,myprob)result: ans = 0.25 0.5 0.75expected: ans = 1. 2. 3.b) wrong results:histc(3,myprob,normalization=%t)results: ans = 0.25 0.5 0.75expected: ans = 0.1666667 0.3333333 0.5
Designs
Child items 0
Show closed items
No child items are currently assigned. Use child items to break down this issue into smaller parts.
Linked items 0
Link issues together to show that they're related or that one is blocking others.
Learn more.
You are right. Several points should be improved or corrected:
histc() should be able to work either in discrete mode or in continuous mode (as dsearch does). This would require a new "c"|"d" flag.
there are 2 kinds of normalization: either in bins heights (as announced in the help page, but not implemented) ; or in total area (as tentatively implemented, with error).
histc() should be able to process texts as data, as dsearch() now does.
histc() should be able to return bins innerly defined when only a required number of bins is specified.
IMO, the normalization flag should be turned from boolean to integer. Then, ==0 will mean "no normalization = raw bins populations" ; ==1 will mean "height-normalized". This would also allow to implement normalization==2 later.
In the .sci file, comments ending lines
//cf = cf./(nd*cw); // Normalization in bin heights
cf = cf./nd; // Normalization in total area
are wrong.
The first line is wrong. It does no correspond to any proper normalization. It should be deleted.
For the second line: the implementation is for heights normalized against the total population, not in total area.
A normalization in total area would be implemented with
cw = cb(2:
)-cb(1:
-1); // Bin width
cf = cf.*cw; // areas of bins
cf = cf/sum(cf); // normalized areas
For this normalization, the user should be warned in the help page that the total area covers only defined bins. Therefore, elements possibly laying outside predefined bins are not taken into account.
Hello Samuel, thanks for your observations, you might want to start a new commit to improve histc?
I think that all the 3 normalization status could be implemented in this commit. Don't you?
For extending histc() to text processing and returning bins, i agree: this will require a little more work and a SEP, and a new commit.
Actually, the initial line in 5.5.0
cf = cf./(nd*cw);
almost implements the total-area normalization (mine in comment#3 is wrong. I apologize for my bad criticism in comments #1 and #3 about area normalization): heights h are such that raw areas cw.*h are equal to raw populations cf ==> heights hn corresponding to normalized areas are such that cw.*hn = cf/sum(cf) ==> hn = cf ./cw / sum(cf)
As stated just above, the calculation of nd might be revised:
The line #37
nd = length(data); // Number of data values
a) can be set in the normalization block, or just after the dsearch() instruction. nd is not used elsewhere.
b) might become: nd = sum(cf)
Hence, support to text data will be straightforward (whereas length(data) would return the number of chars for text data), and elements outside bins won't be taken into account.
For heights normalization, cf/size(data,"*") as well as cf/sum(cf) could be chosen. I have personally no argument to recommend one or the other.
There are two things wrong with histc:
a) histc without normalization uses normalization=%t as default, however in the help normalization=%f is defined as default.
FIXED
b) histc with normalization=%t gives TOTALLYWRONG results.
HOW TO REPRODUCE THE BUG:
myprob=[1 2 2 3 3 3]
b) wrong results:
histc(3,myprob,normalization=%t)
results:
ans = 0.25 0.5 0.75
expected:
ans = 0.1666667 0.3333333 0.5
No. [0.25 0.5 0.75] is correct. Indeed, the classes width is (3-1)/3, in such a way that
--> sum([0.25 0.5 0.75]) * (3-1)/3
ans =
1.