histc uses wrong default and normalization=%t is wrong

Reported by Richard llom

There are two things wrong with histc:

a) 
histc without normalization uses normalization=%t as default, however in the help normalization=%f is defined as default.

b)
histc with normalization=%t gives *TOTALLY* *WRONG* results.


HOW TO REPRODUCE THE BUG:
-------------------------
myprob=[1 2 2 3 3 3]

a) (wrong default)
histc(3,myprob)
result:
 ans  =   0.25    0.5    0.75
expected:
 ans  =   1.    2.    3.

b) wrong results:
histc(3,myprob,normalization=%t)
results:
 ans  =   0.25    0.5    0.75
expected:
 ans  =   0.1666667    0.3333333    0.5

Designs

Child items 0

No child items are currently assigned. Use child items to break down this issue into smaller parts.

Activity

scilab bot added 5.5.0 Bugzilla Major PC Statistics labels 10 years ago

added 5.5.0 Bugzilla Major PC Statistics labels
scilab bot @scilabbot 10 years ago

Author Owner
By Samuel GOUGEON (@sgougeon)

You are right. Several points should be improved or corrected:

histc() should be able to work either in discrete mode or in continuous mode (as dsearch does). This would require a new "c"|"d" flag.

there are 2 kinds of normalization: either in bins heights (as announced in the help page, but not implemented) ; or in total area (as tentatively implemented, with error).

histc() should be able to process texts as data, as dsearch() now does.

histc() should be able to return bins innerly defined when only a required number of bins is specified.

We could work on such an upgrade. Regards Samuel
scilab bot @scilabbot 10 years ago

Author Owner

By Paul BIGNIER

Hi Richard,

Thank you for that report. Indeed the help page did not match histc's functioning and the normalization was improper. Here is the fix for both problems: https://gitlab.com/scilab/legacy_codereview/-/wikis/changes/14821/

Hello Samuel, thanks for your observations, you might want to start a new commit to improve histc?

Regards, Paul
scilab bot @scilabbot 10 years ago

Author Owner

By Samuel GOUGEON (@sgougeon)

Hello Paul,

(In reply to Paul BIGNIER from comment #2)

https://gitlab.com/scilab/legacy_codereview/-/wikis/changes/14821/

IMO, the normalization flag should be turned from boolean to integer. Then, ==0 will mean "no normalization = raw bins populations" ; ==1 will mean "height-normalized". This would also allow to implement normalization==2 later.

In the .sci file, comments ending lines

//cf = cf./(nd*cw); // Normalization in bin heights cf = cf./nd; // Normalization in total area

are wrong.

The first line is wrong. It does no correspond to any proper normalization. It should be deleted. For the second line: the implementation is for heights normalized against the total population, not in total area.

A normalization in total area would be implemented with cw = cb(2:
)-cb(1:
-1); // Bin width cf = cf.*cw; // areas of bins cf = cf/sum(cf); // normalized areas For this normalization, the user should be warned in the help page that the total area covers only defined bins. Therefore, elements possibly laying outside predefined bins are not taken into account.

Hello Samuel, thanks for your observations, you might want to start a new commit to improve histc?

I think that all the 3 normalization status could be implemented in this commit. Don't you? For extending histc() to text processing and returning bins, i agree: this will require a little more work and a SEP, and a new commit.

Best regards Samuel
scilab bot @scilabbot 10 years ago

Author Owner

By Samuel GOUGEON (@sgougeon)

Actually, the initial line in 5.5.0 cf = cf./(nd*cw); almost implements the total-area normalization (mine in comment#3 is wrong. I apologize for my bad criticism in comments #1 and #3 about area normalization): heights h are such that raw areas cw.*h are equal to raw populations cf ==> heights hn corresponding to normalized areas are such that cw.*hn = cf/sum(cf) ==> hn = cf ./cw / sum(cf)

As stated just above, the calculation of nd might be revised: The line #37 nd = length(data); // Number of data values a) can be set in the normalization block, or just after the dsearch() instruction. nd is not used elsewhere. b) might become: nd = sum(cf) Hence, support to text data will be straightforward (whereas length(data) would return the number of chars for text data), and elements outside bins won't be taken into account.

For heights normalization, cf/size(data,"*") as well as cf/sum(cf) could be chosen. I have personally no argument to recommend one or the other.
scilab bot @scilabbot 8 years ago

Author Owner

By Pierre-Aime AGNEL

merged here e9a34075

without the improvements on NaN
scilab bot closed 8 years ago

closed
scilab bot @scilabbot 8 years ago

Author Owner

By Christophe Dang Ngoc Chan

Hello,

there is one remaining problem in the 6.0.0 release with built-in help in French: it says

"normalization scalaire booléen. normalization=%f (par défaut)"

which is not correct. The online page in French is also erroneous

https://help.scilab.org/docs/6.0.0/fr_FR/histc.html

whereas the page in English is correct.

Regards
scilab bot @scilabbot 8 years ago

Author Owner

By Paul BIGNIER

A proposed fix was pushed by Samuel to address many of histc's issues: https://gitlab.com/scilab/legacy_codereview/-/wikis/changes/19045/

It hasn't been pushed in Scilab 6.0.0 but may appear in Scilab 6.0.1.

Regards, Paul
scilab bot reopened 8 years ago

reopened
scilab bot @scilabbot 7 years ago

Author Owner

By Nicolae Cindea

Hello,

The normalization for histplot and histc does not work in Scilab 6.0.0. Here is a simple example:

U = grand(10000, 1, 'nor', 0, 1) H = histplot(100, U, normalization=%t) HC = histc(100, U, normalization=%t) S = (max(U) - min(U)) / 100 * sum(H) SC = (max(U) - min(U)) / 100 * sum(HC) disp(S) // gives 0.0777328 in Scilab 6.0.0 // gives 1. in Scilab 5.5.2 disp(SC)// gives 0.0777328 in Scilab 6.0.0 // gives 1. in Scilab 5.5.2

I know that a bug is already open on this question, and I hope that this issue will rapidly fixed.

Regards, Nicolae.
scilab bot @scilabbot 6 years ago

Author Owner

By Samuel GOUGEON (@sgougeon)

Urgent fix proposed for Scilab 6.0.2 in review at https://gitlab.com/scilab/legacy_codereview/-/wikis/changes/20472
scilab bot @scilabbot 6 years ago

Author Owner

By Samuel GOUGEON (@sgougeon)

(In reply to Nicolae Cindea from comment #8)

The normalization for histplot and histc does not work in Scilab 6.0.0. Here is a simple example:

U = grand(10000, 1, 'nor', 0, 1) HC = histc(100, U, normalization=%t) SC = (max(U) - min(U)) / 100 * sum(HC) disp(SC)// gives 0.0777328 in Scilab 6.0.0 // gives 1. in Scilab 5.5.2

It is now fixed
scilab bot @scilabbot 6 years ago

Author Owner

By Samuel GOUGEON (@sgougeon)

To be more explicit than in my comment #1:

There are two things wrong with histc: a) histc without normalization uses normalization=%t as default, however in the help normalization=%f is defined as default.

FIXED

b) histc with normalization=%t gives TOTALLY WRONG results.

HOW TO REPRODUCE THE BUG:

myprob=[1 2 2 3 3 3] b) wrong results: histc(3,myprob,normalization=%t) results: ans = 0.25 0.5 0.75 expected: ans = 0.1666667 0.3333333 0.5

No. [0.25 0.5 0.75] is correct. Indeed, the classes width is (3-1)/3, in such a way that --> sum([0.25 0.5 0.75]) * (3-1)/3 ans = 1.

Reminder: Review : https://gitlab.com/scilab/legacy_codereview/-/wikis/changes/20472
scilab bot @scilabbot 6 years ago

Author Owner

By Samuel GOUGEON (@sgougeon)

Fixed in Scilab 6.0.2 since 2ea43430
scilab bot closed 6 years ago

closed
scilab bot assigned to @sgougeon 2 years ago

assigned to @sgougeon

Please register or sign in to reply

Labels

5.5.0 Bugzilla Major PC Statistics

Milestone

None

Weight

None

Due date

None

Health status

None

Confidentiality

Confidentiality controls have moved to the issue actions menu () at the top of the page.

2 Participants