[GRASS5] Duplicates: some stats

Hello,

I have made a pass to detect C files (whether .c or .h) which are
equals to a particular percentage (i.e. they differ (in lines) for less
than a particular threshold, for less than a particular percentage of
their number of lines).

Here the pass has been made asking for a 95% equality (in other words
for files differing in less than 5% of their lines).

This results in 1622 duplicates.

I attach the resulting file, which reports the pair of files compared.

Interestingly, you will see that some files in src AND in src.nonGPL are
equals...
--
Thierry Laronde (Alceste) <tlaronde@polynum.org>
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C

(attachments)

grass50.duplicates.gz (14.6 KB)

Hello Thierry

On Sun, 16 Nov 2003, Thierry Laronde wrote:

Here the pass has been made asking for a 95% equality (in other words
for files differing in less than 5% of their lines).

See also http://grass.itc.it/pipermail/grass5/2002-March/008642.html and
http://mpa.itc.it/markus/tmp/grass.cln where the analysis is done at the
function level which is potentially more useful.

BTW responding to an earlier comment---in general I prefer e-mails sent to
both me and the mailing list as it is clearer who the message is addressed
to and who it requires attention from. Also it is quicker than waiting for
the message to be delivered through the mailing list, which can be
convenient if a short debate is going on.

Paul

Hello Paul,

On Sun, Nov 16, 2003 at 10:36:15PM +0000, Paul Kelly wrote:

Hello Thierry

On Sun, 16 Nov 2003, Thierry Laronde wrote:

> Here the pass has been made asking for a 95% equality (in other words
> for files differing in less than 5% of their lines).

See also http://grass.itc.it/pipermail/grass5/2002-March/008642.html and
http://mpa.itc.it/markus/tmp/grass.cln where the analysis is done at the
function level which is potentially more useful.

Indeed interesting. I urge others to look at it. As is said/suggested
in the mail, the "clones" research can be done at several distinct
levels.
The one (simple) I conducted may indicate that some separate programs
should be a single one with multiple options, or may indicate the need
for a more "atomic" program to give the feature to be found at the
intersection of several others.

There is a supplementary information of some value: the historical
evolution of the code. The clones (with files as an element) have
dramatically increased in number with time.

And all these numbers give an intuition about the work that has to be
done...
--
Thierry Laronde (Alceste) <tlaronde@polynum.org>
Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C