[GRASS5] Re: Grass clones

Hi Giulio, hi Ettore,
(cc grass5 list)

thanks a lot for your efforts! With this mail I try to get
more people interested in that story... hi developers!

Developers: Prof. Antoniol was so kind to start a first try on
GRASS clone detection. A clone is considered as a piece of code
which is very similar to another piece of code (poor man's definition).
If a clone appears several times, it should become a library
function to simplify the maintenance. In fact detection of clones
is not so easy. You may, for an introduction, read this paper:
Paolo Tonella, ITC-irst: "An Introduction to Clone Detection"
http://mpa.itc.it/grass2001/tonella2001_clones.ps

I do not re-sent the new GRASS clones analysis here due to bandwidth
limitations, but I have put this preliminary analysis online:

http://mpa.itc.it/markus/tmp/grass.cln
(550k, ASCII)

Please read further on Giulio Antoniol's mail below.

Giulio: let's wait for some comments... I'll forward if needed.

Thanks!

Markus

On Thu, Mar 14, 2002 at 04:28:49PM +0100, antoniol wrote:

Hi Markus

    sorry for the delay we are really busy this time ....

here the first shot ... I used the:

grass5src_cvs_snapshot_experimentalMar_8_2002.tar.gz

Clones were extracted at the function level, i.e., the finest grain unit
was considered the
C functions (we did not search clones between compount statments).

Out of about 20000 functions there are about 16777 over 5 LOC with
something like 5000 clones.
Clones were computed in the most stringent way (exact matching).

when you see:

---------------------------------------------
15402 15317 12498 5391

/ponza2/grass/src.contrib/GMSL/ogl3d_linux/gsf/open.c reverse 163 168
/ponza2/grass/src.contrib/GMSL/ogl3d_linux/gsf/gsd_img.c reverse 170 175

/ponza2/grass/src.contrib/CERL/SGI/libimage/open.c reverse 163 168
/ponza2/grass/src/libes/libimage/open.c reverse 214 219
---------------------------------------------

means that the functions reverse in the 4 files open.c, gsd_img.c,
open.c (SGI) and open.c (libimage)
are clones; the function body starts at loc 163 ends at 168 etc.

About 2 hours were required to parse the code and produce the result. Our
C parser is not very precise instead it is extremely robust thus there may
be functions missed in the computation.

Please consider this the very preliminary result. I did the work in
cooperation with Ettore Merlo who wrote the clone recognizer. He told me
he will produce a finer grained classification plus a color html
visualization.

We just need to find another couple of hours to process the data.

In the meantime, I take the liberty to suggest that we cooperate following
the clone documentation and refactoring process, we are basically very
interested to obtain data from the GRASS developers on clone effects. For
example, since you have a bug tracking tool, we may correlate clone with
removed bugs: how many time a removed bug impacted a clone, and were
clones de-bugged?

Are there cases where automatic refactoring is possible? Or, can we use
clone information to cluster functions into libraries?

Let us know, ciao

Giulio

Markus Neteler wrote:

Developers: Prof. Antoniol was so kind to start a first try on
GRASS clone detection. A clone is considered as a piece of code
which is very similar to another piece of code (poor man's definition).
If a clone appears several times, it should become a library
function to simplify the maintenance. In fact detection of clones
is not so easy. You may, for an introduction, read this paper:
Paolo Tonella, ITC-irst: "An Introduction to Clone Detection"
http://mpa.itc.it/grass2001/tonella2001_clones.ps

I do not re-sent the new GRASS clones analysis here due to bandwidth
limitations, but I have put this preliminary analysis online:

http://mpa.itc.it/markus/tmp/grass.cln
(550k, ASCII)

Please read further on Giulio Antoniol's mail below.

Giulio: let's wait for some comments... I'll forward if needed.

Some comments:

1. I don't think that we need to consider "similar" functions
initially. By "similar", I mean functions which could be merged into a
single function with the addition of extra parameters. I'm more
concerned about "exact" clones, which could simply be moved into a
library without requiring any interface changes.

I refer to these below as false matches, in the sense that they aren't
what I'm interested in at present. I accept that they could be
considered valid matches in other contexts; they may subsequently
prove useful in highlighting areas where it might be worth rethinking
the design.

2. Removing the parameterisation of identifiers would eliminate many
of the false matches, without eliminating many (any?) true matches.
There are quite a lot of functions which have identical structure, but
operate upon different global variables. E.g.

src/libes/gis/opencell.c G__reallocate_temp_buf 815 828
src/libes/gis/opencell.c G__reallocate_mask_buf 796 809

3. Alternatively, most[*] of the false matches relate to functions
within a common source file. I think that we can safely eliminate all
such cases. The cases which deserve attention are those where new
functionality was created by copy-paste-modify, and the current
developers are unaware as to the origin of the code.

[*] But not all; e.g. r.mapcalc/r.mapcalc3/r3.mapcalc have many
similarly structured functions (e.g. sin/cos/tan), but each lives in
its own source file.

4. There will be some "exact" clones which I don't expect to be found
this way, due to differences which aren't straightforward to ignore.
Specific examples include:

a) Code has been converted from pre-ANSI (K&R) C to ANSI C.

b) Code has undergone stylistic changes other than simple
reformatting, e.g. changing "if (x == 0)" to "if (!x)" etc.

c) Bugfixes, minor enhancements etc, which are equally applicable to
all copies, have only been made to certain copies.

--
Glynn Clements <glynn.clements@virgin.net>

Markus Neteler wrote:

I do not re-sent the new GRASS clones analysis here due to bandwidth
limitations, but I have put this preliminary analysis online:

http://mpa.itc.it/markus/tmp/grass.cln
(550k, ASCII)

I've attached a couple of scripts for processing the data. One is an
awk script to add ID numbers and convert it to CSV. The other is an
SQL script which performs some simple computations (e.g. number of
code lines, number of distinct files).

--
Glynn Clements <glynn.clements@virgin.net>

(attachments)

grass-clone.awk (285 Bytes)
grass-clone.sql (581 Bytes)