I would like to draw your attention to a new GRASS add-on, r.randomforest, which uses the scikit-learn and pandas Python packages to classify GRASS rasters. Similar to existing GRASS classification methods, it uses an imagery group and a raster of labelled pixels as the inputs for the classification. It also reads the rasters row-by-row, and then bundles these rows based on a user specified row increment to the classifier to keep memory requirements low, but also allow efficient classification because the scikit-learn implementation is multithreaded by default, and row-by-row results in too much stop-start behaviour. The feature importance scores and out-of-bag error are displayed in the command window.
I would appreciate testing - you need to have scikit-learn and pandas installed in your Python environment which is easy on Linux and OS X, and instructions are provided in the tool for Windows.
I have another add-on that I will upload soon, r.roc, which generates ROC and AUROC for prediction models.
On Sat, Mar 26, 2016 at 12:40 PM, Steven Pawley <dr.stevenpawley@gmail.com>
wrote:
I would like to draw your attention to a new GRASS add-on, r.randomforest,
which uses the scikit-learn and pandas Python packages to classify GRASS
rasters.
Thanks, this looks good. Please consider adding an image to the
documentation to better promote the module [1] and also an example which
would work with the NC SPM dataset [2]. For the addon to generate
documentation on the server and work well at few other special occasions,
it is advantageous to employ lazy import technique for the non-standard
dependencies, see for example v.class.ml and v.class.mlpy [3].
Great news! I gave it a quick try (on Ubuntu 14.04, GRASS 7 master). Size input raster layers: rows: 1578, columns: 1436
1st try - input full map, classes 1/0,
I had to stop as it took too much time. Stopping it did not stop the python processes however, I had to kill the processes.
2nd try - input random sample of 100 points, 1 (12) and 0 (88), with b flag
r.randomforest -b igroup=predictors@SampleSize roi=test2 output=test2_output ntrees=500 mfeatures=-1 minsplit=2 randst=1 lines=100
Group references the following raster maps:
Traceback (most recent call last):
File “/home/paulo/.grass7/addons/scripts/r.randomforest”,
line 335, in
main()
File “/home/paulo/.grass7/addons/scripts/r.randomforest”,
line 243, in main
class_weight = “balanced”, max_features = mfeatures,
min_samples_split = minsplit, random_state = randst)
TypeError: init() got an unexpected keyword argument
‘class_weight’
Removing raster <tmp_jNyNcqZa>
3rd try- input random sample of 100 points, 1 (#12) and 0 (#88), with b flag
r.randomforest igroup=predictors@SampleSize roi=test2 output=test2_output ntrees=500 mfeatures=-1 minsplit=2 randst=1 lines=100
Group references the following raster maps:
Our OOB prediction of accuracy is: 89.0%
Raster Importance
0 bio1_wc30s@SampleSize 0.183670
1 bio2_wc30s@SampleSize 0.139914
2 bio3_wc30s@SampleSize 0.105035
3 bio4_wc30s@SampleSize 0.106413
4 bio13_wc30s@SampleSize 0.087399
5 bio14_wc30s@SampleSize 0.146495
6 dm_wc30s@SampleSize 0.104575
7 llds_wc30s@SampleSize 0.126499
Removing raster <tmp_RhTllKlA>
Questions
I am using it for species distribution modeling (presence/absence input map), but I prefer to use the regression mode. Is there a way to force it to use the regression mode?
Are you planning to implement other classification methods? Seems if this works it shouldn’t be too hard to replace the randomforest method by any of the other methods in scipy? I have for som time been thinking about using scipy, but my programming skills are not up to standards. But perhaps it is easier using your addon as template?
I would like to draw your attention to a new GRASS add-on, r.randomforest, which uses the scikit-learn and pandas Python packages to classify GRASS rasters. Similar to existing GRASS classification methods, it uses an imagery group and a raster of labelled pixels as the inputs for the classification. It also reads the rasters row-by-row, and then bundles these rows based on a user specified row increment to the classifier to keep memory requirements low, but also allow efficient classification because the scikit-learn implementation is multithreaded by default, and row-by-row results in too much stop-start behaviour. The feature importance scores and out-of-bag error are displayed in the command window.
I would appreciate testing - you need to have scikit-learn and pandas installed in your Python environment which is easy on Linux and OS X, and instructions are provided in the tool for Windows.
I have another add-on that I will upload soon, r.roc, which generates ROC and AUROC for prediction models.
On Sat, Mar 26, 2016 at 1:42 PM, Paulo van Breugel <p.vanbreugel@gmail.com>
wrote:
*2nd try - input random sample of 100 points, 1 (12) and 0 (88), with b
flag*
r.randomforest -b igroup=predictors@SampleSize roi=test2
output=test2_output ntrees=500 mfeatures=-1 minsplit=2 randst=1 lines=100
Group <predictors> references the following raster maps:
Traceback (most recent call last):
File "/home/paulo/.grass7/addons/scripts/r.randomforest",
line 335, in <module>
main()
File "/home/paulo/.grass7/addons/scripts/r.randomforest",
line 243, in main
class_weight = "balanced", max_features = mfeatures,
min_samples_split = minsplit, random_state = randst)
TypeError: __init__() got an unexpected keyword argument
'class_weight'
Removing raster <tmp_jNyNcqZa>
Of course, another great addition to the module code is a test suite [1].
The basic test can be on small random data and it may not event test the
correctness of result. It is enough when it just tests that the modules
does not fail and gives some results.
Interesting. I'll try to have a look when I find some time.
Speaking of ROC, it made me think that I've written a small module
that calculate some scores for evaluating dichotomous forecasts: https://bitbucket.org/lrntct/r.sim.stats/
If it can be of use to someone.
Regards,
Laurent
2016-03-26 10:40 GMT-06:00 Steven Pawley <dr.stevenpawley@gmail.com>:
Hello developers,
I would like to draw your attention to a new GRASS add-on, r.randomforest,
which uses the scikit-learn and pandas Python packages to classify GRASS
rasters. Similar to existing GRASS classification methods, it uses an
imagery group and a raster of labelled pixels as the inputs for the
classification. It also reads the rasters row-by-row, and then bundles these
rows based on a user specified row increment to the classifier to keep
memory requirements low, but also allow efficient classification because the
scikit-learn implementation is multithreaded by default, and row-by-row
results in too much stop-start behaviour. The feature importance scores and
out-of-bag error are displayed in the command window.
I would appreciate testing - you need to have scikit-learn and pandas
installed in your Python environment which is easy on Linux and OS X, and
instructions are provided in the tool for Windows.
I have another add-on that I will upload soon, r.roc, which generates ROC
and AUROC for prediction models.
Thanks for those pointers re. lazy technique and documentation. I have a RandomForest diagram to explain the process, as well as some examples, so I’ll update documentation next week.
Paulo thanks for running a few tests. It looks there is an error with the class_weight parameter, I’ll check into that.
In terms of species distribution modelling, I have been using the tool for landslide susceptibility modelling, which I believe is methodologically similar to SDM in terms of having a binary response variable. I have been doing this for the area of Alberta, using an 8000 x 14000 pixel and 17 band stack of predictors. In the case of a binary response variable, the usual approach is to run random forest in classification mode, i.e. with fully grown trees, but use the class probabilities to represent the ‘species’ or ‘landslide’ index.
I am planning to implement other methods in the scikit learn package, which represents a trivial change to the module once he bugs are ironed out. I will probably look to create modules for SVM and logistic regression, and maybe nearest neighbours classification. Certainly open to any suggestions.
I would like to draw your attention to a new GRASS add-on, r.randomforest, which uses the scikit-learn and pandas Python packages to classify GRASS rasters.
Thanks, this looks good. Please consider adding an image to the documentation to better promote the module [1] and also an example which would work with the NC SPM dataset [2]. For the addon to generate documentation on the server and work well at few other special occasions, it is advantageous to employ lazy import technique for the non-standard dependencies, see for example v.class.ml and v.class.mlpy [3].
Yes, your user case will not differ methodologically from species modeling based on presence/absence. One reason I was asking for the regression randomForest is that in one article (can’t remember the title, will look it up) it was found that the regression approach yielded better results, even though the response variable is binary. One your help page, you write that r.randomforest performs random forest classification and regression, and the regression mode can be used by setting the mode to the regression option. But I am not seeing that option?
Great you are planning other methods as well. Giving model uncertainties (quite an issue in species distribution modeling), having multiple methods is really a plus, especially as it allows one to build consensus models [1] and combine them to create uncertainty maps.
Cheers,
Paulo
[1]Marmion, M., Parviainen, M., Luoto, M., Heikkinen, R.K., & Thuiller, W. 2009. Evaluation of consensus methods in predictive species distribution modelling. Diversity and Distributions 15: 59–69.
···
On 27-03-16 00:47, Steven Pawley wrote:
Hi Vaclaw and Paulo,
Thanks for those pointers re. lazy technique and documentation. I have a RandomForest diagram to explain the process, as well as some examples, so I’ll update documentation next week.
Paulo thanks for running a few tests. It looks there is an error with the class_weight parameter, I’ll check into that.
In terms of species distribution modelling, I have been using the tool for landslide susceptibility modelling, which I believe is methodologically similar to SDM in terms of having a binary response variable. I have been doing this for the area of Alberta, using an 8000 x 14000 pixel and 17 band stack of predictors. In the case of a binary response variable, the usual approach is to run random forest in classification mode, i.e. with fully grown trees, but use the class probabilities to represent the ‘species’ or ‘landslide’ index.
I am planning to implement other methods in the scikit learn package, which represents a trivial change to the module once he bugs are ironed out. I will probably look to create modules for SVM and logistic regression, and maybe nearest neighbours classification. Certainly open to any suggestions.
I would like to draw your attention to a new GRASS add-on, r.randomforest, which uses the scikit-learn and pandas Python packages to classify GRASS rasters.
Thanks, this looks good. Please consider adding an image to the documentation to better promote the module [1] and also an example which would work with the NC SPM dataset [2]. For the addon to generate documentation on the server and work well at few other special occasions, it is advantageous to employ lazy import technique for the non-standard dependencies, see for example v.class.ml and v.class.mlpy [3].
Many thanks for this. I updated the mode last night to include the ability to force regression mode, as well as including some more error checking for valid combinations of input parameters. Classification mode also checks that the input labelled pixels are CELL type. I’m not outputting all of the appropriate uncertainty measures like RSQ yet for regression mode yet, but I’ll add those in.
That is interesting that you had better performance when using regression. I will have to check that for my application using scikit learn. In R using the randomforest package, the results were pretty much identical but my classes were balanced already, which I think is one factor that can lead to significant differences between binary classification probabilities vs regression.
Yes definitely will use this as a template to include other methods. I Only recently switched my work from R to Python but am just submitting a paper based on R which uses a range of classifiers like randomforest, GLM, GAM, and MARS which it was useful to evaluate the differences.
Steve
···
On 27-03-16 00:47, Steven Pawley wrote:
Hi Vaclaw and Paulo,
Thanks for those pointers re. lazy technique and documentation. I have a RandomForest diagram to explain the process, as well as some examples, so I’ll update documentation next week.
Paulo thanks for running a few tests. It looks there is an error with the class_weight parameter, I’ll check into that.
In terms of species distribution modelling, I have been using the tool for landslide susceptibility modelling, which I believe is methodologically similar to SDM in terms of having a binary response variable. I have been doing this for the area of Alberta, using an 8000 x 14000 pixel and 17 band stack of predictors. In the case of a binary response variable, the usual approach is to run random forest in classification mode, i.e. with fully grown trees, but use the class probabilities to represent the ‘species’ or ‘landslide’ index.
I am planning to implement other methods in the scikit learn package, which represents a trivial change to the module once he bugs are ironed out. I will probably look to create modules for SVM and logistic regression, and maybe nearest neighbours classification. Certainly open to any suggestions.
I would like to draw your attention to a new GRASS add-on, r.randomforest, which uses the scikit-learn and pandas Python packages to classify GRASS rasters.
Thanks, this looks good. Please consider adding an image to the documentation to better promote the module [1] and also an example which would work with the NC SPM dataset [2]. For the addon to generate documentation on the server and work well at few other special occasions, it is advantageous to employ lazy import technique for the non-standard dependencies, see for example v.class.ml and v.class.mlpy [3].
Great, I’ll check it out. It was a study by somebody else, I can’t remember which one right now, but it will come back to me. But yes, the fact that for species distribution modeling the sampling is often highly unbalanced (with large number of pseudo-absence) is likely to play a role. It sometimes seems there are almost as many different conclusions about the best method as there are publications (OK, I might exaggerate a bit here), so comparing difference models is very useful. So very glad you are doing this (as I said, I have looked at scipy before and how it could be implemented in GRASS, but my Python skills are just not up to it).
···
On 27-03-16 16:58, Steven Pawley wrote:
Hello Paulo,
Many thanks for this. I updated the mode last night to include the ability to force regression mode, as well as including some more error checking for valid combinations of input parameters. Classification mode also checks that the input labelled pixels are CELL type. I’m not outputting all of the appropriate uncertainty measures like RSQ yet for regression mode yet, but I’ll add those in.
That is interesting that you had better performance when using regression. I will have to check that for my application using scikit learn. In R using the randomforest package, the results were pretty much identical but my classes were balanced already, which I think is one factor that can lead to significant differences between binary classification probabilities vs regression.
Yes definitely will use this as a template to include other methods. I Only recently switched my work from R to Python but am just submitting a paper based on R which uses a range of classifiers like randomforest, GLM, GAM, and MARS which it was useful to evaluate the differences.
Yes, your user case will not differ methodologically from species modeling based on presence/absence. One reason I was asking for the regression randomForest is that in one article (can’t remember the title, will look it up) it was found that the regression approach yielded better results, even though the response variable is binary. One your help page, you write that r.randomforest performs random forest classification and regression, and the regression mode can be used by setting the mode to the regression option. But I am not seeing that option?
Great you are planning other methods as well. Giving model uncertainties (quite an issue in species distribution modeling), having multiple methods is really a plus, especially as it allows one to build consensus models [1] and combine them to create uncertainty maps.
Cheers,
Paulo
[1]Marmion, M., Parviainen, M., Luoto, M., Heikkinen, R.K., & Thuiller, W. 2009. Evaluation of consensus methods in predictive species distribution modelling. Diversity and Distributions 15: 59–69.
On 27-03-16 00:47, Steven Pawley wrote:
Hi Vaclaw and Paulo,
Thanks for those pointers re. lazy technique and documentation. I have a RandomForest diagram to explain the process, as well as some examples, so I’ll update documentation next week.
Paulo thanks for running a few tests. It looks there is an error with the class_weight parameter, I’ll check into that.
In terms of species distribution modelling, I have been using the tool for landslide susceptibility modelling, which I believe is methodologically similar to SDM in terms of having a binary response variable. I have been doing this for the area of Alberta, using an 8000 x 14000 pixel and 17 band stack of predictors. In the case of a binary response variable, the usual approach is to run random forest in classification mode, i.e. with fully grown trees, but use the class probabilities to represent the ‘species’ or ‘landslide’ index.
I am planning to implement other methods in the scikit learn package, which represents a trivial change to the module once he bugs are ironed out. I will probably look to create modules for SVM and logistic regression, and maybe nearest neighbours classification. Certainly open to any suggestions.
I would like to draw your attention to a new GRASS add-on, r.randomforest, which uses the scikit-learn and pandas Python packages to classify GRASS rasters.
Thanks, this looks good. Please consider adding an image to the documentation to better promote the module [1] and also an example which would work with the NC SPM dataset [2]. For the addon to generate documentation on the server and work well at few other special occasions, it is advantageous to employ lazy import technique for the non-standard dependencies, see for example v.class.ml and v.class.mlpy [3].
2nd try - input random sample of 100 points, 1 (12) and 0 (88), with b flag
r.randomforest -b igroup=predictors@SampleSize roi=test2 output=test2_output ntrees=500 mfeatures=-1 minsplit=2 randst=1 lines=100
Group references the following raster maps:
Traceback (most recent call last):
File “/home/paulo/.grass7/addons/scripts/r.randomforest”,
line 335, in
main()
File “/home/paulo/.grass7/addons/scripts/r.randomforest”,
line 243, in main
class_weight = “balanced”, max_features = mfeatures,
min_samples_split = minsplit, random_state = randst)
TypeError: init() got an unexpected keyword argument
‘class_weight’
Removing raster <tmp_jNyNcqZa>
Hi Steven,
I finally found out the above-mentioned problem. I was using an old version of the skikit package. In first instance I failed to realize this because I had actually upgraded it using:
sudo pip install --upgrade scikit-learn
Just in case others encounter the same problem, the problem was that some point I must have installed it through the Synaps package manager (I am on Ubuntu), which is a old version. Apparently the version installed this way overrides the version installed via pip. Removing the package in Synaps package manager (one can of course use apt-get remove) resolved it.