Wednesday 26 November 2014

Logistic regression in Mahout

$ mahout trainlogistic --input Avazu/train_rev2 \
> --output Avazu/model \
> --target click --categories 2 \
> --predictors id hour banner_pos site_id site_domain site_category app_id app_domain app_category device_id device_os device_type device_geo_country \
> --types word numeric numeric word word word word word word word word numeric word \
> --features 13 --passes 20 --rate 50


MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath.
Running on hadoop, using /opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/hadoop/bin/hadoop and HADOOP_CONF_DIR=/etc/hadoop/conf
MAHOUT-JOB: /opt/cloudera/parcels/CDH-5.1.2-1.cdh5.1.2.p0.3/lib/mahout/mahout-examples-0.9-cdh5.1.2-job.jar
14/11/14 20:55:51 WARN driver.MahoutDriver: No trainlogistic.props found on classpath, will use command-line arguments only
Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded
at java.io.BufferedReader.<init>(BufferedReader.java:98)
at java.io.BufferedReader.<init>(BufferedReader.java:109)
at org.apache.commons.csv.ExtendedBufferedReader.<init>(ExtendedBufferedReader.java:58)


Because logistic regression is executed in sequential mode instead of invoking MR jobs. The input is in local disk. So it is hard to process large data set.

Below example is very good

Reference:
http://stackoverflow.com/questions/25148444/mahout-random-forest-example-command-line-parameter-for-data-not-recognized
http://en.wikipedia.org/wiki/Logistic_regression
http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/
http://www.slideshare.net/tanuvir/logistic-regression-using-mahout
https://confluence.atlassian.com/display/JIRAKB/JIRA+Crashes+Due+to+OutOfMemoryError+GC+Overhead+Limit+Exceeded

No comments:

Post a Comment