R User? [Archive] - Horse Racing Forum - PaceAdvantage.Com

traynor

03-24-2016, 06:30 PM

Anyone using R for data mining? If so, have you done any (formal or informal) benchmarking to compare execution speed of R vectors with lists (or ArrayLists) in C-based languages?

Basic question: In your experience, are R vectors faster in execution speed than lists in C-based languages? I understand that to the user, R "does stuff all at once" to the discrete elements of a vector. Does this translate to faster execution, or just less code to write?

Python is way too slow and clunky for the heavy lifting I am doing. I need something that goes faster, not necessarily that is easier to code. Multithreading is not a viable option.

Red Knave

03-25-2016, 09:21 AM

First, it's been a while and second I just futzed around with R for a week or so, but my answer would be that interpreted R will be slower than even poorly written C for any task. Since so many R functions are already precompiled (in C :) ), you will likely see an improvement over python but I have no way to confirm that. Compiling functions or using JIT can help R but if your aim is to go faster you might as well bite the bullet and use a language with less overhead.

traynor

03-25-2016, 12:19 PM

First, it's been a while and second I just futzed around with R for a week or so, but my answer would be that interpreted R will be slower than even poorly written C for any task. Since so many R functions are already precompiled (in C :) ), you will likely see an improvement over python but I have no way to confirm that. Compiling functions or using JIT can help R but if your aim is to go faster you might as well bite the bullet and use a language with less overhead.

Thanks for the comment. I should have explained that I am currently using a C-based language (actually languages) and trying to find a faster process. R and Python are often recommended, but many of the recommendations in data mining (for interpreted languages such as R, Python, and others) seem aimed primarily at non-programmers or knee-jerk-reflex open-source advocates. For both groups, speed and computing efficiency seem secondary to "ease of use" or "open source code."

I need the speed and computing efficiency. Recursive algorithms would be nice, but not possible in this application (without a massive amount of preliminary work that would eliminate any gain by using recursive algorithms).

Jeff P

03-25-2016, 12:35 PM

I've been using the mlogit module in R for a little over a year now.

Haven't run the type of benchmark speed tests (R vs. C or other languages) you are asking about.

My mode of operation has been pretty simple: Export selected columns and rows from my main database (JCapper) to .csv file - and from there tell R to read the .csv file, map the fields in the .csv file, and perform vector analysis for selected columns in the .csv file.

Can't say it's terribly fast. (It may not be.) But as long as each .csv file is not overly large (or vector analysis doesn't involve too many columns) the process is fast enough for my needs.

Using R in this manner has led to some valid data observations while relieving me of the need to write my own max likelihood function. (So less code.)

-jp

.

Red Knave

03-25-2016, 01:57 PM

Thanks for the comment. I should have explained that I am currently using a C-based language (actually languages) and trying to find a faster process.Got it. It sounded like you were using python.
R and Python are often recommended, but many of the recommendations in data mining (for interpreted languages such as R, Python, and others) seem aimed primarily at non-programmers or knee-jerk-reflex open-source advocates. For both groups, speed and computing efficiency seem secondary to "ease of use" or "open source code."Right. Jeff P is a programmer but it sounds like R gives him "ease of use".I need the speed and computing efficiency. Recursive algorithms would be nice, but not possible in this application (without a massive amount of preliminary work that would eliminate any gain by using recursive algorithms).Sounds like more horsepower and/or profiling and detail sweating are next for you. ;)

DeltaLover

03-25-2016, 02:17 PM

Anyone using R for data mining? If so, have you done any (formal or informal)
benchmarking to compare execution speed of R vectors with lists (or ArrayLists)
in C-based languages?

Using numpy should be fast enough for most matrix manipulations. If you still find
your python code to be sluggish, run it through a profiler and detect where exactly
the bottleneck is. A faster algorithm is usually the best answer to this type of
problems. Where I have found python to be really slow is in I/O, especially when
reading delimited files; in this case you can easily go around by implementing
your lower level code in plain C extentions.

Python is way too slow and clunky for the heavy lifting I am doing. I need
something that goes faster, not necessarily that is easier to code.
Multithreading is not a viable option.

Although it is true that a multithreaded solution is not the best to use for
performance boosting (based on the limitations of python's GIL), I think that
you must consider a multi-process application (if you are going to implement it
python, so you can take advantage of multiple cores).

traynor

03-25-2016, 03:33 PM

Got it. It sounded like you were using python.
Right. Jeff P is a programmer but it sounds like R gives him "ease of use".Sounds like more horsepower and/or profiling and detail sweating are next for you. ;)

Unfortunately, I think you are correct. It is like the old caveat about not using an elephant gun to swat a fly. I sometimes think I am hunting elephants with a fly swatter.

I have a friend (in the data analysis end of the horsey field) who uses Hadoop and multiple nodes. I think that is likely to be my (next) solution.

traynor

03-25-2016, 03:42 PM

I've been using the mlogit module in R for a little over a year now.

Haven't run the type of benchmark speed tests (R vs. C or other languages) you are asking about.

My mode of operation has been pretty simple: Export selected columns and rows from my main database (JCapper) to .csv file - and from there tell R to read the .csv file, map the fields in the .csv file, and perform vector analysis for selected columns in the .csv file.

Can't say it's terribly fast. (It may not be.) But as long as each .csv file is not overly large (or vector analysis doesn't involve too many columns) the process is fast enough for my needs.

Using R in this manner has led to some valid data observations while relieving me of the need to write my own max likelihood function. (So less code.)

-jp

.

Thanks for your comment. I think R would work well for most data sources. However, I am in a situation in which the data sources are very large and require multiple searches (piped, with each search phase dependent on the prior search results). And it is never "completed"--meaning the mining process is ongoing, so the time required for each "partial completion" stage is critical.

traynor

03-25-2016, 03:53 PM

Using numpy should be fast enough for most matrix manipulations. If you still find
your python code to be sluggish, run it through a profiler and detect where exactly
the bottleneck is. A faster algorithm is usually the best answer to this type of
problems. Where I have found python to be really slow is in I/O, especially when
reading delimited files; in this case you can easily go around by implementing
your lower level code in plain C extentions.

Although it is true that a multithreaded solution is not the best to use for
performance boosting (based on the limitations of python's GIL), I think that
you must consider a multi-process application (if you are going to implement it
python, so you can take advantage of multiple cores).

Thanks for your comment. It is not so much the algorithms that need optimizing (much has been done already) because they create a bottleneck as it is the number of searches that must be made to match patterns. (Relatively) simple searches, performed many times.

DeltaLover

03-25-2016, 03:56 PM

Thanks for your comment. It is not so much the algorithms that need optimizing (much has been done already) because they create a bottleneck as it is the number of searches that must be made to match patterns. (Relatively) simple searches, performed many times.

having multiple searches performed many times:

Think of applying:
(1) Dynamic programming - memoization
(2) Bitwise matching

traynor

03-25-2016, 04:43 PM

having multiple searches performed many times:

Think of applying:
(1) Dynamic programming - memoization
(2) Bitwise matching

Thanks for the suggestions. (1) is, in principle, what I have been doing.