PDA

View Full Version : Excel Data Scrape


WORKFORCE
01-15-2016, 01:52 PM
Good evening,

I am wondering if there’s someone out there who can help me in my quest to scrape data from a website. Having researched the web for a little bit of guidance I believe this type of scraping is difficult due to the website I intend to use.

The website I am trying to scrape data from is www.racingpost.com.

The specific page I want to scrape data from is:
1. The results page
http://www.racingpost.com/horses2/results/home.sd

2. The race page
http://www.racingpost.com/horses/result_home.sd?race_id=641471&r_date=2016-01-15&popup=yes#results_top_tabs=re_&results_bottom_tabs=ANALYSIS

Once at the results page I need to obtain the following information:
· Racecourse
· Going
· Distance
· Class
· Finishing Time of the race

Once at the race page I need to obtain the following information:
· Horses name
· Beaten distances
· Age
· Weight
· Trainer
· OR

My intended goal is to be able to load up Excel each evening and run a process that which can grab this information to be saved in one sheet and one excel file so that I can open each day individually if needed.

Kind regards

MJC922
01-15-2016, 08:02 PM
You don't say how much progress (if any) you've made to begin with but the starting point with Excel is to click the Data tab > Get External Data - From Web

Paste your URL in the address bar such as:

http://www.racingpost.com/horses2/results/home.sd?r_date=2016-01-14

Click Go, Click Import> Click OK

At this point you should see data being pulled into the sheet. So this will get you started but getting it to work the way you want is probably not going to be trivial. You'll probably have to get your feet wet with VBA as notice the URL will have to change on the fly to indicate the date, you'll also have to probably do something with the data like dump it into another sheet or save it as a CSV. If you can code and have time to spend this is probably not much more than a one weekend project. If you can't code then it could take a long time and depending upon how much you want to learn might be better off paying someone to code to meet your requirements.

Longshot6977
01-16-2016, 07:18 PM
Here is a link to a good detailed procedure from poster CBEDO. This should get you started after a few modifications.

http://216.92.33.211/forum/showthread.php?t=61958

MJC922
01-17-2016, 10:09 AM
Yeah I see in that thread there were some questions around making the query dynamic. The best way to go down that path is start recording a macro, run through all of the steps I gave, then click 'stop recording' and open the macro up in 'edit' mode to see the VBA.

Notice there you will see the date which looks something like this:

With ActiveSheet.QueryTables.Add(Connection:= _
"URL;http://www.racingpost.com/horses2/results/home.sd?r_date=2016-01-14"

Wherever the date is located you will want to inject a new date string in the proper format such as:

With ActiveSheet.QueryTables.Add(Connection:= _
"URL;http://www.racingpost.com/horses2/results/home.sd?r_date=" & MyNewDate & ""

MyNewDate could be a string variable assigned from a custom function, a function you create for yourself which contains for example the Date() function (in VBA this returns todays date) minus a parameter that you feed it to subtract x number of days, and the 'x' would best be supplied by a counter inside the loop such as d = d +1. Within the function you would also then build the date string in the proper format.

So this macro would be configured to loop for best results i.e. you have it pull in the data starting from todays date, save it to another sheet or a csv, then increment the date back to yesterday using the custom function described and so on, so you keep going back one day until you have a year's worth of data or whatever you need. Put a timer in there to wait if you do this so you don't kill the server with a ton of requests all at once, maybe wait 10 seconds in-between asking for each day.

In my experience scraping a third-party site like this is something I would only look at as a way to get a quick archive of historical data for research purposes and not much more. That is, I wouldn't recommend putting all kinds of time into coding around scraping a third-party site on a daily basis for a long-term solution. Third-party sites eventually change the format of their web pages and suddenly everything will stop working for you even if it shifts just one cell over, the data you pulled into cell C20 today could be in G28 tomorrow, which requires modding all of the code you're using to 'handle' data in various ways. So it's not a good sustainable solution.

Following the research if you want something sustainable then I would look at coding for Trackmaster's (or another major providers) csv result charts, which aren't going to be free but the format will remain stable.

I think it's really unfortunate people have to resort to scraping in the first place. Racing should hand out a database with a years worth of data for research purposes. We have all of these people looking at fantasy sports, many would love to take a look at this sport in a serious way. I don't expect a 'current' db to be handed out every day unless people are paying for it. However all of the data from 2013 to 2014 let's say should be freely available IMO.