How to extract data from HTML file

GEC: Discuss gaming, computers and electronics and venture into the bizarre world of STGODs.

Moderator: Thanas

Post Reply
User avatar
White Cat
Padawan Learner
Posts: 212
Joined: 2002-08-29 03:48pm
Location: A thousand km from the centre of the universe
Contact:

How to extract data from HTML file

Post by White Cat »

I'm struggling with getting some Web data into a usable format. I have a whole bunch of individual Web pages, with one record per file, and I want to extract the data to a single CSV (or Excel/Access) file. The structure of the HTML files looks something like this:

Code: Select all

<td width="25%"><span id="lblField1Title">Field 1 Title:</span></td>
<td width="75%"><span id="lblField1" class="field1_description">Field 1 data</span></td>
*more code*
<td width="25%"><span id="lblField2Title">Field 2 Title:</span></td>
<td width="75%"><span id="lblField2" class="field2_description">Field 2 data</span></td>
And so on. The end result should look like this:

"Field 1 data","Field 2 data", ... "Field 5 data"

Anyone know of a good method for extracting data like this? The toughest problem I've run into is that the extraction tools I've found assume that your data is nicely arranged with one record per row, instead of one record per file with the data scattered around within, like I'm dealing with.
LISTEN TO MY LOUSY ANIME SONG
User avatar
Alferd Packer
Sith Marauder
Posts: 3703
Joined: 2002-07-19 09:22pm
Location: Slumgullion Pass
Contact:

Post by Alferd Packer »

Well, you might want to figure out if the text lies before and after the fields you want to grab is uniform. For example, let's say you know that in each file, the word "description" appears two characters before the data you want to grab, and the phrase </span> appears immediately after the record.

You would then write a program that reads each line of the HTML file, and when it finds the word "description," it would record what came two characters after it, stopping when it had read in </span>, then truncating the last seven characters (the </span>) from the text. Append the output to a text file, and move on to the next HTML file.
"There is a principle which is a bar against all information, which is proof against all arguments and which cannot fail to keep a man in everlasting ignorance--that principle is contempt prior to investigation." -Herbert Spencer

"Against stupidity the gods themselves contend in vain." - Schiller, Die Jungfrau von Orleans, III vi.
User avatar
General Zod
Never Shuts Up
Posts: 29205
Joined: 2003-11-18 03:08pm
Location: The Clearance Rack
Contact:

Post by General Zod »

Couldn't you just open the raw data file in a browser then copy and paste the layout from the browser itself into Excel? Since it sounds as if you don't want to bother with all of the code. Reasonably, opening it in a browser would render the page as it's supposed to be displayed when viewed over the web, without any of the junk data.
"It's you Americans. There's something about nipples you hate. If this were Germany, we'd be romping around naked on the stage here."
User avatar
phongn
Rebel Leader
Posts: 18487
Joined: 2002-07-03 11:11pm

Post by phongn »

Assuming that this is well-formed HTML, you could probably try writing some .NET code or something to parse and extract the data you want.
User avatar
Beowulf
The Patrician
Posts: 10619
Joined: 2002-07-04 01:18am
Location: 32ULV

Post by Beowulf »

If it's well formed as XML, you could use an XSLT stylesheet to transform it, and then concatenate all the records together. Or you could use an XPath query to locate and retrieve the correct data.
"preemptive killing of cops might not be such a bad idea from a personal saftey[sic] standpoint..." --Keevan Colton
"There's a word for bias you can't see: Yours." -- William Saletan
User avatar
White Cat
Padawan Learner
Posts: 212
Joined: 2002-08-29 03:48pm
Location: A thousand km from the centre of the universe
Contact:

Post by White Cat »

Thanks to everyone who offered advice. I ended up using a program called TextHarvest.

In case anyone else here ever needs to do something similar, here's the script I wrote for this purpose:

Code: Select all

FullPage = SetFromFile $ActualIFN
Field1 = Parse FullPage 'text before Field1 data' 'text after Field1 data'
Field2 = Parse FullPage 'text before Field2 data' 'text after Field2 data'
Field3 = Parse FullPage 'text before Field3 data' 'text after Field3 data'
Field4 = Parse FullPage 'text before Field4 data' 'text after Field4 data'

OutCSV ',' 'Init'
OutCSV $ActualIFN ;Outputs the filename to the CSV
OutCSV Field1
OutCSV Field2
OutCSV Field3
OutCSV Field4
OutCSV '' 'Done'
However, something about the program (not sure whether it's a bug or designed to work that way) causes this code to run once for every line in the program, so if (for example) your HTML source files are 50 lines long, your output CSV file will have 50 copies of each record. To solve this, find a piece of text that only appears once per file (e.g. "DOCTYPE") and put it in TextHarvest's "Keep" line, preceded by a slash.
LISTEN TO MY LOUSY ANIME SONG
Post Reply