How to extract data from HTML file

White Cat · Post by **White Cat** » 2007-07-24 04:06am

I'm struggling with getting some Web data into a usable format. I have a whole bunch of individual Web pages, with one record per file, and I want to extract the data to a single CSV (or Excel/Access) file. The structure of the HTML files looks something like this:

Code: Select all

<td width="25%"><span id="lblField1Title">Field 1 Title:</span></td>
<td width="75%"><span id="lblField1" class="field1_description">Field 1 data</span></td>
*more code*
<td width="25%"><span id="lblField2Title">Field 2 Title:</span></td>
<td width="75%"><span id="lblField2" class="field2_description">Field 2 data</span></td>

And so on. The end result should look like this:

"Field 1 data","Field 2 data", ... "Field 5 data"

Anyone know of a good method for extracting data like this? The toughest problem I've run into is that the extraction tools I've found assume that your data is nicely arranged with one record per row, instead of one record per file with the data scattered around within, like I'm dealing with.

Alferd Packer · Post by **Alferd Packer** » 2007-07-24 03:18pm

Well, you might want to figure out if the text lies before and after the fields you want to grab is uniform. For example, let's say you know that in each file, the word "description" appears two characters before the data you want to grab, and the phrase </span> appears immediately after the record.

You would then write a program that reads each line of the HTML file, and when it finds the word "description," it would record what came two characters after it, stopping when it had read in </span>, then truncating the last seven characters (the </span>) from the text. Append the output to a text file, and move on to the next HTML file.

General Zod · Post by **General Zod** » 2007-07-24 03:52pm

Couldn't you just open the raw data file in a browser then copy and paste the layout from the browser itself into Excel? Since it sounds as if you don't want to bother with all of the code. Reasonably, opening it in a browser would render the page as it's supposed to be displayed when viewed over the web, without any of the junk data.

phongn · Post by **phongn** » 2007-07-24 09:40pm

Assuming that this is well-formed HTML, you could probably try writing some .NET code or something to parse and extract the data you want.

Post by **Beowulf** » 2007-07-24 10:00pm

If it's well formed as XML, you could use an XSLT stylesheet to transform it, and then concatenate all the records together. Or you could use an XPath query to locate and retrieve the correct data.

White Cat · Post by **White Cat** » 2007-07-29 01:13am

Thanks to everyone who offered advice. I ended up using a program called TextHarvest.

In case anyone else here ever needs to do something similar, here's the script I wrote for this purpose:

Code: Select all

FullPage = SetFromFile $ActualIFN
Field1 = Parse FullPage 'text before Field1 data' 'text after Field1 data'
Field2 = Parse FullPage 'text before Field2 data' 'text after Field2 data'
Field3 = Parse FullPage 'text before Field3 data' 'text after Field3 data'
Field4 = Parse FullPage 'text before Field4 data' 'text after Field4 data'

OutCSV ',' 'Init'
OutCSV $ActualIFN ;Outputs the filename to the CSV
OutCSV Field1
OutCSV Field2
OutCSV Field3
OutCSV Field4
OutCSV '' 'Done'

However, something about the program (not sure whether it's a bug or designed to work that way) causes this code to run once for every line in the program, so if (for example) your HTML source files are 50 lines long, your output CSV file will have 50 copies of each record. To solve this, find a piece of text that only appears once per file (e.g. "DOCTYPE") and put it in TextHarvest's "Keep" line, preceded by a slash.