![]() |
|
|
|
|
1
18th October 14:33
External User
Posts: 1
|
I have a very large text file that gets dumped into a directoory every
now and then. It is typically around 750MB long, at least, and my question is: What is the best method to parse this thing and insert the data into a postgres db? I have tried using file(), fget*() and some others, all with limited success. It goes through OK (I am sending it to a background process on the server and using a callback function to email me when done) but it is really knocking the machine hard, besides taking a real long time to finish. Is there a better way of approaching this? Any help would be greatly appreciated. --Paul All Email originating from UWC is covered by disclaimer http://www.uwc.ac.za/portal/uwc2006/...imer/index.htm |
|
|
|
|
4
18th October 14:33
External User
Posts: 1
|
In addition to Martin's good suggestions (and also assuming you're running
php-cli via cron), you could use nice to stop it consuming too many resources: http://en.wikipedia.org/wiki/Nice_%28Unix%29 Edward |
|
|
5
18th October 14:34
External User
Posts: 1
|
This is the current approach that I am taking, was just really wondering
if there was some kind of voodoo that would speed things up a bit. Thanks both for your responses, appreciate it! --Paul All Email originating from UWC is covered by disclaimer http://www.uwc.ac.za/portal/uwc2006/...imer/index.htm |
|
|
6
18th October 14:34
External User
Posts: 1
|
Post some samples of the data you are parsing and a sample of the code
you've written to parse them. If you're parsing 750 megs of data then it's quite likely you could squeeze some performance out of the parse routines themselves. Cheers, Rob. -- .................................................. .......... SwarmBuy.com - http://www.swarmbuy.com Leveraging the buying power of the masses! .................................................. .......... |
|
|
7
18th October 14:34
External User
Posts: 1
|
Today's dataset is in a CSV (tab separated) , so I am using fgetcsv, it
looks like this (geo data): 936374 Roodepoort Roodepoort Roodeport-Maraisburg -26.1666667 27.8666667 P PPL ZA ZA 06 0 1759 Africa/Johannesburg 2004-05-11 Code: [SNIP] $row = 1; $handle = fopen($csvfile, "r"); while (($data = fgetcsv($handle, 1000, "\t")) !== FALSE) { $num = count($data); $row++; $insarr = array('userid' => $userid, 'geonameid' => $data[0], 'name' => $data[1], 'asciiname' => $data[2], 'alternatenames' => $data[3], 'latitude' => $data[4], 'longitude' => $data[5], 'featureclass' => $data[6], 'featurecode' => $data[7], 'countrycode' => $data[8], 'cc2' => $data[9], 'admin1code' => $data[10], 'admin2code' => $data[11], 'population' => $data[12], 'elevation' => $data[13], 'gtopo30' => $data[14], 'timezoneid' => $data[15], 'moddate' => $data[16] ); $this->objDbGeo->insertRecord($insarr); //$arr[] = $data; } fclose($handle); --Paul All Email originating from UWC is covered by disclaimer http://www.uwc.ac.za/portal/uwc2006/...imer/index.htm |
|
|
9
18th October 14:34
External User
Posts: 1
|
You can probably speed this up A LOT if you can batch multiple queries.
For instance, in MySQL you can do: INSERT INTO some_table ( x, y, z ) VALUES ( 1, 2, 3 ), ( 2, 3, 4 ), ( 5, 6, 7 ), ... If you do these in batches of 1000 you should be able to make a big time savings. Since you're useing fgetcsv() it's doubtful you can improve the file access/parse. Your bottleneck is most likely the database inserts. Cheers, Rob. -- .................................................. .......... SwarmBuy.com - http://www.swarmbuy.com Leveraging the buying power of the masses! .................................................. .......... |
|