Mombu the Php Forum

Go Back   Mombu the Php Forum > Php > Very Large text file parsing
User Name
Password
REGISTER NOW! Mark Forums Read




Reply
1 18th October 14:33
pscott
External User
 
Posts: 1
Default Very Large text file parsing



I have a very large text file that gets dumped into a directoory every
now and then. It is typically around 750MB long, at least, and my
question is:

What is the best method to parse this thing and insert the data into a
postgres db?

I have tried using file(), fget*() and some others, all with limited
success. It goes through OK (I am sending it to a background process on
the server and using a callback function to email me when done) but it
is really knocking the machine hard, besides taking a real long time to
finish.

Is there a better way of approaching this? Any help would be greatly
appreciated.

--Paul


All Email originating from UWC is covered by disclaimer http://www.uwc.ac.za/portal/uwc2006/...imer/index.htm
  Reply With Quote


 


2 18th October 14:33
martin
External User
 
Posts: 1
Default Very Large text file parsing



First, which is your approach? I suspect that you are doing this with a
cron job through php-cli.

Now, to avoid using to many resources, try with fopen() and fgets().
Also work with persistent connections, so you don't have that overhead.

The problem with file() is that it will load all the file to memory, and
you don't want 700+MB in memory.

--
21:50:04 up 2 days, 9:07, 0 users, load average: 0.92, 0.37, 0.18
---------------------------------------------------------
Lic. Martín Marqués | SELECT 'mmarques' ||
Centro de Telemática | '@' || 'unl.edu.ar';
Universidad Nacional | DBA, Programador,
del Litoral | Administrador
---------------------------------------------------------
  Reply With Quote
3 18th October 14:33
lammi
External User
 
Posts: 1
Default Very Large text file parsing


better split that file in small parts. it will take hours to parse a
750 mb file, it's not a good idea to allow a script to run for such a
long time. if you're on a multiprocessor machine you may use a single
processor for that job.

is that file updated from day to day and you need to insert the
changes into your database?
  Reply With Quote
4 18th October 14:33
edward
External User
 
Posts: 1
Default Very Large text file parsing


In addition to Martin's good suggestions (and also assuming you're running
php-cli via cron), you could use nice to stop it consuming too many
resources:

http://en.wikipedia.org/wiki/Nice_%28Unix%29

Edward
  Reply With Quote
5 18th October 14:34
pscott
External User
 
Posts: 1
Default Very Large text file parsing


This is the current approach that I am taking, was just really wondering
if there was some kind of voodoo that would speed things up a bit.

Thanks both for your responses, appreciate it!

--Paul


All Email originating from UWC is covered by disclaimer http://www.uwc.ac.za/portal/uwc2006/...imer/index.htm
  Reply With Quote
6 18th October 14:34
robert
External User
 
Posts: 1
Default Very Large text file parsing


Post some samples of the data you are parsing and a sample of the code
you've written to parse them. If you're parsing 750 megs of data then
it's quite likely you could squeeze some performance out of the parse
routines themselves.

Cheers,
Rob.
--
.................................................. ..........
SwarmBuy.com - http://www.swarmbuy.com

Leveraging the buying power of the masses!
.................................................. ..........
  Reply With Quote
7 18th October 14:34
pscott
External User
 
Posts: 1
Default Very Large text file parsing


Today's dataset is in a CSV (tab separated) , so I am using fgetcsv, it
looks like this (geo data):

936374 Roodepoort Roodepoort Roodeport-Maraisburg -26.1666667 27.8666667
P PPL ZA ZA 06 0 1759 Africa/Johannesburg 2004-05-11

Code:
[SNIP]
$row = 1;
$handle = fopen($csvfile, "r");
while (($data = fgetcsv($handle, 1000, "\t")) !== FALSE) {
$num = count($data); $row++;
$insarr = array('userid' => $userid,
'geonameid' => $data[0],
'name' => $data[1],
'asciiname' => $data[2],
'alternatenames' => $data[3],
'latitude' => $data[4],
'longitude' => $data[5],
'featureclass' => $data[6],
'featurecode' => $data[7],
'countrycode' => $data[8],
'cc2' => $data[9],
'admin1code' => $data[10],
'admin2code' => $data[11],
'population' => $data[12],
'elevation' => $data[13],
'gtopo30' => $data[14],
'timezoneid' => $data[15],
'moddate' => $data[16] );
$this->objDbGeo->insertRecord($insarr);
//$arr[] = $data;
}
fclose($handle);

--Paul

All Email originating from UWC is covered by disclaimer http://www.uwc.ac.za/portal/uwc2006/...imer/index.htm
  Reply With Quote
8 18th October 14:34
martin
External User
 
Posts: 1
Default Very Large text file parsing


Adding to Robert's comment, try not to copy variables, but to reference
them, so you will use less resources.

Also free DB results after query.

--
21:50:04 up 2 days, 9:07, 0 users, load average: 0.92, 0.37, 0.18
---------------------------------------------------------
Lic. MartÃ*n Marqués | SELECT 'mmarques' ||
Centro de Telemática | '@' || 'unl.edu.ar';
Universidad Nacional | DBA, Programador,
del Litoral | Administrador
---------------------------------------------------------
  Reply With Quote
9 18th October 14:34
robert
External User
 
Posts: 1
Default Very Large text file parsing


You can probably speed this up A LOT if you can batch multiple queries.
For instance, in MySQL you can do:

INSERT INTO some_table
( x, y, z )
VALUES
( 1, 2, 3 ),
( 2, 3, 4 ),
( 5, 6, 7 ),
...

If you do these in batches of 1000 you should be able to make a big time
savings. Since you're useing fgetcsv() it's doubtful you can improve the
file access/parse. Your bottleneck is most likely the database inserts.

Cheers,
Rob.
--
.................................................. ..........
SwarmBuy.com - http://www.swarmbuy.com

Leveraging the buying power of the masses!
.................................................. ..........
  Reply With Quote
10 18th October 14:34
martin
External User
 
Posts: 1
Default Very Large text file parsing


What's $num and $row for?


Those objDbGeo->insertRecord() do some sort of control over the data
that is passed in the array?

If not, you should just use the COPY command of PostgreSQL (you are
using PostgreSQL if I remember correctly) or simply do a bash script
using psql and the \copy command.

--
21:50:04 up 2 days, 9:07, 0 users, load average: 0.92, 0.37, 0.18
---------------------------------------------------------
Lic. MartÃ*n Marqués | SELECT 'mmarques' ||
Centro de Telemática | '@' || 'unl.edu.ar';
Universidad Nacional | DBA, Programador,
del Litoral | Administrador
---------------------------------------------------------
  Reply With Quote
Reply


Thread Tools
Display Modes




666