Mombu the Programming Forum

Go Back   Mombu the Programming Forum > Programming > Word-processing challenge anyone?
User Name
Password
REGISTER NOW! Mark Forums Read




Reply
1 18th September 20:47
david frank
External User
 
Posts: 1
Default Word-processing challenge anyone?



This reply to a PL/I ancient topic from 13 Feb 2005 is being "re-freshed" to
allow new responses here instead of
in comp.lang.fortran.
One of those respondees of course being Robin Vowels who as a PL/I advocate
is again posting that he has
solution(s) without posting proof as usual, and a Ruby advocate who is being
invited to re-post his Ruby solution for
the "bible word count problem" of two years ago, here now that the topic is
again "visible"


<snip>
  Reply With Quote


 


2 18th September 20:47
eric i.
External User
 
Posts: 1
Default Word-processing challenge anyone?



I've attached a Ruby solution below. It takes 2.593646 seconds on a
2.33 GHz Intel Core 2 Duo (although only one core is used). This is
an interpreted speed. There are at least four highly active parallel
efforts to create a compiled Ruby underway as I type this. They might
produce faster
times.

My code not only keeps track of the unique words, it counts how many
time each appears in the text. Then, after the timing result is
produced, the code outputs information about the most frequent word(s)
(which happens to be "the") and the least frequent words (4004 appear
only once, from "abaddon" to "zuzims"). But again, that output does
not affect the timing.

Also, since I wanted to be able to access the data from the web
directly, there's a little bit of code to allow it to skip the non-
included material at the top and bottom of the file.

Eric

====

Are you interested in on-site Ruby training that uses well-designed,
real-world, hands-on exercises? http://LearnRuby.com

========

# Reads a file containing the text of the Bible and, after processing
# the data slightly, prints out how many total words and how many
# unique words it contained. See http://tinyurl.com/354hry for the
# full problem description.

# This solution is offered by LearnRuby.com (http://learnruby.com).

# If there is a file named "kjv12.txt" in the current directory, the
# data will be read from that file. Otherwise, the data will be read
# from the URI "http://patriot.net/users/bmcgin/kjv12.txt".

start_time = Time.now

Bible_Filename = "kjv12.txt"
Bible_URI = "http://patriot.net/users/bmcgin/kjv12.txt"

input = begin
open Bible_Filename
rescue
require 'open-uri'
puts "NOTE: time taken is invalid since it includes web
access\n\n"
open Bible_URI
end

state = :skip_top
word_count = 0
words_seen = Hash.new(0)

input.each_line do |line|
state = rocess if
state == :skip_top && line =~ /Book\s+01\s+Genesis/

next unless state == rocess

state = :skip_bottom if line =~ /022:021.*Amen\./

# remove apostrophe between letters
mod_line = line.gsub /([[:alpha:]])'([[:alpha:]])/, '\1\2'

# convert sequences of non-letters to single spaces, remove white
# space at either end, and convert letters to lower case
mod_line.gsub!(/[^[:alpha:]]+/, ' ').strip!.downcase!

words = mod_line.split
word_count += words.size
words.each { |word| words_seen[word] += 1 }
end

input.close

puts "Number of words: %d" % word_count
puts "Number of unique words: %d" % words_seen.size

end_time = Time.now

puts "Time taken to compute: %f seconds" % (end_time - start_time)

#
# Extra information, just for the fun of it...
#

# figure out the counts for the most and least frequent words
word_counts = words_seen.values
top_word_count = word_counts.max
bottom_word_count = word_counts.min

# put together a list of the most frequent word(s) and the least
# frequent word(s)
top_words = words_seen.select { |word, count|
count == top_word_count
}.map { |e| e[0] }
bottom_words = words_seen.select { |word, count|
count == bottom_word_count
}.map { |e| e[0] }

# output information about the most and least frequent words
puts("\nThe following %d most frequent word(s) each appeared %d
time(s):" %
[top_words.size, top_word_count])
puts top_words.sort.join("\n").gsub(/^/, ' ')

puts "\nThe following %d least frequent word(s) each appeared %d
time(s):" %
[bottom_words.size, bottom_word_count]
puts bottom_words.sort.join("\n").gsub(/^/, ' ')
====
  Reply With Quote


 


3 18th September 20:50
robin
External User
 
Posts: 1
Default Word-processing challenge anyone?


This is a translation to PL/I of DF's Fortran code.
Like that code, it's ASCII specific.

(nofofl, nosize):
word_counts: proc options (main);
dcl (hashbits value(17), maxw value ((2**hashbits)), wlen value (30)) fixed binary (31);
dcl line char (80) var, sword(wlen) char (1), ch char (1);
dcl (ich, i, k, n, nchars, (nc, collisions, total, unique) init (0), odd, even) fixed bin (31);
dcl 1 wc(maxw) static,
2 word char(wlen),
2 count fixed binary (31);
dcl letters char (52) init ('abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPWQSTUV WXYZ');

wc.word = ' ' ; wc.count = 0;

on endfile (sysin) go to compact;

more:
nc = 0;
get edit (line) (L);
line = translate(line, 'abcdefghijklmnopqrstuvwxyz',
'ABCDEFGHIJKLMNOPQRSTUVWXYZ');
ml:
do forever; nc = nc+1 ;
if nc > length(line) then go to more;
nc = search(line, letters, nc); /* skip until alpha start */
if nc = 0 then go to more; /* no alpha chars at end of line. */
ch = substr(line,nc,1);
ich = unspec(ch);
k = 1 ; sword(1) = ch; /* found start new word */
odd = ich ; even = 1; /* init. hash with 1st char */
il:
do forever; nc = nc+1;
if nc > length(line) then leave il;
ch = substr(line,nc,1);
if ch = '''' then iterate il; /* delete ' from word */
if ch < 'A' | ch > 'z' then leave il; /* end word */
if ch > 'Z' & ch < 'a' then leave il;
ich = unspec (ch);
k = k+1 ; sword(k) = ch;
if (iand(k,1) = 0) then
even = ieor(isll(even,5),ich); /* accum even hash */
else
odd = ieor(isll(odd, 5),ich); /* accum odd hash */
end;

n = ieor(isrl(odd*even,hashbits+2),odd*even); /* hash product pieces */
n = iand(n,maxw-1); /* positive index */ do forever;
n = n+1 ; if (n > maxw-1) then n = 1; /* reset index */
if (wc(n).count = 0) then
do; wc(n).word = substr(string(sword),1,k) ; wc(n).count = 1 ; iterate ml; end;
/* initial entry */
else if substr(string(sword),1,k) = wc(n).word then
do; wc(n).count = wc(n).count+1 ; iterate ml; end; /* count occurrences */
else
collisions = collisions+1;
end;
end;

compact:
n = 0;
do i = 1 to maxw; /* make entries contiguous from wc(1: */
if (wc(i).count = 0) then iterate;
n = n+1 ; wc(n) = wc(i);
total = total + wc(i).count ; unique = unique+1;
end;

call qsort(0,n-1); /* quicksort wc(1:n) entries */

put skip edit ( (wc(i).count, trim(wc(i).word) do i=1 to n)) (f(5), x(1), a, skip);
put skip list ( 'total words =', total );
put skip list ( 'unique words =', unique);
put skip data (collisions);

qsort: proc (l,r) recursive;
dcl 1 tempwc,
2 word char(wlen),
2 count fixed binary (31);
dcl sword char(wlen);
dcl ( l, r, i,j ) fixed binary (31);

i = l ; j = r ; sword = wc((l+r+2)/2).word;
do while (i <= j);
do while (wc(i+1).word < sword & i < r);
i = i+1; end;
do while (sword < wc(j+1).word & j > l);
j = j-1;
end;
if (i <= j) then
do;
tempwc = wc(i+1);
wc(i+1) = wc(j+1);
wc(j+1) = tempwc; /* swap words,counts */
i = i+1;
j = j-1;
end;
end;
if (l < j) then call qsort(l, j);
if (i < r) then call qsort(i, r);
end qsort;
end word_counts;
/*
.......
3 zuph
5 zur
1 zuriel
5 zurishaddai
1 zuzims
total words = 789781
unique words = 12691
COLLISIONS= 4318;
*/
  Reply With Quote
4 18th September 20:50
david frank
External User
 
Posts: 1
Default Word-processing challenge anyone?


<snip source>
Congratulations,
very interesting translation of my
http://home.earthlink.net/~dave_gemini/wc.f90 fortran source
that NO-ONE expected to see after almost 3yrs from the original challenge.

I note you have removed any benchmark timing, surely the code will beat
Eric's RUBY version = 2.6 sec ??

Prove that PL/I "once upon a time" supported EASY distribution of a windows
exe program by
making your exe available to run on our PCs, even tho I sense that no
longer is supported with the
"web-sphere PL/I system"

OTOH, I can EASILY make my windows exe (1 file) available on request for
anyone's use, and note that it will process
ANY text file

Perhaps someone will confirm your source is valid for their compiler..
  Reply With Quote
5 18th September 20:50
gerard schildberger
External User
 
Posts: 1
Default Word-processing challenge anyone?


|> This is a translation to PL/I of DF's Fortran code.
|> Like that code, it's ASCII specific.
| <snip source>
| Congratulations,
| very interesting translation of my
| http://home.earthlink.net/~dave_gemini/wc.f90 fortran source
| that NO-ONE expected to see after almost 3yrs from the original challenge. |
| > .......
| > 3 zuph
| > 5 zur
| > 1 zuriel
| > 5 zurishaddai
| > 1 zuzims
| > total words = 789781
| > unique words = 12691
| > COLLISIONS= 4318;
| > */
|
| I note you have removed any benchmark timing, surely the code will beat
| Eric's RUBY version = 2.6 sec ??

Benchmark timings from different CPUs are meaningless unless
all benchmarks are done on the same CPU, same operating
systems, same harddrive(s), etc. Saying that I ran some
code on my computer, and it ran in 2.2 seconds. So what?
I'd have to run the Fortran code (with a particular Fortran
compiler) also.

I also wish you would write the code to not be ASCII dependent.
__________________________________________________ ____Gerard S.


| Prove that PL/I "once upon a time" supported EASY distribution of a windows
| exe program by
| making your exe available to run on our PCs, even tho I sense that no
| longer is supported with the
| "web-sphere PL/I system"
|
| OTOH, I can EASILY make my windows exe (1 file) available on request for
| anyone's use, and note that it will process
| ANY text file
|
| Perhaps someone will confirm your source is valid for their compiler..
  Reply With Quote
6 18th September 20:51
robin
External User
 
Posts: 1
Default Word-processing challenge anyone?


I didn't remove anything. I just didn't implement it, as it's irrelevant.

As I posted previously, time is irrelevant on different PCs.


It's a function of the linker, not the compiler.


No, your code WON'T process any text file.
It will ONLY handle ASCII.


Perhaps someone will confirm that YOUR source is valid
for their compiler.
But YOUR source code is non-portable, so it 's not guaranteed to
compile on every Fortran compiler.
  Reply With Quote
Reply


Thread Tools
Display Modes




666