Mombu the Programming Forum

Go Back   Mombu the Programming Forum > Programming > General Purpose parsing
User Name
Password
REGISTER NOW! Mark Forums Read




Reply
1 10th August 01:02
reuben grinberg
External User
 
Posts: 1
Default General Purpose parsing



I'm trying to write a parser in prolog and all the tutorials on the net
assume that I'm only interested in a limited number of tokens:

sentence --> verb, noun.
verb(hammering).
noun(hammer).
noun(nail).

I'd like to pick up tokens that fit a certain regular expression:
[a-zA-Z][a-zA-Z0-9]*

I've tried a number of different things.

First, I tried:
id --> [S], {regmatch("^[a-zA-Z][a-zA-Z0-9]*$", S)}.

| ?- phrase(id, "bob").
no

So then I tried, this, thinking that the problem might have to do with
the difference between strings and atoms:

id --> [C], {atom_chars(C,S), regmatch("^[a-zA-Z][a-zA-Z0-9]*$", S)}.

again:

| ?- phrase(id, "bob").
no
When I trace through, it doesn't even look like regmatch is being called:
| ?- phrase(id, "bob").
1 1 Call: id([98,111,98],[]) ?
2 2 Call: 'C'([98,111,98],_1148,[]) ?
2 2 Fail: 'C'([98,111,98],_1148,[]) ?
1 1 Fail: id([98,111,98],[]) ?
no

I tried a different tack, which was to manually encode the regular
expression:
id --> [C], {letter([C])}.
id --> [C], restid, {letter([C])}.
restid --> [C], {letter([C])}.
restid --> [C], {num([C])}.

restid --> [C], restid, {letter([C])}.
restid --> [C], restid, {num([C])}.

letter("a").
....
num("9").

This seems to work:
| ?- phrase(id, "bob").
yes
| ?- phrase(id, "bob ").
no


But now I have the following problem.
I've added another rule:
bind --> id, ["="], id.

Now when I try this rule out it doesn't work:
| ?- phrase(bind, "a=b").
no

I've pasted in the trace for this below, and it's really freaking long
for some reason.

Any advice on what the correct way to parse arbitrary tokens using
regular expressions is in addition to why I can't get my 'bind' rule to
work would be appreciated.

Thanks,
Reuben Grinberg


| ?- phrase(bind, "a=b").
26 5 Call: restid([],_1107) ?
1 1 Call: bind([97,61,98],[]) ?
27 6 Call: 'C'([],_7704,_1107) ?
2 2 Call: id([97,61,98],_1107) ?
27 6 Fail: 'C'([],_7704,_1107) ?
3 3 Call: 'C'([97,61,98],_1773,_1107) ?
28 6 Call: 'C'([],_7704,_1107) ?
3 3 Exit: 'C'([97,61,98],97,[61,98]) ?
28 6 Fail: 'C'([],_7704,_1107) ?
4 3 Call: letter([97]) ?
29 6 Call: 'C'([],_7710,_7711) ?
? 4 3 Exit: letter([97]) ?
29 6 Fail: 'C'([],_7710,_7711) ?
? 2 2 Exit: id([97,61,98],[61,98]) ?
30 6 Call: 'C'([],_7710,_7711) ?
5 2 Call: 'C'([61,98],[61],_1101) ?
30 6 Fail: 'C'([],_7710,_7711) ?
5 2 Fail: 'C'([61,98],[61],_1101) ?
26 5 Fail: restid([],_1107) ?
2 2 Redo: id([97,61,98],[61,98]) ?
13 4 Fail: restid([98],_1107) ?
4 3 Redo: letter([97]) ?
31 4 Call: 'C'([61,98],_3736,_3737) ?
4 3 Fail: letter([97]) ?
31 4 Exit: 'C'([61,98],61,[98]) ?
6 3 Call: 'C'([97,61,98],_1779,_1780) ?
32 4 Call: restid([98],_1107) ?
6 3 Exit: 'C'([97,61,98],97,[61,98]) ?
33 5 Call: 'C'([98],_5717,_1107) ?
7 3 Call: restid([61,98],_1107) ?
33 5 Exit: 'C'([98],98,[]) ?
8 4 Call: 'C'([61,98],_3730,_1107) ?
34 5 Call: letter([98]) ?
8 4 Exit: 'C'([61,98],61,[98]) ?
? 34 5 Exit: letter([98]) ?
9 4 Call: letter([61]) ?
? 32 4 Exit: restid([98],[]) ?
9 4 Fail: letter([61]) ?
35 4 Call: num([61]) ?
10 4 Call: 'C'([61,98],_3730,_1107) ?
35 4 Fail: num([61]) ?
10 4 Exit: 'C'([61,98],61,[98]) ?
32 4 Redo: restid([98],[]) ?
11 4 Call: num([61]) ?
34 5 Redo: letter([98]) ?
11 4 Fail: num([61]) ?
34 5 Fail: letter([98]) ?
12 4 Call: 'C'([61,98],_3736,_3737) ?
36 5 Call: 'C'([98],_5717,_1107) ?
12 4 Exit: 'C'([61,98],61,[98]) ?
36 5 Exit: 'C'([98],98,[]) ?
13 4 Call: restid([98],_1107) ?
37 5 Call: num([98]) ?
14 5 Call: 'C'([98],_5717,_1107) ?
37 5 Fail: num([98]) ?
14 5 Exit: 'C'([98],98,[]) ?
38 5 Call: 'C'([98],_5723,_5724) ?
15 5 Call: letter([98]) ?
38 5 Exit: 'C'([98],98,[]) ?
? 15 5 Exit: letter([98]) ?
39 5 Call: restid([],_1107) ?
? 13 4 Exit: restid([98],[]) ?
40 6 Call: 'C'([],_7704,_1107) ?
16 4 Call: letter([61]) ?
40 6 Fail: 'C'([],_7704,_1107) ?
16 4 Fail: letter([61]) ?
41 6 Call: 'C'([],_7704,_1107) ?
13 4 Redo: restid([98],[]) ?
41 6 Fail: 'C'([],_7704,_1107) ?
15 5 Redo: letter([98]) ?
42 6 Call: 'C'([],_7710,_7711) ?
15 5 Fail: letter([98]) ?
42 6 Fail: 'C'([],_7710,_7711) ?
17 5 Call: 'C'([98],_5717,_1107) ?
43 6 Call: 'C'([],_7710,_7711) ?
17 5 Exit: 'C'([98],98,[]) ?
43 6 Fail: 'C'([],_7710,_7711) ?
18 5 Call: num([98]) ?
39 5 Fail: restid([],_1107) ?
18 5 Fail: num([98]) ?
44 5 Call: 'C'([98],_5723,_5724) ?
19 5 Call: 'C'([98],_5723,_5724) ?
44 5 Exit: 'C'([98],98,[]) ?
19 5 Exit: 'C'([98],98,[]) ?
45 5 Call: restid([],_1107) ?
20 5 Call: restid([],_1107) ?
46 6 Call: 'C'([],_7704,_1107) ?
21 6 Call: 'C'([],_7704,_1107) ?
46 6 Fail: 'C'([],_7704,_1107) ?
21 6 Fail: 'C'([],_7704,_1107) ?
47 6 Call: 'C'([],_7704,_1107) ?
22 6 Call: 'C'([],_7704,_1107) ?
47 6 Fail: 'C'([],_7704,_1107) ?
22 6 Fail: 'C'([],_7704,_1107) ?
48 6 Call: 'C'([],_7710,_7711) ?
23 6 Call: 'C'([],_7710,_7711) ?
48 6 Fail: 'C'([],_7710,_7711) ?
23 6 Fail: 'C'([],_7710,_7711) ?
49 6 Call: 'C'([],_7710,_7711) ?
24 6 Call: 'C'([],_7710,_7711) ?
49 6 Fail: 'C'([],_7710,_7711) ?
24 6 Fail: 'C'([],_7710,_7711) ?
45 5 Fail: restid([],_1107) ?
20 5 Fail: restid([],_1107) ?
32 4 Fail: restid([98],_1107) ?
25 5 Call: 'C'([98],_5723,_5724) ?
7 3 Fail: restid([61,98],_1107) ?
25 5 Exit: 'C'([98],98,[]) ?
2 2 Fail: id([97,61,98],_1107) ?
26 5 Call: restid([],_1107) ?
1 1 Fail: bind([97,61,98],[]) ?
27 6 Call: 'C'([],_7704,_1107) ?
27 6 Fail: 'C'([],_7704,_1107) ?
28 6 Call: 'C'([],_7704,_1107) ?
| ?-
28 6 Fail: 'C'([],_7704,_1107) ?
| ?-
29 6 Call: 'C'([],_7710,_7711) ?
29 6 Fail: 'C'([],_7710,_7711) ?
30 6 Call: 'C'([],_7710,_7711) ?
30 6 Fail: 'C'([],_7710,_7711) ?
26 5 Fail: restid([],_1107) ?
13 4 Fail: restid([98],_1107) ?
31 4 Call: 'C'([61,98],_3736,_3737) ?
31 4 Exit: 'C'([61,98],61,[98]) ?
32 4 Call: restid([98],_1107) ?
33 5 Call: 'C'([98],_5717,_1107) ?
33 5 Exit: 'C'([98],98,[]) ?
34 5 Call: letter([98]) ?
? 34 5 Exit: letter([98]) ?
? 32 4 Exit: restid([98],[]) ?
35 4 Call: num([61]) ?
35 4 Fail: num([61]) ?
32 4 Redo: restid([98],[]) ?
34 5 Redo: letter([98]) ?
34 5 Fail: letter([98]) ?
36 5 Call: 'C'([98],_5717,_1107) ?
36 5 Exit: 'C'([98],98,[]) ?
37 5 Call: num([98]) ?
37 5 Fail: num([98]) ?
38 5 Call: 'C'([98],_5723,_5724) ?
38 5 Exit: 'C'([98],98,[]) ?
39 5 Call: restid([],_1107) ?
40 6 Call: 'C'([],_7704,_1107) ?
40 6 Fail: 'C'([],_7704,_1107) ?
41 6 Call: 'C'([],_7704,_1107) ?
41 6 Fail: 'C'([],_7704,_1107) ?
42 6 Call: 'C'([],_7710,_7711) ?
42 6 Fail: 'C'([],_7710,_7711) ?
43 6 Call: 'C'([],_7710,_7711) ?
43 6 Fail: 'C'([],_7710,_7711) ?
39 5 Fail: restid([],_1107) ?
44 5 Call: 'C'([98],_5723,_5724) ?
44 5 Exit: 'C'([98],98,[]) ?
45 5 Call: restid([],_1107) ?
46 6 Call: 'C'([],_7704,_1107) ?
46 6 Fail: 'C'([],_7704,_1107) ?
47 6 Call: 'C'([],_7704,_1107) ?
47 6 Fail: 'C'([],_7704,_1107) ?
48 6 Call: 'C'([],_7710,_7711) ?
48 6 Fail: 'C'([],_7710,_7711) ?
49 6 Call: 'C'([],_7710,_7711) ?
49 6 Fail: 'C'([],_7710,_7711) ?
45 5 Fail: restid([],_1107) ?
32 4 Fail: restid([98],_1107) ?
7 3 Fail: restid([61,98],_1107) ?
2 2 Fail: id([97,61,98],_1107) ?
1 1 Fail: bind([97,61,98],[]) ?
no
  Reply With Quote


 


2 10th August 01:07
markus triska
External User
 
Posts: 1
Default General Purpose parsing



Reuben Grinberg <reuben.grinberg@aya.yale.edu> writes:


S will only be a single element of the list (of character codes). You
need to accumulate more elements for "bob".


Idea for a shorter version:

id --> [C], { between(0'a, 0'z, C) }, id_r.
id_r --> [].
id_r --> [C], { between(0'a, 0'z, C) ; between(0'0, 0'9, C)}, id_r.


We have:

%?- X = "=".
%@% X = [61]

So the rule is actually:

bind --> id, [[61]], id.

Try instead:

bind --> id, [0'=], id.

Because:

%?- X = 0'=.
%@% X = 61;

All the best,
Markus

--
comp.lang.prolog FAQ: http://www.logic.at/prolog/faq/
  Reply With Quote
3 10th August 01:07
reuben grinberg
External User
 
Posts: 1
Default General Purpose parsing


Thanks for your reply Markus!

Your suggestions fixed my problem and the id definition you suggested is
much better than the one I had.

Could you explain to me what the 0' syntax is? I'm using sicstus and I
searched the documentation for 0' and didn't find anything. is 0'a an
atom, a string, or a single character?

Also, do you know why I can't use spaces now in my phrase?

expr --> [let], id.

| ?- phrase(expr, "let a").
no
| ?- phrase(expr, "leta").
yes


Also, I'm unclear about why my code is working differently than this
snippet:
a --> b, c.
b --> [the].
c --> [dog].

| ?- phrase(a, "the dog").
no
| ?- phrase(a, "thedog").
no
| ?- phrase(b, "the").
no
| ?- phrase(b, [the]).
yes
| ?- phrase(a, [the, dog]).
yes


But this doesn't work with my code:
| ?- phrase(expr, [let, a]).
! Domain error in argument 1 of >= /2
! expected expression, found let
! goal: let>=97


I'm guessing some of the problems I'm having have to do with prolog
trying to parse and tokenize at the same time. I have a function that
tokenizes exactly the way I want:
hasktok(X,Y) :- tokenize(
"+|([a-zA-Z][a-zA-Z0-9]*|\\(|\\)|\\\\|\\.|=|",
X,Y).

But because my expr, bind, etc... won't take arrays in phrase, this
doesn't work:

| ?- hasktok("let a", Y).
Y = [let,a] ? ;
Y = [le,t,a] ? ;
Y = [l,et,a] ? ;
Y = [l,e,t,a] ?
yes
| ?- hasktok("let a", Y), phrase(expr, Y).
! Domain error in argument 1 of >= /2
! expected expression, found let
! goal: let>=97


Any advice you have would be much appreciated!

Thanks,
Reuben Grinberg
  Reply With Quote
4 13th August 07:05
markus triska
External User
 
Posts: 1
Default General Purpose parsing


Reuben Grinberg <reuben.grinberg@aya.yale.edu> writes:

0'X denotes the ASCII/Unicode code point of character X.


Ask Prolog:

%?- atom(0'a).
%@% No

%?- string(0'a).
%@% No

%?- number(0'a).
%@% Yes

%?- 97 =:= 0'a.
%@% Yes

0' (=:= 32) is neither between 0'a through 0'z nor between 0'0 through
0'9, so it's not permitted in id/2.

This mixes atoms (`let') with character codes (id/2).


"leta" is a valid id. "leta" is shorthand notation for a list of
character codes:

%?- Xs = "leta".
%@% Xs = [108, 101, 116, 97]


Your grammar generates lists of character codes. The snippet generates
lists of atoms.


It's common to first tokenise (= convert character codes to atoms and
compound terms), and then parse based on these tokens for clarity. You
can do it simultaneously too of course, but not by arbitrarily
intermingling the phases. DCGs are suitable for both phases: For
tokenising, since strings are lists. For parsing, since lists of terms
are - well, also lists. So it's list processing in both cases.

--
comp.lang.prolog FAQ: http://www.logic.at/prolog/faq/
  Reply With Quote
5 23rd August 06:23
reuben grinberg
External User
 
Posts: 1
Default General Purpose parsing


Thanks a lot for your help. I got it working.
  Reply With Quote


 


Reply


Thread Tools
Display Modes




666