Mombu the Microsoft Forum sponsored links

Go Back   Mombu the Microsoft Forum > Microsoft > Windows 2003 Server (TECHNET) > Filter string to remove non-utf-8 characters
User Name
Password
REGISTER NOW! Mark Forums Read

sponsored links


Reply
 
1 11th December 17:40
bloodfart
External User
 
Posts: 1
Default Filter string to remove non-utf-8 characters


I've been working on this all day. I feel very frustrated (and
humilated).

I'm simply trying to write a function that cleans a string using
vbscript so it doesn't have any characters outside of the utf-8
character range.

It's seems simple enough, but I'm stumped.

Please, please help!!!
  Reply With Quote


  sponsored links


2 11th December 17:41
evertjan
External User
 
Posts: 1
Default Filter string to remove non-utf-8 characters


What character numbers are outside utf-8?
<http://www.mail-archive.com/regexp-dev@jakarta.apache.org/msg00175.html>


--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)
  Reply With Quote
3 11th December 17:46
bloodfart
External User
 
Posts: 1
Default Filter string to remove non-utf-8 characters


Perhaps I'm not describing my problem properly. I want to get rid of
all characters that will make an rss feed in utf-8 crash. Examples are:
, , , etc. I'm using .asp to create my rss page, so I would like
it to be a vbscript function.

Is this more clear?

Thank you for the help.
  Reply With Quote
4 11th December 17:47
bloodfart
External User
 
Posts: 1
Default Filter string to remove non-utf-8 characters


BTW, I tried this:

testVar = "asd asdf... "

'Create a regular expression object
Dim regEx
Set regEx = New RegExp

'The global property tells the RegExp engine to find ALL matching
'substrings, instead of just the first instance. We need this to be
true.
regEx.Global = true

'Our pattern tells us what to find in the string... In this case, we
find
'anything that isn't a numerical character, or a lowercase or
'uppercase alphabetic character
regEx.Pattern = "[^0-9a-zA-Z]"

'Use the replace function of RegExp to clean the username. The replace
'function takes the string to search (using the Pattern above as the
'search criteria), and the string to replace any found strings with.
'In this case, we want to replace our matches with nothing (''),
'as the matching characters will be the ones we don't want in our
username.
dim username
username = regEx.Replace(testVar, "")

But, writing testVar still contains the "".

Thanks!
  Reply With Quote
5 11th December 17:47
bloodfart
External User
 
Posts: 1
Default Filter string to remove non-utf-8 characters


BTW, I tried this:


testVar = "asd asdf... "

'Create a regular expression object
Dim regEx
Set regEx = New RegExp

'The global property tells the RegExp engine to find ALL matching
'substrings, instead of just the first instance. We need this to be
true.
regEx.Global = true

'Our pattern tells us what to find in the string... In this case, we
find
'anything that isn't a numerical character, or a lowercase or
'uppercase alphabetic character
regEx.Pattern = "[^0-9a-zA-Z]"

'Use the replace function of RegExp to clean the username. The replace
'function takes the string to search (using the Pattern above as the
'search criteria), and the string to replace any found strings with.
'In this case, we want to replace our matches with nothing (''),
'as the matching characters will be the ones we don't want in our
username.
dim username
username = regEx.Replace(testVar, "")


But, writing testVar still contains .
  Reply With Quote
6 14th December 05:04
bloodfart
External User
 
Posts: 1
Default Filter string to remove non-utf-8 characters


Perhaps I'm not describing my problem properly. I want to get rid of
all characters that will my an rss feed in utf-8 crash. Examples are:
, , , etc. I'm using .asp to create my rss page, so I would like
it to be a vbscript function.

Is this more clear?

Thank you for the help.
  Reply With Quote
7 14th December 05:05
afro-man
External User
 
Posts: 1
Default Filter string to remove non-utf-8 characters


I may be wrong here, but the Pattern in the RegExp.Pattern property is
the pattern you want to replace, not the one you want to keep. You
might also considder adding the RegExp.Test value so that you do not
need to run the replace against every sting, but only those that have
the pattern in them that you wish to exclude.

Also, the Pattern should have multiple values separated by a | - with
them run together like that it is looking for that exact string, which
will never match anything...

Your RegExp.Pattern should be the things you want to get rid of
separated by the | marker.

That should work...
  Reply With Quote
8 14th December 05:11
evertjan
External User
 
Posts: 1
Default Filter string to remove non-utf-8 characters


aren't those utf-8 characters?

--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)
  Reply With Quote
9 14th December 05:13
anthony jones
External User
 
Posts: 1
Default Filter string to remove non-utf-8 characters


UTF-8 is an encoding scheme for the unicode charater set which is massive
and I doubt you are trying to send any characters outside it's range.

I can think of a couple of things you might be trying to do.

1) Write an XML file without the benefit of using MSXML properly.
2) Sending an XML stream from ASP but your getting the encoding wrong.

Give us more details of your task and we can help. Stripping out characters
like , O and ? is not the answer.

Anthony.
  Reply With Quote
10 14th December 05:17
tgetz
External User
 
Posts: 1
Default Filter string to remove non-utf-8 characters


Sorry Everyone, I'm new to RSS and XML. Here is exactly what I'm trying
to do:


I am trying to do i bulk upload to Google Base. They require RSS2
feeds.
I'm using classic .ASP (vbscript) to pull my data stored in SQL 2000.

It seems pretty simply as I just created the file to their specs and it
worked. Here is an example:

<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Google Jobs</title>
<link>http://www.google.com/support/jobs/</link>
<description>Information about job openings at Google
Inc.</description>
<item>
<title>HR ****yst - Mountain View</title>
<link>http://www.google.com/support/jobs/bin/topic.py?dep_id=1077&amp;loc_id=1116</link>
<description>We have an immediate need for an experienced ****ytical HR
professional.
The ideal candidate has a proven record of developing ****ytical
frameworks to make
fact-based decisions.</description>
</item>
</channel>
</rss>

However, I'm uploading 50,000+ records and every so often I get an
invalid character (examples: , , ) that crashes my feed. The
culprit is the description field. This data is added by international
users.


As mentioned, I'm new to RSS (and XML), but my understanding is, I need
to convert the data to UTF-8.

It seems the simplest method would be to create a function in vbscript
and just filter the string. However, if I could accomplish the same
thing in TSQL I would be just as happy.

I tried the following, but asc() errors when it encounters one of the
above invalid characters:


function filterForXML(strString)

for i = 1 to Len(strString)
charCode = Asc(Mid(strString, i, 1))
if charCode < 32 or charCode >= 127 then
strString = left(strString,i-1) &
right(strString,Len(strString)-i)
end if
next

filterForXML = strString

end function


Does anyone have any suggestions?
  Reply With Quote
Reply


Thread Tools
Display Modes




Copyright 2006 SmartyDevil.com - Dies Mies Jeschet Boenedoesef Douvema Enitemaus -
666