Python论坛  - 讨论区

标题:[python-chinese] incomplete multibyte sequence

2004年03月12日 星期五 18:35

Anthony Liu antonyliu2002 at yahoo.com
Fri Mar 12 18:35:51 HKT 2004

I am trying to process a huge Chinese document.  The
single document is in pure text format and it's nearly
4M.

I always get "incomplete multibyte sequence" error
when I try to unicode the sentences.

I think the reason is because the Chinese document
uses both ascii punctuations and 2-byte Chinese
punctuations.

For example, the single document can both , and
, and both < > and 《》.

Is there anyway, I can go around this?  Don't ask me
to fix the Chinese document!

__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com


[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年03月12日 星期五 21:51

Zoom.Quiet zoomq at infopro.cn
Fri Mar 12 21:51:09 HKT 2004

Hello Anthony,

can u try line by line to process it
not do in time?

=== [ 18:35 ; 04-03-12 ] you wrote:

AL> I am trying to process a huge Chinese document.  The
AL> single document is in pure text format and it's nearly
AL> 4M.

AL> I always get "incomplete multibyte sequence" error
AL> when I try to unicode the sentences.

AL> I think the reason is because the Chinese document
AL> uses both ascii punctuations and 2-byte Chinese
AL> punctuations.

AL> For example, the single document can both , and
AL> , and both < > and 《》.

AL> Is there anyway, I can go around this?  Don't ask me
AL> to fix the Chinese document!

AL> __________________________________
AL> Do you Yahoo!?
AL> Yahoo! Search - Find what you’re looking for faster
AL> http://search.yahoo.com

=== === === === === === === === === === 

-- 
Best regards,
 Zoom.Quiet                            

 /=======================================\
]Time is unimportant, only life important![
 \=======================================/



[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年03月13日 星期六 00:37

Rui Luo lplusplus at hotmail.com
Sat Mar 13 00:37:10 HKT 2004

ÈçÏÂÃæÒ»ÐÐ
s.accept();
Ëü»áÒ»Ö±blockÔÚÉÏÃæÒ»ÐÐ,ÎÒÓùýCtrl-C»òÕßÔÚÁíÒ»¸öÏß³ÌÖÐcloseÕâ¸ösocket¶¼²»ÐÐ.ÇëÎÊÎÒ¸ÃÈçºÎ°ì²ÅÄÜ"Àñò"µÄ½áÊøÎҵijÌÐòÄØ?

лл

_________________________________________________________________
Create a Job Alert on MSN Careers and enter for a chance to win $1000! 
http://msn.careerbuilder.com/promo/kaday.htm?siteid=CBMSN_1K≻_extcmp=JS_JASweep_MSNHotm2



[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年03月13日 星期六 02:52

Anthony Liu antonyliu2002 at yahoo.com
Sat Mar 13 02:52:46 HKT 2004

Yes, I can read it sentence by sentence, but the
problem is how I can know where there is going to be
the unicode problem.

--- "Zoom.Quiet" <zoomq at infopro.cn> wrote:
> Hello Anthony,
> 
> can u try line by line to process it
> not do in time?
> 
> === [ 18:35 ; 04-03-12 ] you wrote:
> 
> AL> I am trying to process a huge Chinese document. 
> The
> AL> single document is in pure text format and it's
> nearly
> AL> 4M.
> 
> AL> I always get "incomplete multibyte sequence"
> error
> AL> when I try to unicode the sentences.
> 
> AL> I think the reason is because the Chinese
> document
> AL> uses both ascii punctuations and 2-byte Chinese
> AL> punctuations.
> 
> AL> For example, the single document can both , and
> AL> , and both < > and 《》.
> 
> AL> Is there anyway, I can go around this?  Don't
> ask me
> AL> to fix the Chinese document!
> 
> AL> __________________________________
> AL> Do you Yahoo!?
> AL> Yahoo! Search - Find what you抮e looking
for
> faster
> AL> http://search.yahoo.com
> 
> === === === === === === === === === === 
> 
> -- 
> Best regards,
>  Zoom.Quiet                            
> 
>  /=======================================\
> ]Time is unimportant, only life important![
>  \=======================================/
> 
> _______________________________________________
> python-chinese list
> python-chinese at lists.python.cn
> http://python.cn/mailman/listinfo/python-chinese


__________________________________
Do you Yahoo!?
Yahoo! Search - Find what you’re looking for faster
http://search.yahoo.com


[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年03月13日 星期六 04:02

John Li johnli at ahlt.net
Sat Mar 13 04:02:23 HKT 2004

It's a problem with the unicode converter or the source
text.  To find out which, analyze the bytes that cause
the problem.  What are the decimal or hex values?
Then we can see whether in fact it is a valid or invalid
sequence.

Also, please fix your Chinese emails.  If Yahoo is the
problem, please consider switching to another free
email service.  Thanks!

John

> 
> I am trying to process a huge Chinese document.  The
> single document is in pure text format and it's nearly
> 4M.
> 
> I always get "incomplete multibyte sequence" error
> when I try to unicode the sentences.
> 
> I think the reason is because the Chinese document
> uses both ascii punctuations and 2-byte Chinese
> punctuations.
> 
> For example, the single document can both , and
> , and both < > and 《》.
> 
> Is there anyway, I can go around this?  Don't ask me
> to fix the Chinese document!
> 
 



[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年03月13日 星期六 06:17

Anthony Liu antonyliu2002 at yahoo.com
Sat Mar 13 06:17:28 HKT 2004

看看现在能不能看到汉字?

Yes, what you say makes very good sense.  

The following 2 lines attempt to break apart the
Chinese sentence at punctuations.

str =
"世界名著《红楼梦》的作者曹雪芹是前清有名的才子。"
alist=re.split('《|》|。', str)

It works fine, and alist will contain 3 chunks of the
sentence as expected. 

But if I unicode the str before I call re.split like
so:

str = unicode(str, 'gbk')

then the regular expression passed to re.split won't
match anything.

I tried unicoding the punctuations in the regular
expression as well, like so:

leftbk = '《'
rightbk= '》'
fullstop = '。'

pattern = '\'' + leftbk '|' + rightbk + '|' + fullstop
+ '\''

alist=re.split(pattern, str)

It does not work.  I am kindov at my wit's end.






 --- John Li <johnli at ahlt.net> 的正文:> It's a
problem with the unicode converter or the
> source
> text.  To find out which, analyze the bytes that
> cause
> the problem.  What are the decimal or hex values?
> Then we can see whether in fact it is a valid or
> invalid
> sequence.
> 
> Also, please fix your Chinese emails.  If Yahoo is
> the
> problem, please consider switching to another free
> email service.  Thanks!
> 
> John
> 
> > 
> > I am trying to process a huge Chinese document. 
> The
> > single document is in pure text format and it's
> nearly
> > 4M.
> > 
> > I always get "incomplete multibyte sequence" error
> > when I try to unicode the sentences.
> > 
> > I think the reason is because the Chinese document
> > uses both ascii punctuations and 2-byte Chinese
> > punctuations.
> > 
> > For example, the single document can both , and
> > , and both < > and 《》.
> > 
> > Is there anyway, I can go around this?  Don't ask
> me
> > to fix the Chinese document!
> > 
>  
> 
> > _______________________________________________
> python-chinese list
> python-chinese at lists.python.cn
> http://python.cn/mailman/listinfo/python-chinese
>  

_________________________________________________________
Do You Yahoo!? 
完全免费的雅虎电邮,马上注册获赠额外60兆网络存储空间
http://cn.rd.yahoo.com/mail_cn/tag/?http://cn.mail.yahoo.com


[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年03月13日 星期六 13:00

John Li johnli at ahlt.net
Sat Mar 13 13:00:03 HKT 2004

> 看看现在能不能看到汉字?

谢谢!
 
> Yes, what you say makes very good sense.  
> 
> The following 2 lines attempt to break apart the
> Chinese sentence at punctuations.
> 
> str =
> "世界名著《红楼梦》的作者曹雪芹是前清有名的才子。"
> alist=re.split('《|》|。', str)
> 
> It works fine, and alist will contain 3 chunks of the
> sentence as expected. 
> 
> But if I unicode the str before I call re.split like
> so:
> 
> str = unicode(str, 'gbk')
> 
> then the regular expression passed to re.split won't
> match anything.
> 
> I tried unicoding the punctuations in the regular
> expression as well, like so:
> 
> leftbk = '《'
> rightbk= '》'
> fullstop = '。'
> 
> pattern = '\'' + leftbk '|' + rightbk + '|' + fullstop
> + '\''
> 
> alist=re.split(pattern, str)
> 
> It does not work.  I am kindov at my wit's end.

----------------------------------------

import wx, re

str = '世界名著《红楼梦》的作者曹雪芹是前清有名的才子。'
str = unicode(str, 'gbk')
leftbk = unicode('《', 'gbk')
rightbk = unicode('》', 'gbk')
fullstop = unicode('。', 'gbk')
pattern = leftbk + u'|' + rightbk + u'|' + fullstop
alist=re.split(pattern, str)
for x in alist:
	print x.encode('gbk')

--John



[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年03月13日 星期六 13:04

John Li johnli at ahlt.net
Sat Mar 13 13:04:27 HKT 2004

> import wx, re
> 
对不起!不要 import wx

John 

[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年03月13日 星期六 16:18

Anthony Liu antonyliu2002 at yahoo.com
Sat Mar 13 16:18:56 HKT 2004

Yes, John, thank you very much for your hint and the
sample code.  I did not realize that I need to use
u'|' for the or operator.

--- John Li <johnli at ahlt.net> wrote:
> > import wx, re
> > 
> 对不起!不要
import wx
> 
> John 
> > _______________________________________________
> python-chinese list
> python-chinese at lists.python.cn
> http://python.cn/mailman/listinfo/python-chinese
> 


__________________________________
Do you Yahoo!?
Yahoo! Mail - More reliable, more storage, less spam
http://mail.yahoo.com


[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年03月14日 星期日 00:46

John Li johnli at ahlt.net
Sun Mar 14 00:46:27 HKT 2004

> Yes, John, thank you very much for your hint and the
> sample code.  I did not realize that I need to use
> u'|' for the or operator.
> 
Actually, I don't think that's essential--I was just being
careful.  I think that if any one of the strings is unicode,
then the rest will be automatically converted to unicode,
so that the resulting string is unicode.



[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

2004年03月14日 星期日 02:07

Anthony Liu antonyliu2002 at yahoo.com
Sun Mar 14 02:07:50 HKT 2004

> I think that if any one of the strings is
> unicode,
> then the rest will be automatically converted to
> unicode,
> so that the resulting string is unicode.

Hi, John, are you serious in saying the above?

__________________________________
Do you Yahoo!?
Yahoo! Mail - More reliable, more storage, less spam
http://mail.yahoo.com


[导入自Mailman归档:http://www.zeuux.org/pipermail/zeuux-python]

如下红色区域有误,请重新填写。

    你的回复:

    请 登录 后回复。还没有在Zeuux哲思注册吗?现在 注册 !

    Zeuux © 2024

    京ICP备05028076号