在 Python 中打印 tsv 文件(使用 UTF-8)的内容

Printing contents of a tsv file (with UTF-8) in Python

下面的代码在我命名为 tsv_test.py:

的文件中运行良好
import csv

class ReadUTF8():

    def unicode_csv_reader(self, utf8_data, dialect=csv.excel_tab, **kwargs):
        csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
        for row in csv_reader:
            yield [unicode(cell, 'utf-8') for cell in row]


    def load_deck_data(self):
        filename = 'lexicon.tsv'
        reader = self.unicode_csv_reader(open(filename))
        for field1, field2, field3, field4 in reader:
            print field1, field2, field3, field4

ReadUTF8().load_deck_data()

但是当我 copy/paste 它进入我的项目(这是一个 kivy 项目)时,它中断了。代码和错误如下:

class StudyScreenManagement(ScreenManager):

    def unicode_csv_reader(self, utf8_data, dialect=csv.excel_tab, **kwargs):
        csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
        for row in csv_reader:
            yield [unicode(cell, 'utf-8') for cell in row]


    def load_deck_data(self):
        filename = 'lexicon.tsv'
        reader = self.unicode_csv_reader(open(filename))
        for field1, field2, field3, field4 in reader:
            print field1, field2, field3, field4

我怀疑这是否相关,但为了以防万一,相关的 .kv 文件:

Button:
    text: 'Lexicon'
    on_press: app.root.load_deck_data()

输出:

 File "/Users/bearnun/code/mingyu/mingyuKivy/mingyu_controllers.py", line 14, in load_deck_data
 for field1, field2, field3, field4 in reader:
 ValueError: need more than 1 value to unpack

::旁注::

我尝试在这两种情况下都只打印 'field1'。有了这个改变,两者的输出是:

[u'\u4b03', u'\u98d2', u'[sa4]', u'/variant of \u98af|\u98d2[sa4]/']
[u'\u4b20', u'\u4b20', u'[fei1]', u'/old variant of \u970f[fei1]/']

我想要的输出:

䬃 飒 [sa4] /variant of 颯|飒[sa4]/
䬠 䬠 [fei1] /old variant of 霏[fei1]/

[在下方编辑]

lexicon.tsv内容:

䬃   飒   [sa4]   /variant of 颯|飒[sa4]/
䬠   䬠   [fei1]  /old variant of 霏[fei1]/

显然,我收到的是列表而不是生成器,所以如果在 load_deck_data() 中我更改...:[=​​12=]

for field1, field2, field3, field4 in reader:
    print field1, field2, field3, field4

...到...:[=​​12=]

for line in reader:
    print ''.join(line)

...我的项目运行良好。当然,这在最初有效的小代码片段中不起作用。

我很想知道为什么我在一个地方得到一个生成器,而在另一个地方得到一个列表。 :)

Apparently, I am receiving a list instead of a generator, so if in load_deck_data() I change:

for field1, field2, field3, field4 in reader:
    print field1, field2, field3, field4

to:

for line in reader:
    print ''.join(line)

my project works fine.

看看这个例子:

data = [
    ['a', 'b', 'c', 'd'],
    ['e'],
]

def mygen(x):
    for item in x:
        yield item

for line in mygen(data):
    print ''.join(line)

--output:--
abcd
e

for col1, col2, col3, col4 in mygen(data):
    print col1, col2, col3, col4


--output:--
a b c d

Traceback (most recent call last):
  File "1.py", line 13, in <module>
    for col1, col2, col3, col4 in mygen(data):
ValueError: need more than 1 value to unpack

在第一个for-in循环中,你在问,"Please retrieve all the elements in the list and join them together."在第二个for-in循环中,你在要求,"Retrieve four elements from the list!"看到区别了吗?在第一种情况下,列表可以包含 0 到 n 个元素,并且不会出现错误。在第二种情况下,列表必须至少有 4 个元素——否则会出错。

I would love to know why I'm getting a generator in one place, but a list in another.

简单。你不是。 csv.reader() returns 每行的字符串列表,这意味着 your generator function returns 每次迭代的字符串列表。

我认为您更改了文件中的数据。在一个文件中,您有 tab delimited 数据和 csv.reader() returns 文件中每一行的四个内容的列表,可以将其解压缩为四个变量;但是您的另一个文件有 non-tab delimited 数据,这导致 csv.reader() 将整行作为一项读取,因此 csv.reader() returns 的字符串列表仅包含一项, 并且一个单项列表不能被分解成四个变量。

I tried just printing 'field1' in both cases. With that change the output for both is:

[u'\u4b03', u'\u98d2', u'[sa4]', u'/variant of \u98af|\u98d2[sa4]/']
[u'\u4b20', u'\u4b20', u'[fei1]', u'/old variant of \u970f[fei1]/']

而不是 print field1,如果你这样做 print repr(field1) 我想你会得到:

"[u'\u4b03', u'\u98d2', u'[sa4]', u'/variant of \u98af|\u98d2[sa4]/']"

注意外引号,这意味着您的 tsv 文件在一行中确实包含以下内容:

[䬃, 飒, [sa4], /variant of 颯|飒[sa4]/]

没有制表符分隔任何东西,所以整行看起来像一个列表被作为一个项目读入,因此 csv.reader() returns 一个包含该项目的列表物品。您误以为单个项目是 python 列表,因为当您打印字符串时,python 不显示引号。例如,以下两个打印语句的输出没有区别:

>>> print "[1, 2, 3]"
[1, 2, 3]
>>> print [1, 2, 3]
[1, 2, 3]

print 在其他情况下也可以欺骗你,因为字符串可以包含不可打印的字符,print 的输出不会显示这些字符:

>>> print "hello\bworld"
hellworld

底线是:你永远无法通过查看 print 的输出知道原来的东西是什么。每当您想确切知道原始事物是什么时,请始终使用:

print repr(some_string)

现在,看看结果:

>>> print repr([1, 2, 3])
[1, 2, 3]
>>> print repr('[1, 2, 3]')
'[1, 2, 3]'
>>> print repr('hello\bworld')
'hello\x08world'

输出准确地告诉你原来的东西是什么。

使用以下制表符分隔的 lexicon.tsv 文件:

1   2   3   €
䬃   飒   [sa4]   /variant of 颯|飒[sa4]/

单击 Lexicon 按钮后,下面的代码不会导致错误:

from kivy.app import App
from kivy.uix.screenmanager import ScreenManager, Screen
import csv

class StudyScreenManager(ScreenManager):

    def unicode_csv_reader(self, utf8_data, dialect=csv.excel_tab, **kwargs):
        csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
        for row in csv_reader:
            yield [unicode(cell, 'utf-8') for cell in row]


    def load_deck_data(self):
        filename = 'lexicon.tsv'
        reader = self.unicode_csv_reader(open(filename))
        for field1, field2, field3, field4 in reader:
            print field1, field2, field3, field4


class HistoryScreen(Screen):
    pass

class MathScreen(Screen):
    pass

class MyApp(App):
    def build(self):
        sm = StudyScreenManager()
        sm.add_widget(HistoryScreen(name='history'))
        sm.add_widget(MathScreen(name='math'))

        return sm

MyApp().run()

my.kv:

<HistoryScreen>:  #the 'root' of the following widget hierarchy:
    BoxLayout:
        Button:
            text: 'Lexicon'
            on_press: app.root.load_deck_data()  #self=Button, root=HistoryScreen, app.root=the Widget returned by build()
        Button:
            text: "Next"
            on_press: root.manager.current = "math"

<MathScreen>: #the 'root' of the following widget heirarchy:
    BoxLayout:
        Button:
            text: 'Lexicon'
            on_press: app.root.load_deck_data()
        Button:
            text: 'Previous'
            on_press: root.manager.current = "history"

单击 Lexicon 按钮后,这是我在 utf-8 aware terminal window 中看到的输出:

1 2 3 €
䬃 飒 [sa4] /variant of 颯|飒[sa4]/