通过根据频率重复单词来制作文本文件

Question

我知道根据 Whosebug 问题标准这个问题可能不合适，但我已经做了几个月的编码练习来解析和分析我以前从未做过编程的文本，并得到了这个论坛的帮助。

我用频率分析法分析了多个 xml 文件，存储在 mysqldb 中。 [字数]

我想根据频率重复单词来制作一个文本文件。（例如早餐，6 => 早餐早餐早餐早餐早餐早餐）在重复词之间包含一个space，从最低（文本的开头）到最高频率解析词（'a'或'the'将是最频繁的片段，并得出正文内容的最后一部分）。

请允许我了解一些想法、库、编码示例.. 谢谢。

import math
import random
import requests
import collections
import string
import re
import MySQLdb as mdb
import xml.etree.ElementTree as ET
from xml.dom import minidom
from string import punctuation
from oauthlib import *
from operator import itemgetter
from collections import defaultdict
from functools import reduce
import requests, re
from xml.etree import ElementTree
from collections import Counter
from lxml import html




### MYSQL ###

db = mdb.connect(host="****", user="****", passwd="****", db="****")

cursor = db.cursor()
sql = "DROP TABLE IF EXISTS Table1"
cursor.execute(sql)
db.commit()
sql = "CREATE TABLE Table1(Id INT PRIMARY KEY AUTO_INCREMENT, keyword TEXT, frequency INT)"
cursor.execute(sql)
db.commit()



## XML PARSING
def main(n=1000):

    # A list of feeds to process and their xpath


    feeds = [
        {'url': 'http://www.nyartbeat.com/list/event_type_print_painting.en.xml', 'xpath': './/Description'},
        {'url': 'http://feeds.feedburner.com/FriezeMagazineUniversal?format=xml', 'xpath': './/description'},
        {'url': 'http://www.artandeducation.net/category/announcement/feed/', 'xpath': './/description'},
        {'url': 'http://www.blouinartinfo.com/rss/visual-arts.xml', 'xpath': './/description'},
        {'url': 'http://feeds.feedburner.com/ContemporaryArtDaily?format=xml', 'xpath': './/description'}
    ]



    # A place to hold all feed results
    results = []

    # Loop all the feeds
    for feed in feeds:
        # Append feed results together
        results = results + process(feed['url'], feed['xpath'])

    # Join all results into a big string
    contents=",".join(map(str, results))

    # Remove double+ spaces
    contents = re.sub('\s+', ' ', contents)

    # Remove everything that is not a character or whitespace
    contents = re.sub('[^A-Za-z ]+', '', contents)

    # Create a list of lower case words that are at least 8 characters
    words=[w.lower() for w in contents.split() if len(w) >=1 ]


    # Count the words
    word_count = Counter(words)

    # Clean the content a little
    filter_words = ['art', 'artist', 'artist']
    for word in filter_words:
        if word in word_count:
            del word_count[word]



# Add to DB
    for word, count in word_count.most_common(n):
                sql = """INSERT INTO Table1 (keyword, frequency) VALUES(%s, %s)"""
                cursor.execute(sql, (word, count))
                db.commit()

def process(url, xpath):
    """
    Downloads a feed url and extracts the results with a variable path
    :param url: string
    :param xpath: string
    :return: list
    """
    contents = requests.get(url)
    root = ElementTree.fromstring(contents.content)
    return [element.text.encode('utf8') if element.text is not None else '' for element in root.findall(xpath)]





if __name__ == "__main__":
    main()

Answer 1

假设您在 for 循环中使用的 word_count.most_common(n) 将 return 一个包含 word 和 count 的元组或列表：

让我们将它存储在一个变量中：

words = word_count.most_common(n)
# Ex: [('a',5),('apples',2),('the',4)]

使用 itemgetter，按计数排序：

from operator import itemgetter
words = sorted(words, key = itemgetter(1))
# words = [('apples', 2), ('the', 4), ('a', 5)]

现在遍历每个条目，并将其附加到列表中：

out = []
for word, count in words:
    out += [word]*count
# out = ['apples', 'apples', 'the', 'the', 'the', 'the', 'a', 'a', 'a', 'a', 'a']

下一行将把它变成一个长字符串：

final = " ".join(out)
# final = "apples apples the the the the a a a a a"

现在只需将其写入文件即可：

with open("filename.txt","w+") as f:
    f.write(final)

代码如下所示：

from operator import itemgetter

words = word_count.most_common(n)
words = sorted(words, key = itemgetter(1))

out = []
for word, count in words:
    out += [word]*count

final = " ".join(out)

with open("filename.txt","w+") as f:
    f.write(final)

通过根据频率重复单词来制作文本文件

make text file by repeating word based on frequency

python

parsing

text

frequency

word