从文本文件中读取指定字符串后的某些字母

Question

我想在一个巨大的文本文件 (50MB) 中提取非常具体的字符“data-permalink=”之后的字符和数字。理想情况下，输出应该写在一个简单的（单独的）文本文件中，看起来像这样：

34k89 456ij 233a4 ...

“data-permalink="" 始终保持完全相同（在源代码中通常如此），但其中的 id 可以是字符和数字的任意组合。乍一看似乎很简单，但事实并非如此在一行的开头，或者所需的输出不是一个单独的词我无法在要求的时间内提出一个有效的解决方案。我运行没时间了，需要一个解决方案或提示立即对此进行处理，非常感谢任何帮助

源数据文件中的数据示例：

以上随机内容 ....

我最了解c++或python，所以使用这些语言的这种解决方案会很好。

我试过这样的事情：

#include <iostream>
#include <string>
#include <fstream>
using namespace std;

int main()
{
    ifstream in ("data.txt");
    if(in.fail())
    {
        cout<<"error";
    }
    else
    {
        char c;
        while(in.get(c))
        {
            if(c=="data-permalink=")
                cout<<"lol this is awesome"
            else
                cout<<" ";
        }
    }
    return 0;
}

这只是一次随机尝试，看看该结构是否有效，离解决方案还差得很远。这个概率。也让你们猜猜我现在有多糟糕。

Answer 1

嗯，现在基本上50MB都算“小”了。用这么小的数据，你可以把整个文件读入一个std::string然后做一个线性搜索。

所以，算法是：

打开文件并检查它们是否可以打开
将完整文件读入 std::string
对字符串“data-permalink=""
记住永久链接的起始位置
搜索结尾的“
使用 std::strings substr函数创建输出永久链接字符串
将其写入文件
转到 1.

我用随机数据创建了一个 70MB 的随机测试文件。

整个过程不到1秒。即使使用缓慢的线性搜索。

但请注意。你想解析一个 HTML 文件。由于潜在的嵌套结构，这很可能不起作用。为此，您应该使用现有的 HTML 解析器。

无论如何。这是许多可能的解决方案之一。

#include <iostream>
#include <fstream>
#include <string>
#include <random>
#include <iterator>
#include <algorithm>

std::string randomSourceCharacters{ " abcdefghijklmnopqrstuvwxyz" };
const std::string sourceFileName{ "r:\test.txt" };
const std::string linkFileName{ "r:\links.txt" };

void createRandomData() {
    std::random_device randomDevice;
    std::mt19937 randomGgenerator(randomDevice());
    std::uniform_int_distribution<> randomCharacterDistribution(0, randomSourceCharacters.size() - 1);
    std::uniform_int_distribution<> randomLength(10, 30);

    if (std::ofstream ofs{ sourceFileName }; ofs) {


        for (size_t i{}; i < 1000000; ++i) {

            const int prefixLength{ randomLength(randomGgenerator) };
            const int linkLength{ randomLength(randomGgenerator) };
            const int suffixLength{ randomLength(randomGgenerator) };

            for (int k{}; k < prefixLength; ++k)
                ofs << randomSourceCharacters[randomCharacterDistribution(randomGgenerator)];
            ofs << "data-permalink=\"";

            for (int k{}; k < linkLength; ++k)
                ofs << randomSourceCharacters[randomCharacterDistribution(randomGgenerator)];
            ofs << "\"";
            for (int k{}; k < suffixLength; ++k)
                ofs << randomSourceCharacters[randomCharacterDistribution(randomGgenerator)];

        }
    }
    else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for writing\n";
}


int main() {
    // Please uncomment if you want to create a file with test data
    // createRandomData();


    // Open source file for reading and check, if file could be opened
    if (std::ifstream ifs{ sourceFileName }; ifs) {

        // Open link file for writing and check, if file could be opened
        if (std::ofstream ofs{ linkFileName }; ofs) {

            // Read the complete 50MB file into a string
            std::string data(std::istreambuf_iterator<char>(ifs), {});

            const std::string searchString{ "data-permalink=\"" };
            const std::string permalinkEndString{ "\"" };

            // Do a linear search
            for (size_t posBegin{}; posBegin < data.length(); ) {

                // Search for the begin of the permalink
                if (posBegin = data.find(searchString, posBegin); posBegin != std::string::npos) {

                    const size_t posStartForEndSearch = posBegin + searchString.length() ;

                    // Search fo the end of the perma link
                    if (size_t posEnd = data.find(permalinkEndString, posStartForEndSearch); posEnd != std::string::npos) {

                        // Output result
                        const size_t lengthPermalink{ posEnd - posStartForEndSearch };
                        const std::string output{ data.substr(posStartForEndSearch, lengthPermalink) };
                        ofs << output << '\n';
                        posBegin = posEnd + 1;
                    }
                    else break;
                }
                else break;
            }
        }
        else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for reading\n";
    }
    else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for reading\n";
}

编辑

如果您需要唯一链接，您可以将结果存储在 std::unordered_set 中，然后稍后输出。

#include <iostream>
#include <fstream>
#include <string>
#include <iterator>
#include <algorithm>
#include <unordered_set>

const std::string sourceFileName{ "r:\test.txt" };
const std::string linkFileName{ "r:\links.txt" };

int main() {

    // Open source file for reading and check, if file could be opened
    if (std::ifstream ifs{ sourceFileName }; ifs) {

        // Open link file for writing and check, if file could be opened
        if (std::ofstream ofs{ linkFileName }; ofs) {

            // Read the complete 50MB file into a string
            std::string data(std::istreambuf_iterator<char>(ifs), {});

            const std::string searchString{ "data-permalink=\"" };
            const std::string permalinkEndString{ "\"" };

            // Here we will store unique results
            std::unordered_set<std::string> result{};

            // Do a linear search
            for (size_t posBegin{}; posBegin < data.length(); ) {

                // Search for the begin of the permalink
                if (posBegin = data.find(searchString, posBegin); posBegin != std::string::npos) {

                    const size_t posStartForEndSearch = posBegin + searchString.length();

                    // Search fo the end of the perma link
                    if (size_t posEnd = data.find(permalinkEndString, posStartForEndSearch); posEnd != std::string::npos) {

                        // Output result
                        const size_t lengthPermalink{ posEnd - posStartForEndSearch };
                        const std::string output{ data.substr(posStartForEndSearch, lengthPermalink) };
                        result.insert(output);

                        posBegin = posEnd + 1;
                    }
                    else break;
                }
                else break;
            }
            for (const std::string& link : result)
               ofs << link << '\n';

        }
        else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for reading\n";
    }
    else std::cerr << "\nError: Could not open source file '" << sourceFileName << "' for reading\n";
}

从文本文件中读取指定字符串后的某些字母

Reading certain letters after a specified string from a text file

python

c++

text

output

编辑