c# Best type/collection/list/dataset 处理超大数据（csv/tab 文件）

Question

我正在构建一个处理非常大的 csv 文件的 WPF (MVVM) 应用程序。我们说的是 1GB 到 10GB。

我打开文件并用 File.ReadLines 将其解析为以下列表 class:

public class FileLine
{
    public DateTime Time { get; set; } 
    public string Message { get; set; } //Usually around 256 characters
    public string Info1 { get; set; } //Exact 56 characters
    public string Info2 { get; set; } //Exact 4 characters
    //and so on
}

...然后我会进行各种数据操作、查询、图表...只要您能想到...一切都使用 Linq。

我们正在测试一个 1.8GB 的文件，当它打开时，该过程占用大约 2GB 的内存。

最终，当我的客户需要打开他的 10GB 文件时，这是不可能的，因为它将占用 12GB+ 的内存。这种工作最好的是什么type/collection/list/dataset？

Answer 1

当我不得不做这样的事情之前，我通过一个包含字典列表的容器对象来处理它。当时我认为限制 would/should 是 2^32 个元素，但是在获得 2^32 个元素之前抛出了超出集合的异常并且仍然有很多 GB 的 ram。假设你想要一个字典，这样的东西应该可以工作，直到你真的用尽所有的物理和虚拟内存......一个可能的解决方案如下......我记得几年前我工作时服务器实际上有 512Gb ram，我敢肯定他们现在有更多的...无论如何这是一个单独的故事。

   public class MyHugeDictionary  
   {  
        List<Dictionary<typea, typeb> allDict= null;  
        Dictionary<typea, typeb> currDictionary ;  

        MyHugeDictiionary()  
        {  
            allDict = new List<Dictionary<typea, typeb>();  
            currDictionary = new Dictionary<typea, typeb);  
            allDict.Add(currDictionary);  
        }  

        public bool ItemExists( typea, typeb)  
        {  
            foreach( KeyValue<Dictionary<typea, typeb> kv in allDict)  
            {  
                if( kv.ContainsKey(typea) )  
                {  
                    return true;  
                }  
            }  
            return false;  
        }  

        public Add( typea a, typeb b)  
        {  
            try  
            {  
                if( !ItemExist( tyepa, typeb) )  // find if items is in any other dictionary first  
                {  
                    currDictionary.Add( a, b) ;  
                }  
                else  { // handle dups... ; }  
            }  
            catch( CollectionSizeError x)   // look-up for actual exception
            {  
                currDictionary = CreateDictiionary();  
                allDict.Add( currDictionary ) ;  
                currDictionary.Add( a,b);  
            }  
            catch( OutOfMemory y)     // look-up for actual exception
            {  
                // oops game over for real now :(  
            }  
         }  
    }

Answer 2

经过一番讨论，最好的办法是读取文件，处理它，然后处理掉所有其他的东西，只坚持结果。

另一种可能性是使用数据库，但它会增加太多的复杂性，尽管这是可能的。

Answer 3

看到这个：

https://github.com/aumcode/nfx/tree/master/Source/NFX/ApplicationModel/Pile https://www.infoq.com/articles/Big-Memory-Part-3

您可以存储任何您想要的 - 没有停顿。大集合的问题是：一种。它们并不是真正设计用于容纳很多条目（即字典永远不会缩小到零大小） b.当你有太多对象时你得到 GC stalls/pauses

查看上面的链接 - 我们所做的是 "hiding" 来自 GC 的数据，如文章中所述。这样，您可以使用 LocalCache class 作为字典来存储数百万个对象。

对于网络中的大内存应用程序 - 请记住在您的应用程序配置文件中启用 64 位并将 GC 设置为服务器模式

c# Best type/collection/list/dataset 处理超大数据（csv/tab 文件）

c# Best type/collection/list/dataset to handle super large data (csv/tab files)

c#

linq

collections

dataset

large-files