优化字符串解析
Optimize String Parsing
我需要解析 "txf" 格式的数据文件。这些文件可能包含 1000 多个条目。由于格式定义明确,如 JSON,我想制作一个像 JSON 这样的通用解析器,它可以序列化和反序列化 txf 文件。
与JSON相反,标记无法识别对象或数组。如果出现具有相同标签的条目,我们需要将其视为数组。
#
标记对象的开始。
$
标记一个对象的成员
/
标记一个对象的结束
以下是示例 "txf" 文件
#Employees
$LastUpdated=2015-02-01 14:01:00
#Employee
$Id=1
$Name=Employee 01
#Departments
$LastUpdated=2015-02-01 14:01:00
#Department
$Id=1
$Name=Department Name
/Department
/Departments
/Employee
#Employee
/Employee
/Employees
我能够使用 NSScanner 创建一个通用的 TXF Parser。但是随着条目的增加,性能需要进行更多的调整。
我写了作为 plist
获得的基础对象,并再次比较了我写的解析器的性能。我的解析器比 plist
解析器慢大约 10 倍。
虽然plist
的文件大小是txf
的5倍,标记字符也更多,但我觉得还有很大的优化空间。
非常感谢这方面的任何帮助。
编辑:包括解析代码
static NSString *const kArray = @"TXFArray";
static NSString *const kBodyText = @"TXFText";
@interface TXFParser ()
/*Temporary variable to hold values of an object*/
@property (nonatomic, strong) NSMutableDictionary *dict;
/*An array to hold the hierarchial data of all nodes encountered while parsing*/
@property (nonatomic, strong) NSMutableArray *stack;
@end
@implementation TXFParser
#pragma mark - Getters
- (NSMutableArray *)stack{
if (!_stack) {
_stack = [NSMutableArray new];
}return _stack;
}
#pragma mark -
- (id)objectFromString:(NSString *)txfString{
[txfString enumerateLinesUsingBlock:^(NSString *string, BOOL *stop) {
if ([string hasPrefix:@"#"]) {
[self didStartParsingTag:[string substringFromIndex:1]];
}else if([string hasPrefix:@"$"]){
[self didFindKeyValuePair:[string substringFromIndex:1]];
}else if([string hasPrefix:@"/"]){
[self didEndParsingTag:[string substringFromIndex:1]];
}else{
//[self didFindBodyValue:string];
}
}]; return self.dict;
}
#pragma mark -
- (void)didStartParsingTag:(NSString *)tag{
[self parserFoundObjectStartForKey:tag];
}
- (void)didFindKeyValuePair:(NSString *)tag{
NSArray *components = [tag componentsSeparatedByString:@"="];
NSString *key = [components firstObject];
NSString *value = [components lastObject];
if (key.length) {
self.dict[key] = value?:@"";
}
}
- (void)didFindBodyValue:(NSString *)bodyString{
if (!bodyString.length) return;
bodyString = [bodyString stringByTrimmingCharactersInSet:[NSCharacterSet illegalCharacterSet]];
if (!bodyString.length) return;
self.dict[kBodyText] = bodyString;
}
- (void)didEndParsingTag:(NSString *)tag{
[self parserFoundObjectEndForKey:tag];
}
#pragma mark -
- (void)parserFoundObjectStartForKey:(NSString *)key{
self.dict = [NSMutableDictionary new];
[self.stack addObject:self.dict];
}
- (void)parserFoundObjectEndForKey:(NSString *)key{
NSDictionary *dict = self.dict;
//Remove the last value of stack
[self.stack removeLastObject];
//Load the previous object as dict
self.dict = [self.stack lastObject];
//The stack has contents, then we need to append objects
if ([self.stack count]) {
[self addObject:dict forKey:key];
}else{
//This is root object,wrap with key and assign output
self.dict = (NSMutableDictionary *)[self wrapObject:dict withKey:key];
}
}
#pragma mark - Add Objects after finding end tag
- (void)addObject:(id)dict forKey:(NSString *)key{
//If there is no value, bailout
if (!dict) return;
//Check if the dict already has a value for key array.
NSMutableArray *array = self.dict[kArray];
//If array key is not found look for another object with same key
if (array) {
//Array found add current object after wrapping with key
NSDictionary *currentDict = [self wrapObject:dict withKey:key];
[array addObject:currentDict];
}else{
id prevObj = self.dict[key];
if (prevObj) {
/*
There is a prev value for the same key. That means we need to wrap that object in a collection.
1. Remove the object from dictionary,
2. Wrap it with its key
3. Add the prev and current value to array
4. Save the array back to dict
*/
[self.dict removeObjectForKey:key];
NSDictionary *prevDict = [self wrapObject:prevObj withKey:key];
NSDictionary *currentDict = [self wrapObject:dict withKey:key];
self.dict[kArray] = [@[prevDict,currentDict] mutableCopy];
}else{
//Simply add object to dict
self.dict[key] = dict;
}
}
}
/*Wraps Object with a key for the serializer to generate txf tag*/
- (NSDictionary *)wrapObject:(id)obj withKey:(NSString *)key{
if (!key ||!obj) {
return @{};
}
return @{key:obj};
}
编辑 2:
样本 TXF file 超过 1000 个条目。
您是否考虑过使用拉式读取和递归处理?这样就无需将整个文件读入内存,也无需管理一些自己的堆栈来跟踪您解析的深度。
下面是 Swift 中的示例。该示例适用于您的示例 "txf",但不适用于保管箱版本;您的一些 "members" 跨越多行。如果这是一个要求,它可以很容易地实现到 switch/case "$"
部分。但是,我也没有看到您自己的代码处理这个问题。此外,该示例还没有遵循正确的 Swift 错误处理(parse
方法需要一个额外的 NSError
参数)
import Foundation
extension String
{
public func indexOfCharacter(char: Character) -> Int? {
if let idx = find(self, char) {
return distance(self.startIndex, idx)
}
return nil
}
func substringToIndex(index:Int) -> String {
return self.substringToIndex(advance(self.startIndex, index))
}
func substringFromIndex(index:Int) -> String {
return self.substringFromIndex(advance(self.startIndex, index))
}
}
func parse(aStreamReader:StreamReader, parentTagName:String) -> Dictionary<String,AnyObject> {
var dict = Dictionary<String,AnyObject>()
while let line = aStreamReader.nextLine() {
let firstChar = first(line)
let theRest = dropFirst(line)
switch firstChar! {
case "$":
if let idx = theRest.indexOfCharacter("=") {
let key = theRest.substringToIndex(idx)
let value = theRest.substringFromIndex(idx+1)
dict[key] = value
} else {
println("no = sign")
}
case "#":
let subDict = parse(aStreamReader,theRest)
var list = dict[theRest] as? [Dictionary<String,AnyObject>]
if list == nil {
dict[theRest] = [subDict]
} else {
list!.append(subDict)
}
case "/":
if theRest != parentTagName {
println("mismatch... [\(theRest)] != [\(parentTagName)]")
} else {
return dict
}
default:
println("mismatch... [\(line)]")
}
}
println("shouldn't be here...")
return dict
}
var data : Dictionary<String,AnyObject>?
if let aStreamReader = StreamReader(path: "/Users/taoufik/Desktop/QuickParser/QuickParser/file.txf") {
if var line = aStreamReader.nextLine() {
let tagName = line.substringFromIndex(advance(line.startIndex, 1))
data = parse(aStreamReader, tagName)
}
aStreamReader.close()
}
println(JSON(data!))
而 StreamReader
是从
借来的
编辑
- 查看完整代码 https://github.com/tofi9/QuickParser
- 拉式逐行读入objective-c:How to read data from NSFileHandle line by line?
编辑 2
我在 C++11 中重写了上面的内容,并在 2012 MBA I5 上使用 dropbox 上的更新文件在不到 0.05 秒(发布模式)内将其发布到 运行。我怀疑 NSDictionary
和 NSArray
一定有一些惩罚。下面的代码可以编译成一个objective-c项目(文件需要有扩展名.mm):
#include <iostream>
#include <sstream>
#include <string>
#include <fstream>
#include <map>
#include <vector>
using namespace std;
class benchmark {
private:
typedef std::chrono::high_resolution_clock clock;
typedef std::chrono::milliseconds milliseconds;
clock::time_point start;
public:
benchmark(bool startCounting = true) {
if(startCounting)
start = clock::now();
}
void reset() {
start = clock::now();
}
double elapsed() {
milliseconds ms = std::chrono::duration_cast<milliseconds>(clock::now() - start);
double elapsed_secs = ms.count() / 1000.0;
return elapsed_secs;
}
};
struct obj {
map<string,string> properties;
map<string,vector<obj>> subObjects;
};
obj parse(ifstream& stream, string& parentTagName) {
obj obj;
string line;
while (getline(stream, line))
{
auto firstChar = line[0];
auto rest = line.substr(1);
switch (firstChar) {
case '$': {
auto idx = rest.find_first_of('=');
if (idx == -1) {
ostringstream o;
o << "no = sign: " << line;
throw o.str();
}
auto key = rest.substr(0,idx);
auto value = rest.substr(idx+1);
obj.properties[key] = value;
break;
}
case '#': {
auto subObj = parse(stream, rest);
obj.subObjects[rest].push_back(subObj);
break;
}
case '/':
if(rest != parentTagName) {
ostringstream o;
o << "mismatch end of object " << rest << " != " << parentTagName;
throw o.str();
} else {
return obj;
}
break;
default:
ostringstream o;
o << "mismatch line " << line;
throw o.str();
break;
}
}
throw "I don't know why I'm here. Probably because the file is missing an end of object marker";
}
void visualise(obj& obj, int indent = 0) {
for(auto& property : obj.properties) {
cout << string(indent, '\t') << property.first << " = " << property.second << endl;
}
for(auto& subObjects : obj.subObjects) {
for(auto& subObject : subObjects.second) {
cout << string(indent, '\t') << subObjects.first << ": " << endl;
visualise(subObject, indent + 1);
}
}
}
int main(int argc, const char * argv[]) {
try {
obj result;
benchmark b;
ifstream stream("/Users/taoufik/Desktop/QuickParser/QuickParser/Members.txf");
string line;
if (getline(stream, line))
{
string tagName = line.substr(1);
result = parse(stream, tagName);
}
cout << "elapsed " << b.elapsed() << " ms" << endl;
visualise(result);
}catch(string s) {
cout << "error " << s;
}
return 0;
}
编辑 3
请参阅 link 以获得完整的 C++ 代码:https://github.com/tofi9/TxfParser
我对您的 github 源做了一些工作 - 通过以下 2 处更改,我得到了 30% 的总体改进,但主要改进来自 "Optimisation 1"
优化 1 - 根据您的数据进行以下工作。
+ (int)locate:(NSString*)inString check:(unichar) identifier
{
int ret = -1;
for (int i = 0 ; i < inString.length; i++){
if (identifier == [inString characterAtIndex:i]) {
ret = i;
break;
}
}
return ret;
}
- (void)didFindKeyValuePair:(NSString *)tag{
#if 0
NSArray *components = [tag componentsSeparatedByString:@"="];
NSString *key = [components firstObject];
NSString *value = [components lastObject];
#else
int locate = [TXFParser locate:tag check:'='];
NSString *key = [tag substringToIndex:locate];
NSString *value = [tag substringFromIndex:locate+1];
#endif
if (key.length) {
self.dict[key] = value?:@"";
}
}
优化2:
- (id)objectFromString:(NSString *)txfString{
[txfString enumerateLinesUsingBlock:^(NSString *string, BOOL *stop) {
#if 0
if ([string hasPrefix:@"#"]) {
[self didStartParsingTag:[string substringFromIndex:1]];
}else if([string hasPrefix:@"$"]){
[self didFindKeyValuePair:[string substringFromIndex:1]];
}else if([string hasPrefix:@"/"]){
[self didEndParsingTag:[string substringFromIndex:1]];
}else{
//[self didFindBodyValue:string];
}
#else
unichar identifier = ([string length]>0)?[string characterAtIndex:0]:0;
if (identifier == '#') {
[self didStartParsingTag:[string substringFromIndex:1]];
}else if(identifier == '$'){
[self didFindKeyValuePair:[string substringFromIndex:1]];
}else if(identifier == '/'){
[self didEndParsingTag:[string substringFromIndex:1]];
}else{
//[self didFindBodyValue:string];
}
#endif
}]; return self.dict;
}
希望对你有帮助。
我需要解析 "txf" 格式的数据文件。这些文件可能包含 1000 多个条目。由于格式定义明确,如 JSON,我想制作一个像 JSON 这样的通用解析器,它可以序列化和反序列化 txf 文件。
与JSON相反,标记无法识别对象或数组。如果出现具有相同标签的条目,我们需要将其视为数组。
#
标记对象的开始。$
标记一个对象的成员/
标记一个对象的结束
以下是示例 "txf" 文件
#Employees
$LastUpdated=2015-02-01 14:01:00
#Employee
$Id=1
$Name=Employee 01
#Departments
$LastUpdated=2015-02-01 14:01:00
#Department
$Id=1
$Name=Department Name
/Department
/Departments
/Employee
#Employee
/Employee
/Employees
我能够使用 NSScanner 创建一个通用的 TXF Parser。但是随着条目的增加,性能需要进行更多的调整。
我写了作为 plist
获得的基础对象,并再次比较了我写的解析器的性能。我的解析器比 plist
解析器慢大约 10 倍。
虽然plist
的文件大小是txf
的5倍,标记字符也更多,但我觉得还有很大的优化空间。
非常感谢这方面的任何帮助。
编辑:包括解析代码
static NSString *const kArray = @"TXFArray";
static NSString *const kBodyText = @"TXFText";
@interface TXFParser ()
/*Temporary variable to hold values of an object*/
@property (nonatomic, strong) NSMutableDictionary *dict;
/*An array to hold the hierarchial data of all nodes encountered while parsing*/
@property (nonatomic, strong) NSMutableArray *stack;
@end
@implementation TXFParser
#pragma mark - Getters
- (NSMutableArray *)stack{
if (!_stack) {
_stack = [NSMutableArray new];
}return _stack;
}
#pragma mark -
- (id)objectFromString:(NSString *)txfString{
[txfString enumerateLinesUsingBlock:^(NSString *string, BOOL *stop) {
if ([string hasPrefix:@"#"]) {
[self didStartParsingTag:[string substringFromIndex:1]];
}else if([string hasPrefix:@"$"]){
[self didFindKeyValuePair:[string substringFromIndex:1]];
}else if([string hasPrefix:@"/"]){
[self didEndParsingTag:[string substringFromIndex:1]];
}else{
//[self didFindBodyValue:string];
}
}]; return self.dict;
}
#pragma mark -
- (void)didStartParsingTag:(NSString *)tag{
[self parserFoundObjectStartForKey:tag];
}
- (void)didFindKeyValuePair:(NSString *)tag{
NSArray *components = [tag componentsSeparatedByString:@"="];
NSString *key = [components firstObject];
NSString *value = [components lastObject];
if (key.length) {
self.dict[key] = value?:@"";
}
}
- (void)didFindBodyValue:(NSString *)bodyString{
if (!bodyString.length) return;
bodyString = [bodyString stringByTrimmingCharactersInSet:[NSCharacterSet illegalCharacterSet]];
if (!bodyString.length) return;
self.dict[kBodyText] = bodyString;
}
- (void)didEndParsingTag:(NSString *)tag{
[self parserFoundObjectEndForKey:tag];
}
#pragma mark -
- (void)parserFoundObjectStartForKey:(NSString *)key{
self.dict = [NSMutableDictionary new];
[self.stack addObject:self.dict];
}
- (void)parserFoundObjectEndForKey:(NSString *)key{
NSDictionary *dict = self.dict;
//Remove the last value of stack
[self.stack removeLastObject];
//Load the previous object as dict
self.dict = [self.stack lastObject];
//The stack has contents, then we need to append objects
if ([self.stack count]) {
[self addObject:dict forKey:key];
}else{
//This is root object,wrap with key and assign output
self.dict = (NSMutableDictionary *)[self wrapObject:dict withKey:key];
}
}
#pragma mark - Add Objects after finding end tag
- (void)addObject:(id)dict forKey:(NSString *)key{
//If there is no value, bailout
if (!dict) return;
//Check if the dict already has a value for key array.
NSMutableArray *array = self.dict[kArray];
//If array key is not found look for another object with same key
if (array) {
//Array found add current object after wrapping with key
NSDictionary *currentDict = [self wrapObject:dict withKey:key];
[array addObject:currentDict];
}else{
id prevObj = self.dict[key];
if (prevObj) {
/*
There is a prev value for the same key. That means we need to wrap that object in a collection.
1. Remove the object from dictionary,
2. Wrap it with its key
3. Add the prev and current value to array
4. Save the array back to dict
*/
[self.dict removeObjectForKey:key];
NSDictionary *prevDict = [self wrapObject:prevObj withKey:key];
NSDictionary *currentDict = [self wrapObject:dict withKey:key];
self.dict[kArray] = [@[prevDict,currentDict] mutableCopy];
}else{
//Simply add object to dict
self.dict[key] = dict;
}
}
}
/*Wraps Object with a key for the serializer to generate txf tag*/
- (NSDictionary *)wrapObject:(id)obj withKey:(NSString *)key{
if (!key ||!obj) {
return @{};
}
return @{key:obj};
}
编辑 2:
样本 TXF file 超过 1000 个条目。
您是否考虑过使用拉式读取和递归处理?这样就无需将整个文件读入内存,也无需管理一些自己的堆栈来跟踪您解析的深度。
下面是 Swift 中的示例。该示例适用于您的示例 "txf",但不适用于保管箱版本;您的一些 "members" 跨越多行。如果这是一个要求,它可以很容易地实现到 switch/case "$"
部分。但是,我也没有看到您自己的代码处理这个问题。此外,该示例还没有遵循正确的 Swift 错误处理(parse
方法需要一个额外的 NSError
参数)
import Foundation
extension String
{
public func indexOfCharacter(char: Character) -> Int? {
if let idx = find(self, char) {
return distance(self.startIndex, idx)
}
return nil
}
func substringToIndex(index:Int) -> String {
return self.substringToIndex(advance(self.startIndex, index))
}
func substringFromIndex(index:Int) -> String {
return self.substringFromIndex(advance(self.startIndex, index))
}
}
func parse(aStreamReader:StreamReader, parentTagName:String) -> Dictionary<String,AnyObject> {
var dict = Dictionary<String,AnyObject>()
while let line = aStreamReader.nextLine() {
let firstChar = first(line)
let theRest = dropFirst(line)
switch firstChar! {
case "$":
if let idx = theRest.indexOfCharacter("=") {
let key = theRest.substringToIndex(idx)
let value = theRest.substringFromIndex(idx+1)
dict[key] = value
} else {
println("no = sign")
}
case "#":
let subDict = parse(aStreamReader,theRest)
var list = dict[theRest] as? [Dictionary<String,AnyObject>]
if list == nil {
dict[theRest] = [subDict]
} else {
list!.append(subDict)
}
case "/":
if theRest != parentTagName {
println("mismatch... [\(theRest)] != [\(parentTagName)]")
} else {
return dict
}
default:
println("mismatch... [\(line)]")
}
}
println("shouldn't be here...")
return dict
}
var data : Dictionary<String,AnyObject>?
if let aStreamReader = StreamReader(path: "/Users/taoufik/Desktop/QuickParser/QuickParser/file.txf") {
if var line = aStreamReader.nextLine() {
let tagName = line.substringFromIndex(advance(line.startIndex, 1))
data = parse(aStreamReader, tagName)
}
aStreamReader.close()
}
println(JSON(data!))
而 StreamReader
是从
编辑
- 查看完整代码 https://github.com/tofi9/QuickParser
- 拉式逐行读入objective-c:How to read data from NSFileHandle line by line?
编辑 2
我在 C++11 中重写了上面的内容,并在 2012 MBA I5 上使用 dropbox 上的更新文件在不到 0.05 秒(发布模式)内将其发布到 运行。我怀疑 NSDictionary
和 NSArray
一定有一些惩罚。下面的代码可以编译成一个objective-c项目(文件需要有扩展名.mm):
#include <iostream>
#include <sstream>
#include <string>
#include <fstream>
#include <map>
#include <vector>
using namespace std;
class benchmark {
private:
typedef std::chrono::high_resolution_clock clock;
typedef std::chrono::milliseconds milliseconds;
clock::time_point start;
public:
benchmark(bool startCounting = true) {
if(startCounting)
start = clock::now();
}
void reset() {
start = clock::now();
}
double elapsed() {
milliseconds ms = std::chrono::duration_cast<milliseconds>(clock::now() - start);
double elapsed_secs = ms.count() / 1000.0;
return elapsed_secs;
}
};
struct obj {
map<string,string> properties;
map<string,vector<obj>> subObjects;
};
obj parse(ifstream& stream, string& parentTagName) {
obj obj;
string line;
while (getline(stream, line))
{
auto firstChar = line[0];
auto rest = line.substr(1);
switch (firstChar) {
case '$': {
auto idx = rest.find_first_of('=');
if (idx == -1) {
ostringstream o;
o << "no = sign: " << line;
throw o.str();
}
auto key = rest.substr(0,idx);
auto value = rest.substr(idx+1);
obj.properties[key] = value;
break;
}
case '#': {
auto subObj = parse(stream, rest);
obj.subObjects[rest].push_back(subObj);
break;
}
case '/':
if(rest != parentTagName) {
ostringstream o;
o << "mismatch end of object " << rest << " != " << parentTagName;
throw o.str();
} else {
return obj;
}
break;
default:
ostringstream o;
o << "mismatch line " << line;
throw o.str();
break;
}
}
throw "I don't know why I'm here. Probably because the file is missing an end of object marker";
}
void visualise(obj& obj, int indent = 0) {
for(auto& property : obj.properties) {
cout << string(indent, '\t') << property.first << " = " << property.second << endl;
}
for(auto& subObjects : obj.subObjects) {
for(auto& subObject : subObjects.second) {
cout << string(indent, '\t') << subObjects.first << ": " << endl;
visualise(subObject, indent + 1);
}
}
}
int main(int argc, const char * argv[]) {
try {
obj result;
benchmark b;
ifstream stream("/Users/taoufik/Desktop/QuickParser/QuickParser/Members.txf");
string line;
if (getline(stream, line))
{
string tagName = line.substr(1);
result = parse(stream, tagName);
}
cout << "elapsed " << b.elapsed() << " ms" << endl;
visualise(result);
}catch(string s) {
cout << "error " << s;
}
return 0;
}
编辑 3
请参阅 link 以获得完整的 C++ 代码:https://github.com/tofi9/TxfParser
我对您的 github 源做了一些工作 - 通过以下 2 处更改,我得到了 30% 的总体改进,但主要改进来自 "Optimisation 1"
优化 1 - 根据您的数据进行以下工作。
+ (int)locate:(NSString*)inString check:(unichar) identifier
{
int ret = -1;
for (int i = 0 ; i < inString.length; i++){
if (identifier == [inString characterAtIndex:i]) {
ret = i;
break;
}
}
return ret;
}
- (void)didFindKeyValuePair:(NSString *)tag{
#if 0
NSArray *components = [tag componentsSeparatedByString:@"="];
NSString *key = [components firstObject];
NSString *value = [components lastObject];
#else
int locate = [TXFParser locate:tag check:'='];
NSString *key = [tag substringToIndex:locate];
NSString *value = [tag substringFromIndex:locate+1];
#endif
if (key.length) {
self.dict[key] = value?:@"";
}
}
优化2:
- (id)objectFromString:(NSString *)txfString{
[txfString enumerateLinesUsingBlock:^(NSString *string, BOOL *stop) {
#if 0
if ([string hasPrefix:@"#"]) {
[self didStartParsingTag:[string substringFromIndex:1]];
}else if([string hasPrefix:@"$"]){
[self didFindKeyValuePair:[string substringFromIndex:1]];
}else if([string hasPrefix:@"/"]){
[self didEndParsingTag:[string substringFromIndex:1]];
}else{
//[self didFindBodyValue:string];
}
#else
unichar identifier = ([string length]>0)?[string characterAtIndex:0]:0;
if (identifier == '#') {
[self didStartParsingTag:[string substringFromIndex:1]];
}else if(identifier == '$'){
[self didFindKeyValuePair:[string substringFromIndex:1]];
}else if(identifier == '/'){
[self didEndParsingTag:[string substringFromIndex:1]];
}else{
//[self didFindBodyValue:string];
}
#endif
}]; return self.dict;
}
希望对你有帮助。