正则表达式从可能使用不同语言的文本中获取名称和值

Question

我正在尝试从此处的简短产品描述中提取规格名称和规格值：

Brand name: Lenovo‏, Model: IdeaPad 320‏, Size: 15.6''‏, CPU: Intel Core i3 - U‏, The Operating System: Free Dos‏, Capacity: 500GB‏, GPU: Intel‏, Memory Size: 4GB‏, Resolution: 1366x768‏, Optical Drive: DVD-RW (Dual Layer)‏, Color: Red‏, Connection Ports: HDMI‏, USB 3.0‏, USB Type-C‏, Features: HDD 5400RPM‏, Intel Skylake Processor‏, Full Keyboard‏, Bluetooth‏, Warranty: 1 Year‏,

我不太擅长正则表达式。我还是新手，但我尝试了以下模式。只有当我为该值添加另一个模式时，我才设法检测到规范名称，但由于不同的字符和可能性，它不适用于所有规范值。我想要实现的是：

选择完全匹配的规范名称和值。
让完整匹配的第一组包含规范名称，第二组包含规范值。

这是我做的图案，但只针对英文文本

((^[a-zA-Z]+|\s[a-zA-Z]+))+:( +[a-zA-Z0-9]+)

这里还有另一种语言的文本，希伯来语。文本应该从右到左阅读。

יצרן: Lenovo‏, דגם: IdeaPad 320‏, גודל: 15.6''‏, מעבד: Intel Core i3 - U‏, מערכת הפעלה: ללא מערכת הפעלה‏, נפח: 500GB‏, כרטיס מסך: Intel‏, גודל זכרון: 4GB‏, רזולוציה: 1366x768‏, כונן אופטי: (DVD-RW (Dual Layer‏, צבע: אדום‏, חיבורים: HDMI‏, USB 3.0‏, USB Type-C‏, תכונות: דיסק קשיח 5400Rpm‏, מעבד Skylake‏, מקלדת מלאה‏, Bluetooth‏, משך אחריות: שנה‏,

Answer 1

这似乎有效。我不确定希伯来语，但英语看起来不错。分解规范文本是基于 "spec name" 不包含逗号且规范值不包含冒号的假设。（或许可以克服后者。）

public class Product {

  static String prodEng = "Brand name: Lenovo‏, Model: IdeaPad 320‏, Size: 15.6\"‏, CPU: Intel Core i3 - U‏, The Operating System: Free Dos‏, Capacity: 500GB‏, GPU: Intel‏, Memory Size: 4GB‏, Resolution: 1366x768‏, Optical Drive: DVD-RW (Dual Layer), Color: Red‏, Connection Ports: HDMI‏, USB 3.0‏, USB Type-C‏, Features: HDD 5400RPM‏, Intel Skylake Processor‏, Full Keyboard‏, Bluetooth‏, Warranty: 1 Year‏,";

  static String prodHeb = "יצרן: Lenovo‏, דגם: IdeaPad 320‏, גודל: 15.6''‏, מעבד: Intel Core i3 - U‏, מערכת הפעלה: ללא מערכת הפעלה‏, נפח: 500GB‏, כרטיס מסך: Intel‏, גודל זכרון: 4GB‏, רזולוציה: 1366x768‏, כונן אופטי: (DVD-RW (Dual Layer‏, צבע: אדום‏, חיבורים: HDMI‏, USB 3.0‏, USB Type-C‏, תכונות: דיסק קשיח 5400Rpm‏, מעבד Skylake‏, מקלדת מלאה‏, Bluetooth‏, משך אחריות: שנה‏,";

  private String specStr;
  private Map<String,String> specMap = new LinkedHashMap<>();
  private Pattern pat = Pattern.compile( "([^,]+?):\s*([^:]+)?,\s*" );

  public Product( String spec ){
    this.specStr = spec;
    decompose( this.specStr );
  }

  private void decompose( String s ){
    Matcher mat = pat.matcher( s );
    int pos = 0;
    while( mat.find( pos ) ){
      String key = mat.group( 1 );
      String val = mat.group( 2 );
      pos += mat.group( 0 ).length();
      specMap.put( key, val );
    }
  }

  public Map<String,String> getSpecs(){
    return specMap;
  }

  public static void main(String[] args) throws Exception {
    Product pe = new Product( prodEng );
    pe.getSpecs().forEach( (k, v) -> { System.out.println( k + ": " + v ); } );
    ph = new Product( prodHeb );
    ph.getSpecs().forEach( (k, v) -> { System.out.println( k + ": " + v ); } );
  }
}

正则表达式从可能使用不同语言的文本中获取名称和值

Regex grab name and value from a text that could be in a different languages

java

regex

regex-group