使用 PowerShell 从文档中提取所有大写单词

Question

使用 PowerShell 从文档中提取所有大写单词。据我所知，一切正常，直到最后一行代码。我的 RegEx 有问题还是我的方法全错了？

#Extract content of Microsoft Word Document to text
$word = New-Object -comobject Word.Application
$word.Visible = $True 
$doc = $word.Documents.Open("D:\Deleteme\test.docx") 
$sel = $word.Selection
$paras = $doc.Paragraphs

$path = "D:\deleteme\words.txt"

foreach ($para in $paras) 
{ 
    $para.Range.Text | Out-File -FilePath $path -Append
}

#Find all capitalized words :( Everything works except this. I want to extract all Capitalized words
$capwords = Get-Content $path | Select-string -pattern "/\b[A-Z]+\b/g"

Answer 1

我修改了你的脚本，并且能够在我的测试文档中获取所有大写单词。

$word = New-Object -comobject Word.Application
$word.Visible = $True 
$doc = $word.Documents.Open("D:\WordTest\test.docx") 
$sel = $word.Selection
$paras = $doc.Paragraphs

$path = "D:\WordTest\words.txt"

foreach ($para in $paras) 
{ 
    $para.Range.Text | Out-File -FilePath $path -Append
}

# Get all words in the content
$AllWords = (Get-Content $path)

# Split all words into an array
$WordArray = ($AllWords).split(' ')

# Create array for capitalized words to capture them during ForEach loop
$CapWords = @()

# ForEach loop for each word in the array
foreach($SingleWord in $WordArray){

    # Perform a check to see if the word is fully capitalized
    $Check = $SingleWord -cmatch '\b[A-Z]+\b'
    
    # If check is true, remove special characters and put it into the $CapWords array
    if($Check -eq $True){
        $SingleWord = $SingleWord -replace '[\W]', ''
        $CapWords += $SingleWord
    }
}

我把它作为一个大写单词的数组输出，但如果你想让它成为一个字符串，你总是可以把它连接回去：

$CapString = $CapWords -join " "

Answer 2

PowerShell 使用 strings 来存储正则表达式并且具有 no 正则表达式语法 literals 例如 /.../ - 也不用于 post-位置匹配 options 例如 g.
PowerShell 默认情况下 不区分大小写 并且需要 选择加入 区分大小写（-CaseSensitive 在 Select-String 的情况下）。
- 没有它，[A-Z] 实际上与 [A-Za-z] 相同，因此匹配大写和小写（英文）字母。
g 选项的等效项是 Select-String 的 -AllMatches 开关，它查找 all 匹配项在每个输入行上（默认情况下，它只查找 first.
什么Select-String输出不是字符串，即不是直接匹配的行，而是包装器对象 类型 [Microsoft.PowerShell.Commands.MatchInfo] 以及关于每个匹配项的元数据。
- 该类型的实例有一个 .Matches 属性，其中包含 [System.Text.RegularExpressions.Match] 个实例的数组，其 .Value 属性包含每个匹配项的文本（而 .Line 属性包含完整的匹配行）。

总而言之：

$capwords = Get-Content -Raw $path |
  Select-String -CaseSensitive -AllMatches -Pattern '\b[A-Z]+\b' |
    ForEach-Object { $_.Matches.Value }

注意 -Raw 与 Get-Content 的使用，这大大加快了处理速度，因为整个文件内容被读取为 单行、多行字符串 - 本质上，Select-String 然后将整个内容视为单个“行”。这种优化是可能的，因为您对逐行处理不感兴趣，只关心正则表达式在所有行中捕获的内容。

顺便说一句：

$_.Matches.Value 利用 PowerShell 的 member-access enumeration，您可以类似地利用它来避免必须显式循环 $paras 中的段落：

# Use member-access enumeration on collection $paras to get the .Range
# property values of all collection elements and access their .Text
# property value.
$paras.Range.Text | Out-File -FilePath $path

.NET API 备选方案：

[regex]::Matches().NET 方法允许更简洁且性能更好的替代方法：

$capwords = [regex]::Matches((Get-Content -Raw $path), '\b[A-Z]+\b').Value

请注意，与 PowerShell 相比，.NET 正则表达式 API 默认情况下区分大小写，因此无需选择加入。

.Value 再次利用成员访问枚举，以便从所有返回的匹配信息对象中提取匹配文本。

使用 PowerShell 从文档中提取所有大写单词

Extract all Capitalized words from document using PowerShell

powershell

extract

capitalization