从分析图像中提取医学标记名称、值和单位?

Extract medical marker name, values and units from analysed image?

我正在使用 Amazon Textract 分析匿名血液测试。 它由标记、它们的值、单位、参考间隔组成。

我想将它们提取到这样的字典中:

{"globulin": [2.8, gidL, [1.0, 4.0]], "cholesterol": [161, mg/dL, [120, 240]], .... }

以下是此类 OCR 生成文本的示例:

Name:
Date Perfermed
$/6/2010
DOBESevState:
Date Collected:
05/03/201004.00 PN
Date Lac Meat: 05/03/2010 10.45 A
Eraminer:
PTM
Date Received: $/7/2010 12:13.11A
Tukit No.
8028522035
Abeormal
Normal
Range
CARDLAC RISK
CHOLESTEROL
161.00
120.00 240.00 mg/dL
CHOLESTEROLHDL RATIO
2.39
1.250 5.00
HIGH DENSITY LIPOPROTEINCHDL)
67.30
35.00 75.00 me/dL
LOW DENSITY LIPOPROTEIN (LDL)
78.70
60.00 a 190.00 midI.
TRIGLYCERIDES
75.00
10.00 a 200.00 made
CHEMISTRIES
ALBUMIN
4.40
3.50 5.50 pidl
ALKALINE PHOSPHATASE
49.00
30.00 120.00 UAL
BLOOD UREA NITROGEN (BUN)
17.00
6.00 2500 meidL
CREATININE
0,85
060 1.50 matdL
FRUCTOSAMINE
182
1.20 1.79 mmoV/l
GAMMA GLUTAMYUTRANSFERASE
9.00
2.00 65.00 UIL
GLOBULIN
2.80
1.00 4.00 gidL.
GLUCOSE
61.00
70.00 125.00 me/dl.
HEMOGLOBIN AIC
5.10
3.00 6.00 %
SGOT (AST)
25.00
0.00 41.00 UM
SOPI (ALT)
22.00
0.00 45.00 IMI
TOTAL BILIRUBIN
0.52
0.10 1.20 mmeldi.
TOTAL PROTEIN
720
6.00 8.50 gidl.
1. This sample lab report shows both normal and abnormal results. as well as
acceptable reference ranges for each testing category.

请告知提取此信息的最佳方法是什么,我已经尝试过 Amazon Comprehend medical - 它可以完成这项工作,但不适用于所有图像。 尝试过 SpaCy:https://github.com/NLPatVCU/medaCyhttps://towardsdatascience.com/named-entity-recognition-with-nltk-and-spacy-8c4a7d88e7da

这可能不是 NLP 的良好应用,因为文本不是任何一种自然语言。相反,它们是可以使用规则提取的结构化数据。编写规则绝对是解决此问题的一种方法。

  1. 可以先尝试在OCR结果上做一个类别的模糊匹配,即"CARDIAC RISK"和"CHEMISTRIES"将字符串划分到各自的类别中。

  2. 如果您确定每个条目只占 3 行,您可以简单地按换行符将它们分区并从那里提取数据。

  3. 一旦你将它们分成条目

这是我 运行 对您提供的数据的一些示例代码。它需要 fuzzyset 包,您可以通过 运行 python3 -m pip install fuzzyset 获得该包。由于有些条目没有单位,我稍微修改了您想要的输出格式,并将单位列为一个列表,这样它就可以很容易地为空。它还存储在第三行找到的 运行dom 字母。

from fuzzyset import FuzzySet

### Load data
with open("ocr_result.txt") as f:
    data = f.read()

lines = data.split("\n")


### Create fuzzy set
CATEGORIES = ("CARDIAC RISK", "chemistries")
fs = FuzzySet(lines)


### Get the line ranges of each category
cat_ranges = [0] * (len(CATEGORIES) + 1)
for i, cat in enumerate(CATEGORIES):
    match = fs.get(cat)[0]
    match_idx = lines.index(match[1])
    cat_ranges[i] = match_idx

last_idx = lines.index(fs.get("sample lab report")[0][1])
cat_ranges[-1] = last_idx


### Read lines in each category
def _to_float(s: str) -> float:
    """
    Attempt to convert a string value to float
    """
    try:
        f = float(s)
    except ValueError:
        if "," in s:
            s = s.replace(",", ".")
            f = float(s)
        else:
            raise ValueError(f"Cannot convert {s} to float.")
    return f


result = {}
for i, cat in enumerate(CATEGORIES):
    result[cat] = {}

    # Ignore the line of the category itself
    s = slice(cat_ranges[i] + 1, cat_ranges[i + 1])
    lines_in_cat = lines[s]

    if len(lines_in_cat) % 3 != 0:
        breakpoint()
        raise ValueError("Something's wrong")

    for i in range(0, len(lines_in_cat), 3):
        _name = lines_in_cat[i]
        _value = lines_in_cat[i + 1]
        _line_3 = lines_in_cat[i + 2].split(" ")

        # Convert value to float
        _value = _to_float(_value)

        # Process line 3 to get range and unit
        _range = []
        _unit = []
        for i, v in enumerate(_line_3):
            if v[0].isdigit() and len(_range) < 2:
                _range.append(_to_float(v))
            else:
                _unit.append(v)

        _l = [_value, _unit, _range]
        result[cat][_name] = _l

print(result)

输出:

{'CARDIAC RISK': {'CHOLESTEROL': [161.0, ['mg/dL'], [120.0, 240.0]], 'CHOLESTEROLHDL RATIO': [2.39, [], [1.25, 5.0]], 'HIGH DENSITY LIPOPROTEINCHDL)': [67.3, ['me/dL'], [35.0, 75.0]], 'LOW DENSITY LIPOPROTEIN (LDL)': [78.7, ['a', 'midI.'], [60.0, 190.0]], 'TRIGLYCERIDES': [75.0, ['a', 'made'], [10.0, 200.0]]}, 'chemistries': {'ALBUMIN': [4.4, ['pidl'], [3.5, 5.5]], 'ALKALINE PHOSPHATASE': [49.0, ['UAL'], [30.0, 120.0]], 'BLOOD UREA NITROGEN (BUN)': [17.0, ['meidL'], [6.0, 2500.0]], 'CREATININE': [0.85, ['matdL'], [60.0, 1.5]], 'FRUCTOSAMINE': [182.0, ['mmoV/l'], [1.2, 1.79]], 'GAMMA GLUTAMYUTRANSFERASE': [9.0, ['UIL'], [2.0, 65.0]], 'GLOBULIN': [2.8, ['gidL.'], [1.0, 4.0]], 'GLUCOSE': [61.0, ['me/dl.'], [70.0, 125.0]], 'HEMOGLOBIN AIC': [5.1, ['%'], [3.0, 6.0]], 'SGOT (AST)': [25.0, ['UM'], [0.0, 41.0]], 'SOPI (ALT)': [22.0, ['IMI'], [0.0, 45.0]], 'TOTAL BILIRUBIN': [0.52, ['mmeldi.'], [0.1, 1.2]], 'TOTAL PROTEIN': [720.0, ['gidl.'], [6.0, 8.5]]}}