如何使用正则表达式中的特定单词获取ID？

learning 发表于 Dev

学习

我的字符串：

<div class="sect1" id="s9781473910270.i101">       
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p> 
</div>
</div>           
<div class="sect1" id="s9781473910270.i103">
<p>sometext [ref*summation]</p>
</div>

<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
</div>
<p>fig1.2 [label*somefigure]</p>
<p>sometext [ref*somefigure]</p>
</div>

目标： 1.在上面的字符串中label*string，ref*string是交叉引用。在这个地方[ref*string]，我需要更换a与表的属性class和href，href是的ID div，其中相关的label*所在。的类别a是的类别div

正如我上面提到的，a元素类和ID是它们的相对div类名称和ID。但是如果div class="metadata"存在，需要忽略它不应该使用其类名和ID。

预期产量：

<div class="sect1" id="s9781473910270.i101">       
<div class="sect2" id="s9781473910270.i102">
<h1 class="title">1.2 Summations and Products[label*summation]</h1>
<p>text</p> 
</div>
</div>             
<div class="sect1" id="s9781473910270.i103">
<p>sometext <a class="section-ref" href="s9781473910270.i102">1.2</a></p>
</div>


<div class="figure" id="s9781473910270.i220">
<div class="metadata" id="s9781473910270.i221">
<p>fig1.2 [label*somefigure]</p>
</div>
<p>sometext <a class="fig-ref" href="s9781473910270.i220">fig 1.2</a></p>          
</div>

如何在不使用DOM解析器的情况下以更简单的方式进行操作？

我的想法是，必须将label* string其ID存储在数组中，并且将对ref字符串进行循环以匹配label* stringif字符串匹配，然后应替换其相关的ID和类，以代替ref* string，因此我尝试了此正则表达式获取label*string及其相关的id和类名。

卡西米尔和希波吕特

这种方法包括使用html结构通过DOMXPath检索所需的元素。正则表达式第二次用于从文本节点或属性中提取信息：

$classRel = ['sect2'  => 'section-ref',
             'figure' => 'fig-ref'];

libxml_use_internal_errors(true);

$dom = new DOMDocument;
$dom->loadHTML($html); // or $dom->loadHTMLFile($url); 

$xp = new DOMXPath($dom);

// make a custom php function available for the XPath query
// (it isn't really necessary, but it is more rigorous than writing
// "contains(@class, 'myClass')" )
$xp->registerNamespace("php", "http://php.net/xpath");

function hasClass($classNode, $className) {
    if (!empty($classNode))
        return in_array($className, preg_split('~\s+~', $classNode[0]->value, -1, PREG_SPLIT_NO_EMPTY));
    return false;
}

$xp->registerPHPFunctions('hasClass');


// The XPath query will find the first ancestor of a text node with '[label*'
// that is a div tag with an id and a class attribute,
// if the class attribute doesn't contain the "metadata" class.

$labelQuery = <<<'EOD'
//text()[contains(., 'label*')]
/ancestor::div
[@id and @class and not(php:function('hasClass', @class, 'metadata'))][1]
EOD;

$idNodeList = $xp->query($labelQuery);

$links = [];

// For each div node, a new link node is created in the associative array $links.
// The keys are labels. 
foreach($idNodeList as $divNode) {

    // The pattern extract the first text part in group 1 and the label in group 2
    if (preg_match('~(\S+) .*? \[label\* ([^]]+) ]~x', $divNode->textContent, $m)) {
        $links[$m[2]] = $dom->createElement('a');
        $links[$m[2]]->setAttribute('href', $divNode->getAttribute('id'));
        $links[$m[2]]->setAttribute('class', $classRel[$divNode->getAttribute('class')]);
        $links[$m[2]]->nodeValue = $m[1];
    }
}


if ($links) { // if $links is empty no need to do anything

    $refNodeList = $xp->query("//text()[contains(., '[ref*')]");

    foreach ($refNodeList as $refNode) {
        // split the text with square brackets parts, the reference name is preserved in a capture
        $parts = preg_split('~\[ref\*([^]]+)]~', $refNode->nodeValue, -1, PREG_SPLIT_DELIM_CAPTURE);

        // create a fragment to receive text parts and links
        $frag = $dom->createDocumentFragment();

        foreach ($parts as $k=>$part) {
            if ($k%2 && isset($links[$part])) { // delimiters are always odd items
                $clone = $links[$part]->cloneNode(true);
                $frag->appendChild($clone);
            } elseif ($part !== '') {
                $frag->appendChild($dom->createTextNode($part));
            }
        }

        $refNode->parentNode->replaceChild($frag, $refNode);
    }
}

$result = '';

$childNodes = $dom->getElementsByTagName('body')->item(0)->childNodes;

foreach ($childNodes as $childNode) {
    $result .= $dom->saveXML($childNode);
}

echo $result;

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-02-21

我来说两句

0条评论

登录后参与评论

上一篇：AngularJS：使用控制器和ng-repeat在div上重新加载数据

来自分类Dev

Related 相关文章

文章

如何使用正则表达式中的特定单词获取ID？

如何使用正则表达式中的特定单词获取ID？

使用正则表达式从字符串中获取特定单词

使用正则表达式仅匹配特定单词中的一个单词

正则表达式用于在电子邮件VBA中获取特定单词

使用正则表达式查找并获取特定单词

如何使用正则表达式仅提取 2 个特定单词之间的数字

使用R中的正则表达式查找最接近特定单词的数字

使用正则表达式从python中的文本中提取特定单词

正则表达式排除特定单词

特定单词的Python正则表达式

包含特定单词的正则表达式

正则表达式匹配特定单词

正则表达式以匹配包含类中特定单词并具有特定ID的标签

如何使用正则表达式匹配包含特定单词的行的第N个单词

如何使用正则表达式匹配包含特定单词的行的第N个单词

正则表达式返回以特定单词开头的行中的所有单词

使用正则表达式删除特定单词之间的单词

正则表达式，用于获取具有以javascript中的特定单词开头的类名称的html元素

正则表达式获取括号内的特定单词

如何用正则表达式替换具有特定单词的行中特定类型的所有字符

如何使用Javascript正则表达式从可能的单词列表中获取单词的最后出现？

如何用正则表达式选择包含特定单词的多行？

如何用正则表达式接受除特定单词以外的任何内容

VB.net中特定单词的正则表达式

正则表达式匹配Google Spreadsheet中特定单词后的前n行

正则表达式匹配Google Spreadsheet中特定单词后的前n行

正则表达式可在Java字符串中查找特定单词

正则表达式从URL中包含特定单词的任何网页打印URL

正则表达式从 URL 中删除特定单词

使用正则表达式在特定单词之后获得值的多次出现