Example: Identify possible URIs (untagged)
- Identify things that should be tagged
as links, but aren't
- In this document set, the tag is
<web>.
- Problem: find strings that look like URI (URL) syntax
- Solution:
- Define “looks like URI syntax” as
“contains :/ substring”
- Expect false hits
- Tokenize strings (delimited by whitespace)
- Report every token containing :/ that does
not appear inside a <web> element
- With it, report an XPath (location path) helping us find
the text easily