Text Similarity/Comparison Question

Mon Mar 25 19:10:36 EDT 2019

Hello.  I am working on a project where one system (System A) contains seven text fields (unstructured data for comments).
I have concatenated all of the fields into a single field.

There is a second system (System B) containing two unstructured fields that capture text comments.  I have concatenated these fields into a single field
just as I did for the first system.  This system contains highly sensitive and prohibitive data.

The issue that I'm trying to solve is that there should not be any text data from System B (sensitive narratives, investigative IDs, etc.)
In essence, I am trying to find the following three items:
1) Find direct references to investigations ("Investigation number ABC123")
2) Language that talks about references (i.e. "Jane Doe is under investigation")
3) Actual cut-and-paste segments where they copied something verbatim from System B to System A in the commentary fields.

It seems as though I may have to use different text similarity (comparison between System A and System B text) or search techniques for one or more of the three items.
I was thinking that Cosine Similarity Computation (CSC) would perhaps be useful, but I thought I would solicit some advice as I'm a recent text analyst using Python.

Thank you in advance.

Kenneth R Adams
Compliance Technology and Analytics
TAS -Text Analytics as a Service
Wells Fargo & Co. |  401 South Tryon Street, Twenty-sixth Floor | Charlotte, NC 28202
MAC: D1050-262
Cell: 704-408.5157

Kenneth.R.Adams at WellsFargo.com<mailto:Kenneth.R.Adams at WellsFargo.com>

[WellsFargoLogo_w_SC]