Usage¶
The typical usage pattern will be something like the following. In the below example, we are finding those sentences (identified by Genia) that are contained in a ce:para XML element.
# Get a SparkSession
spark = pyspark.sql.SparkSession.builder.getOrCreate()
# Set a reasonabe value for shuffle partitons (help with join performance)
spark.conf.set("spark.sql.shuffle.partitions",4)
# Read in some annotations. The below reads in the Original Markup annotations (think XML)
omAnnots = GetAQAnnotations(spark.read.parquet("./tests/resources/om/"),
props=["*"],
decodeProps=["*"],
numPartitions=int(spark.conf.get("spark.sql.shuffle.partitions"))) \
.persist(StorageLevel.DISK_ONLY)
# Read in some more annotations. The below reads in NLP annotations generated by Genia.
geAnnots = GetAQAnnotations(spark.read.parquet("./tests/resources/genia/"),
props=["orig", "lemma", "pos"],
decodeProps=["orig", "lemma"],
numPartitions=int(spark.conf.get("spark.sql.shuffle.partitions"))) \
.persist(StorageLevel.DISK_ONLY)
# Find those sentence annotations contained in a ce:para (XML element)
sentenceParaAnnots = ContainedIn(FilterType(geAnnots,'sentence'),FilterType(omAnnots,'ce:para'))
Query examples¶
To get you started, you can find below some more examples using the various query functions. Note that you can combine them to express more complex queries.
# Find the annotations "simple" that are within 25 characters after an annotation "the".
results = After(FilterProperty(geAnnots,"orig","simple"), FilterProperty(geAnnots,"orig","the"),25)
# Find the annotations "the" that are within 25 characters before an annotation "simple"
results = Before(FilterProperty(geAnnots,"orig","the"), FilterProperty(geAnnots,"orig","simple"),25)
# Find the annotations "simple" that are after an annotation "the" and before an annotation "end"
results = Between(FilterProperty(geAnnots,"orig","simple"), FilterProperty(geAnnots,"orig","the"), FilterProperty(geAnnots,"orig","end"))
# find those annotations "sentence" that are contained in "ce:para"
results = ContainedIn(FilterType(geAnnots,"sentence"), FilterType(omAnnots,"ce:para"))
# find those annotations "sentence" that contain an annotation "simple"
results = Contains(FilterProperty(geAnnots,"orig","sentence"), FilterProperty(geAnnots,"orig","simple"))
# filter the property field "orig" with the value "simulator". A single value or an array of values can be used for the filter comparison.
results = FilterProperty(geAnnots,"orig","simulator")
# filter the annotation sets with the value "ge"
results = FilterSet(geAnnots,"ge")
# filter the annotation types with the value "sentence"
results = FilterType(geAnnots,"sentence")
# return the following "sentence" annotation that contains an annotation "simple" within the annotation anchor: "ce:para"
sentences = Contains(FilterType(geAnnots,"sentence"),FilterProperty(geAnnots,"orig","simple"))
results = Following(FilterType(geAnnots,"sentence"), sentences,FilterType(omAnnots,"ce:para"))
# return the preceding "sentence" annotation that contains an annotation "simple" within the annotation anchor: "ce:para"
val sentences = Contains(FilterType(geAnnots,"sentence"),FilterProperty(geAnnots,"orig","simple"))
val results = Preceding(FilterType(geAnnots,"sentence"), sentences,FilterType(omAnnots,"ce:para"))
# find the annotations "basic" that are in the same document as the annotation "simple" and also match values on the specified property "parentId"
results = MatchProperty(FilterProperty(geAnnots,"orig","basic"), FilterProperty(geAnnots,"orig","simple"),"parentId")
# filter the property field "orig" using a regex expression "^sim*"
results = RegexProperty(geAnnots,"orig","^sim*")
# Sequence find the annotations "the" that are within 50 characters before an annotation "simple"
results = Sequence(FilterProperty(geAnnots,"orig","the"), FilterProperty(geAnnots,"orig","simple"), 50)