How To Split Text Into Sentences In C#

How To Split Text Into Sentences In C#

As a programmer, I can recall countless instances where I had to split a chunk of text into its constituent sentences and do it efficiently.

Using the Split method of Regex was not an option because Regex is inefficient.

string[] sentences = Regex.Split(text, @"(?<=[\.!\?:…\r\n])\s+"); // Slow

The Split method of String met my efficiency requirements, but that did not seem like an option because it eliminated the punctuation marks.

string[] sentences = text.Split(new char[] { '.', '!', '?', ':', '…', '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries); // Trims punctuation marks (not cool)

In other words, assuming the following value for text…

string text = "My name is John. What is your name? Show me the time? It is 9:48PM. Okay, bye now!";

The Split method of String returns:

sentences[0] = "My name is John";
sentences[1] = "What is your name"; // and so on

It goes without saying that terminal punctuation marks like period (.), an exclamation mark (!), question mark (?), colon (:), optionally semi-colon (;), an ellipsis (…), and preferably newline (\n), are the delimiters used by the Split methods to determine the sentence boundaries. The problem with the preferred Split method, of class String, is that it eliminates the delimiters.

It turns out there is a clever way, as of yet unstated in Stack Overflow, for splitting text in a way that is both fast and viable. It involves usage of the StringBuilder class.

StringBuilder sb = new StringBuilder(text);
sb.Replace(". ", ".$$$").Replace("! ", "!$$$").Replace("? ", "?$$$").Replace(": ", ":$$$").Replace("; ", ";$$$").Replace("… ", "…$$$").Replace("\r\n", "\r\n$$$").Replace("\r", "\r$$$").Replace("\n", "\n$$$");

Here, as can be seen, the delimiter is not just the punctuation mark, but the punctuation mark followed by a whitespace character. This is important as we do not want an acronym, or the typical time format, where hours, minutes and seconds are separated by colon, to be split into segments. The whitespace following the punctuation mark is then replaced with a “code string”, in this case, 3 successive dollar signs.

Now we can use the Split method of String to split our StringBuilder converted to string…

string[] sentences = sb.ToString().Split("$$$", StringSplitOptions.RemoveEmptyEntries);

… and we don’t care that the Split method of String eliminates the delimiter because we want the 3-dollar sign delimiter eliminated. That’s how easy and fast it is to divide text into sentences.

Similar Questions:

To split text into sentences in Python, use sent_tokenize(text), where text is a string, to split that string into a list of sentences. You need to download(module) with “punkt” as module the first time the code is executed.

In Python, use the sent_tokenize() method to split a document or paragraph into sentences.

With the help of nltk. tokenize. word_tokenize() method, it is possible to extract the tokens from string of characters by using tokenize. … It actually returns the syllables from a single word.

Tokenization is the process of separating a chunk of text into smaller units called tokens. Tokens can be words, characters or syllables. Therefore, tokenization can be broadly classified into 3 types – word, character, and subword (n-gram characters) tokenization.

Share on facebook
Facebook
Share on twitter
Twitter
Share on linkedin
LinkedIn

Leave Comments on “How To Split Text Into Sentences In C#”...

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.