all,
I have imported the sotu corpus from quanteda
in R. I am somewhat new to dfm objects and am wanting to separate the doc_id
column to give me a name
and a year
column. If this was a tibble, this code works:
library(quanteda)
library(quanteda.corpora)
library(tidyverse)
sotu <- as_tibble(data_corpus_sotu)
sotusubsetted <- sotu %>%
separate(doc_id, c("name","year"),"-")
However, since I am new with dfm and regex, I am not sure if there is an equivalent process if I load in the data as:
library(quanteda)
library(quanteda.corpora)
library(tidyverse)
sotu <- corpus(data_corpus_sotu)
sotudfm <- dfm(sotu)
Is there some equivalent way to do this with dfm objects?
The safest method is also one that will work for any core quanteda object, meaning equally for a corpus, tokens, or dfm object. These involve using the accessor functions, not addressing the internals of the corpus or dfm objects directly, which is strongly discouraged. You can do that, but your code could break in the future if those object structures are changed. In addition, our accessor functions are generally also the most efficient method.
For this task, you want to use the docnames()
functions or accessing the document IDs, and this works for the corpus as well as for the dfm.
library("quanteda")
## Package version: 2.1.2
data("data_corpus_sotu", package = "quanteda.corpora")
data.frame(doc_id = docnames(data_corpus_sotu[1:5])) %>%
tidyr::separate(doc_id, c("name", "year"), "-")
## name year
## 1 Washington 1790
## 2 Washington 1790b
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
data.frame(doc_id = docnames(dfm(data_corpus_sotu[1:5]))) %>%
tidyr::separate(doc_id, c("name", "year"), "-")
## name year
## 1 Washington 1790
## 2 Washington 1790b
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
You could also have taken this from the "President" docvar field and the "Date":
data.frame(
name = data_corpus_sotu$President,
year = lubridate::year(data_corpus_sotu$Date)
) %>%
head()
## name year
## 1 Washington 1790
## 2 Washington 1790
## 3 Washington 1791
## 4 Washington 1792
## 5 Washington 1793
## 6 Washington 1794
Created on 2021-02-13 by the reprex package (v1.0.0)
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments